GPU Optimisation Engineer | SF

Smallest Inc.

San Francisco, CA, United States

Full Time

Expires On: 03/05/2026

Role

We’re hiring a GPU Optimization Engineer who understands GPUs at a deep, architectural level — someone who knows exactly how to squeeze every last millisecond out of a model, what GPU constraints matter, and how to restructure models for real-world inference performance. You’ll work across CUDA kernels, model graph optimizations, hardware-specific tuning, and porting models across GPU architectures. Your work directly impacts the latency, throughput, and reliability of smallest’s real-time speech models.

What You’ll Do

  • Optimize model architectures (ASR, TTS, SLMs) for maximum performance on specific GPU hardware
  • Profile models end-to-end to identify GPU bottlenecks — memory bandwidth, kernel launch overhead, fusion opportunities, quantization constraints
  • Design and implement custom kernels (CUDA/Triton/Tinygrad) for performance-critical model sections
  • Perform operator fusion, graph optimization, and kernel-level scheduling improvements
  • Tune models to fit GPU memory limits while maintaining quality
  • Benchmark and calibrate inference across NVIDIA, AMD, and potentially emerging accelerators
  • Port models across GPU chipsets (NVIDIA → AMD / edge GPUs / new compute backends)
  • Work with TensorRT, ONNX Runtime, and custom runtimes for deployment
  • Partner with the research and infra teams to ensure the entire stack is optimized for real-time workloads

Requirements

  • Strong understanding of GPU architecture — SMs, warps, memory hierarchy, occupancy tuning
  • Hands‑on experience with CUDA, kernel writing, and kernel‑level debugging

    Experience with kernel fusion and model graph optimizations

  • Familiarity with TensorRT, ONNX, Triton, tinygrad, or similar inference engines
  • Strong proficiency in PyTorch and Python
  • Deep understanding of model architectures (transformers, convs, RNNs, attention, diffusion blocks)
  • Experience profiling GPU workloads using Nsight, nvprof, or similar tools
  • Strong problem‑solving abilities with a performance‑first mindset

Great to Have

  • Experience with quantization (INT8, FP8, hybrid formats)
  • Experience with audio/speech models (ASR, TTS, SSL, vocoders)
  • Contributions to open‑source GPU stacks or inference runtimes
  • Published work related to systems‑level model optimization

Who Will Succeed in This Role

Someone who:

  • thinks in kernels, not just layers
  • knows which optimizations are theoretical vs practically impactful
  • understands GPU boundaries (memory, bandwidth, latency) and how to work around them
  • is excited by the challenge of ultra‑low latency and large‑scale real‑time inference
  • loves debugging at the CUDA + model level
#J-18808-Ljbffr

Apply Now