Research Engineer, Infrastructure, Numerics
Thinking Machines Lab is seeking a Research Engineer specializing in Infrastructure and Numerics to join our team in San Francisco. Our mission is to empower humanity by advancing collaborative general intelligence, making AI accessible for diverse needs and goals. Our team comprises scientists and engineers who have contributed to widely used AI products and open-source projects.
In this role, you will design and optimize distributed training infrastructure for large-scale language models, focusing on performance, stability, and reproducibility across multi-GPU and multi-node setups. Responsibilities include implementing low-precision numerics to enhance efficiency without compromising model quality, developing kernels and communication primitives utilizing hardware-level support for mixed and low-precision arithmetic, and collaborating with research teams to co-design model architectures and training recipes aligned with emerging numeric formats and stability constraints.
The ideal candidate holds a Bachelor's degree or equivalent experience in computer science, electrical engineering, statistics, machine learning, physics, robotics, or a related field. A strong understanding of deep learning frameworks such as PyTorch or JAX and their underlying system architectures is essential. Candidates should thrive in a collaborative environment, possess a proactive mindset, and demonstrate strong engineering skills, including the ability to contribute performant, maintainable code and debug complex codebases in areas like floating-point numerics, low-precision arithmetic, and distributed systems.
Preferred qualifications include familiarity with distributed frameworks such as PyTorch/XLA, DeepSpeed, or Megatron-LM; experience implementing FP8, INT8, or block-floating point formats; prior contributions to open-source deep learning infrastructure; publications or projects related to numerical optimization or systems for large models; experience training large-scale AI models; and a track record of improving research productivity through infrastructure design or process improvements.
This position is based in San Francisco, California. Depending on background, skills, and experience, the expected annual salary range for this role is $350,000 to $475,000 USD. We sponsor visas and are committed to working through the visa process for the right candidate.