Thinking Machines Lab is seeking a Research Engineer to design and build core systems that enable scalable, efficient training of large models for deployment and research. This role is crucial in ensuring that experimentation and training processes are fast and reliable, allowing research teams to focus on scientific advancements without system bottlenecks. The ideal candidate will possess deep systems and performance expertise, coupled with a curiosity for machine learning at scale.

Key responsibilities include designing, implementing, and optimizing distributed training systems that scale across thousands of GPUs and nodes for large-scale training workloads. The role involves developing high-performance optimizations to maximize throughput and efficiency, as well as creating reusable frameworks and libraries to improve training reproducibility, reliability, and scalability for new model architectures. Establishing standards for reliability, maintainability, and security to ensure systems are robust under rapid iteration is also essential. Collaboration with researchers and engineers to build scalable infrastructure is a key aspect of the position.

Candidates should have a bachelor's degree or equivalent experience in computer science, electrical engineering, statistics, machine learning, physics, robotics, or a related field. Strong engineering skills are required, with the ability to contribute performant, maintainable code and debug complex codebases. An understanding of deep learning frameworks such as PyTorch and JAX, along with their underlying system architectures, is necessary. The role requires thriving in a highly collaborative environment involving various cross-functional partners and subject matter experts, as well as a proactive mindset to take initiative across different stacks and teams to ensure successful project delivery.

The position is based in San Francisco, California, with an expected annual salary range of $350,000 to $475,000 USD, depending on background, skills, and experience. Thinking Machines Lab offers generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed. Visa sponsorship is available, with a commitment to working through the visa process together for the right fit.

Thinking Machines Lab is an artificial intelligence research and product company dedicated to empowering humanity through advancing collaborative general intelligence. The company is building a future where everyone has access to the knowledge and tools to make AI work for their unique needs and goals. The team comprises scientists, engineers, and builders who have created widely used AI products, including ChatGPT and Character.ai, open-weight models like Mistral, and popular open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.

Research Engineer, Infrastructure, Training Systems

More Jobs at Thinking Machines Lab

Associate General Counsel, Corporate & Commercial

Associate General Counsel, Frontier AI & Privacy

Site Reliability Engineer (SRE)

GTM Strategy & Operations, Tinker