Site Reliability Engineer (SRE)
Thinking Machines Lab is seeking a Site Reliability Engineer (SRE) to enhance the reliability and resilience of Tinker, our fine-tuning API that enables researchers and developers to customize advanced AI models to their specific needs. As part of our mission to make AI systems more accessible and customizable, the SRE will collaborate closely with platform engineers and research teams to ensure robust system performance.
In this role, you will define and manage end-to-end reliability processes, including continuous integration and deployment (CI/CD) workflows, production monitoring, and incident response. Key responsibilities involve developing service level objectives for distributed training systems, designing comprehensive monitoring solutions, leading incident response efforts to ensure rapid recovery, and improving multi-tenant isolation and resource scheduling to optimize utilization without compromising reliability or data separation.
Candidates should possess a bachelor's degree or equivalent experience in computer science, engineering, or a related field. Required qualifications include experience in distributed systems, cloud infrastructure, or site reliability engineering; proficiency in writing software to address reliability challenges through tooling and automation; experience with production incident response and systematic reliability improvements; and strong communication skills with a proven track record of cross-team coordination. Preferred qualifications encompass deep experience operating production cloud services at scale, familiarity with distributed training frameworks, experience building checkpoint and recovery systems for long-running distributed jobs, and expertise in managing Kubernetes clusters handling heterogeneous GPU workloads.
This position is based in San Francisco, California, with an expected annual salary range of $350,000 to $475,000, depending on background, skills, and experience. Thinking Machines Lab offers generous health, dental, and vision benefits, unlimited paid time off, paid parental leave, and relocation support as needed. We also sponsor visas and are committed to working through the visa process with qualified candidates.