Site Reliability Engineer (SRE)

🇬🇧 London, United Kingdom
Posted 14 months ago
Expires June 9, 2026
Full TimeOn-siteEngineering

As a Site Reliability Engineer (SRE) at xAI, you will join a dedicated team responsible for the backend services that power our products, including and its API. Our mission is to create AI systems that accurately understand the universe and aid humanity in its pursuit of knowledge. We are a small, highly motivated team focused on engineering excellence, operating with a flat organizational structure where all members are hands-on contributors.

In this role, you will focus on writing and maintaining highly scalable and reliable services capable of efficiently processing tens of thousands of queries per second. These services are hosted across multiple Kubernetes clusters, both on-premises and in the cloud. Your day-to-day responsibilities will include managing these clusters, ensuring system reliability, and optimizing performance to meet the demands of our products.

We are seeking candidates with expert knowledge of Kubernetes, continuous deployment systems such as Buildkite and ArgoCD, and monitoring technologies like Prometheus, Grafana, and PagerDuty. Proficiency in infrastructure as code tools such as Pulumi or Terraform is essential. Familiarity with systems programming languages like Rust, C++, or Go, as well as experience with traffic management and HTTP proxies such as nginx and envoy, are also required.

At xAI, we value individuals who appreciate challenging themselves and thrive on curiosity. We expect all employees to have strong communication skills, enabling them to concisely and accurately share knowledge with their teammates. Leadership is given to those who show initiative and consistently deliver excellence.

More Jobs at XAI