The Compute Infrastructure team at xAI is responsible for designing, building, and operating massive-scale clusters and orchestration platforms that power frontier AI training, inference, and agent workloads at unprecedented scale. In this role, you will push the boundaries of container orchestration far beyond existing systems like Kubernetes, manage exascale compute resources, optimize for high-performance training runs and production serving, and collaborate closely with research and systems teams to deliver reliable, ultra-scalable infrastructure that enables xAI's next-generation models and applications.

Key responsibilities include building and managing massive-scale clusters to host, persist, train, and serve AI workloads with extreme reliability and performance. You will design, develop, and extend an in-house container orchestration platform that achieves superior scalability, isolation, resource efficiency, and fault-tolerance compared to off-the-shelf solutions. Collaborating with research teams, you will architect and optimize compute clusters specifically for large-scale training runs, inference services, and real-time applications. Additionally, you will profile, debug, and resolve complex system-level performance bottlenecks, resource contention, scheduling issues, and reliability problems across the full stack. Owning end-to-end infrastructure initiatives with first-principles design, rigorous testing, automation, and continuous optimization to support frontier AI compute demands is also a key aspect of this role.

The ideal candidate will have deep expertise in virtualization technologies (KVM, Xen, QEMU) and advanced containerization/sandboxing (Kata, Firecracker, gVisor, Sysbox, or equivalent). Strong proficiency in systems programming languages such as C/C++ and Rust is essential. A proven track record in profiling, debugging, and optimizing complex system-level performance issues, with deep knowledge of Linux kernel internals, resource management, scheduling, memory management, and low-level engineering, is required. Hands-on experience in building or significantly enhancing distributed compute platforms, orchestration systems, or high-performance infrastructure at scale is also necessary. The ability to thrive in a fast-paced, meritocratic environment with full ownership, high standards, and a focus on rigorous execution is crucial.

Preferred qualifications include experience in Linux kernel development, hypervisor extensions, or low-level system programming for compute-intensive workloads. A proven track record in operating or designing large-scale AI training/inference clusters (GPU/TPU scale) is advantageous. Experience with custom runtimes, isolation techniques, or bespoke platforms for specialized AI compute is also desirable. Familiarity with performance tools, tracing, and debugging in production distributed environments is beneficial.

The base salary for this position ranges from $180,000 to $440,000 USD. In addition to the base salary, xAI offers a comprehensive total rewards package, including equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short and long-term disability insurance, life insurance, and various other discounts and perks.

Member of Technical Staff - Compute Infrastructure

More Jobs at XAI

Corporate Counsel

Legal Operations Analyst

IT Systems Engineer

Facilities Operations Technician