Network Engineer - AI/HPC
xAI is seeking a Network Engineer specializing in AI and High-Performance Computing (HPC) to join our dynamic team. As a pivotal member, you will contribute to building and optimizing large-scale GPU clusters, enhancing our mission to develop AI systems that accurately understand the universe and aid humanity in its pursuit of knowledge. Our team values engineering excellence, curiosity, and a hands-on approach within a flat organizational structure.
In this role, you will focus on developing and optimizing RoCEv2 networks at hyperscale, ensuring peak performance and availability. Your daily tasks will involve deep engagement with NCCL, constructing metric dashboards, and fine-tuning configurations to maximize performance. Additionally, you will design the next generation of our backend and frontend networks, facilitating seamless GPU infrastructure expansion with minimal engineering intervention. The position requires significant travel to Memphis for capacity building, participation in an on-call rotation, and involvement in scaling and maintenance efforts.
The ideal candidate will have at least 10 years of experience in designing and operating large-scale networks, with a minimum of 5 years in the Ethernet AI/HPC domain. A deep understanding of Ethernet congestion control is essential, with knowledge of InfiniBand considered a plus. Proficiency in AI training and inference workloads, including the ability to use and debug NCCL, is required. Expertise in developing performance and operations metrics to optimize training and inference traffic is crucial. Experience with Python for automating repetitive tasks and analyzing large datasets is also necessary.
Compensation for this role includes a base salary ranging from $180,000 to $440,000, complemented by equity options. Our comprehensive benefits package encompasses medical, vision, and dental coverage, access to a 401(k) retirement plan, short and long-term disability insurance, life insurance, and various other discounts and perks.
At xAI, we foster a culture that encourages challenging oneself and thriving on curiosity. We operate with a flat organizational structure, where all employees are expected to be hands-on and contribute directly to the company's mission. Leadership is awarded to those who demonstrate initiative and consistently deliver excellence. Strong communication skills are valued, enabling concise and accurate knowledge sharing among teammates.