Hyphen Connect is seeking a highly skilled LLM Pre-training & Distributed Systems Engineer to join our AI Infrastructure team in the San Francisco Bay Area. This role is pivotal in orchestrating large-scale machine learning training runs and optimizing distributed infrastructure to support our cutting-edge AI initiatives.

The successful candidate will be responsible for orchestrating distributed training runs across 1,000+ GPUs using frameworks such as PyTorch, DeepSpeed, or Megatron-LM. Key duties include optimizing networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors, as well as automating checkpointing and failure recovery during extended training processes.

Applicants must possess deep expertise in 3D parallelism, encompassing data, tensor, and pipeline parallelism. Experience managing SLURM or Kubernetes-based GPU clusters is essential. A strong background in systems engineering, with proficiency in C++, CUDA, and Python, is also required.

Hyphen Connect offers a dynamic work environment at the forefront of AI and machine learning. Employees have the opportunity to work on innovative projects with a team of dedicated professionals, fostering both personal and professional growth.

LLM Pre-training & Distributed Engineer (AI Infrastructure)

More Jobs at Hyphen Connect

Robotic Safety Systems & Compliance Architect

AI Specialist (AI Engineering)

AI Safety Specialist (AI Engineering)

AI/Robotics Product Manager