LLM Pre-training & Distributed Engineer (AI Infrastructure)
Hyphen Connect is seeking a highly skilled LLM Pre-training & Distributed Systems Engineer to join our AI Infrastructure team. This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure to support our advanced AI initiatives.
The successful candidate will be responsible for orchestrating distributed training runs across 1,000+ GPUs using frameworks such as PyTorch, DeepSpeed, or Megatron-LM. Key tasks include optimizing networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors, as well as automating checkpointing and failure recovery during extended training periods.
Applicants should possess deep expertise in 3D parallelism, encompassing data, tensor, and pipeline parallelism. Experience managing GPU clusters using SLURM or Kubernetes is essential. A strong background in systems engineering, with proficiency in C++, CUDA, and Python, is also required.
Hyphen Connect offers a dynamic work environment focused on cutting-edge AI technologies. Employees benefit from opportunities for professional growth, collaboration with industry experts, and the chance to contribute to groundbreaking projects in the AI infrastructure domain.