LLM Pre-training & Distributed Engineer (AI Infrastructure)
Hyphen Connect is seeking a highly skilled LLM Pre-training & Distributed Systems Engineer to join our AI Infrastructure team. This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure to support our advanced AI initiatives.
As an LLM Pre-training & Distributed Systems Engineer, you will orchestrate distributed training runs across 1,000+ GPUs using frameworks such as PyTorch, DeepSpeed, or Megatron-LM. Your responsibilities will include optimizing networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors, as well as automating checkpointing and failure recovery during month-long training runs.
The ideal candidate will possess deep expertise in 3D parallelism (Data, Tensor, Pipeline) and experience managing SLURM or Kubernetes-based GPU clusters. A strong systems engineering background with proficiency in C++, CUDA, and Python is essential for success in this role.
Hyphen Connect offers a competitive compensation package, including benefits and perks designed to support our employees' well-being and professional growth.
Joining Hyphen Connect means becoming part of a dynamic team dedicated to advancing AI infrastructure. We provide opportunities for professional development and encourage innovation in a collaborative environment.