Hyphen Connect is seeking a highly skilled LLM Pre-training & Distributed Systems Engineer to join our AI Infrastructure team. This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure to support our advanced AI initiatives.

As an LLM Pre-training & Distributed Systems Engineer, you will orchestrate distributed training runs across 1,000+ GPUs using frameworks such as PyTorch, DeepSpeed, or Megatron-LM. Your responsibilities will include optimizing networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors, as well as automating checkpointing and failure recovery during month-long training runs.

The ideal candidate will possess deep expertise in 3D parallelism (Data, Tensor, Pipeline) and experience managing SLURM or Kubernetes-based GPU clusters. A strong systems engineering background with proficiency in C++, CUDA, and Python is essential for success in this role.

Hyphen Connect offers a competitive compensation package, including benefits and perks designed to support our employees' well-being and professional growth.

Joining Hyphen Connect means becoming part of a dynamic team dedicated to advancing AI infrastructure. We provide opportunities for professional development and encourage innovation in a collaborative environment.

LLM Pre-training & Distributed Engineer (AI Infrastructure)

More Jobs at Hyphen Connect

Robotic Safety Systems & Compliance Architect

AI Specialist (AI Engineering)

AI Safety Specialist (AI Engineering)

AI/Robotics Product Manager