Synthetic Data Engineer (AI Data/Training)
Hyphen Connect is seeking a Synthetic Data Engineer to design and implement domain-specific synthetic data generation pipelines. This role is integral to ensuring high-quality data management for training loops within the organization.
Key responsibilities include designing synthetic data generation pipelines using self-instruct and constitutional prompting, implementing automated quality scoring and de-duplication systems, and managing data pipelines that feed directly into supervised fine-tuning (SFT) and direct preference optimization (DPO) training loops.
The ideal candidate will have proven experience building large-scale data pipelines with tools such as Airflow, Spark, or Ray. Deep knowledge of prompt engineering for data generation and familiarity with dataset distillation and bias mitigation are also required.
Hyphen Connect offers a collaborative work environment focused on innovation in AI data and training. Employees have opportunities for professional growth and development within a dynamic team.