As a multimodal engineer on the Imagine Model Team at xAI, you will develop cutting-edge AI experiences beyond text, focusing on high-fidelity understanding and generation across image and video modalities, while incorporating audio to enhance visual content. Your work will span data curation, modeling, training, inference serving, and product integration, covering both pretraining and post-training phases. Collaborating closely with product teams, you will push model frontiers and deliver exceptional end-to-end user experiences.

Key responsibilities include creating and driving engineering agendas to advance multimodal capabilities, emphasizing image and video generation, editing, understanding, controllable/long-horizon synthesis, agentic planning, reinforcement learning training, and world simulation, including audio integration for richer video experiences. You will improve data quality through annotation, filtering, augmentation, synthetic generation, captioning, and in-depth data studies, particularly for visual and audio data. Additionally, you will design evaluation frameworks, metrics, benchmarks, evaluations, and reward models tailored to image, video, and audio quality and coherence. Implementing efficient algorithms for state-of-the-art model performance, including real-time inference, distillation, and scalable serving for visual content, and developing scalable data collection and processing pipelines for multimodal datasets are also part of your role. Collaborating cross-functionally to integrate AI solutions into production and rapidly iterating based on user feedback is essential.

Required qualifications include a track record in leading studies that significantly improve neural network capabilities and performance through better data or modeling, experience in data-driven experiment designs, systematic analysis, and iterative model debugging, and experience developing or working with large-scale distributed machine learning systems. The ability to deliver optimal end-to-end user experiences and being a hands-on contributor with initiative, excellence, strong work ethic, prioritization skills, and excellent communication are also necessary.

Preferred skills and experience encompass experience in supervised fine-tuning, reinforcement learning, evaluations, human or synthetic data collection, or agentic systems. Proficiency in Python, JAX/XLA, PyTorch, Rust/C++, Spark, Ray, and related large-scale frameworks is advantageous. Domain expertise in multimodal applications such as graphics engines, rendering techniques, image and video understanding and generation, world models, real-time simulation, or controllable/long-horizon visual content creation is beneficial. Experience with agentic reinforcement learning training, controllable/long-horizon generation, or multimodal agents that reason and act across modalities, especially in visual domains, is also preferred.

Compensation for this role ranges from $180,000 to $440,000 USD. The total rewards package at xAI includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short and long-term disability insurance, life insurance, and various other discounts and perks.

Member of Technical Staff - Imagine Model

More Jobs at XAI

Corporate Counsel

Legal Operations Analyst

IT Systems Engineer

Facilities Operations Technician