Principal Reliability Scientist
Graphcore is seeking a Principal Reliability Scientist to lead reliability activities across its complex, high-performance AI systems. This role is integral to the Quality team within Manufacturing Operations, which ensures product robustness and lifecycle performance across Graphcore's hardware portfolio. The team collaborates closely with technology research, chip, board, system design, platform, and operations teams to translate reliability insights into actionable improvements.
The Principal Reliability Scientist will define and refine reliability requirements across silicon, board, and system levels, working in partnership with research and design teams. They will apply advanced reliability methodologies to innovative systems, including challenges associated with liquid-cooled architectures and fluid dynamics. Responsibilities include designing and executing experiments to generate high-quality reliability and performance data, analyzing experimental, field, and manufacturing data to quantify reliability metrics such as MTBF, MTTR, RAS characteristics, and soft error rates, and collaborating with design teams to influence architecture and component selection based on reliability considerations.
Candidates should have a strong background in reliability engineering or reliability science within semiconductor, hardware, or complex systems environments. Experience with physics-of-failure approaches in high-performance computing, AI hardware, or related domains is essential. Proficiency in reliability modeling, experimental design, and statistical data analysis is required, along with the ability to interpret experimental reliability data to drive engineering decisions. Familiarity with key reliability metrics such as MTBF, MTTR, RAS, and failure rate analysis is also necessary.
Preferred qualifications include experience with liquid cooling systems, fluid dynamics, or thermally complex hardware environments, knowledge of soft error mechanisms and SER modeling, and experience contributing to reliability strategy, processes, or tooling improvements.
Graphcore offers a dynamic work environment with opportunities for professional growth. The company values collaboration and continuous learning, providing employees with the chance to work on cutting-edge technology in the AI and machine learning space.