Principal Hardware Diagnostics Engineer
Graphcore is seeking a Principal Hardware Diagnostics Engineer to design and develop diagnostics software for monitoring hardware health and diagnosing system-level issues across its AI infrastructure platforms. This role involves building diagnostics agents, tools, and analytics frameworks that enable engineers and automation systems to identify, isolate, and resolve hardware issues across blade-level servers and rack-scale clusters.
Key responsibilities include designing and developing automated hardware diagnostics solutions for blade-level servers and rack-scale AI systems, architecting and implementing diagnostic agents, monitoring tools, and analytics frameworks to track hardware telemetry, and collaborating with hardware teams to integrate low-level diagnostic modules into monitoring systems. The engineer will also develop diagnostics tools capable of detecting hardware health conditions and isolating failures, create diagnostic modules used for internal validation and production data center operations, and provide detailed hardware fault information to system engineers to accelerate troubleshooting. Additionally, the role involves defining remediation workflows and insights for hardware fault scenarios across nodes and clusters, and collaborating with firmware, networking, and cloud platform teams to integrate diagnostics across the system stack.
The ideal candidate will have a Bachelor's, Master's, or PhD in Computer Science, Computer Engineering, or a related discipline, along with strong software engineering experience in Python, C++, or C#. Experience developing diagnostics or monitoring systems for hardware platforms, working with distributed systems or cloud infrastructure, and a strong knowledge of Linux environments and system-level diagnostics tools are essential. The candidate should also have experience collaborating with CM/ODM partners on manufacturing diagnostics and fault isolation, strong analytical and debugging skills, and excellent communication and collaboration abilities.
Desirable qualifications include experience working with AI hardware platforms or accelerator-based computing systems, familiarity with hyperscale data center infrastructure, experience building cluster-level monitoring or diagnostics systems, and experience interacting with internal or external customers during diagnostics solution development.
Graphcore offers a competitive benefits package and is committed to building an inclusive work environment that makes it a great home for everyone. The company provides a flexible approach to interviews and encourages candidates to discuss any reasonable adjustments they may require.