Senior Systems Engineer – Performance & Reliability
The Senior Systems Engineer – Performance & Reliability at Graphcore is responsible for evaluating and ensuring the performance and reliability of large-scale Linux systems, ranging from single racks to data center-scale infrastructures. This role involves designing and executing measurements to determine system behavior and trustworthiness for production deployment.
Key responsibilities include developing and running measurements on distributed systems, analyzing performance and reliability data, and enhancing systems that execute these measurements at scale. The engineer will work on tasks such as running measurements on large-scale Linux clusters, utilizing tools like pytest for measurement execution, assessing compute, network, and machine learning workload performance, and analyzing variability and repeatability of results.
The ideal candidate should have strong software engineering experience across multiple projects or systems over several years, proficiency in Python, and experience working in Linux-based environments, particularly with distributed or high-performance systems. Familiarity with automation and CI/CD systems (e.g., GitLab CI, Jenkins, GitHub Actions) is essential. The candidate should be capable of designing, implementing, and running experiments that yield meaningful results, interpreting these results accurately, and communicating findings effectively to support decision-making.
Graphcore offers a competitive salary, flexible working arrangements, a generous annual leave policy, private medical insurance, a health cash plan, a dental plan, pension matching up to 5%, life assurance, and income protection. Additional benefits include a generous parental leave policy, an employee assistance program covering health and mental well-being, and a range of healthy food and snacks at the central Bristol office, which also features an in-house barista bar.
Graphcore is committed to building an inclusive work environment that values diversity and encourages individuals from different backgrounds and experiences to apply. The company fosters a culture where everyone has the opportunity to make an impact on the company, its products, and the future of artificial intelligence.