Operations Engineer, Fleet Reliability
As an Operations Engineer at fal, you will play a critical role in maintaining and enhancing the reliability of our GPU clusters, which are integral to our generative media ecosystem. This position involves hands-on management of hardware and software components to ensure optimal performance and uptime.
Your primary responsibilities will include provisioning, validating, and troubleshooting GPU nodes across various clusters, including B300, H200, and H100. You will address hardware and software issues related to compute, network, and storage systems, monitor fleet health, and implement remediation actions as necessary. Additionally, you will develop and refine runbooks to streamline operational processes.
The ideal candidate will have experience administering Linux systems in critical environments, troubleshooting GPU node issues such as NVLink, NCCL, IB, driver, and firmware bugs, and utilizing observability systems like Grafana and Prometheus. Proficiency in scripting languages such as Bash, Python, or Go is also required.
This role offers the opportunity to work in a dynamic and innovative environment, contributing to the development of cutting-edge AI products. You will be part of a team that values curiosity, problem-solving, and automation, providing ample opportunities for professional growth and development.