Software Engineer, Infrastructure
As a Software Engineer, Infrastructure at FAL, you will be instrumental in developing and maintaining the software and processes that ensure the health and productivity of a large fleet of GPU servers. This role involves managing thousands of servers through provisioning, health monitoring, error detection, and recovery, and collaborating with partners to resolve issues beyond automation capabilities.
Your key responsibilities will include building and maintaining a Python-based fleet tracking system to manage the full lifecycle of servers, automating server provisioning and health checks, creating metrics and dashboards for hardware health monitoring, leveraging AI for tool development and automation, implementing OS-level security measures, managing distributed and local storage systems, tuning Linux systems for AI workloads, developing automated error detection and recovery processes, and working with partners to address technical issues.
The ideal candidate will have over three years of experience managing large-scale server fleets, strong software engineering skills in Python, deep knowledge of Linux systems, experience with configuration management and infrastructure-as-code tools, a solid understanding of storage technologies, familiarity with hardware diagnostics, experience in building internal tools for infrastructure visibility, excellent communication skills, and a proactive, ownership-driven approach to work.
FAL offers a competitive compensation package ranging from $180,000 to $250,000, plus equity and benefits. Additional perks include relocation assistance to San Francisco, health, dental, and vision insurance, and regular team events and offsites.
Joining FAL provides an opportunity to engage in challenging and interesting work with ample learning and growth opportunities. The company fosters a collaborative environment where innovation and continuous improvement are encouraged, making it an ideal place for professionals seeking to advance their careers in infrastructure engineering.