Fal is seeking an experienced Site Reliability Engineer to ensure the reliability and availability of its customer-facing systems, including Kubernetes clusters, deployment pipelines, and networking infrastructure. As a key member of the team, you will play a pivotal role in maintaining and enhancing the performance of Fal's generative media ecosystem, which powers the next generation of AI products.

In this role, you will be responsible for owning and operating Kubernetes infrastructure, managing cluster lifecycles, upgrades, networking, and ensuring multi-tenant isolation for customer workloads. You will build and maintain CI/CD pipelines and deployment infrastructure, leverage AI to automate analysis and resolution of production issues, and improve software development speed, reliability, and maintainability. Additionally, you will build dashboards, set up alerting and anomaly detection systems, define and enforce Service Level Objectives (SLOs), and develop incident response processes. Managing and improving networking, load balancing, and service mesh configurations, as well as driving reliability improvements through automation, runbooks, and chaos engineering, will also be key aspects of your day-to-day work.

The ideal candidate will have over 5 years of experience managing critical production systems and software development workflows. Strong production experience in setting up and operating Kubernetes at scale using infrastructure-as-code tools like Terraform and Ansible is essential. Deep knowledge of Linux networking, container networking (CNI plugins, VXLAN, BGP), and DNS is required. Experience in building CI/CD systems and GitOps workflows (FluxCD, ArgoCD), proficiency in Python and either Go or Bash for tooling and automation, and strong experience with logging, monitoring, and alerting tools (Prometheus, Grafana, Loki, Thanos, VictoriaMetrics, Datadog) are also necessary. Excellent communication skills and the ability to drive technical decisions across teams are important, along with a self-starter attitude, quick execution, ownership, and a constant drive for improvement.

Fal offers a competitive compensation package ranging from $180,000 to $250,000, plus equity and benefits. The company provides interesting and challenging work, numerous learning and growth opportunities, and is currently hiring in downtown San Francisco. Relocation assistance to San Francisco is available, along with health, dental, and vision insurance in the U.S., and regular team events and offsites.

Joining Fal means becoming part of a dynamic team that is shaping the future of AI and generative media. The company fosters a collaborative and innovative culture, offering significant opportunities for professional growth and development. If you are passionate about building reliable systems and want to make a substantial impact in a rapidly evolving industry, this role provides an excellent platform to do so.

Software Engineer, Site Reliability

More Jobs at FAL

Workplace Manager

Technical Business Development (Model Labs)

Staff Software Engineer, Forward Deployed

Staff Security Engineer, Infrastructure