Senior Software Engineer, Observability Insights
CoreWeave is seeking a Senior Software Engineer to lead the Observability Insights initiative, focusing on developing product experiences and agentic interfaces atop the company's foundational telemetry layer. This role is pivotal in enabling both CoreWeave and its customers to understand, troubleshoot, and optimize complex AI systems by delivering core components such as multi-tenant APIs, managed Grafana experiences, and MCP-based tool servers. The engineer will collaborate closely with product managers and engineering leadership to shape the end-to-end observability experience, significantly influencing how users interact with cutting-edge AI infrastructure.
Key responsibilities include designing and developing highly available, multi-tenant APIs that expose telemetry and derived insights in a developer-centric manner. The role involves modernizing user interactions with data by building agentic experiences, including MCP servers, agentic tools, and API gateways that safely expose foundational telemetry. Additionally, the engineer will build agentic observability capabilities to enable workflows for guided debugging, workload optimization, and incident summarization, empowering both internal teams and customers. Maintaining the health of telemetry data pipelines, focusing on correlation primitives and aggregation services for root cause analysis and performance detection, is also a critical aspect of the role. The engineer will work to improve the performance, security, reliability, and scalability of insights services, including SLO ownership and latency optimization, while participating in the team's on-call rotation. Collaboration with internal engineering teams to embed observability best practices and custom tooling into their systems is essential, as is contributing to the overall observability strategy and influencing the direction of the platform.
The ideal candidate will have six or more years of experience in software or infrastructure engineering, with a focus on building production-grade backend systems and distributed APIs. A customer-obsessed mindset is crucial, with a preference for adopting a product lens when building developer-facing surfaces like SDKs and CLIs. Proficiency in reliability engineering concepts, including evaluation datasets for LLMs, error budgets for platform services, and fault-tolerant design for multi-tenant systems, is required. Familiarity with observability systems such as ClickHouse, Loki, Victoria Metrics, Prometheus, and Grafana is important. Experience in building agentic applications or LLM features, with a pragmatic approach to grounding, tool calling, and operational safety, is highly valued. The candidate should be comfortable using Go as the primary programming language but capable of collaborating with Python components when required for agentic layers. Working with a passionate team of engineers in an iterative, high-trust agile environment to ensure the collection-to-insights pipeline functions end-to-end is expected.
Preferred qualifications include experience operating Kubernetes clusters at scale with the ability to debug real-world AI workloads. Hands-on experience with logging, tracing, and metrics platforms in production and at scale, with a deep understanding of cardinality, indexing, and query performance, is advantageous. Experience running distributed systems or API services at cloud scale, including event streaming or data pipeline management, is beneficial. Familiarity with building services or products with LLMs, MCP, and agentic frameworks like Langchain and AgentCore is also preferred.
The base salary range for this role is $165,000 to $242,000, determined based on job-related knowledge, skills, experience, and market location. In addition to the base salary, the total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program, all based on eligibility. Benefits include medical, dental, and vision insurance fully paid for by CoreWeave, company-paid life insurance, voluntary supplemental life insurance, short and long-term disability insurance, flexible spending account, health savings account, tuition reimbursement, the ability to participate in the Employee Stock Purchase Program (ESPP), mental wellness benefits through Spring Health, family-forming support provided by Carrot, paid parental leave, flexible, full-service childcare support with Kinside, 401(k) with a generous employer match, flexible PTO, catered lunch each day in office and data center locations, a casual work environment, and a work culture focused on innovative disruption.