Product Infrastructure Engineer - Site Reliability

Company: Zyphra
Location: Palo Alto
Posted on: February 15, 2026

Job Description:

Job Description Job Description Zyphra is an artificial intelligence company based in Palo Alto, California. The Role: As a Infrastructure Engineer - Site Reliability , you’ll be responsible for designing and maintaining the systems that keep Zyphra’s infrastructure robust, observable, secure, and scalable. Your work will be essential to ensuring the reliability and reproducibility of ML workloads, the safety and control of deployments, and the long-term maintainability of our compute environments. You’ll work across: Building and improving observability systems (monitoring, logging, alerting) Designing resilient build and deployment systems across research and production environments Implementing secure release processes with strong auditability and rollback support Collaborating closely with ML engineers, DevOps, and infra teams to improve system reliability and performance Leading incident response, root-cause analysis, and postmortems with a focus on learning and prevention This role is ideal for someone who loves building systems that make other teams faster, safer, and more productive Requirements: Experience in high-performance compute environments, such as ML clusters or GPU farms Background in infrastructure as code (e.g., Ansible, Terraform) Familiarity with software release engineering with for ML/AI systems is a plus Experience designing reliable environments for experimental workloads and reproducible runs Knowledge of compliance and audit standards in deployment and system security Experience with load testing, fault injection, and chaos engineering to harden systems under stress Passion for building tooling that makes infrastructure invisible and reliable for end users Bonus Qualifications: Experience with infrastructure as code (e.g., Ansible, Terraform) Prior work supporting ML/AI infrastructure, including GPU management and workload optimization Exposure to backend development for ML model serving (e.g., vLLM, Ray, SGLang) Experience working with cloud platforms such as AWS, Azure, or GCP Familiarity with containers (Docker, Apptainer) and their integration with scheduling systems (Slurm, Kubernetes) Why Work at Zyphra: Our research methodology is to make grounded, methodical steps toward ambitious goals. Both deep research and engineering excellence are equally valued We strongly value new and crazy ideas and are very willing to bet big on new ideas We move as quickly as we can; we aim to minimize the bar to impact as low as possible We all enjoy what we do and love discussing AI Benefits and Perks: Comprehensive medical, dental, vision, and FSA plans Competitive compensation and 401(k) Relocation and immigration support on a case-by-case basis On-site meals prepared by a dedicated culinary team; Thursday Happy Hours In-person team in Palo Alto, CA, with a collaborative, high-energy environment If you are excited to bring reliability best practices to the frontier of AI infrastructure, this job is for you. Apply Today!

Keywords: Zyphra, Novato , Product Infrastructure Engineer - Site Reliability, IT / Software / Systems , Palo Alto, California

Didn't find what you're looking for? Search again!

Let Palo Alto recruiters find you. Post your resume for free!

Get Palo Alto IT / Software / Systems jobs via email.

View more Novato IT / Software / Systems jobs

Other IT / Software / Systems Jobs

Senior ML Engineer
Description: Job Description Job Description THE OPPORTUNITY We are Amplifier, and we have built the world s first Large Acoustic Model LAM , a foundation model that uses human voice to detect health conditions. (more...)
Company: Amplifier Health
Location: San Francisco
Posted on: 02/16/2026

Senior Software Engineer, AI Product
Description: At Vanta, our mission is to help businesses earn and prove trust. We believe that security should be monitored and verified continuously, and we empower companies to practice better security and prove (more...)
Company: Vanta
Location: Campbell
Posted on: 02/16/2026

Machine Learning Engineer - Robotics, Platforms for Vision Language Action Foundation Models
Description: Job Description Job Description At Toyota Research Institute TRI , we re on a mission to improve the quality of human life. We re developing new tools and capabilities to amplify the human experience. (more...)
Company: Toyota Research Institute
Location: Los Altos
Posted on: 02/16/2026

Salary in Novato, California Area | More details for Novato, California Jobs |Salary

Chief of Staff to CEO
Description: Job Description Job Description Included Health is hiring a Chief of Staff to the CEO who is equal parts strategic partner and operator. You'll run the Office of the CEO and act as a force multiplier (more...)
Company: Included Health
Location: San Francisco
Posted on: 02/16/2026

Senior Software Engineer, Enterprise Platform
Description: At Vanta, our mission is to help businesses earn and prove trust. We believe that security should be monitored and verified continuously, and we empower companies to practice better security and prove (more...)
Company: Vanta
Location: Campbell
Posted on: 02/16/2026

Senior Site Reliability Engineer | Patreon
Description: Job Description Job Description Patreon is seeking a Senior Site Reliability Engineer to join our dynamic technology team in San Francisco, CA. In this pivotal role, you will be instrumental in ensuring (more...)
Company: Ziphire.hr
Location: Sacramento
Posted on: 02/16/2026

Software Engineer, Custodian Data
Description: Do you have a passion for finance amp investing Are you excited by the challenge of modeling industry-critical data and making it highly available Do you enjoy solving complex technical problems and (more...)
Company: Ridgeline
Location: San Ramon
Posted on: 02/16/2026

Senior Building Controls Technician
Description: Job Description Job Description Please note that this is for an upcoming position. We are, however, accepting applications for this anticipated need. If you are interested in joining The Building People, (more...)
Company: The Building People
Location: San Francisco
Posted on: 02/16/2026

Director, Workplace Technology & Applications
Description: Are you a hands-on technology leader who thrives on empowering others through tools, systems, and data Do you enjoy rolling up your sleeves to solve business problems at the intersection of IT, operations, (more...)
Company: Ridgeline
Location: San Ramon
Posted on: 02/16/2026

Principal Product Manager, AI Platform - Remote
Description: Circle NYSE: CRCL is one of the world s leading internet financial platform companies, building the foundation of a more open, global economy through digital assets, payment applications, and programmable (more...)
Company: Circle
Location: Campbell
Posted on: 02/16/2026

Loading more jobs...

Product Infrastructure Engineer - Site Reliability

Didn't find what you're looking for? Search again!

Other IT / Software / Systems Jobs

Log In or Create An Account