Product Infrastructure Engineer - Site Reliability
Company: Zyphra
Location: Palo Alto
Posted on: February 15, 2026
|
|
|
Job Description:
Job Description Job Description Zyphra is an artificial
intelligence company based in Palo Alto, California. The Role: As a
Infrastructure Engineer - Site Reliability , you’ll be responsible
for designing and maintaining the systems that keep Zyphra’s
infrastructure robust, observable, secure, and scalable. Your work
will be essential to ensuring the reliability and reproducibility
of ML workloads, the safety and control of deployments, and the
long-term maintainability of our compute environments. You’ll work
across: Building and improving observability systems (monitoring,
logging, alerting) Designing resilient build and deployment systems
across research and production environments Implementing secure
release processes with strong auditability and rollback support
Collaborating closely with ML engineers, DevOps, and infra teams to
improve system reliability and performance Leading incident
response, root-cause analysis, and postmortems with a focus on
learning and prevention This role is ideal for someone who loves
building systems that make other teams faster, safer, and more
productive Requirements: Experience in high-performance compute
environments, such as ML clusters or GPU farms Background in
infrastructure as code (e.g., Ansible, Terraform) Familiarity with
software release engineering with for ML/AI systems is a plus
Experience designing reliable environments for experimental
workloads and reproducible runs Knowledge of compliance and audit
standards in deployment and system security Experience with load
testing, fault injection, and chaos engineering to harden systems
under stress Passion for building tooling that makes infrastructure
invisible and reliable for end users Bonus Qualifications:
Experience with infrastructure as code (e.g., Ansible, Terraform)
Prior work supporting ML/AI infrastructure, including GPU
management and workload optimization Exposure to backend
development for ML model serving (e.g., vLLM, Ray, SGLang)
Experience working with cloud platforms such as AWS, Azure, or GCP
Familiarity with containers (Docker, Apptainer) and their
integration with scheduling systems (Slurm, Kubernetes) Why Work at
Zyphra: Our research methodology is to make grounded, methodical
steps toward ambitious goals. Both deep research and engineering
excellence are equally valued We strongly value new and crazy ideas
and are very willing to bet big on new ideas We move as quickly as
we can; we aim to minimize the bar to impact as low as possible We
all enjoy what we do and love discussing AI Benefits and Perks:
Comprehensive medical, dental, vision, and FSA plans Competitive
compensation and 401(k) Relocation and immigration support on a
case-by-case basis On-site meals prepared by a dedicated culinary
team; Thursday Happy Hours In-person team in Palo Alto, CA, with a
collaborative, high-energy environment If you are excited to bring
reliability best practices to the frontier of AI infrastructure,
this job is for you. Apply Today!
Keywords: Zyphra, Novato , Product Infrastructure Engineer - Site Reliability, IT / Software / Systems , Palo Alto, California