About Vida
Vida is a publicly traded, early-stage AI company building an AI Agent Operating System that enables businesses and partners to create, deliver, and manage AI workforces. Our platform powers AI agents that communicate, automate workflows, connect with business software, and help teams operate more efficiently across channels and systems.
We are at an important stage of product and company. The core platform is expanding across voice, messaging, computer-use agents, reseller tooling, integrations, analytics, reliability, security, and operational controls. As the platform continues to scale, infrastructure plays a critical role in making Vida reliable, secure, observable, and ready for increasingly complex AI agent workloads.
About the Role
We are hiring a DevOps Engineer to help own the infrastructure foundation for Vida’s AI Agent Operating System.
This is a hands-on infrastructure engineering role. You will design, build, automate, and operate the cloud systems that support Vida’s agents, platform services, observability stack, and AI/LLM gateway infrastructure.
You should be excited to work in an early-stage environment where infrastructure decisions directly affect product reliability, customer trust, security posture, and our ability to scale. This role requires strong judgment, high ownership, and the ability to move between architecture, implementation, automation, debugging, and day-to-day operations.
You will partner closely with engineering and product to improve platform reliability, reduce operational risk, and build systems that can support real-world AI agent workloads across voice, messaging, integrations, and computer-use automation.
This is an ideal role for an infrastructure engineer who wants broad ownership, technical depth, and the opportunity to help build foundational cloud and AI platform systems for a fast-moving AI company.
What You’ll Work On
Vida’s platform is powerful and technically complex. Your job will be to make the infrastructure behind it reliable, secure, observable, and scalable.
Examples of infrastructure problems you may work on include:
- Designing and operating production AWS environments across VPC, EC2, EKS, ELB/NLB, Transit Gateway, Route 53, IAM, ACM, SSM, ECR, GuardDuty, and Secrets Manager.
- Building secure multi-VPC networking patterns with hub-and-spoke architecture and route domain isolation.
- Operating production EKS clusters, platform add-ons, autoscaling, RBAC, namespace boundaries, and upgrade workflows.
- Managing infrastructure as code with Terraform, Terraform Cloud, reusable modules, and policy-controlled CI/CD authentication.
- Building observability systems across Prometheus, Grafana, Alertmanager, kube-state-metrics, Node Exporter, and centralized monitoring patterns.
- Implementing Kubernetes platform services through Helm, including load balancing, autoscaling, external secrets, metrics, and monitoring.
- Improving platform security through private access patterns, least-privilege IAM, workload identity, SSM-only access, mTLS VPN, encrypted storage, and secure secret delivery.
- Supporting AI/LLM infrastructure such as LiteLLM, model gateway telemetry, traffic control, autoscaling, and service-level monitoring.
What You’ll Do
Cloud Infrastructure & AWS
- Architect, operate, and improve AWS environments supporting Vida’s production platform.
- Design secure networking patterns across VPCs, EKS clusters, private services, Transit Gateway, ELB/NLB, Route 53, and Global Accelerator.
- Implement least-privilege IAM, workload identity through IRSA/OIDC, IMDSv2 enforcement, secure access controls, and production-ready AWS security patterns.
- Use ACM, SSM, ECR, GuardDuty, Secrets Manager, and related AWS services to improve security, reliability, and operational efficiency.
- Improve cross-region latency, traffic routing, and availability through Anycast, geo-routing, and resilient ingress patterns.
Infrastructure as Code & Automation
- Build and maintain production infrastructure using Terraform, HCL, Terraform Cloud remote state, policy controls, and OIDC-based CI authentication.
- Create and standardize reusable Terraform modules for networking, EKS, observability, security, and platform services.
- Build reproducible AMIs with Packer, including hardened Docker and runtime dependencies.
- Automate post-provisioning and configuration management with Ansible, dynamic EC2 inventory, SSM, Jinja2 templates, and multi-play orchestration.
- Create deterministic Terraform-to-Ansible handoff patterns for infrastructure provisioning and application bootstrap.
- Automate Helm lifecycle operations and environment-specific configuration rollout.
Kubernetes, EKS & Service Mesh
- Deploy, operate, and upgrade Amazon EKS clusters using managed node groups and production-safe workflows.
- Manage core EKS add-ons including VPC CNI, CoreDNS, kube-proxy, pod identity agent, metrics-server, external-secrets, cluster-autoscaler, aws-load-balancer-controller, and kube-prometheus-stack.
- Implement autoscaling, RBAC boundaries, namespace segmentation, and Kubernetes CRDs such as ServiceMonitor, PodMonitor, ExternalSecret, and ClusterSecretStore.
- Implement and operate Istio service mesh components, including base, control plane, ingress gateway, sidecar injection, and VirtualService routing policies.
- Integrate Istio ingress with internal NLB patterns for private service exposure and service-to-service traffic governance.
Observability, Reliability & AI Platform
- Build and improve observability using Prometheus Operator, Grafana, Alertmanager, Node Exporter, kube-state-metrics, dashboards, and alerting standards.
- Design federated monitoring patterns from in-cluster EKS workloads to centralized hub Prometheus infrastructure.
- Integrate AI platform telemetry, including LiteLLM Prometheus callbacks, ServiceMonitor, model traffic visibility, latency tracking, and runtime anomaly detection.
- Establish dashboards and alerts for cluster health, workload SLOs, infrastructure saturation, runtime errors, and service reliability.
- Deploy and operate LiteLLM on Kubernetes using multi-replica architecture, HPA, Istio ingress, secure secret integration, and service-level monitoring.
- Build scalable AI gateway patterns that support secure model access, traffic control, observability, and high availability.
Security & Operational Controls
- Implement zero-trust infrastructure patterns including private EKS API endpoints, SSM-only operator access, and no public bastion exposure.
- Enforce mTLS Client VPN with ACM-issued certificates and encrypted gp3 EBS volumes by default.
- Integrate AWS Secrets Manager with External Secrets Operator for secure Kubernetes secret delivery.
- Reduce operational risk through identity controls, segmentation, policy-driven access, secure defaults, and automated guardrails.
- Partner with engineering to improve incident readiness, debugging workflows, runtime visibility, and production change safety.
What We’re Looking For
- 5+ years of experience in DevOps, infrastructure engineering, platform engineering, site reliability engineering, cloud engineering, or a related role.
- Deep hands-on experience with AWS production environments, especially VPC, EC2, EKS, ELB/NLB, Transit Gateway, Route 53, IAM, ACM, SSM, ECR, GuardDuty, and Secrets Manager.
- Strong experience with Terraform, Terraform Cloud, reusable modules, remote state, CI/CD authentication, and policy-controlled infrastructure workflows.
- Strong Kubernetes and EKS experience, including managed node groups, core add-ons, Helm, RBAC, autoscaling, namespaces, and production upgrades.
- Experience operating observability stacks with Prometheus, Grafana, Alertmanager, Node Exporter, kube-state-metrics, dashboards, and alerting.
- Experience implementing security-first cloud patterns, including least-privilege IAM, IRSA/OIDC, IMDSv2, private access, encrypted storage, and secure secret management.
- Comfort working with service mesh concepts and technologies such as Istio, ingress gateways, VirtualService routing, sidecar injection, and internal load balancer patterns.
- Strong automation skills using HCL, YAML, Jinja2, Bash, Python, and infrastructure scripting.
- Ability to troubleshoot complex distributed systems across cloud infrastructure, Kubernetes, networking, observability, and application runtime layers.
- High ownership, strong judgment, and a bias toward automation, documentation, reliability, and secure defaults.
- Comfort working in an early-stage company environment where priorities can change quickly and infrastructure work has direct customer impact.
Nice to Have
- Experience building or operating AI/LLM infrastructure, model gateways, LiteLLM, or similar AI platform services.
- Experience with federated Prometheus architectures or centralized observability across multiple clusters and environments.
- Experience with Packer-based AMI pipelines and hardened Linux images, especially Ubuntu ARM64.
- Experience with hub-and-spoke AWS network architecture, Transit Gateway route domain isolation, Global Accelerator, Anycast, or geo-routing.
- Experience with External Secrets Operator, AWS Secrets Manager, ClusterSecretStore, and Kubernetes-native secret delivery.
- Experience supporting voice, communications, automation, telephony, reseller, or multi-tenant SaaS platforms.
- Experience designing production readiness standards, SLOs, incident response practices, and operational runbooks.
- Interest in AI agents, computer-use automation, and the infrastructure required to operate AI workforces at scale.
What Success Looks Like
In your first few months, you will help Vida improve the reliability, security, and observability of our core infrastructure, with a focus on production AWS, EKS, Terraform, Kubernetes platform services, and AI gateway operations.
You will make the platform easier to operate, safer to change, more visible in production, and better prepared for customer growth.
Over time, you will become a key owner of Vida’s infrastructure foundation, helping define how we scale secure, reliable, observable AI agent systems across customers, resellers, workloads, and environments.
Why Join Vida
- Work on foundational infrastructure for AI agents that communicate, operate, and complete real business work.
- Own high-impact cloud, Kubernetes, security, observability, and AI platform systems at an early-stage AI company.
- Help build the infrastructure layer behind Vida’s AI Agent Operating System and AI workforce platform.
- Partner closely with engineering and product on technical architecture, reliability, and production operations.
- Build systems for real customers, real workflows, and real operational impact.
- Help shape how scalable AI agent infrastructure is built and operated.
Compensation
The expected base salary range for this role is to be determined based on experience, location, technical depth, and scope fit.
In addition to salary, this role includes meaningful equity participation and standard company benefits. We view this as a senior, high-leverage role with broad company impact.
How to Apply
Please send your resume using the link provided.