← Back to workFP Complete · VP Platform Engineering, AI R&D

Multi-Model Agent Platform

Problem

Client AI initiatives were slow to ship: model deployment was manual, onboarding new engineers took weeks, and there was no secure, shared substrate for running multi-model LLM and agent workloads in production.

Architecture

A Kubernetes/AWS platform running multi-model LLM and generative-AI pipelines, fronted by SSO-secured agent infrastructure and an internal developer platform with self-service onboarding. Standardized CI/CD on GitHub Actions, with SLAs, incident-response processes, and full observability (Prometheus, Grafana, OpenTelemetry).

What I built

Outcomes

−40%
Model deployment time
−50%
Engineer onboarding time
−60%
CI/CD cycle time
+15%
System uptime
−40%
Mean time to recovery

Stack

Kubernetes (EKS)AWSRustPythonGitHub ActionsPrometheus/GrafanaOpenTelemetry