MLOps monitoring,
made legible.
A real-time monitoring surface for ML pipelines — drift detection, deployment health, and incident replay across model lifecycles.
- HTML · CSS
- Next.js
- Tailwind
- Sector
- AI · MLOps
- Users
- ML engineers, SREs
- Scope
- Product UX · data UX · system tokens
- Surface
- Web app
• 01 · Overview
Overview.
Designed a unified monitoring surface for an ML platform team running 40+ production models. The role: principal product designer working with platform engineering and SRE leadership.
• 02 · Challenge
The challenge.
Ops engineers were debugging in three places — Datadog for infra, MLFlow for experiments, a custom dashboard for drift — and stitching context manually after every page. P1 incidents averaged 38 minutes from alert to root cause.
• 03 · Process
Process.
Started with shadowing on-call rotations to map the actual debugging path. Ran 12 user interviews. Built information-architecture sketches against three real incident retrospectives, then prototyped at three fidelities — flow, mid-fi, hi-fi tokens.
- 12 stakeholder + on-call interviews
- Three retrospective walk-throughs replayed in prototypes
- Token-first design system aligned with platform engineering
• 04 · Solution
Solution.
A single canvas with three switching contexts: model health, deployment, drift. Status grammar borrowed from incident response (P1/P2/P3) so SRE and ML engineers shared a vocabulary. Replay scrubs through any incident in under 10 seconds.
- Unified canvas — one screen, three contexts
- Drift detector with rule-based + statistical thresholds
- Incident replay — scrub the timeline 60min around an alert
- Tokens shipped to engineering as a Tailwind preset
• 05 · Results & metrics
Results.
- 38 → 11 minmean time to root cause
- 72%fewer false-positive alerts
- 100%on-call adoption in week one