CI Failure Triage: A Repeatable Playbook
Thu Aug 14 2025
When CI goes red, panic wastes cycles. Apply a triage routine.
1. Categorize Failure Type
- Lint / Type
- Unit test
- Integration / e2e
- Build / packaging
- Flaky (non-deterministic)
2. Reproduce Locally
Run the exact script from logs. Match Node + dependency versions.
3. Minimize Noise
Re-run failing test file in isolation. Use --runInBand
if race suspected.
4. Check Recent Commits
git log -n 5 --oneline
— look for risky changes (infra, deps, env).
5. Inspect Caches
Out-of-date caches cause weirdness. Bust them: clear node_modules, build artifacts.
6. For Flakes
- Add retries temporarily.
- Increase timeouts only after profiling.
- Capture artifacts (screenshots, logs) for failing runs.
7. Root Cause Template
Symptom: Trigger commit: Root cause: Fix: Safeguard (test / lint / monitor):
8. Add a Guard
After fix, write a test to fail if regression recurs.
9. Communicate
Post concise summary in team channel with impact and mitigation.
10. Continuous Improvement
Track categories of failures monthly -> prioritize investments.
Stable CI accelerates shipping.