CI Failure Triage: A Repeatable Playbook

Thu Aug 14 2025

When CI goes red, panic wastes cycles. Apply a triage routine.

1. Categorize Failure Type

  • Lint / Type
  • Unit test
  • Integration / e2e
  • Build / packaging
  • Flaky (non-deterministic)

2. Reproduce Locally

Run the exact script from logs. Match Node + dependency versions.

3. Minimize Noise

Re-run failing test file in isolation. Use --runInBand if race suspected.

4. Check Recent Commits

git log -n 5 --oneline — look for risky changes (infra, deps, env).

5. Inspect Caches

Out-of-date caches cause weirdness. Bust them: clear node_modules, build artifacts.

6. For Flakes

  • Add retries temporarily.
  • Increase timeouts only after profiling.
  • Capture artifacts (screenshots, logs) for failing runs.

7. Root Cause Template

Symptom: Trigger commit: Root cause: Fix: Safeguard (test / lint / monitor):

8. Add a Guard

After fix, write a test to fail if regression recurs.

9. Communicate

Post concise summary in team channel with impact and mitigation.

10. Continuous Improvement

Track categories of failures monthly -> prioritize investments.

Stable CI accelerates shipping.