As asked
Describe a real incident where you realized mid-incident that you were missing the metric, log, or trace you needed to diagnose the problem. What did you do, and what did you build afterward?
Sample answer outline
A strong answer follows STAR format and names the specific gap (e.g., no database query timing metrics, no trace IDs in logs, no histogram for tail latency). The candidate should describe the workaround used during the incident (log parsing, manual profiling), the post-incident work to add the missing instrumentation, and how they validated the new instrumentation would have found the problem faster. The best answers show they updated the runbook and added a synthetic failure test.
Expect these follow-ups
- How did you decide which gap to fix first when there were multiple missing signals?
- How do you prevent the same observability gap from appearing in a new service?