Debugging and Troubleshooting While On-Call

Systematic debugging approach for on-call engineers to efficiently diagnose and resolve issues during incidents, including investigation workflows and common debugging strategies.

Last updated: November 6, 2025

management

Engineering Manager

on-call

debugging

# Debugging and Troubleshooting While On-Call Act as an on-call engineer systematically debugging an issue during an incident. ## Debugging Framework **Systematic Approach**: 1. Gather information 2. Reproduce the issue 3. Form hypotheses 4. Test hypotheses 5. Fix or escalate 6. Verify resolution --- ## Initial Investigation (0-15 minutes) ### Step 1: Gather Context **Information to Collect**: - [ ] When did the issue start? [Time] - [ ] What was working before? [State] - [ ] What changed recently? [Deployments/configs] - [ ] How many users affected? [Scale] - [ ] What error messages? [Logs/metrics] **Quick Checks**: - [ ] Status page: [Any known issues?] - [ ] Recent deployments: [Last [X] hours] - [ ] Monitoring dashboards: [Any anomalies?] - [ ] Alert history: [Related alerts?] **Question Template**: - What exactly is broken? - When did it start breaking? - Who/what is affected? - What changed recently? - What should be happening instead? --- ## Information Gathering Tools ### Logs **Where to Look**: - [ ] Application logs: [Location] - [ ] Error logs: [Location] - [ ] Access logs: [Location] - [ ] System logs: [Location] **Useful Commands**: ``` # Search for errors grep -i error /var/log/app.log | tail -100 # Find recent errors grep -i error /var/log/app.log | tail -50 | less # Search by time range grep "2024-01-01 14:" /var/log/app.log | grep error # Count error frequency grep -i error /var/log/app.log | wc -l ``` ### Metrics **Metrics to Check**: - [ ] Error rate: [Current vs. baseline] - [ ] Response time: [Current vs. baseline] - [ ] CPU/Memory: [Current usage] - [ ] Request rate: [Traffic patterns] - [ ] Database connections: [Pool status] ### Monitoring Dashboards **Dashboards to Review**: - [ ] [Dashboard name]: [What it shows] - [ ] [Dashboard name]: [What it shows] - [ ] [Dashboard name]: [What it shows] **Key Questions**: - What metrics are abnormal? - When did metrics start changing? - Are there any patterns? - Which services are affected? --- ## Hypothesis Formation ### Common Issue Categories **1. Recent Changes** - Hypothesis: Recent deployment/config change caused issue - Evidence to check: [Deployment logs, config changes] - How to verify: [Rollback, compare configs] **2. Dependency Issues** - Hypothesis: External dependency/service is down - Evidence to check: [Dependency health, API responses] - How to verify: [Test dependency directly] **3. Resource Exhaustion** - Hypothesis: Out of CPU/memory/disk/connections - Evidence to check: [Resource metrics, limits] - How to verify: [Check resource usage, limits] **4. Configuration Issues** - Hypothesis: Incorrect configuration - Evidence to check: [Config files, environment variables] - How to verify: [Compare with known good config] **5. Data Issues** - Hypothesis: Corrupted or missing data - Evidence to check: [Database integrity, data checks] - How to verify: [Query data, check backups] **6. Network Issues** - Hypothesis: Network connectivity problems - Evidence to check: [Network metrics, connectivity tests] - How to verify: [Ping tests, traceroute] --- ## Investigation Workflow ### Workflow Template **Step 1: Reproduce the Issue** - [ ] Can I reproduce it? - [ ] What steps reproduce it? - [ ] Is it consistent or intermittent? - [ ] What's the error message/behavior? **Step 2: Check the Obvious** - [ ] Is service running? [Status check] - [ ] Are dependencies healthy? [Health checks] - [ ] Are there recent errors? [Error logs] - [ ] Has anything changed? [Recent changes] **Step 3: Narrow Down Scope** - [ ] Is it affecting all users or subset? - [ ] Is it affecting all features or specific ones? - [ ] Is it affecting all regions or specific region? - [ ] Is it affecting all servers or specific server? **Step 4: Form Initial Hypothesis** - [ ] Based on evidence, what's most likely? - [ ] What would explain all the symptoms? - [ ] What's the simplest explanation? **Step 5: Test Hypothesis** - [ ] How can I verify this hypothesis? - [ ] What evidence would confirm it? - [ ] What test can I run? --- ## Common Debugging Strategies ### Strategy 1: Divide and Conquer - [ ] Split system into components - [ ] Test each component independently - [ ] Identify which component is failing - [ ] Narrow down to specific component ### Strategy 2: Binary Search - [ ] Check midpoint of system - [ ] If issue before midpoint, check first half - [ ] If issue after midpoint, check second half - [ ] Repeat until isolated ### Strategy 3: Compare Working vs. Broken - [ ] What's different between working and broken? - [ ] Compare configs: [Working vs. broken] - [ ] Compare code versions: [Working vs. broken] - [ ] Compare environments: [Working vs. broken] ### Strategy 4: Check Logs Systematically - [ ] Start from when issue began - [ ] Follow log trail chronologically - [ ] Look for error patterns - [ ] Trace request flow through logs ### Strategy 5: Use Debugging Tools - [ ] Debugger: [Breakpoints, step through] - [ ] Profilers: [Performance analysis] - [ ] Tracing: [Distributed tracing] - [ ] Monitoring: [Real-time metrics] --- ## Debugging Common Issues ### Issue: Service Not Responding **Investigation Steps**: 1. [ ] Check if service is running: [Command] 2. [ ] Check service logs: [Command] 3. [ ] Check resource usage: [CPU/Memory] 4. [ ] Check network connectivity: [Ping/curl] 5. [ ] Check health endpoint: [URL] 6. [ ] Check recent deployments: [When] 7. [ ] Check dependencies: [Are they up?] **Common Causes**: - Service crashed: [Check logs for crash] - Resource exhaustion: [Check CPU/memory] - Configuration error: [Check configs] - Dependency down: [Check dependencies] ### Issue: High Error Rate **Investigation Steps**: 1. [ ] Check error logs: [Most common errors] 2. [ ] Identify error patterns: [What errors?] 3. [ ] Check when errors started: [Timeline] 4. [ ] Check recent changes: [Deployments/configs] 5. [ ] Check related metrics: [Request rate, latency] 6. [ ] Check database: [Query performance] **Common Causes**: - Code bug: [Recent deployment] - Database issue: [Connection/query problems] - Dependency issue: [External service down] - Resource issue: [Out of capacity] ### Issue: Slow Performance **Investigation Steps**: 1. [ ] Check response times: [Current vs. baseline] 2. [ ] Check resource usage: [CPU/Memory/DB] 3. [ ] Check database queries: [Slow queries] 4. [ ] Check network latency: [Network metrics] 5. [ ] Check request rate: [Traffic spike?] 6. [ ] Check recent changes: [Deployments] **Common Causes**: - N+1 queries: [Check database queries] - Resource exhaustion: [CPU/Memory] - Slow dependency: [External service] - Traffic spike: [More requests than capacity] ### Issue: Data Inconsistency **Investigation Steps**: 1. [ ] Identify scope: [What data is affected?] 2. [ ] Check data integrity: [Database checks] 3. [ ] Check recent changes: [Data migrations/writes] 4. [ ] Check application logic: [How data is written] 5. [ ] Check replication: [If replicated] 6. [ ] Review recent operations: [What changed data?] **Common Causes**: - Bug in application logic: [Check code] - Data migration issue: [Check migrations] - Replication lag: [Check replication] - Concurrent updates: [Race condition] --- ## Effective Debugging Practices ### Time Management - [ ] Set time limits for investigation phases - [ ] Escalate if stuck after [X] minutes - [ ] Don't rabbit hole into one theory - [ ] Take breaks if frustrated ### Documentation - [ ] Document what you've checked - [ ] Document hypotheses tested - [ ] Document findings - [ ] Document resolution steps ### Collaboration - [ ] Ask for help when stuck - [ ] Share findings with team - [ ] Use pair debugging if helpful - [ ] Escalate appropriately ### Verification - [ ] Verify fix actually resolves issue - [ ] Test related functionality - [ ] Monitor metrics after fix - [ ] Confirm no regressions --- ## Debugging Checklist **Before Starting Investigation**: - [ ] Understand the problem clearly - [ ] Gather initial information - [ ] Set up monitoring/logging access - [ ] Prepare debugging tools **During Investigation**: - [ ] Follow systematic approach - [ ] Document findings - [ ] Test hypotheses - [ ] Don't make assumptions - [ ] Ask for help when needed **After Finding Root Cause**: - [ ] Verify root cause - [ ] Implement fix - [ ] Verify fix works - [ ] Document solution - [ ] Update runbooks --- ## Escalation Criteria **When to Escalate**: - [ ] No progress after [X] minutes - [ ] Issue exceeds your expertise - [ ] Need access/permissions you don't have - [ ] Issue affecting critical systems - [ ] Customer impact severe **How to Escalate**: - [ ] Summarize what you've found - [ ] Share investigation steps taken - [ ] Explain what you need help with - [ ] Provide context and evidence - [ ] Include customer impact --- ## Common Mistakes to Avoid **Don't**: - [ ] Assume you know the cause without evidence - [ ] Skip systematic investigation - [ ] Make changes without understanding impact - [ ] Ignore obvious checks - [ ] Work in isolation when stuck - [ ] Forget to document findings - [ ] Forget to verify fixes **Do**: - [ ] Follow systematic approach - [ ] Gather evidence before acting - [ ] Document everything - [ ] Ask for help when needed - [ ] Verify your fixes - [ ] Learn from incidents

Debugging and Troubleshooting While On-Call

Unlock Premium Features

Related Prompts

Social

Legal

Try These Resources

Related Prompts

Use bug investigation helper prompt

Use performance bottleneck analyzer prompt

Use mathematical problem solving prompt