# Debugging and Troubleshooting While On-Call
Act as an on-call engineer systematically debugging an issue during an incident.
## Debugging Framework
**Systematic Approach**:
1. Gather information
2. Reproduce the issue
3. Form hypotheses
4. Test hypotheses
5. Fix or escalate
6. Verify resolution
---
## Initial Investigation (0-15 minutes)
### Step 1: Gather Context
**Information to Collect**:
- [ ] When did the issue start? [Time]
- [ ] What was working before? [State]
- [ ] What changed recently? [Deployments/configs]
- [ ] How many users affected? [Scale]
- [ ] What error messages? [Logs/metrics]
**Quick Checks**:
- [ ] Status page: [Any known issues?]
- [ ] Recent deployments: [Last [X] hours]
- [ ] Monitoring dashboards: [Any anomalies?]
- [ ] Alert history: [Related alerts?]
**Question Template**:
- What exactly is broken?
- When did it start breaking?
- Who/what is affected?
- What changed recently?
- What should be happening instead?
---
## Information Gathering Tools
### Logs
**Where to Look**:
- [ ] Application logs: [Location]
- [ ] Error logs: [Location]
- [ ] Access logs: [Location]
- [ ] System logs: [Location]
**Useful Commands**:
```
# Search for errors
grep -i error /var/log/app.log | tail -100
# Find recent errors
grep -i error /var/log/app.log | tail -50 | less
# Search by time range
grep "2024-01-01 14:" /var/log/app.log | grep error
# Count error frequency
grep -i error /var/log/app.log | wc -l
```
### Metrics
**Metrics to Check**:
- [ ] Error rate: [Current vs. baseline]
- [ ] Response time: [Current vs. baseline]
- [ ] CPU/Memory: [Current usage]
- [ ] Request rate: [Traffic patterns]
- [ ] Database connections: [Pool status]
### Monitoring Dashboards
**Dashboards to Review**:
- [ ] [Dashboard name]: [What it shows]
- [ ] [Dashboard name]: [What it shows]
- [ ] [Dashboard name]: [What it shows]
**Key Questions**:
- What metrics are abnormal?
- When did metrics start changing?
- Are there any patterns?
- Which services are affected?
---
## Hypothesis Formation
### Common Issue Categories
**1. Recent Changes**
- Hypothesis: Recent deployment/config change caused issue
- Evidence to check: [Deployment logs, config changes]
- How to verify: [Rollback, compare configs]
**2. Dependency Issues**
- Hypothesis: External dependency/service is down
- Evidence to check: [Dependency health, API responses]
- How to verify: [Test dependency directly]
**3. Resource Exhaustion**
- Hypothesis: Out of CPU/memory/disk/connections
- Evidence to check: [Resource metrics, limits]
- How to verify: [Check resource usage, limits]
**4. Configuration Issues**
- Hypothesis: Incorrect configuration
- Evidence to check: [Config files, environment variables]
- How to verify: [Compare with known good config]
**5. Data Issues**
- Hypothesis: Corrupted or missing data
- Evidence to check: [Database integrity, data checks]
- How to verify: [Query data, check backups]
**6. Network Issues**
- Hypothesis: Network connectivity problems
- Evidence to check: [Network metrics, connectivity tests]
- How to verify: [Ping tests, traceroute]
---
## Investigation Workflow
### Workflow Template
**Step 1: Reproduce the Issue**
- [ ] Can I reproduce it?
- [ ] What steps reproduce it?
- [ ] Is it consistent or intermittent?
- [ ] What's the error message/behavior?
**Step 2: Check the Obvious**
- [ ] Is service running? [Status check]
- [ ] Are dependencies healthy? [Health checks]
- [ ] Are there recent errors? [Error logs]
- [ ] Has anything changed? [Recent changes]
**Step 3: Narrow Down Scope**
- [ ] Is it affecting all users or subset?
- [ ] Is it affecting all features or specific ones?
- [ ] Is it affecting all regions or specific region?
- [ ] Is it affecting all servers or specific server?
**Step 4: Form Initial Hypothesis**
- [ ] Based on evidence, what's most likely?
- [ ] What would explain all the symptoms?
- [ ] What's the simplest explanation?
**Step 5: Test Hypothesis**
- [ ] How can I verify this hypothesis?
- [ ] What evidence would confirm it?
- [ ] What test can I run?
---
## Common Debugging Strategies
### Strategy 1: Divide and Conquer
- [ ] Split system into components
- [ ] Test each component independently
- [ ] Identify which component is failing
- [ ] Narrow down to specific component
### Strategy 2: Binary Search
- [ ] Check midpoint of system
- [ ] If issue before midpoint, check first half
- [ ] If issue after midpoint, check second half
- [ ] Repeat until isolated
### Strategy 3: Compare Working vs. Broken
- [ ] What's different between working and broken?
- [ ] Compare configs: [Working vs. broken]
- [ ] Compare code versions: [Working vs. broken]
- [ ] Compare environments: [Working vs. broken]
### Strategy 4: Check Logs Systematically
- [ ] Start from when issue began
- [ ] Follow log trail chronologically
- [ ] Look for error patterns
- [ ] Trace request flow through logs
### Strategy 5: Use Debugging Tools
- [ ] Debugger: [Breakpoints, step through]
- [ ] Profilers: [Performance analysis]
- [ ] Tracing: [Distributed tracing]
- [ ] Monitoring: [Real-time metrics]
---
## Debugging Common Issues
### Issue: Service Not Responding
**Investigation Steps**:
1. [ ] Check if service is running: [Command]
2. [ ] Check service logs: [Command]
3. [ ] Check resource usage: [CPU/Memory]
4. [ ] Check network connectivity: [Ping/curl]
5. [ ] Check health endpoint: [URL]
6. [ ] Check recent deployments: [When]
7. [ ] Check dependencies: [Are they up?]
**Common Causes**:
- Service crashed: [Check logs for crash]
- Resource exhaustion: [Check CPU/memory]
- Configuration error: [Check configs]
- Dependency down: [Check dependencies]
### Issue: High Error Rate
**Investigation Steps**:
1. [ ] Check error logs: [Most common errors]
2. [ ] Identify error patterns: [What errors?]
3. [ ] Check when errors started: [Timeline]
4. [ ] Check recent changes: [Deployments/configs]
5. [ ] Check related metrics: [Request rate, latency]
6. [ ] Check database: [Query performance]
**Common Causes**:
- Code bug: [Recent deployment]
- Database issue: [Connection/query problems]
- Dependency issue: [External service down]
- Resource issue: [Out of capacity]
### Issue: Slow Performance
**Investigation Steps**:
1. [ ] Check response times: [Current vs. baseline]
2. [ ] Check resource usage: [CPU/Memory/DB]
3. [ ] Check database queries: [Slow queries]
4. [ ] Check network latency: [Network metrics]
5. [ ] Check request rate: [Traffic spike?]
6. [ ] Check recent changes: [Deployments]
**Common Causes**:
- N+1 queries: [Check database queries]
- Resource exhaustion: [CPU/Memory]
- Slow dependency: [External service]
- Traffic spike: [More requests than capacity]
### Issue: Data Inconsistency
**Investigation Steps**:
1. [ ] Identify scope: [What data is affected?]
2. [ ] Check data integrity: [Database checks]
3. [ ] Check recent changes: [Data migrations/writes]
4. [ ] Check application logic: [How data is written]
5. [ ] Check replication: [If replicated]
6. [ ] Review recent operations: [What changed data?]
**Common Causes**:
- Bug in application logic: [Check code]
- Data migration issue: [Check migrations]
- Replication lag: [Check replication]
- Concurrent updates: [Race condition]
---
## Effective Debugging Practices
### Time Management
- [ ] Set time limits for investigation phases
- [ ] Escalate if stuck after [X] minutes
- [ ] Don't rabbit hole into one theory
- [ ] Take breaks if frustrated
### Documentation
- [ ] Document what you've checked
- [ ] Document hypotheses tested
- [ ] Document findings
- [ ] Document resolution steps
### Collaboration
- [ ] Ask for help when stuck
- [ ] Share findings with team
- [ ] Use pair debugging if helpful
- [ ] Escalate appropriately
### Verification
- [ ] Verify fix actually resolves issue
- [ ] Test related functionality
- [ ] Monitor metrics after fix
- [ ] Confirm no regressions
---
## Debugging Checklist
**Before Starting Investigation**:
- [ ] Understand the problem clearly
- [ ] Gather initial information
- [ ] Set up monitoring/logging access
- [ ] Prepare debugging tools
**During Investigation**:
- [ ] Follow systematic approach
- [ ] Document findings
- [ ] Test hypotheses
- [ ] Don't make assumptions
- [ ] Ask for help when needed
**After Finding Root Cause**:
- [ ] Verify root cause
- [ ] Implement fix
- [ ] Verify fix works
- [ ] Document solution
- [ ] Update runbooks
---
## Escalation Criteria
**When to Escalate**:
- [ ] No progress after [X] minutes
- [ ] Issue exceeds your expertise
- [ ] Need access/permissions you don't have
- [ ] Issue affecting critical systems
- [ ] Customer impact severe
**How to Escalate**:
- [ ] Summarize what you've found
- [ ] Share investigation steps taken
- [ ] Explain what you need help with
- [ ] Provide context and evidence
- [ ] Include customer impact
---
## Common Mistakes to Avoid
**Don't**:
- [ ] Assume you know the cause without evidence
- [ ] Skip systematic investigation
- [ ] Make changes without understanding impact
- [ ] Ignore obvious checks
- [ ] Work in isolation when stuck
- [ ] Forget to document findings
- [ ] Forget to verify fixes
**Do**:
- [ ] Follow systematic approach
- [ ] Gather evidence before acting
- [ ] Document everything
- [ ] Ask for help when needed
- [ ] Verify your fixes
- [ ] Learn from incidents