Create DevOps Runbook
Develop operational runbooks for common tasks, troubleshooting procedures, and standard operational procedures for DevOps teams.
v3
Last updated: November 6, 2025
management
Engineering Manager
runbook
devops
Loading...
Develop operational runbooks for common tasks, troubleshooting procedures, and standard operational procedures for DevOps teams.
# Create DevOps Runbook
Act as an Engineering Manager creating a DevOps runbook for operational procedures.
## Runbook Context
- **Service/System**: [Name]
- **Environment**: [Production/Staging/Development]
- **Owner**: [Team/person]
- **Last Updated**: [Date]
## Runbook Structure
### 1. Service Overview
**Purpose**: [What this service does]
**Key Components**:
- [ ] [Component 1]: [Description]
- [ ] [Component 2]: [Description]
- [ ] [Component 3]: [Description]
**Dependencies**:
- [ ] [Dependency 1]: [How it's used]
- [ ] [Dependency 2]: [How it's used]
**Architecture**:
- [Diagram or description]
- [Data flow]
- [Key integrations]
---
### 2. Health Checks
**Service Health Endpoint**:
```
GET /health
Expected Response: {"status": "healthy"}
```
**Key Health Indicators**:
- [ ] API response time: [Target < X ms]
- [ ] Error rate: [Target < X%]
- [ ] Database connectivity: [Check]
- [ ] External dependencies: [Check]
**Monitoring Dashboards**:
- [ ] [Dashboard URL] - [What it shows]
- [ ] [Dashboard URL] - [What it shows]
**Alert Thresholds**:
- [ ] Error rate > [X]%: [Alert]
- [ ] Response time > [X]ms: [Alert]
- [ ] CPU > [X]%: [Alert]
- [ ] Memory > [X]%: [Alert]
---
### 3. Common Operations
**Deployment Procedure**:
1. [ ] Backup current version
2. [ ] Run pre-deployment checks
3. [ ] Deploy to staging: [Command]
4. [ ] Verify staging deployment
5. [ ] Deploy to production: [Command]
6. [ ] Monitor deployment metrics
7. [ ] Verify production deployment
**Rollback Procedure**:
1. [ ] Identify version to rollback to
2. [ ] Execute rollback: [Command]
3. [ ] Verify rollback success
4. [ ] Monitor system health
**Scalability Procedures**:
- **Scale Up**: [Commands/Steps]
- **Scale Down**: [Commands/Steps]
- **Auto-scaling**: [Configuration]
**Backup Procedures**:
- **Manual Backup**: [Commands]
- **Backup Verification**: [Commands]
- **Backup Restoration**: [Commands]
---
### 4. Troubleshooting Guide
**Issue: High Error Rate**
- [ ] Check error logs: [Command]
- [ ] Review recent deployments
- [ ] Check dependency health
- [ ] Review system metrics
- [ ] Solution: [Common fixes]
**Issue: Slow Response Times**
- [ ] Check CPU/Memory usage
- [ ] Review database query performance
- [ ] Check network latency
- [ ] Review recent changes
- [ ] Solution: [Common fixes]
**Issue: Service Unavailable**
- [ ] Check service status: [Command]
- [ ] Review infrastructure status
- [ ] Check logs for errors
- [ ] Verify dependencies
- [ ] Solution: [Common fixes]
**Issue: Database Connection Errors**
- [ ] Check database status
- [ ] Verify connection strings
- [ ] Check network connectivity
- [ ] Review connection pool settings
- [ ] Solution: [Common fixes]
---
### 5. Emergency Procedures
**Service Down - Emergency Response**:
1. [ ] Acknowledge incident
2. [ ] Notify team: [Method]
3. [ ] Check service status: [Command]
4. [ ] Review recent deployments
5. [ ] Execute rollback if needed: [Command]
6. [ ] Monitor recovery
**Data Corruption - Emergency Response**:
1. [ ] Stop data writes
2. [ ] Assess corruption scope
3. [ ] Restore from backup: [Command]
4. [ ] Verify data integrity
5. [ ] Resume operations
**Security Incident - Emergency Response**:
1. [ ] Isolate affected systems
2. [ ] Notify security team
3. [ ] Preserve evidence
4. [ ] Assess impact
5. [ ] Deploy patches if needed
---
### 6. Maintenance Tasks
**Regular Maintenance**:
- **Daily**: [Tasks]
- **Weekly**: [Tasks]
- **Monthly**: [Tasks]
- **Quarterly**: [Tasks]
**Log Rotation**:
- [ ] Configuration: [Location]
- [ ] Retention: [Duration]
- [ ] Rotation: [Frequency]
**Certificate Renewal**:
- [ ] Certificates: [List]
- [ ] Renewal process: [Steps]
- [ ] Monitoring: [How to monitor]
**Database Maintenance**:
- [ ] Backup verification: [Schedule]
- [ ] Index optimization: [Schedule]
- [ ] Vacuum/cleanup: [Schedule]
---
### 7. Access & Permissions
**Required Access**:
- [ ] [Service/System]: [Access level]
- [ ] [Service/System]: [Access level]
**Access Request Process**:
- [ ] Request via [Method]
- [ ] Approval required from [Role]
- [ ] Access granted via [Method]
**SSH/Remote Access**:
- [ ] Jump host: [Hostname]
- [ ] SSH command: [Command]
- [ ] Key management: [Process]
---
### 8. Documentation & Resources
**Documentation Links**:
- [ ] Architecture docs: [URL]
- [ ] API docs: [URL]
- [ ] Deployment guide: [URL]
**Related Runbooks**:
- [ ] [Related runbook]: [Link]
- [ ] [Related runbook]: [Link]
**Contact Information**:
- [ ] On-call engineer: [Contact]
- [ ] Team lead: [Contact]
- [ ] Escalation: [Contact]
---
## Runbook Best Practices
**Keep Updated**:
- Review quarterly
- Update after incidents
- Update after major changes
**Test Procedures**:
- Test runbook procedures regularly
- Verify commands work
- Update outdated steps
**Clear & Concise**:
- Use step-by-step format
- Include exact commands
- Provide context for decisionsGet access to enhanced versions, advanced examples, and premium support for this prompt.
Loading revision history...