Incident Response Runbook¶
Last Updated: 2026-01-18 Version: 1.0
Overview¶
This runbook provides procedures for responding to incidents in Cesivi Server.
Related: - _docs/SLO_DEFINITIONS.md - monitoring/prometheus/alerts.yml
Incident Classification¶
| Severity | Response Time | Examples |
|---|---|---|
| P0 - Critical | < 15 min | Service down |
| P1 - High | < 1 hour | High error rate |
| P2 - Medium | < 4 hours | Cache issues |
1. High Error Rate¶
Alert: HighErrorRate (>1% for 5 min)
Severity: P1 Critical
Symptoms¶
- Error rate >1% in Grafana
- Users reporting 500 errors
- Error logs with correlation IDs
Investigation¶
- Check error patterns:
grep '"Level":"Error"' MockData/Logs/Server/cesivi-*.log | tail -50 - Check database:
dotnet run --project Cesivi.Cli health - Check cache:
redis-cli ping(if using Redis)
Resolution¶
- Database down: Restart database, verify health
- Cache down: Restart Redis, clear cache
- Code error: Rollback deployment
- Load spike: Scale horizontally
Escalation¶
-
15 min unresolved → Senior engineer
- Data corruption → Database admin
2. High Latency¶
Alert: HighLatency (P95 >500ms)
Severity: P1 Critical
Symptoms¶
- P95 latency >500ms
- Slow responses
- Timeouts
Investigation¶
- Check latency: Via Prometheus or Grafana
- Identify slow endpoints
- Check database query performance
- Check cache hit rate
Resolution¶
- Slow queries: Add indexes, optimize CAML
- Low cache hit: Increase cache size/TTL
- Resource contention: Scale horizontally
3. Service Down¶
Alert: ServiceDown
Severity: P0 Critical
Symptoms¶
- Service unreachable
- Health check failing
- No response
Investigation¶
- Check service:
systemctl status cesivi/docker ps/kubectl get pods - Check logs:
journalctl -u cesivi -n 100 - Check resources:
free -h,df -h
Resolution¶
- Crashed: Restart service, check logs
- Out of memory: Increase limits, restart
- Disk full: Clean logs, verify space
- Port conflict: Stop conflicting process or change port
4. Database Connection Failures¶
Severity: P0 Critical
Investigation¶
- Test connectivity:
psql/sqlcmd - Check service:
systemctl status postgresql - Check pool: Check connection pool logs
Resolution¶
- Service down: Start database
- Wrong credentials: Update connection string
- Pool exhausted: Increase pool size
5. Cache Failures (Redis)¶
Severity: P2 Medium
Investigation¶
- Check Redis:
systemctl status redis - Test:
redis-cli ping - Check memory:
redis-cli info memory
Resolution¶
- Redis down: Restart Redis
- Memory full: Flush expired keys, increase maxmemory
- Failover: Service continues with InMemory (degraded)
Recovery Procedures¶
Service Restart¶
systemctl restart cesivi # Systemd
docker-compose restart # Docker
kubectl rollout restart deployment/cesivi-server # K8s
Restore from Backup¶
# 1. Stop service
# 2. Restore files/database from backup
# 3. Restart service
# 4. Verify: curl http://localhost:5001/health
Rollback Deployment¶
git revert <commit> # Git
docker pull cesivi:previous-tag # Docker
kubectl rollout undo deployment # K8s
Escalation¶
| Scenario | Time | Escalate To |
|---|---|---|
| P0 unresolved | 15 min | Incident Commander |
| Data loss | Immediate | Senior DBA |
| P1 unresolved | 1 hour | Senior Engineer |
Process¶
- Gather context (time, impact, steps taken, correlation IDs)
- Escalate via PagerDuty/Email/Slack
- Provide updates (every 15 min for P0, 30 min for P1)
Postmortem Template¶
After P0/P1 incidents:
# Incident: [Title]
Date: YYYY-MM-DD
Duration: X hours
Severity: P0/P1
## Timeline
- HH:MM - Started
- HH:MM - Detected
- HH:MM - Resolved
## Root Cause
[Analysis]
## Resolution
[Steps]
## Action Items
1. Prevent: [Action]
2. Detect: [Action]
3. Respond: [Action]
Created: 2026-01-18
Author: PLAN-152 Phase 3.3