Incident Response Runbook¶

Last Updated: 2026-01-18 Version: 1.0

Overview¶

This runbook provides procedures for responding to incidents in Cesivi Server.

Related: - _docs/SLO_DEFINITIONS.md - monitoring/prometheus/alerts.yml

Incident Classification¶

Severity	Response Time	Examples
P0 - Critical	< 15 min	Service down
P1 - High	< 1 hour	High error rate
P2 - Medium	< 4 hours	Cache issues

1. High Error Rate¶

Alert: HighErrorRate (>1% for 5 min)
Severity: P1 Critical

Symptoms¶

Error rate >1% in Grafana
Users reporting 500 errors
Error logs with correlation IDs

Investigation¶

Check error patterns: grep '"Level":"Error"' MockData/Logs/Server/cesivi-*.log | tail -50
Check database: dotnet run --project Cesivi.Cli health
Check cache: redis-cli ping (if using Redis)

Resolution¶

Database down: Restart database, verify health
Cache down: Restart Redis, clear cache
Code error: Rollback deployment
Load spike: Scale horizontally

Escalation¶

15 min unresolved → Senior engineer
Data corruption → Database admin

2. High Latency¶

Alert: HighLatency (P95 >500ms)
Severity: P1 Critical

Symptoms¶

P95 latency >500ms
Slow responses
Timeouts

Investigation¶

Check latency: Via Prometheus or Grafana
Identify slow endpoints
Check database query performance
Check cache hit rate

Resolution¶

Slow queries: Add indexes, optimize CAML
Low cache hit: Increase cache size/TTL
Resource contention: Scale horizontally

3. Service Down¶

Alert: ServiceDown
Severity: P0 Critical

Symptoms¶

Service unreachable
Health check failing
No response

Investigation¶

Check service: systemctl status cesivi / docker ps / kubectl get pods
Check logs: journalctl -u cesivi -n 100
Check resources: free -h, df -h

Resolution¶

Crashed: Restart service, check logs
Out of memory: Increase limits, restart
Disk full: Clean logs, verify space
Port conflict: Stop conflicting process or change port

4. Database Connection Failures¶

Severity: P0 Critical

Investigation¶

Test connectivity: psql / sqlcmd
Check service: systemctl status postgresql
Check pool: Check connection pool logs

Resolution¶

Service down: Start database
Wrong credentials: Update connection string
Pool exhausted: Increase pool size

5. Cache Failures (Redis)¶

Severity: P2 Medium

Investigation¶

Check Redis: systemctl status redis
Test: redis-cli ping
Check memory: redis-cli info memory

Resolution¶

Redis down: Restart Redis
Memory full: Flush expired keys, increase maxmemory
Failover: Service continues with InMemory (degraded)

Recovery Procedures¶

Service Restart¶

systemctl restart cesivi          # Systemd
docker-compose restart            # Docker
kubectl rollout restart deployment/cesivi-server  # K8s

Restore from Backup¶

# 1. Stop service
# 2. Restore files/database from backup
# 3. Restart service
# 4. Verify: curl http://localhost:5001/health

Rollback Deployment¶

git revert <commit>              # Git
docker pull cesivi:previous-tag  # Docker
kubectl rollout undo deployment  # K8s

Escalation¶

Scenario	Time	Escalate To
P0 unresolved	15 min	Incident Commander
Data loss	Immediate	Senior DBA
P1 unresolved	1 hour	Senior Engineer

Process¶

Gather context (time, impact, steps taken, correlation IDs)
Escalate via PagerDuty/Email/Slack
Provide updates (every 15 min for P0, 30 min for P1)

Postmortem Template¶

After P0/P1 incidents:

# Incident: [Title]
Date: YYYY-MM-DD
Duration: X hours
Severity: P0/P1

## Timeline
- HH:MM - Started
- HH:MM - Detected
- HH:MM - Resolved

## Root Cause
[Analysis]

## Resolution
[Steps]

## Action Items
1. Prevent: [Action]
2. Detect: [Action]
3. Respond: [Action]

Created: 2026-01-18
Author: PLAN-152 Phase 3.3