Skip to content

Incident Response Runbook

Last Updated: 2026-01-18 Version: 1.0

Overview

This runbook provides procedures for responding to incidents in Cesivi Server.

Related: - _docs/SLO_DEFINITIONS.md - monitoring/prometheus/alerts.yml

Incident Classification

Severity Response Time Examples
P0 - Critical < 15 min Service down
P1 - High < 1 hour High error rate
P2 - Medium < 4 hours Cache issues

1. High Error Rate

Alert: HighErrorRate (>1% for 5 min)
Severity: P1 Critical

Symptoms

  • Error rate >1% in Grafana
  • Users reporting 500 errors
  • Error logs with correlation IDs

Investigation

  1. Check error patterns: grep '"Level":"Error"' MockData/Logs/Server/cesivi-*.log | tail -50
  2. Check database: dotnet run --project Cesivi.Cli health
  3. Check cache: redis-cli ping (if using Redis)

Resolution

  • Database down: Restart database, verify health
  • Cache down: Restart Redis, clear cache
  • Code error: Rollback deployment
  • Load spike: Scale horizontally

Escalation

  • 15 min unresolved → Senior engineer

  • Data corruption → Database admin

2. High Latency

Alert: HighLatency (P95 >500ms)
Severity: P1 Critical

Symptoms

  • P95 latency >500ms
  • Slow responses
  • Timeouts

Investigation

  1. Check latency: Via Prometheus or Grafana
  2. Identify slow endpoints
  3. Check database query performance
  4. Check cache hit rate

Resolution

  • Slow queries: Add indexes, optimize CAML
  • Low cache hit: Increase cache size/TTL
  • Resource contention: Scale horizontally

3. Service Down

Alert: ServiceDown
Severity: P0 Critical

Symptoms

  • Service unreachable
  • Health check failing
  • No response

Investigation

  1. Check service: systemctl status cesivi / docker ps / kubectl get pods
  2. Check logs: journalctl -u cesivi -n 100
  3. Check resources: free -h, df -h

Resolution

  • Crashed: Restart service, check logs
  • Out of memory: Increase limits, restart
  • Disk full: Clean logs, verify space
  • Port conflict: Stop conflicting process or change port

4. Database Connection Failures

Severity: P0 Critical

Investigation

  1. Test connectivity: psql / sqlcmd
  2. Check service: systemctl status postgresql
  3. Check pool: Check connection pool logs

Resolution

  • Service down: Start database
  • Wrong credentials: Update connection string
  • Pool exhausted: Increase pool size

5. Cache Failures (Redis)

Severity: P2 Medium

Investigation

  1. Check Redis: systemctl status redis
  2. Test: redis-cli ping
  3. Check memory: redis-cli info memory

Resolution

  • Redis down: Restart Redis
  • Memory full: Flush expired keys, increase maxmemory
  • Failover: Service continues with InMemory (degraded)

Recovery Procedures

Service Restart

systemctl restart cesivi          # Systemd
docker-compose restart            # Docker
kubectl rollout restart deployment/cesivi-server  # K8s

Restore from Backup

# 1. Stop service
# 2. Restore files/database from backup
# 3. Restart service
# 4. Verify: curl http://localhost:5001/health

Rollback Deployment

git revert <commit>              # Git
docker pull cesivi:previous-tag  # Docker
kubectl rollout undo deployment  # K8s

Escalation

Scenario Time Escalate To
P0 unresolved 15 min Incident Commander
Data loss Immediate Senior DBA
P1 unresolved 1 hour Senior Engineer

Process

  1. Gather context (time, impact, steps taken, correlation IDs)
  2. Escalate via PagerDuty/Email/Slack
  3. Provide updates (every 15 min for P0, 30 min for P1)

Postmortem Template

After P0/P1 incidents:

# Incident: [Title]
Date: YYYY-MM-DD
Duration: X hours
Severity: P0/P1

## Timeline
- HH:MM - Started
- HH:MM - Detected
- HH:MM - Resolved

## Root Cause
[Analysis]

## Resolution
[Steps]

## Action Items
1. Prevent: [Action]
2. Detect: [Action]
3. Respond: [Action]

Created: 2026-01-18
Author: PLAN-152 Phase 3.3