Service Level Objectives (SLOs)¶
Last Updated: 2026-01-18
Status: Production SLOs for Cesivi Server
Review Schedule: Quarterly
Overview¶
This document defines the Service Level Objectives (SLOs) for Cesivi Server. SLOs are internal targets for service performance and reliability that guide operational decisions and alerting thresholds.
Key Concepts: - SLI (Service Level Indicator): A quantitative measure of service behavior - SLO (Service Level Objective): Target value or range for an SLI - Error Budget: The allowed amount of unreliability
Measurement Period: 30-day rolling window
1. Availability SLO¶
Target¶
99.9% uptime (monthly rolling window)
Definition¶
The percentage of time the service is available and responding to requests.
Availability = (successful_requests / total_requests) * 100
Successful requests: HTTP 2xx, 3xx, 4xx (client errors don't count against availability) Failed requests: HTTP 5xx (server errors)
Error Budget¶
0.1% downtime = 43.2 minutes per month
Breakdown: - Monthly: 43.2 minutes - Weekly: 10.08 minutes - Daily: 1.44 minutes
Prometheus Query¶
(sum(rate(http_requests_total{status!~"5.."}[30d])) / sum(rate(http_requests_total[30d]))) * 100
Rationale¶
99.9% availability is appropriate for a production SharePoint mock server supporting development, testing, and production workflows.
2. Latency SLO¶
Targets¶
- P95 < 500ms for list operations
- P95 < 1s for file downloads
- P99 < 2s for all operations
Definition¶
Time from receiving a request to sending the complete response.
- P95: 95% of requests complete faster than this threshold
- P99: 99% of requests complete faster than this threshold
Prometheus Queries¶
List Operations P95:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{endpoint=~".*/_api/web/lists/.*"}[5m])) by (le)) < 0.5
File Downloads P95:
histogram_quantile(0.95, sum(rate(cesivi_file_download_duration_seconds_bucket[5m])) by (le)) < 1.0
Rationale¶
- 500ms for list operations: Interactive UI operations should feel responsive
- 1s for file downloads: File operations can tolerate slightly higher latency
- 2s for P99: Allows for occasional slow requests
Exclusions¶
- File uploads/downloads >10MB
- Bulk CSOM operations (>50 operations)
- Initial startup requests (first 60 seconds)
3. Error Rate SLO¶
Target¶
< 1% server errors (5xx responses)
Definition¶
Percentage of requests resulting in server errors.
Error Rate = (5xx_responses / total_requests) * 100
Prometheus Query¶
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100 < 1.0
Rationale¶
1% error rate allows for occasional transient failures while maintaining high reliability.
Breakdown by type: - 500 Internal Server Error: <0.5% - 502 Bad Gateway: <0.3% - 503 Service Unavailable: <0.2% - 504 Gateway Timeout: <0.1%
4. Throughput SLO¶
Target¶
Support >100 requests/second under normal load
Definition¶
Number of requests the service can handle per second.
Prometheus Query¶
sum(rate(http_requests_total[1m])) > 100
Rationale¶
100 req/s is sufficient for development/testing. Production can scale horizontally.
Capacity planning: - Single instance: 100-200 req/s - Multi-instance (3 replicas): 300-600 req/s - Kubernetes auto-scaling: 1000+ req/s
Load testing targets: - Sustained: 100 req/s for 1 hour - Burst: 300 req/s for 5 minutes - Peak: 500 req/s for 30 seconds
5. Data Integrity SLO¶
Target¶
99.99% data consistency (zero data loss tolerance)
Definition¶
All write operations must be persisted correctly and retrievable.
Validation¶
- Write-read consistency checks
- Cross-storage provider validation
- ACID transaction integrity
Rationale¶
Data integrity is critical for SharePoint mock supporting workflows and migration.
6. Cache Performance SLO¶
Target¶
Cache hit rate >70% for frequently accessed data
Definition¶
Percentage of requests served from cache.
Cache Hit Rate = (cache_hits / (cache_hits + cache_misses)) * 100
Prometheus Query¶
(sum(rate(cesivi_cache_hit_total[5m])) / (sum(rate(cesivi_cache_hit_total[5m])) + sum(rate(cesivi_cache_miss_total[5m])))) * 100 > 70
Rationale¶
70% hit rate balances memory usage with performance.
Target by cache type: - CAML Parse Cache: >80% - Reflection Cache: >95% - Object Path Cache: >70%
Error Budget Management¶
Monthly Allocation¶
| SLO | Target | Error Budget | Allowed Downtime |
|---|---|---|---|
| Availability | 99.9% | 0.1% | 43.2 min |
| Error Rate | <1% | 1% | ~432 min errors |
| Latency | <500ms | 5% slow | ~2,160 min |
Policy¶
Error budget consumption: - >50% remaining: Normal operations - 25-50% remaining: Review changes, consider rollback - <25% remaining: Freeze non-critical releases - Exhausted: Emergency freeze, postmortem required
SLO Review¶
Schedule¶
- Weekly: Error budget review
- Monthly: Full SLO compliance
- Quarterly: Strategic alignment
Adjustment Criteria¶
- Tighten: Meeting SLO with >90% budget for 3 months
- Loosen: Frequently exhausting budget
- New SLO: New capabilities or requirements
Operational Implications¶
Alerting¶
Based on SLO thresholds (monitoring/prometheus/alerts.yml): - Critical: SLO violation - Warning: Approaching violation - Info: Error budget consumption
Incident Response¶
See _docs/INCIDENT_RESPONSE.md for procedures.
Capacity Planning¶
SLOs guide infrastructure decisions: - Availability: Redundancy planning - Latency: Performance optimization - Throughput: Horizontal scaling - Data Integrity: Backup procedures
Version: 1.0
Created: 2026-01-18
Author: PLAN-152 Phase 3.1
Next Review: 2026-04-18