Service Level Objectives (SLOs)¶

Last Updated: 2026-01-18
Status: Production SLOs for Cesivi Server
Review Schedule: Quarterly

Overview¶

This document defines the Service Level Objectives (SLOs) for Cesivi Server. SLOs are internal targets for service performance and reliability that guide operational decisions and alerting thresholds.

Key Concepts: - SLI (Service Level Indicator): A quantitative measure of service behavior - SLO (Service Level Objective): Target value or range for an SLI - Error Budget: The allowed amount of unreliability

Measurement Period: 30-day rolling window

1. Availability SLO¶

Target¶

99.9% uptime (monthly rolling window)

Definition¶

The percentage of time the service is available and responding to requests.

Availability = (successful_requests / total_requests) * 100

Successful requests: HTTP 2xx, 3xx, 4xx (client errors don't count against availability) Failed requests: HTTP 5xx (server errors)

Error Budget¶

0.1% downtime = 43.2 minutes per month

Breakdown: - Monthly: 43.2 minutes - Weekly: 10.08 minutes - Daily: 1.44 minutes

Prometheus Query¶

(sum(rate(http_requests_total{status!~"5.."}[30d])) / sum(rate(http_requests_total[30d]))) * 100

Rationale¶

99.9% availability is appropriate for a production SharePoint mock server supporting development, testing, and production workflows.

2. Latency SLO¶

Targets¶

P95 < 500ms for list operations
P95 < 1s for file downloads
P99 < 2s for all operations

Definition¶

Time from receiving a request to sending the complete response.

P95: 95% of requests complete faster than this threshold
P99: 99% of requests complete faster than this threshold

Prometheus Queries¶

List Operations P95:

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{endpoint=~".*/_api/web/lists/.*"}[5m])) by (le)) < 0.5

File Downloads P95:

histogram_quantile(0.95, sum(rate(cesivi_file_download_duration_seconds_bucket[5m])) by (le)) < 1.0

Rationale¶

500ms for list operations: Interactive UI operations should feel responsive
1s for file downloads: File operations can tolerate slightly higher latency
2s for P99: Allows for occasional slow requests

Exclusions¶

File uploads/downloads >10MB
Bulk CSOM operations (>50 operations)
Initial startup requests (first 60 seconds)

3. Error Rate SLO¶

Target¶

< 1% server errors (5xx responses)

Definition¶

Percentage of requests resulting in server errors.

Error Rate = (5xx_responses / total_requests) * 100

Prometheus Query¶

(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100 < 1.0

Rationale¶

1% error rate allows for occasional transient failures while maintaining high reliability.

Breakdown by type: - 500 Internal Server Error: <0.5% - 502 Bad Gateway: <0.3% - 503 Service Unavailable: <0.2% - 504 Gateway Timeout: <0.1%

4. Throughput SLO¶

Target¶

Support >100 requests/second under normal load

Definition¶

Number of requests the service can handle per second.

Prometheus Query¶

sum(rate(http_requests_total[1m])) > 100

Rationale¶

100 req/s is sufficient for development/testing. Production can scale horizontally.

Capacity planning: - Single instance: 100-200 req/s - Multi-instance (3 replicas): 300-600 req/s - Kubernetes auto-scaling: 1000+ req/s

Load testing targets: - Sustained: 100 req/s for 1 hour - Burst: 300 req/s for 5 minutes - Peak: 500 req/s for 30 seconds

5. Data Integrity SLO¶

Target¶

99.99% data consistency (zero data loss tolerance)

Definition¶

All write operations must be persisted correctly and retrievable.

Validation¶

Write-read consistency checks
Cross-storage provider validation
ACID transaction integrity

Rationale¶

Data integrity is critical for SharePoint mock supporting workflows and migration.

6. Cache Performance SLO¶

Target¶

Cache hit rate >70% for frequently accessed data

Definition¶

Percentage of requests served from cache.

Cache Hit Rate = (cache_hits / (cache_hits + cache_misses)) * 100

Prometheus Query¶

(sum(rate(cesivi_cache_hit_total[5m])) / (sum(rate(cesivi_cache_hit_total[5m])) + sum(rate(cesivi_cache_miss_total[5m])))) * 100 > 70

Rationale¶

70% hit rate balances memory usage with performance.

Target by cache type: - CAML Parse Cache: >80% - Reflection Cache: >95% - Object Path Cache: >70%

Error Budget Management¶

Monthly Allocation¶

SLO	Target	Error Budget	Allowed Downtime
Availability	99.9%	0.1%	43.2 min
Error Rate	<1%	1%	~432 min errors
Latency	<500ms	5% slow	~2,160 min

Policy¶

Error budget consumption: - >50% remaining: Normal operations - 25-50% remaining: Review changes, consider rollback - <25% remaining: Freeze non-critical releases - Exhausted: Emergency freeze, postmortem required

SLO Review¶

Schedule¶

Weekly: Error budget review
Monthly: Full SLO compliance
Quarterly: Strategic alignment

Adjustment Criteria¶

Tighten: Meeting SLO with >90% budget for 3 months
Loosen: Frequently exhausting budget
New SLO: New capabilities or requirements

Operational Implications¶

Alerting¶

Based on SLO thresholds (monitoring/prometheus/alerts.yml): - Critical: SLO violation - Warning: Approaching violation - Info: Error budget consumption

Incident Response¶

See _docs/INCIDENT_RESPONSE.md for procedures.

Capacity Planning¶

SLOs guide infrastructure decisions: - Availability: Redundancy planning - Latency: Performance optimization - Throughput: Horizontal scaling - Data Integrity: Backup procedures

Version: 1.0
Created: 2026-01-18
Author: PLAN-152 Phase 3.1
Next Review: 2026-04-18