Skip to content

Service Level Objectives (SLOs)

Last Updated: 2026-01-18
Status: Production SLOs for Cesivi Server
Review Schedule: Quarterly


Overview

This document defines the Service Level Objectives (SLOs) for Cesivi Server. SLOs are internal targets for service performance and reliability that guide operational decisions and alerting thresholds.

Key Concepts: - SLI (Service Level Indicator): A quantitative measure of service behavior - SLO (Service Level Objective): Target value or range for an SLI - Error Budget: The allowed amount of unreliability

Measurement Period: 30-day rolling window


1. Availability SLO

Target

99.9% uptime (monthly rolling window)

Definition

The percentage of time the service is available and responding to requests.

Availability = (successful_requests / total_requests) * 100

Successful requests: HTTP 2xx, 3xx, 4xx (client errors don't count against availability) Failed requests: HTTP 5xx (server errors)

Error Budget

0.1% downtime = 43.2 minutes per month

Breakdown: - Monthly: 43.2 minutes - Weekly: 10.08 minutes - Daily: 1.44 minutes

Prometheus Query

(sum(rate(http_requests_total{status!~"5.."}[30d])) / sum(rate(http_requests_total[30d]))) * 100

Rationale

99.9% availability is appropriate for a production SharePoint mock server supporting development, testing, and production workflows.


2. Latency SLO

Targets

  • P95 < 500ms for list operations
  • P95 < 1s for file downloads
  • P99 < 2s for all operations

Definition

Time from receiving a request to sending the complete response.

  • P95: 95% of requests complete faster than this threshold
  • P99: 99% of requests complete faster than this threshold

Prometheus Queries

List Operations P95:

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{endpoint=~".*/_api/web/lists/.*"}[5m])) by (le)) < 0.5

File Downloads P95:

histogram_quantile(0.95, sum(rate(cesivi_file_download_duration_seconds_bucket[5m])) by (le)) < 1.0

Rationale

  • 500ms for list operations: Interactive UI operations should feel responsive
  • 1s for file downloads: File operations can tolerate slightly higher latency
  • 2s for P99: Allows for occasional slow requests

Exclusions

  • File uploads/downloads >10MB
  • Bulk CSOM operations (>50 operations)
  • Initial startup requests (first 60 seconds)

3. Error Rate SLO

Target

< 1% server errors (5xx responses)

Definition

Percentage of requests resulting in server errors.

Error Rate = (5xx_responses / total_requests) * 100

Prometheus Query

(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100 < 1.0

Rationale

1% error rate allows for occasional transient failures while maintaining high reliability.

Breakdown by type: - 500 Internal Server Error: <0.5% - 502 Bad Gateway: <0.3% - 503 Service Unavailable: <0.2% - 504 Gateway Timeout: <0.1%


4. Throughput SLO

Target

Support >100 requests/second under normal load

Definition

Number of requests the service can handle per second.

Prometheus Query

sum(rate(http_requests_total[1m])) > 100

Rationale

100 req/s is sufficient for development/testing. Production can scale horizontally.

Capacity planning: - Single instance: 100-200 req/s - Multi-instance (3 replicas): 300-600 req/s - Kubernetes auto-scaling: 1000+ req/s

Load testing targets: - Sustained: 100 req/s for 1 hour - Burst: 300 req/s for 5 minutes - Peak: 500 req/s for 30 seconds


5. Data Integrity SLO

Target

99.99% data consistency (zero data loss tolerance)

Definition

All write operations must be persisted correctly and retrievable.

Validation

  • Write-read consistency checks
  • Cross-storage provider validation
  • ACID transaction integrity

Rationale

Data integrity is critical for SharePoint mock supporting workflows and migration.


6. Cache Performance SLO

Target

Cache hit rate >70% for frequently accessed data

Definition

Percentage of requests served from cache.

Cache Hit Rate = (cache_hits / (cache_hits + cache_misses)) * 100

Prometheus Query

(sum(rate(cesivi_cache_hit_total[5m])) / (sum(rate(cesivi_cache_hit_total[5m])) + sum(rate(cesivi_cache_miss_total[5m])))) * 100 > 70

Rationale

70% hit rate balances memory usage with performance.

Target by cache type: - CAML Parse Cache: >80% - Reflection Cache: >95% - Object Path Cache: >70%


Error Budget Management

Monthly Allocation

SLO Target Error Budget Allowed Downtime
Availability 99.9% 0.1% 43.2 min
Error Rate <1% 1% ~432 min errors
Latency <500ms 5% slow ~2,160 min

Policy

Error budget consumption: - >50% remaining: Normal operations - 25-50% remaining: Review changes, consider rollback - <25% remaining: Freeze non-critical releases - Exhausted: Emergency freeze, postmortem required


SLO Review

Schedule

  • Weekly: Error budget review
  • Monthly: Full SLO compliance
  • Quarterly: Strategic alignment

Adjustment Criteria

  • Tighten: Meeting SLO with >90% budget for 3 months
  • Loosen: Frequently exhausting budget
  • New SLO: New capabilities or requirements

Operational Implications

Alerting

Based on SLO thresholds (monitoring/prometheus/alerts.yml): - Critical: SLO violation - Warning: Approaching violation - Info: Error budget consumption

Incident Response

See _docs/INCIDENT_RESPONSE.md for procedures.

Capacity Planning

SLOs guide infrastructure decisions: - Availability: Redundancy planning - Latency: Performance optimization - Throughput: Horizontal scaling - Data Integrity: Backup procedures


Version: 1.0
Created: 2026-01-18
Author: PLAN-152 Phase 3.1
Next Review: 2026-04-18