Cesivi Server - Observability Guide¶

Version: 1.0 Last Updated: 2026-01-11 Author: Cesivi Team

Table of Contents¶

Overview
Observability Stack
Health Checks
Prometheus Metrics
OpenTelemetry Distributed Tracing
Configuration
Docker Compose Setup
Kubernetes Setup
Grafana Dashboards
Troubleshooting

Overview¶

Cesivi Server provides comprehensive observability for production deployments through:

Health Checks - /health, /ready, /live endpoints for load balancer probes
Prometheus Metrics - Detailed performance and usage metrics at /metrics
OpenTelemetry Tracing - Distributed request tracing across services
Graceful Shutdown - Clean termination with request draining

This guide covers setup, configuration, and best practices for monitoring Cesivi Server in production.

Observability Stack¶

Components¶

Component	Purpose	Port	URL
Cesivi Server	Main application	5000	http://localhost:5000
Health Endpoints	Load balancer probes	5000	/health, /ready, /live
Metrics Endpoint	Prometheus scraping	5000	/metrics
Prometheus	Metrics collection	9090	http://localhost:9090
Jaeger	Trace visualization	16686	http://localhost:16686
Grafana (optional)	Dashboards	3000	http://localhost:3000

Architecture¶

┌──────────────────────────────────────────────────────────┐
│ Monitoring Stack                                         │
├──────────────────────────────────────────────────────────┤
│                                                           │
│  ┌─────────────┐    Scrape /metrics    ┌─────────────┐  │
│  │ Prometheus  │◄─────────────────────┤ SPM Server │  │
│  │   (9090)    │   Every 15s           │   (5000)    │  │
│  └──────┬──────┘                        └──────┬──────┘  │
│         │                                      │         │
│         │ Query                        Send traces       │
│         ▼                                      ▼         │
│  ┌─────────────┐                        ┌─────────────┐  │
│  │  Grafana    │                        │   Jaeger    │  │
│  │   (3000)    │                        │  (16686)    │  │
│  └─────────────┘                        └─────────────┘  │
│                                                           │
│  ┌────────────────────────────────────────────────────┐  │
│  │ Load Balancer                                      │  │
│  │ - Probes /health, /ready, /live every 10s         │  │
│  │ - Removes unhealthy instances from rotation       │  │
│  └────────────────────────────────────────────────────┘  │
│                                                           │
└──────────────────────────────────────────────────────────┘

Health Checks¶

Cesivi Server exposes three health check endpoints for different purposes:

`/health` - Overall Health¶

Returns 200 OK if the application is running.

Use case: Basic "is the server alive" check

Response (Healthy):

{
  "status": "Healthy",
  "description": "Server is healthy",
  "timestamp": "2026-01-11T10:30:00.000Z"
}

Response (Unhealthy): - HTTP 503 Service Unavailable

Example:

curl http://localhost:5000/health

`/ready` - Readiness Check¶

Checks if the server is ready to accept traffic (verifies dependencies: Redis, SQL Server, storage).

Use case: Load balancer routing decisions (don't route traffic if dependencies are down)

Response (Ready):

{
  "status": "Ready",
  "description": "All dependencies are healthy",
  "checks": {
    "redis": "Healthy",
    "sqlserver": "Healthy",
    "storage": "Healthy"
  },
  "timestamp": "2026-01-11T10:30:00.000Z"
}

Response (Not Ready): - HTTP 503 Service Unavailable - JSON body includes failed checks

Example:

curl http://localhost:5000/ready

Kubernetes Readiness Probe:

readinessProbe:
  httpGet:
    path: /ready
    port: 5000
  initialDelaySeconds: 10
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3

`/live` - Liveness Check¶

Checks if the application is responsive (not deadlocked or hung).

Use case: Pod/container restart decisions (restart if app is hung)

Response (Live):

{
  "status": "Live",
  "description": "Server is responsive",
  "timestamp": "2026-01-11T10:30:00.000Z"
}

Response (Dead): - HTTP 503 Service Unavailable (or timeout)

Example:

curl http://localhost:5000/live

Kubernetes Liveness Probe:

livenessProbe:
  httpGet:
    path: /live
    port: 5000
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3

Prometheus Metrics¶

Cesivi Server exposes comprehensive Prometheus metrics at /metrics for monitoring performance, usage, and health.

Accessing Metrics¶

# View all metrics
curl http://localhost:5000/metrics

# Query specific metric
curl http://localhost:5000/metrics | grep spm_csom_requests_total

Metric Categories¶

1. CSOM Request Metrics¶

Metric	Type	Labels	Description
`spm_csom_requests_total`	Counter	operation, status, server_id	Total CSOM requests processed
`spm_csom_request_duration_seconds`	Histogram	operation, server_id	CSOM request processing time
`spm_csom_requests_in_flight`	Gauge	server_id	Number of active CSOM requests

Example:

# CSOM request rate (requests per second)
rate(spm_csom_requests_total[5m])

# CSOM error rate
rate(spm_csom_requests_total{status="error"}[5m])

# CSOM request duration (95th percentile)
histogram_quantile(0.95, rate(spm_csom_request_duration_seconds_bucket[5m]))

2. REST API Metrics¶

Metric	Type	Labels	Description
`spm_rest_requests_total`	Counter	endpoint, method, status, server_id	Total REST requests processed
`spm_rest_request_duration_seconds`	Histogram	endpoint, method, server_id	REST request processing time

Example:

# REST API request rate by endpoint
rate(spm_rest_requests_total[5m])

# Slow REST endpoints (>1s)
histogram_quantile(0.95, rate(spm_rest_request_duration_seconds_bucket[5m])) > 1

3. SOAP Service Metrics¶

Metric	Type	Labels	Description
`spm_soap_requests_total`	Counter	service, operation, status, server_id	Total SOAP requests processed
`spm_soap_request_duration_seconds`	Histogram	service, operation, server_id	SOAP request processing time

Example:

# SOAP request rate by service
sum by (service) (rate(spm_soap_requests_total[5m]))

4. Session Metrics¶

Metric	Type	Labels	Description
`spm_active_sessions_total`	Gauge	server_id	Number of active sessions
`spm_sessions_created_total`	Counter	server_id	Total sessions created
`spm_sessions_expired_total`	Counter	server_id	Total sessions expired

Example:

# Session churn rate
rate(spm_sessions_created_total[5m]) + rate(spm_sessions_expired_total[5m])

5. Cache Metrics¶

Metric	Type	Labels	Description
`spm_cache_hits_total`	Counter	cache_type, server_id	Total cache hits
`spm_cache_misses_total`	Counter	cache_type, server_id	Total cache misses
`spm_cache_hit_ratio`	Gauge	cache_type, server_id	Cache hit ratio (0.0 to 1.0)
`spm_cache_items_total`	Gauge	cache_type, server_id	Number of items in cache

Example:

# Cache hit ratio over time
spm_cache_hit_ratio

# Cache effectiveness by type
sum by (cache_type) (rate(spm_cache_hits_total[5m])) /
sum by (cache_type) (rate(spm_cache_hits_total[5m]) + rate(spm_cache_misses_total[5m]))

6. Storage Metrics¶

Metric	Type	Labels	Description
`spm_storage_operations_total`	Counter	operation, status, server_id	Total storage operations
`spm_storage_operation_duration_seconds`	Histogram	operation, server_id	Storage operation duration
`spm_storage_list_items_total`	Gauge	server_id	Total list items in storage
`spm_storage_files_total`	Gauge	server_id	Total files in storage
`spm_storage_size_bytes`	Gauge	server_id	Total storage size in bytes

Example:

# Storage operation latency
histogram_quantile(0.99, rate(spm_storage_operation_duration_seconds_bucket[5m]))

# Storage growth rate
rate(spm_storage_size_bytes[1h])

7. Health Check Metrics¶

Metric	Type	Labels	Description
`spm_health_check_status`	Gauge	check_name, server_id	Health check status (1=healthy, 0=unhealthy)
`spm_health_check_duration_seconds`	Histogram	check_name, server_id	Health check duration

Example:

# Unhealthy dependencies
spm_health_check_status == 0

8. Distributed State Metrics¶

Metric	Type	Labels	Description
`spm_distributed_lock_acquisitions_total`	Counter	lock_key, status, server_id	Lock acquisition attempts
`spm_distributed_lock_hold_duration_seconds`	Histogram	lock_key, server_id	Lock hold time
`spm_pubsub_messages_published_total`	Counter	channel, server_id	Pub/sub messages published
`spm_pubsub_messages_received_total`	Counter	channel, server_id	Pub/sub messages received

Example:

# Lock contention (timeouts)
rate(spm_distributed_lock_acquisitions_total{status="timeout"}[5m])

9. Authentication Metrics¶

Metric	Type	Labels	Description
`spm_authentication_attempts_total`	Counter	auth_type, status, server_id	Authentication attempts
`spm_authentication_duration_seconds`	Histogram	auth_type, server_id	Authentication processing time

Example:

# Authentication failure rate
rate(spm_authentication_attempts_total{status="failure"}[5m])

10. Server Info Metrics¶

Metric	Type	Labels	Description
`spm_server_info`	Gauge	version, environment, server_id, storage_provider, distributed_state_provider	Server metadata (always 1.0)

Example:

# Server inventory
spm_server_info

OpenTelemetry Distributed Tracing¶

Cesivi Server supports OpenTelemetry for distributed request tracing across services.

Configuration¶

appsettings.json:

{
  "OpenTelemetry": {
    "OtlpEndpoint": "http://jaeger:4318"
  }
}

Trace Sources¶

The following components are automatically instrumented:

ASP.NET Core - HTTP request/response traces
HTTP Client - Outgoing HTTP calls (plugins, remote event receivers)
CSOM Processor - CSOM request processing
REST/SOAP APIs - API endpoint traces

Custom Spans¶

Add custom spans to your code:

using Cesivi.Server.Observability;

// Start a custom activity
using var activity = CesiviActivitySource.Activity.StartActivity("CustomOperation");

// Add custom tags
activity?.SetTag("custom.key", "value");
activity?.SetTag("item.id", itemId);

// Process...

// Activity automatically ends when disposed

Viewing Traces¶

Jaeger UI: http://localhost:16686

Select "Cesivi Server" service
Find traces by:
Operation name
Tags (e.g., http.url, sp.webapp)
Duration
Status code

Trace Context Propagation¶

Cesivi Server automatically propagates trace context across:

HTTP requests (W3C Trace Context headers)
Distributed state operations (Redis pub/sub)
Remote event receivers

Configuration¶

appsettings.json¶

{
  "ServerMetrics": {
    "ServerId": "spm-server-1",
    "Environment": "production"
  },
  "OpenTelemetry": {
    "OtlpEndpoint": "http://jaeger:4318"
  },
  "Storage": {
    "Provider": "SqlServer"
  },
  "DistributedState": {
    "Provider": "Redis"
  }
}

Environment Variables¶

Variable	Purpose	Default
`CESIVI_SERVER_ID`	Server identifier for metrics	Machine name
`CESIVI_ENVIRONMENT`	Environment name (dev/staging/prod)	Development
`ASPNETCORE_ENVIRONMENT`	ASP.NET Core environment	Development

Docker Compose Setup¶

The docker-compose.multiserver.yml includes full observability stack:

Starting the Stack¶

# Start all services (3 SPM + Redis + SQL + Nginx + Prometheus + Jaeger)
docker-compose -f docker-compose.multiserver.yml up -d

# View logs
docker-compose -f docker-compose.multiserver.yml logs -f

# Stop services
docker-compose -f docker-compose.multiserver.yml down

Accessing Services¶

Service	URL	Purpose
Cesivi (via Nginx)	http://localhost:8080	Load-balanced access
Prometheus	http://localhost:9090	Metrics dashboard
Jaeger UI	http://localhost:16686	Trace visualization

Prometheus Queries¶

Prometheus UI: http://localhost:9090

Example queries:

# Total request rate across all servers
sum(rate(spm_csom_requests_total[5m]))

# Request rate by server
sum by (server_id) (rate(spm_csom_requests_total[5m]))

# Error rate
sum(rate(spm_csom_requests_total{status="error"}[5m]))

# Request duration (95th percentile)
histogram_quantile(0.95,
  sum by (le) (rate(spm_csom_request_duration_seconds_bucket[5m]))
)

Kubernetes Setup¶

ServiceMonitor for Prometheus Operator¶

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: cesivi-metrics
  namespace: cesivi
spec:
  selector:
    matchLabels:
      app: cesivi
  endpoints:
  - port: http
    path: /metrics
    interval: 15s

Deployment with Probes¶

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cesivi
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: cesivi
        image: cesivi:latest
        ports:
        - containerPort: 5000
          name: http
        livenessProbe:
          httpGet:
            path: /live
            port: 5000
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 3
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 5000
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 3
          failureThreshold: 3

Grafana Dashboards¶

Installing Grafana¶

Docker Compose:

Add to docker-compose.multiserver.yml:

  grafana:
    image: grafana/grafana:10.2.3
    container_name: spm-grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana
    depends_on:
      - prometheus
    networks:
      - spm-network

volumes:
  grafana-data:
    driver: local

Connecting to Prometheus¶

Open Grafana: http://localhost:3000
Login: admin / admin
Add Data Source:
Type: Prometheus
URL: http://prometheus:9090
Save & Test

Dashboard Panels¶

1. Request Rate¶

sum(rate(spm_csom_requests_total[5m]))

2. Error Rate¶

sum(rate(spm_csom_requests_total{status="error"}[5m])) /
sum(rate(spm_csom_requests_total[5m])) * 100

3. Request Duration (P95)¶

histogram_quantile(0.95,
  sum by (le) (rate(spm_csom_request_duration_seconds_bucket[5m]))
)

4. Active Sessions¶

sum(spm_active_sessions_total)

5. Cache Hit Ratio¶

spm_cache_hit_ratio

Importing Dashboards¶

Save dashboard JSON to grafana-dashboards/cesivi.json and import via Grafana UI.

Troubleshooting¶

Metrics Not Appearing¶

Problem: /metrics endpoint returns empty or no metrics

Solutions:

Check if metrics are being recorded:

curl http://localhost:5000/metrics | grep spm_

Verify server ID is set:

curl http://localhost:5000/metrics | grep spm_server_info

Check Program.cs configuration:
CesiviMetrics.InitializeServerInfo() is called
app.UseMetricsMiddleware() is registered

Prometheus Not Scraping¶

Problem: Prometheus targets show "DOWN" status

Solutions:

Check target health in Prometheus:
Open http://localhost:9090/targets
Look for errors

Verify network connectivity:

docker exec spm-prometheus wget -O- http://spm-server-1:5000/metrics

Check Prometheus config:

docker exec spm-prometheus cat /etc/prometheus/prometheus.yml

Jaeger Not Receiving Traces¶

Problem: No traces appear in Jaeger UI

Solutions:

Verify OTLP endpoint configuration:

{
  "OpenTelemetry": {
    "OtlpEndpoint": "http://jaeger:4318"
  }
}

Check Jaeger logs:
```
docker logs spm-jaeger
```

Verify trace export:

# Check if traces are being sent
docker logs spm-server-1 | grep -i telemetry

Health Checks Failing¶

Problem: /ready returns 503

Solutions:

Check dependency health manually:

# Redis
docker exec spm-redis redis-cli ping

# SQL Server
docker exec spm-sqlserver /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P 'Cesivi2024!' -Q 'SELECT 1'

Check application logs:

docker logs spm-server-1 | grep -i health

Verify HealthCheckService configuration in Program.cs

High Memory Usage¶

Problem: Metrics collection causing high memory usage

Solutions:

Reduce metric cardinality:
Limit number of unique label values
Use metric normalization (already implemented in MetricsMiddleware)

Adjust Prometheus retention:

command:
  - '--storage.tsdb.retention.time=7d'  # Default: 15d

Check for metric leaks:

curl http://localhost:5000/metrics | wc -l
# Should be <10,000 lines for normal operation

Best Practices¶

1. Metric Naming¶

Use consistent prefix: spm_
Use descriptive names: spm_csom_request_duration_seconds (not spm_req_dur)
Include units in name: _seconds, _bytes, _total

2. Label Cardinality¶

Keep label values bounded (don't use unbounded IDs)
Use normalization to reduce cardinality
Avoid high-cardinality labels like user_id, item_id

3. Health Check Design¶

/health - Fast, no dependencies (basic liveness)
/ready - Checks dependencies (routing decision)
/live - Checks responsiveness (restart decision)

4. Alert Rules¶

Create Prometheus alert rules for critical metrics:

groups:
  - name: cesivi
    rules:
    - alert: HighErrorRate
      expr: rate(spm_csom_requests_total{status="error"}[5m]) > 0.1
      for: 5m
      annotations:
        summary: "High CSOM error rate"

    - alert: SlowRequests
      expr: histogram_quantile(0.95, rate(spm_csom_request_duration_seconds_bucket[5m])) > 5
      for: 5m
      annotations:
        summary: "CSOM requests are slow"

Additional Resources¶

Prometheus Documentation: https://prometheus.io/docs/
Grafana Documentation: https://grafana.com/docs/
OpenTelemetry Documentation: https://opentelemetry.io/docs/
Jaeger Documentation: https://www.jaegertracing.io/docs/

For questions or issues, please file a GitHub issue or contact the Cesivi team.

Cesivi Server - Observability Guide¶

Table of Contents¶

Overview¶

Observability Stack¶

Components¶

Architecture¶

Health Checks¶

/health - Overall Health¶

/ready - Readiness Check¶

/live - Liveness Check¶

Prometheus Metrics¶

Accessing Metrics¶

Metric Categories¶

1. CSOM Request Metrics¶

2. REST API Metrics¶

3. SOAP Service Metrics¶

4. Session Metrics¶

5. Cache Metrics¶

6. Storage Metrics¶

7. Health Check Metrics¶

8. Distributed State Metrics¶

9. Authentication Metrics¶

10. Server Info Metrics¶

OpenTelemetry Distributed Tracing¶

Configuration¶

Trace Sources¶

Custom Spans¶

Viewing Traces¶

Trace Context Propagation¶

Configuration¶

appsettings.json¶

Environment Variables¶

Docker Compose Setup¶

Starting the Stack¶

Accessing Services¶

Prometheus Queries¶

Kubernetes Setup¶

ServiceMonitor for Prometheus Operator¶

Deployment with Probes¶

Grafana Dashboards¶

Installing Grafana¶

Connecting to Prometheus¶

Dashboard Panels¶

1. Request Rate¶

2. Error Rate¶

3. Request Duration (P95)¶

4. Active Sessions¶

5. Cache Hit Ratio¶

Importing Dashboards¶

Troubleshooting¶

Metrics Not Appearing¶

Prometheus Not Scraping¶

Jaeger Not Receiving Traces¶

Health Checks Failing¶

High Memory Usage¶

Best Practices¶

1. Metric Naming¶

2. Label Cardinality¶

3. Health Check Design¶

4. Alert Rules¶

Additional Resources¶

`/health` - Overall Health¶

`/ready` - Readiness Check¶

`/live` - Liveness Check¶