Cesivi Server - Observability Guide¶
Version: 1.0 Last Updated: 2026-01-11 Author: Cesivi Team
Table of Contents¶
- Overview
- Observability Stack
- Health Checks
- Prometheus Metrics
- OpenTelemetry Distributed Tracing
- Configuration
- Docker Compose Setup
- Kubernetes Setup
- Grafana Dashboards
- Troubleshooting
Overview¶
Cesivi Server provides comprehensive observability for production deployments through:
- Health Checks -
/health,/ready,/liveendpoints for load balancer probes - Prometheus Metrics - Detailed performance and usage metrics at
/metrics - OpenTelemetry Tracing - Distributed request tracing across services
- Graceful Shutdown - Clean termination with request draining
This guide covers setup, configuration, and best practices for monitoring Cesivi Server in production.
Observability Stack¶
Components¶
| Component | Purpose | Port | URL |
|---|---|---|---|
| Cesivi Server | Main application | 5000 | http://localhost:5000 |
| Health Endpoints | Load balancer probes | 5000 | /health, /ready, /live |
| Metrics Endpoint | Prometheus scraping | 5000 | /metrics |
| Prometheus | Metrics collection | 9090 | http://localhost:9090 |
| Jaeger | Trace visualization | 16686 | http://localhost:16686 |
| Grafana (optional) | Dashboards | 3000 | http://localhost:3000 |
Architecture¶
┌──────────────────────────────────────────────────────────┐
│ Monitoring Stack │
├──────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ Scrape /metrics ┌─────────────┐ │
│ │ Prometheus │◄─────────────────────┤ SPM Server │ │
│ │ (9090) │ Every 15s │ (5000) │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ │ Query Send traces │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Grafana │ │ Jaeger │ │
│ │ (3000) │ │ (16686) │ │
│ └─────────────┘ └─────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Load Balancer │ │
│ │ - Probes /health, /ready, /live every 10s │ │
│ │ - Removes unhealthy instances from rotation │ │
│ └────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────┘
Health Checks¶
Cesivi Server exposes three health check endpoints for different purposes:
/health - Overall Health¶
Returns 200 OK if the application is running.
Use case: Basic "is the server alive" check
Response (Healthy):
{
"status": "Healthy",
"description": "Server is healthy",
"timestamp": "2026-01-11T10:30:00.000Z"
}
Response (Unhealthy): - HTTP 503 Service Unavailable
Example:
curl http://localhost:5000/health
/ready - Readiness Check¶
Checks if the server is ready to accept traffic (verifies dependencies: Redis, SQL Server, storage).
Use case: Load balancer routing decisions (don't route traffic if dependencies are down)
Response (Ready):
{
"status": "Ready",
"description": "All dependencies are healthy",
"checks": {
"redis": "Healthy",
"sqlserver": "Healthy",
"storage": "Healthy"
},
"timestamp": "2026-01-11T10:30:00.000Z"
}
Response (Not Ready): - HTTP 503 Service Unavailable - JSON body includes failed checks
Example:
curl http://localhost:5000/ready
Kubernetes Readiness Probe:
readinessProbe:
httpGet:
path: /ready
port: 5000
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
/live - Liveness Check¶
Checks if the application is responsive (not deadlocked or hung).
Use case: Pod/container restart decisions (restart if app is hung)
Response (Live):
{
"status": "Live",
"description": "Server is responsive",
"timestamp": "2026-01-11T10:30:00.000Z"
}
Response (Dead): - HTTP 503 Service Unavailable (or timeout)
Example:
curl http://localhost:5000/live
Kubernetes Liveness Probe:
livenessProbe:
httpGet:
path: /live
port: 5000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
Prometheus Metrics¶
Cesivi Server exposes comprehensive Prometheus metrics at /metrics for monitoring performance, usage, and health.
Accessing Metrics¶
# View all metrics
curl http://localhost:5000/metrics
# Query specific metric
curl http://localhost:5000/metrics | grep spm_csom_requests_total
Metric Categories¶
1. CSOM Request Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
spm_csom_requests_total |
Counter | operation, status, server_id | Total CSOM requests processed |
spm_csom_request_duration_seconds |
Histogram | operation, server_id | CSOM request processing time |
spm_csom_requests_in_flight |
Gauge | server_id | Number of active CSOM requests |
Example:
# CSOM request rate (requests per second)
rate(spm_csom_requests_total[5m])
# CSOM error rate
rate(spm_csom_requests_total{status="error"}[5m])
# CSOM request duration (95th percentile)
histogram_quantile(0.95, rate(spm_csom_request_duration_seconds_bucket[5m]))
2. REST API Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
spm_rest_requests_total |
Counter | endpoint, method, status, server_id | Total REST requests processed |
spm_rest_request_duration_seconds |
Histogram | endpoint, method, server_id | REST request processing time |
Example:
# REST API request rate by endpoint
rate(spm_rest_requests_total[5m])
# Slow REST endpoints (>1s)
histogram_quantile(0.95, rate(spm_rest_request_duration_seconds_bucket[5m])) > 1
3. SOAP Service Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
spm_soap_requests_total |
Counter | service, operation, status, server_id | Total SOAP requests processed |
spm_soap_request_duration_seconds |
Histogram | service, operation, server_id | SOAP request processing time |
Example:
# SOAP request rate by service
sum by (service) (rate(spm_soap_requests_total[5m]))
4. Session Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
spm_active_sessions_total |
Gauge | server_id | Number of active sessions |
spm_sessions_created_total |
Counter | server_id | Total sessions created |
spm_sessions_expired_total |
Counter | server_id | Total sessions expired |
Example:
# Session churn rate
rate(spm_sessions_created_total[5m]) + rate(spm_sessions_expired_total[5m])
5. Cache Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
spm_cache_hits_total |
Counter | cache_type, server_id | Total cache hits |
spm_cache_misses_total |
Counter | cache_type, server_id | Total cache misses |
spm_cache_hit_ratio |
Gauge | cache_type, server_id | Cache hit ratio (0.0 to 1.0) |
spm_cache_items_total |
Gauge | cache_type, server_id | Number of items in cache |
Example:
# Cache hit ratio over time
spm_cache_hit_ratio
# Cache effectiveness by type
sum by (cache_type) (rate(spm_cache_hits_total[5m])) /
sum by (cache_type) (rate(spm_cache_hits_total[5m]) + rate(spm_cache_misses_total[5m]))
6. Storage Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
spm_storage_operations_total |
Counter | operation, status, server_id | Total storage operations |
spm_storage_operation_duration_seconds |
Histogram | operation, server_id | Storage operation duration |
spm_storage_list_items_total |
Gauge | server_id | Total list items in storage |
spm_storage_files_total |
Gauge | server_id | Total files in storage |
spm_storage_size_bytes |
Gauge | server_id | Total storage size in bytes |
Example:
# Storage operation latency
histogram_quantile(0.99, rate(spm_storage_operation_duration_seconds_bucket[5m]))
# Storage growth rate
rate(spm_storage_size_bytes[1h])
7. Health Check Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
spm_health_check_status |
Gauge | check_name, server_id | Health check status (1=healthy, 0=unhealthy) |
spm_health_check_duration_seconds |
Histogram | check_name, server_id | Health check duration |
Example:
# Unhealthy dependencies
spm_health_check_status == 0
8. Distributed State Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
spm_distributed_lock_acquisitions_total |
Counter | lock_key, status, server_id | Lock acquisition attempts |
spm_distributed_lock_hold_duration_seconds |
Histogram | lock_key, server_id | Lock hold time |
spm_pubsub_messages_published_total |
Counter | channel, server_id | Pub/sub messages published |
spm_pubsub_messages_received_total |
Counter | channel, server_id | Pub/sub messages received |
Example:
# Lock contention (timeouts)
rate(spm_distributed_lock_acquisitions_total{status="timeout"}[5m])
9. Authentication Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
spm_authentication_attempts_total |
Counter | auth_type, status, server_id | Authentication attempts |
spm_authentication_duration_seconds |
Histogram | auth_type, server_id | Authentication processing time |
Example:
# Authentication failure rate
rate(spm_authentication_attempts_total{status="failure"}[5m])
10. Server Info Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
spm_server_info |
Gauge | version, environment, server_id, storage_provider, distributed_state_provider | Server metadata (always 1.0) |
Example:
# Server inventory
spm_server_info
OpenTelemetry Distributed Tracing¶
Cesivi Server supports OpenTelemetry for distributed request tracing across services.
Configuration¶
appsettings.json:
{
"OpenTelemetry": {
"OtlpEndpoint": "http://jaeger:4318"
}
}
Trace Sources¶
The following components are automatically instrumented:
- ASP.NET Core - HTTP request/response traces
- HTTP Client - Outgoing HTTP calls (plugins, remote event receivers)
- CSOM Processor - CSOM request processing
- REST/SOAP APIs - API endpoint traces
Custom Spans¶
Add custom spans to your code:
using Cesivi.Server.Observability;
// Start a custom activity
using var activity = CesiviActivitySource.Activity.StartActivity("CustomOperation");
// Add custom tags
activity?.SetTag("custom.key", "value");
activity?.SetTag("item.id", itemId);
// Process...
// Activity automatically ends when disposed
Viewing Traces¶
Jaeger UI: http://localhost:16686
- Select "Cesivi Server" service
- Find traces by:
- Operation name
- Tags (e.g.,
http.url,sp.webapp) - Duration
- Status code
Trace Context Propagation¶
Cesivi Server automatically propagates trace context across:
- HTTP requests (W3C Trace Context headers)
- Distributed state operations (Redis pub/sub)
- Remote event receivers
Configuration¶
appsettings.json¶
{
"ServerMetrics": {
"ServerId": "spm-server-1",
"Environment": "production"
},
"OpenTelemetry": {
"OtlpEndpoint": "http://jaeger:4318"
},
"Storage": {
"Provider": "SqlServer"
},
"DistributedState": {
"Provider": "Redis"
}
}
Environment Variables¶
| Variable | Purpose | Default |
|---|---|---|
CESIVI_SERVER_ID |
Server identifier for metrics | Machine name |
CESIVI_ENVIRONMENT |
Environment name (dev/staging/prod) | Development |
ASPNETCORE_ENVIRONMENT |
ASP.NET Core environment | Development |
Docker Compose Setup¶
The docker-compose.multiserver.yml includes full observability stack:
Starting the Stack¶
# Start all services (3 SPM + Redis + SQL + Nginx + Prometheus + Jaeger)
docker-compose -f docker-compose.multiserver.yml up -d
# View logs
docker-compose -f docker-compose.multiserver.yml logs -f
# Stop services
docker-compose -f docker-compose.multiserver.yml down
Accessing Services¶
| Service | URL | Purpose |
|---|---|---|
| Cesivi (via Nginx) | http://localhost:8080 | Load-balanced access |
| Prometheus | http://localhost:9090 | Metrics dashboard |
| Jaeger UI | http://localhost:16686 | Trace visualization |
Prometheus Queries¶
Prometheus UI: http://localhost:9090
Example queries:
# Total request rate across all servers
sum(rate(spm_csom_requests_total[5m]))
# Request rate by server
sum by (server_id) (rate(spm_csom_requests_total[5m]))
# Error rate
sum(rate(spm_csom_requests_total{status="error"}[5m]))
# Request duration (95th percentile)
histogram_quantile(0.95,
sum by (le) (rate(spm_csom_request_duration_seconds_bucket[5m]))
)
Kubernetes Setup¶
ServiceMonitor for Prometheus Operator¶
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: cesivi-metrics
namespace: cesivi
spec:
selector:
matchLabels:
app: cesivi
endpoints:
- port: http
path: /metrics
interval: 15s
Deployment with Probes¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: cesivi
spec:
replicas: 3
template:
spec:
containers:
- name: cesivi
image: cesivi:latest
ports:
- containerPort: 5000
name: http
livenessProbe:
httpGet:
path: /live
port: 5000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 5000
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
Grafana Dashboards¶
Installing Grafana¶
Docker Compose:
Add to docker-compose.multiserver.yml:
grafana:
image: grafana/grafana:10.2.3
container_name: spm-grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana
depends_on:
- prometheus
networks:
- spm-network
volumes:
grafana-data:
driver: local
Connecting to Prometheus¶
- Open Grafana: http://localhost:3000
- Login: admin / admin
- Add Data Source:
- Type: Prometheus
- URL: http://prometheus:9090
- Save & Test
Dashboard Panels¶
1. Request Rate¶
sum(rate(spm_csom_requests_total[5m]))
2. Error Rate¶
sum(rate(spm_csom_requests_total{status="error"}[5m])) /
sum(rate(spm_csom_requests_total[5m])) * 100
3. Request Duration (P95)¶
histogram_quantile(0.95,
sum by (le) (rate(spm_csom_request_duration_seconds_bucket[5m]))
)
4. Active Sessions¶
sum(spm_active_sessions_total)
5. Cache Hit Ratio¶
spm_cache_hit_ratio
Importing Dashboards¶
Save dashboard JSON to grafana-dashboards/cesivi.json and import via Grafana UI.
Troubleshooting¶
Metrics Not Appearing¶
Problem: /metrics endpoint returns empty or no metrics
Solutions:
-
Check if metrics are being recorded:
curl http://localhost:5000/metrics | grep spm_ -
Verify server ID is set:
curl http://localhost:5000/metrics | grep spm_server_info -
Check Program.cs configuration:
CesiviMetrics.InitializeServerInfo()is calledapp.UseMetricsMiddleware()is registered
Prometheus Not Scraping¶
Problem: Prometheus targets show "DOWN" status
Solutions:
- Check target health in Prometheus:
- Open http://localhost:9090/targets
-
Look for errors
-
Verify network connectivity:
docker exec spm-prometheus wget -O- http://spm-server-1:5000/metrics -
Check Prometheus config:
docker exec spm-prometheus cat /etc/prometheus/prometheus.yml
Jaeger Not Receiving Traces¶
Problem: No traces appear in Jaeger UI
Solutions:
-
Verify OTLP endpoint configuration:
{ "OpenTelemetry": { "OtlpEndpoint": "http://jaeger:4318" } } -
Check Jaeger logs:
docker logs spm-jaeger -
Verify trace export:
# Check if traces are being sent docker logs spm-server-1 | grep -i telemetry
Health Checks Failing¶
Problem: /ready returns 503
Solutions:
-
Check dependency health manually:
# Redis docker exec spm-redis redis-cli ping # SQL Server docker exec spm-sqlserver /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P 'Cesivi2024!' -Q 'SELECT 1' -
Check application logs:
docker logs spm-server-1 | grep -i health -
Verify HealthCheckService configuration in Program.cs
High Memory Usage¶
Problem: Metrics collection causing high memory usage
Solutions:
- Reduce metric cardinality:
- Limit number of unique label values
-
Use metric normalization (already implemented in MetricsMiddleware)
-
Adjust Prometheus retention:
command: - '--storage.tsdb.retention.time=7d' # Default: 15d -
Check for metric leaks:
curl http://localhost:5000/metrics | wc -l # Should be <10,000 lines for normal operation
Best Practices¶
1. Metric Naming¶
- Use consistent prefix:
spm_ - Use descriptive names:
spm_csom_request_duration_seconds(notspm_req_dur) - Include units in name:
_seconds,_bytes,_total
2. Label Cardinality¶
- Keep label values bounded (don't use unbounded IDs)
- Use normalization to reduce cardinality
- Avoid high-cardinality labels like
user_id,item_id
3. Health Check Design¶
/health- Fast, no dependencies (basic liveness)/ready- Checks dependencies (routing decision)/live- Checks responsiveness (restart decision)
4. Alert Rules¶
Create Prometheus alert rules for critical metrics:
groups:
- name: cesivi
rules:
- alert: HighErrorRate
expr: rate(spm_csom_requests_total{status="error"}[5m]) > 0.1
for: 5m
annotations:
summary: "High CSOM error rate"
- alert: SlowRequests
expr: histogram_quantile(0.95, rate(spm_csom_request_duration_seconds_bucket[5m])) > 5
for: 5m
annotations:
summary: "CSOM requests are slow"
Additional Resources¶
- Prometheus Documentation: https://prometheus.io/docs/
- Grafana Documentation: https://grafana.com/docs/
- OpenTelemetry Documentation: https://opentelemetry.io/docs/
- Jaeger Documentation: https://www.jaegertracing.io/docs/
For questions or issues, please file a GitHub issue or contact the Cesivi team.