Skip to content

Cesivi Server - Observability Guide

Version: 1.0 Last Updated: 2026-01-11 Author: Cesivi Team


Table of Contents

  1. Overview
  2. Observability Stack
  3. Health Checks
  4. Prometheus Metrics
  5. OpenTelemetry Distributed Tracing
  6. Configuration
  7. Docker Compose Setup
  8. Kubernetes Setup
  9. Grafana Dashboards
  10. Troubleshooting

Overview

Cesivi Server provides comprehensive observability for production deployments through:

  • Health Checks - /health, /ready, /live endpoints for load balancer probes
  • Prometheus Metrics - Detailed performance and usage metrics at /metrics
  • OpenTelemetry Tracing - Distributed request tracing across services
  • Graceful Shutdown - Clean termination with request draining

This guide covers setup, configuration, and best practices for monitoring Cesivi Server in production.


Observability Stack

Components

Component Purpose Port URL
Cesivi Server Main application 5000 http://localhost:5000
Health Endpoints Load balancer probes 5000 /health, /ready, /live
Metrics Endpoint Prometheus scraping 5000 /metrics
Prometheus Metrics collection 9090 http://localhost:9090
Jaeger Trace visualization 16686 http://localhost:16686
Grafana (optional) Dashboards 3000 http://localhost:3000

Architecture

┌──────────────────────────────────────────────────────────┐
│ Monitoring Stack                                         │
├──────────────────────────────────────────────────────────┤
│                                                           │
│  ┌─────────────┐    Scrape /metrics    ┌─────────────┐  │
│  │ Prometheus  │◄─────────────────────┤ SPM Server │  │
│  │   (9090)    │   Every 15s           │   (5000)    │  │
│  └──────┬──────┘                        └──────┬──────┘  │
│         │                                      │         │
│         │ Query                        Send traces       │
│         ▼                                      ▼         │
│  ┌─────────────┐                        ┌─────────────┐  │
│  │  Grafana    │                        │   Jaeger    │  │
│  │   (3000)    │                        │  (16686)    │  │
│  └─────────────┘                        └─────────────┘  │
│                                                           │
│  ┌────────────────────────────────────────────────────┐  │
│  │ Load Balancer                                      │  │
│  │ - Probes /health, /ready, /live every 10s         │  │
│  │ - Removes unhealthy instances from rotation       │  │
│  └────────────────────────────────────────────────────┘  │
│                                                           │
└──────────────────────────────────────────────────────────┘

Health Checks

Cesivi Server exposes three health check endpoints for different purposes:

/health - Overall Health

Returns 200 OK if the application is running.

Use case: Basic "is the server alive" check

Response (Healthy):

{
  "status": "Healthy",
  "description": "Server is healthy",
  "timestamp": "2026-01-11T10:30:00.000Z"
}

Response (Unhealthy): - HTTP 503 Service Unavailable

Example:

curl http://localhost:5000/health


/ready - Readiness Check

Checks if the server is ready to accept traffic (verifies dependencies: Redis, SQL Server, storage).

Use case: Load balancer routing decisions (don't route traffic if dependencies are down)

Response (Ready):

{
  "status": "Ready",
  "description": "All dependencies are healthy",
  "checks": {
    "redis": "Healthy",
    "sqlserver": "Healthy",
    "storage": "Healthy"
  },
  "timestamp": "2026-01-11T10:30:00.000Z"
}

Response (Not Ready): - HTTP 503 Service Unavailable - JSON body includes failed checks

Example:

curl http://localhost:5000/ready

Kubernetes Readiness Probe:

readinessProbe:
  httpGet:
    path: /ready
    port: 5000
  initialDelaySeconds: 10
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3


/live - Liveness Check

Checks if the application is responsive (not deadlocked or hung).

Use case: Pod/container restart decisions (restart if app is hung)

Response (Live):

{
  "status": "Live",
  "description": "Server is responsive",
  "timestamp": "2026-01-11T10:30:00.000Z"
}

Response (Dead): - HTTP 503 Service Unavailable (or timeout)

Example:

curl http://localhost:5000/live

Kubernetes Liveness Probe:

livenessProbe:
  httpGet:
    path: /live
    port: 5000
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3


Prometheus Metrics

Cesivi Server exposes comprehensive Prometheus metrics at /metrics for monitoring performance, usage, and health.

Accessing Metrics

# View all metrics
curl http://localhost:5000/metrics

# Query specific metric
curl http://localhost:5000/metrics | grep spm_csom_requests_total

Metric Categories

1. CSOM Request Metrics

Metric Type Labels Description
spm_csom_requests_total Counter operation, status, server_id Total CSOM requests processed
spm_csom_request_duration_seconds Histogram operation, server_id CSOM request processing time
spm_csom_requests_in_flight Gauge server_id Number of active CSOM requests

Example:

# CSOM request rate (requests per second)
rate(spm_csom_requests_total[5m])

# CSOM error rate
rate(spm_csom_requests_total{status="error"}[5m])

# CSOM request duration (95th percentile)
histogram_quantile(0.95, rate(spm_csom_request_duration_seconds_bucket[5m]))

2. REST API Metrics

Metric Type Labels Description
spm_rest_requests_total Counter endpoint, method, status, server_id Total REST requests processed
spm_rest_request_duration_seconds Histogram endpoint, method, server_id REST request processing time

Example:

# REST API request rate by endpoint
rate(spm_rest_requests_total[5m])

# Slow REST endpoints (>1s)
histogram_quantile(0.95, rate(spm_rest_request_duration_seconds_bucket[5m])) > 1

3. SOAP Service Metrics

Metric Type Labels Description
spm_soap_requests_total Counter service, operation, status, server_id Total SOAP requests processed
spm_soap_request_duration_seconds Histogram service, operation, server_id SOAP request processing time

Example:

# SOAP request rate by service
sum by (service) (rate(spm_soap_requests_total[5m]))

4. Session Metrics

Metric Type Labels Description
spm_active_sessions_total Gauge server_id Number of active sessions
spm_sessions_created_total Counter server_id Total sessions created
spm_sessions_expired_total Counter server_id Total sessions expired

Example:

# Session churn rate
rate(spm_sessions_created_total[5m]) + rate(spm_sessions_expired_total[5m])

5. Cache Metrics

Metric Type Labels Description
spm_cache_hits_total Counter cache_type, server_id Total cache hits
spm_cache_misses_total Counter cache_type, server_id Total cache misses
spm_cache_hit_ratio Gauge cache_type, server_id Cache hit ratio (0.0 to 1.0)
spm_cache_items_total Gauge cache_type, server_id Number of items in cache

Example:

# Cache hit ratio over time
spm_cache_hit_ratio

# Cache effectiveness by type
sum by (cache_type) (rate(spm_cache_hits_total[5m])) /
sum by (cache_type) (rate(spm_cache_hits_total[5m]) + rate(spm_cache_misses_total[5m]))

6. Storage Metrics

Metric Type Labels Description
spm_storage_operations_total Counter operation, status, server_id Total storage operations
spm_storage_operation_duration_seconds Histogram operation, server_id Storage operation duration
spm_storage_list_items_total Gauge server_id Total list items in storage
spm_storage_files_total Gauge server_id Total files in storage
spm_storage_size_bytes Gauge server_id Total storage size in bytes

Example:

# Storage operation latency
histogram_quantile(0.99, rate(spm_storage_operation_duration_seconds_bucket[5m]))

# Storage growth rate
rate(spm_storage_size_bytes[1h])

7. Health Check Metrics

Metric Type Labels Description
spm_health_check_status Gauge check_name, server_id Health check status (1=healthy, 0=unhealthy)
spm_health_check_duration_seconds Histogram check_name, server_id Health check duration

Example:

# Unhealthy dependencies
spm_health_check_status == 0

8. Distributed State Metrics

Metric Type Labels Description
spm_distributed_lock_acquisitions_total Counter lock_key, status, server_id Lock acquisition attempts
spm_distributed_lock_hold_duration_seconds Histogram lock_key, server_id Lock hold time
spm_pubsub_messages_published_total Counter channel, server_id Pub/sub messages published
spm_pubsub_messages_received_total Counter channel, server_id Pub/sub messages received

Example:

# Lock contention (timeouts)
rate(spm_distributed_lock_acquisitions_total{status="timeout"}[5m])

9. Authentication Metrics

Metric Type Labels Description
spm_authentication_attempts_total Counter auth_type, status, server_id Authentication attempts
spm_authentication_duration_seconds Histogram auth_type, server_id Authentication processing time

Example:

# Authentication failure rate
rate(spm_authentication_attempts_total{status="failure"}[5m])

10. Server Info Metrics

Metric Type Labels Description
spm_server_info Gauge version, environment, server_id, storage_provider, distributed_state_provider Server metadata (always 1.0)

Example:

# Server inventory
spm_server_info


OpenTelemetry Distributed Tracing

Cesivi Server supports OpenTelemetry for distributed request tracing across services.

Configuration

appsettings.json:

{
  "OpenTelemetry": {
    "OtlpEndpoint": "http://jaeger:4318"
  }
}

Trace Sources

The following components are automatically instrumented:

  1. ASP.NET Core - HTTP request/response traces
  2. HTTP Client - Outgoing HTTP calls (plugins, remote event receivers)
  3. CSOM Processor - CSOM request processing
  4. REST/SOAP APIs - API endpoint traces

Custom Spans

Add custom spans to your code:

using Cesivi.Server.Observability;

// Start a custom activity
using var activity = CesiviActivitySource.Activity.StartActivity("CustomOperation");

// Add custom tags
activity?.SetTag("custom.key", "value");
activity?.SetTag("item.id", itemId);

// Process...

// Activity automatically ends when disposed

Viewing Traces

Jaeger UI: http://localhost:16686

  1. Select "Cesivi Server" service
  2. Find traces by:
  3. Operation name
  4. Tags (e.g., http.url, sp.webapp)
  5. Duration
  6. Status code

Trace Context Propagation

Cesivi Server automatically propagates trace context across:

  • HTTP requests (W3C Trace Context headers)
  • Distributed state operations (Redis pub/sub)
  • Remote event receivers

Configuration

appsettings.json

{
  "ServerMetrics": {
    "ServerId": "spm-server-1",
    "Environment": "production"
  },
  "OpenTelemetry": {
    "OtlpEndpoint": "http://jaeger:4318"
  },
  "Storage": {
    "Provider": "SqlServer"
  },
  "DistributedState": {
    "Provider": "Redis"
  }
}

Environment Variables

Variable Purpose Default
CESIVI_SERVER_ID Server identifier for metrics Machine name
CESIVI_ENVIRONMENT Environment name (dev/staging/prod) Development
ASPNETCORE_ENVIRONMENT ASP.NET Core environment Development

Docker Compose Setup

The docker-compose.multiserver.yml includes full observability stack:

Starting the Stack

# Start all services (3 SPM + Redis + SQL + Nginx + Prometheus + Jaeger)
docker-compose -f docker-compose.multiserver.yml up -d

# View logs
docker-compose -f docker-compose.multiserver.yml logs -f

# Stop services
docker-compose -f docker-compose.multiserver.yml down

Accessing Services

Service URL Purpose
Cesivi (via Nginx) http://localhost:8080 Load-balanced access
Prometheus http://localhost:9090 Metrics dashboard
Jaeger UI http://localhost:16686 Trace visualization

Prometheus Queries

Prometheus UI: http://localhost:9090

Example queries:

# Total request rate across all servers
sum(rate(spm_csom_requests_total[5m]))

# Request rate by server
sum by (server_id) (rate(spm_csom_requests_total[5m]))

# Error rate
sum(rate(spm_csom_requests_total{status="error"}[5m]))

# Request duration (95th percentile)
histogram_quantile(0.95,
  sum by (le) (rate(spm_csom_request_duration_seconds_bucket[5m]))
)

Kubernetes Setup

ServiceMonitor for Prometheus Operator

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: cesivi-metrics
  namespace: cesivi
spec:
  selector:
    matchLabels:
      app: cesivi
  endpoints:
  - port: http
    path: /metrics
    interval: 15s

Deployment with Probes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cesivi
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: cesivi
        image: cesivi:latest
        ports:
        - containerPort: 5000
          name: http
        livenessProbe:
          httpGet:
            path: /live
            port: 5000
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 3
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 5000
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 3
          failureThreshold: 3

Grafana Dashboards

Installing Grafana

Docker Compose:

Add to docker-compose.multiserver.yml:

  grafana:
    image: grafana/grafana:10.2.3
    container_name: spm-grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana
    depends_on:
      - prometheus
    networks:
      - spm-network

volumes:
  grafana-data:
    driver: local

Connecting to Prometheus

  1. Open Grafana: http://localhost:3000
  2. Login: admin / admin
  3. Add Data Source:
  4. Type: Prometheus
  5. URL: http://prometheus:9090
  6. Save & Test

Dashboard Panels

1. Request Rate

sum(rate(spm_csom_requests_total[5m]))

2. Error Rate

sum(rate(spm_csom_requests_total{status="error"}[5m])) /
sum(rate(spm_csom_requests_total[5m])) * 100

3. Request Duration (P95)

histogram_quantile(0.95,
  sum by (le) (rate(spm_csom_request_duration_seconds_bucket[5m]))
)

4. Active Sessions

sum(spm_active_sessions_total)

5. Cache Hit Ratio

spm_cache_hit_ratio

Importing Dashboards

Save dashboard JSON to grafana-dashboards/cesivi.json and import via Grafana UI.


Troubleshooting

Metrics Not Appearing

Problem: /metrics endpoint returns empty or no metrics

Solutions:

  1. Check if metrics are being recorded:

    curl http://localhost:5000/metrics | grep spm_
    

  2. Verify server ID is set:

    curl http://localhost:5000/metrics | grep spm_server_info
    

  3. Check Program.cs configuration:

  4. CesiviMetrics.InitializeServerInfo() is called
  5. app.UseMetricsMiddleware() is registered

Prometheus Not Scraping

Problem: Prometheus targets show "DOWN" status

Solutions:

  1. Check target health in Prometheus:
  2. Open http://localhost:9090/targets
  3. Look for errors

  4. Verify network connectivity:

    docker exec spm-prometheus wget -O- http://spm-server-1:5000/metrics
    

  5. Check Prometheus config:

    docker exec spm-prometheus cat /etc/prometheus/prometheus.yml
    


Jaeger Not Receiving Traces

Problem: No traces appear in Jaeger UI

Solutions:

  1. Verify OTLP endpoint configuration:

    {
      "OpenTelemetry": {
        "OtlpEndpoint": "http://jaeger:4318"
      }
    }
    

  2. Check Jaeger logs:

    docker logs spm-jaeger
    

  3. Verify trace export:

    # Check if traces are being sent
    docker logs spm-server-1 | grep -i telemetry
    


Health Checks Failing

Problem: /ready returns 503

Solutions:

  1. Check dependency health manually:

    # Redis
    docker exec spm-redis redis-cli ping
    
    # SQL Server
    docker exec spm-sqlserver /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P 'Cesivi2024!' -Q 'SELECT 1'
    

  2. Check application logs:

    docker logs spm-server-1 | grep -i health
    

  3. Verify HealthCheckService configuration in Program.cs


High Memory Usage

Problem: Metrics collection causing high memory usage

Solutions:

  1. Reduce metric cardinality:
  2. Limit number of unique label values
  3. Use metric normalization (already implemented in MetricsMiddleware)

  4. Adjust Prometheus retention:

    command:
      - '--storage.tsdb.retention.time=7d'  # Default: 15d
    

  5. Check for metric leaks:

    curl http://localhost:5000/metrics | wc -l
    # Should be <10,000 lines for normal operation
    


Best Practices

1. Metric Naming

  • Use consistent prefix: spm_
  • Use descriptive names: spm_csom_request_duration_seconds (not spm_req_dur)
  • Include units in name: _seconds, _bytes, _total

2. Label Cardinality

  • Keep label values bounded (don't use unbounded IDs)
  • Use normalization to reduce cardinality
  • Avoid high-cardinality labels like user_id, item_id

3. Health Check Design

  • /health - Fast, no dependencies (basic liveness)
  • /ready - Checks dependencies (routing decision)
  • /live - Checks responsiveness (restart decision)

4. Alert Rules

Create Prometheus alert rules for critical metrics:

groups:
  - name: cesivi
    rules:
    - alert: HighErrorRate
      expr: rate(spm_csom_requests_total{status="error"}[5m]) > 0.1
      for: 5m
      annotations:
        summary: "High CSOM error rate"

    - alert: SlowRequests
      expr: histogram_quantile(0.95, rate(spm_csom_request_duration_seconds_bucket[5m])) > 5
      for: 5m
      annotations:
        summary: "CSOM requests are slow"

Additional Resources

  • Prometheus Documentation: https://prometheus.io/docs/
  • Grafana Documentation: https://grafana.com/docs/
  • OpenTelemetry Documentation: https://opentelemetry.io/docs/
  • Jaeger Documentation: https://www.jaegertracing.io/docs/

For questions or issues, please file a GitHub issue or contact the Cesivi team.