Multi-Server Deployment Guide¶
PLAN-129 Phase 2.3: Distributed State Infrastructure for Multi-Server Deployments
This guide explains how to deploy Cesivi Server in a multi-server configuration with load balancing, using Redis for distributed state and SQL Server (or other backends) for shared storage.
Table of Contents¶
- Overview
- Architecture
- Prerequisites
- Quick Start with Docker Compose
- Configuration
- Deployment Scenarios
- Monitoring & Health Checks
- Troubleshooting
- Performance Tuning
- Security Considerations
Overview¶
Cesivi Server supports multi-server deployments with:
- Distributed Session State: Sessions persist across server restarts and are accessible from any server instance
- Distributed Cache: User permissions, authentication tokens cached with pub/sub invalidation
- Distributed Locks: Coordination for critical operations
- Pub/Sub: Cache invalidation notifications propagate to all servers
- Queue: Background job processing across servers
- Load Balancing: No sticky sessions required - any server can handle any request
Key Benefits: - High Availability: Server failures don't lose user sessions - Horizontal Scalability: Add more servers as load increases - Zero Downtime Deployments: Rolling updates without session loss - Fault Tolerance: Redis replication and SQL Server clustering support
Architecture¶
┌──────────────┐
│ Load Balancer│ (Nginx/HAProxy/Azure LB)
│ (Round-Robin)│
└──────┬───────┘
│
┌───┴────┬────────┬────────┐
│ │ │ │
┌──▼──┐ ┌──▼──┐ ┌──▼──┐ ┌──▼──┐
│SPM 1│ │SPM 2│ │SPM 3│ │ ... │ Cesivi Server instances
└──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘
│ │ │ │
└────────┴────────┴────────┘
│
┌────▼────┐
│ Redis │ ← Distributed State (session/cache/locks/pub-sub/queue)
│ Primary │
└────┬────┘
│ (optional replication)
┌────▼────┐
│ Redis │
│ Replica │
└─────────┘
│
┌────────┴────────┐
│ │
┌──▼──────┐ ┌──────▼────┐
│SQL Server│ │ File Share│ ← Shared Storage (persistent data)
│(Primary) │ │ (NFS/S3) │
│(Replica) │ │ │
└───────────┘ └───────────┘
Components:
- Load Balancer: Distributes requests across SPM instances (round-robin, least-connections, etc.)
- Cesivi Server Instances: Stateless application servers (scale horizontally)
- Redis: Stores session state, cache, provides pub/sub and distributed locks
- Shared Storage: SQL Server, PostgreSQL, or file share for persistent SharePoint data
Prerequisites¶
Infrastructure Requirements¶
Per Server Instance: - CPU: 2+ cores - RAM: 4GB+ (8GB recommended) - Disk: 10GB+ (depends on storage backend)
Redis: - CPU: 1-2 cores - RAM: 2GB+ (size based on session/cache volume) - Disk: 1GB+ for persistence (AOF/RDB)
SQL Server (if used): - CPU: 2+ cores - RAM: 4GB+ - Disk: 50GB+ (depends on data volume)
Software Requirements¶
- Docker & Docker Compose (for containerized deployments) OR
- .NET 10.0 SDK/Runtime (for manual deployments)
- Redis 7.0+
- SQL Server 2019+ or PostgreSQL 13+ (optional)
- Load balancer (Nginx, HAProxy, Azure Load Balancer, etc.)
Quick Start with Docker Compose¶
The easiest way to test multi-server deployment is using the provided docker-compose.multiserver.yml:
# 1. Start all services (3 SPM instances + Redis + SQL Server + Nginx)
docker-compose -f docker-compose.multiserver.yml up -d
# 2. Check service health
docker-compose -f docker-compose.multiserver.yml ps
# 3. Access Cesivi Server through load balancer
curl http://localhost:8080/health
# 4. View logs
docker-compose -f docker-compose.multiserver.yml logs -f spm-server-1
# 5. Stop all services
docker-compose -f docker-compose.multiserver.yml down
What's Included:
- 3x Cesivi Server instances (ports 5001, 5002, 5003)
- 1x Redis (port 6379) - Distributed state backend
- 1x SQL Server Express (port 1433) - Shared storage (optional)
- 1x Nginx (port 8080) - Load balancer
Testing Session Persistence:
# Create a session on any server
curl -c cookies.txt http://localhost:8080/api/session/create
# Subsequent requests will be load-balanced but session persists
curl -b cookies.txt http://localhost:8080/api/session/get
# Stop server 1, session still works on servers 2 and 3
docker stop spm-server-1
curl -b cookies.txt http://localhost:8080/api/session/get
Configuration¶
Environment Variables¶
Distributed State Configuration:
# Redis provider (multi-server)
Cesivi__DistributedState__Provider=Redis
Cesivi__DistributedState__ConnectionString=redis-host:6379,password=YourPassword,ssl=true
# InMemory provider (single-server)
Cesivi__DistributedState__Provider=InMemory
Storage Configuration:
# InMemory (development/testing)
Cesivi__Storage__Provider=InMemory
# SQL Server (production)
Cesivi__Storage__Provider=SqlServer
Cesivi__Storage__ConnectionString=Server=sql-host,1433;Database=Cesivi;User=sa;Password=YourPassword;TrustServerCertificate=True
# PostgreSQL (production)
Cesivi__Storage__Provider=PostgreSQL
Cesivi__Storage__ConnectionString=Host=pg-host;Port=5432;Database=Cesivi;Username=postgres;Password=YourPassword
Farm Configuration:
# Enable farm mode (multi-server)
Cesivi__Farm__Enabled=true
Cesivi__Farm__ServerId=spm-server-1 # Unique per server
appsettings.Production.json¶
{
"Cesivi": {
"DistributedState": {
"Provider": "Redis",
"ConnectionString": "redis-cluster.example.com:6379,password=SecurePassword123!,ssl=true,abortConnect=false"
},
"Storage": {
"Provider": "SqlServer",
"ConnectionString": "Server=sql-cluster.example.com,1433;Database=Cesivi;User Id=spm_app;Password=AppPassword123!;TrustServerCertificate=True;Encrypt=True"
},
"Farm": {
"Enabled": true,
"ServerId": "${HOSTNAME}"
}
},
"Logging": {
"LogLevel": {
"Default": "Information",
"Cesivi": "Information",
"Microsoft.AspNetCore": "Warning"
}
}
}
Distributed DataProtection + OIDC server-side state (PLAN-1772)¶
Running more than one instance behind a load balancer requires a shared ASP.NET Core DataProtection key ring — otherwise each instance mints its own keys and auth cookies / OIDC correlation cookies minted on instance A fail to validate on instance B.
- Cesivi.Server:
Cesivi:DataProtection:Provider=PostgreSql|SqlServer|SharedDirectory(reuses the configured storage connection string, or an explicit UNC/SMB/NFSKeysPath). Fail-fast startup gates reject an unshareableFileSystemprovider whenCluster:Mode=Cluster, and reject a DataProtection backend that doesn't match the configuredStorageProvider(e.g.SqlServerDataProtection with a non-SQL storage backend). - Cesivi.WebUI:
WebUI:DataProtectionKeysPath— a shared path all instances point at — plusWebUI:DistributedState:Provider=Redisso the OIDCstateblob (theServerSideStateDataFormat) is stored server-side in shared Redis instead of per-process memory. This is what lets a login started on instance A complete on instance B (and avoids the Keycloak HTTP 431 you get from an oversizedstatequery parameter). The defaultInMemoryprovider is correct only for a single WebUI instance.
Full reference (complete config templates, nginx upstream template with no session
affinity, health probes, acceptance checklist):
_docs_dev/multi-instance-deploy.md.
Deployment Scenarios¶
Scenario 1: Docker Swarm / Kubernetes¶
Docker Swarm Stack:
# spm-stack.yml
version: '3.8'
services:
spm-server:
image: Cesivi/server:latest
deploy:
replicas: 3
update_config:
parallelism: 1
delay: 10s
restart_policy:
condition: on-failure
environment:
- Cesivi__DistributedState__Provider=Redis
- Cesivi__DistributedState__ConnectionString=${REDIS_CONN_STRING}
- Cesivi__Storage__Provider=SqlServer
- Cesivi__Storage__ConnectionString=${SQL_CONN_STRING}
networks:
- spm-network
ports:
- target: 5000
published: 5000
protocol: tcp
mode: host
networks:
spm-network:
driver: overlay
Deploy:
docker stack deploy -c spm-stack.yml spm
Kubernetes Deployment:
# spm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: spm-server
spec:
replicas: 3
selector:
matchLabels:
app: spm-server
template:
metadata:
labels:
app: spm-server
spec:
containers:
- name: spm-server
image: Cesivi/server:latest
ports:
- containerPort: 5000
env:
- name: Cesivi__DistributedState__Provider
value: "Redis"
- name: Cesivi__DistributedState__ConnectionString
valueFrom:
secretKeyRef:
name: spm-secrets
key: redis-connection
- name: Cesivi__Storage__Provider
value: "SqlServer"
- name: Cesivi__Storage__ConnectionString
valueFrom:
secretKeyRef:
name: spm-secrets
key: sql-connection
livenessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 5000
initialDelaySeconds: 10
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: spm-server
spec:
type: LoadBalancer
selector:
app: spm-server
ports:
- protocol: TCP
port: 80
targetPort: 5000
Deploy:
kubectl apply -f spm-deployment.yaml
Scenario 2: Azure App Service¶
- Create Azure Resources:
- Azure App Service (multiple instances with auto-scaling)
- Azure Cache for Redis (Standard/Premium tier)
-
Azure SQL Database (Standard/Premium tier)
-
Configure App Service:
# Set environment variables
az webapp config appsettings set -n spm-app -g spm-rg --settings \
Cesivi__DistributedState__Provider=Redis \
Cesivi__DistributedState__ConnectionString="spm-cache.redis.cache.windows.net:6380,password=...,ssl=true" \
Cesivi__Storage__Provider=SqlServer \
Cesivi__Storage__ConnectionString="Server=tcp:spm-sql.database.windows.net,1433;Database=Cesivi;..."
# Enable auto-scaling
az monitor autoscale create -n spm-autoscale -g spm-rg \
--resource spm-app \
--min-count 2 \
--max-count 10 \
--count 3
Scenario 3: AWS ECS Fargate¶
{
"family": "spm-server",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "1024",
"memory": "2048",
"containerDefinitions": [
{
"name": "spm-server",
"image": "Cesivi/server:latest",
"portMappings": [
{
"containerPort": 5000,
"protocol": "tcp"
}
],
"environment": [
{
"name": "Cesivi__DistributedState__Provider",
"value": "Redis"
},
{
"name": "Cesivi__DistributedState__ConnectionString",
"value": "spm-cache.abc123.0001.use1.cache.amazonaws.com:6379"
},
{
"name": "Cesivi__Storage__Provider",
"value": "SqlServer"
}
],
"secrets": [
{
"name": "Cesivi__Storage__ConnectionString",
"valueFrom": "arn:aws:secretsmanager:us-east-1:123456789:secret:spm/sql-connection"
}
],
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost:5000/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3
}
}
]
}
Monitoring & Health Checks¶
Health Endpoints¶
Cesivi Server provides several health check endpoints:
# Basic health check (returns 200 OK if healthy)
GET /health
# Detailed health check (includes dependencies)
GET /health/ready
# Liveness probe (for Kubernetes)
GET /health/live
# Metrics endpoint (Prometheus format)
GET /metrics
Example Response:
{
"status": "Healthy",
"totalDuration": "00:00:00.0234567",
"entries": {
"DistributedState": {
"status": "Healthy",
"description": "Redis connection healthy",
"data": {
"provider": "Redis",
"connected": true
}
},
"Storage": {
"status": "Healthy",
"description": "SQL Server connection healthy",
"data": {
"provider": "SqlServer",
"connected": true
}
}
}
}
Logging¶
Recommended Logging Configuration:
{
"Serilog": {
"MinimumLevel": {
"Default": "Information",
"Override": {
"Microsoft": "Warning",
"Cesivi.Server.Services.DistributedSessionStateStore": "Debug",
"Cesivi.Server.Authentication.DistributedTokenCache": "Debug",
"Cesivi.Server.Authorization.DistributedUserContextCache": "Debug"
}
},
"WriteTo": [
{
"Name": "Console",
"Args": {
"outputTemplate": "[{Timestamp:HH:mm:ss} {Level:u3}] {SourceContext} {Message:lj}{NewLine}{Exception}"
}
},
{
"Name": "File",
"Args": {
"path": "/var/log/spm/log-.txt",
"rollingInterval": "Day",
"retainedFileCountLimit": 7
}
}
]
}
}
Metrics & Monitoring¶
Key Metrics to Monitor:
- Request Rate: Requests/second per server
- Response Time: P50, P95, P99 latencies
- Error Rate: 4xx/5xx responses
- Session Count: Active sessions (from Redis)
- Cache Hit Rate: User context cache efficiency
- Redis Memory Usage: Track for capacity planning
- SQL Connection Pool: Active connections
- Server CPU/RAM: Resource utilization
Prometheus Metrics:
# Request rate
rate(http_requests_total[5m])
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
# Response time (p95)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Active sessions
redis_sessions_active
# Cache hit rate
rate(usercontext_cache_hits_total[5m]) / rate(usercontext_cache_requests_total[5m])
Troubleshooting¶
Issue: Sessions Not Persisting Across Servers¶
Symptoms: - User logged out when request goes to different server - Session data lost randomly
Diagnosis:
# Check Redis connection from SPM server
docker exec spm-server-1 curl http://localhost:5000/health
# Check Redis directly
docker exec spm-redis redis-cli -a YourPassword KEYS "session:*"
# Check distributed state provider configuration
docker exec spm-server-1 env | grep DistributedState
Solutions:
- Verify
Cesivi__DistributedState__Provider=Redisis set - Check Redis connection string is correct
- Ensure Redis is accessible from all SPM instances
- Check firewall rules (port 6379)
- Verify Redis password if authentication is enabled
Issue: Cache Invalidation Not Propagating¶
Symptoms: - User permissions cached incorrectly after changes - Stale data visible on some servers but not others
Diagnosis:
# Enable debug logging for pub/sub
Cesivi__Logging__LogLevel__Cesivi.Server.Authorization.DistributedUserContextCache=Debug
# Check Redis pub/sub channels
docker exec spm-redis redis-cli -a YourPassword PUBSUB CHANNELS
Solutions:
- Verify pub/sub is working:
PUBLISH usercontext:invalidated "test" - Check subscription logs in SPM server logs
- Ensure Redis supports pub/sub (should be enabled by default)
- Check network latency between servers and Redis
Issue: High Redis Memory Usage¶
Symptoms: - Redis memory usage growing continuously - Out of memory errors
Diagnosis:
# Check Redis memory stats
docker exec spm-redis redis-cli -a YourPassword INFO memory
# Check key count and expiration
docker exec spm-redis redis-cli -a YourPassword DBSIZE
docker exec spm-redis redis-cli -a YourPassword TTL "session:abc123"
Solutions:
- Verify TTL is set correctly on keys
- Enable Redis eviction policy:
maxmemory-policy allkeys-lru - Increase Redis memory limit
- Reduce session/cache TTL if appropriate
- Monitor for memory leaks in application code
Issue: SQL Server Connection Pool Exhausted¶
Symptoms: - Timeout errors when accessing storage - "Timeout expired. The timeout period elapsed..." errors
Diagnosis:
# Check SQL Server connections
SELECT * FROM sys.dm_exec_sessions WHERE program_name LIKE '%Cesivi%'
# Check connection pool size in SPM logs
grep "Connection pool" /var/log/spm/*.log
Solutions:
- Increase connection pool size:
Max Pool Size=200in connection string - Verify connections are being properly disposed
- Check for long-running queries causing connection blocking
- Enable connection pooling:
Pooling=true(default)
Performance Tuning¶
Redis Optimization¶
Connection Pooling:
# Cesivi Server automatically pools Redis connections
# Configure connection multiplexer settings:
Cesivi__DistributedState__ConnectionString=redis-host:6379,abortConnect=false,connectTimeout=5000,syncTimeout=5000
Redis Configuration:
# redis.conf optimizations
# Maximum memory (e.g., 2GB)
maxmemory 2gb
# Eviction policy (remove least recently used keys when out of memory)
maxmemory-policy allkeys-lru
# Persistence (RDB + AOF for durability)
save 900 1
save 300 10
save 60 10000
appendonly yes
appendfsync everysec
# Networking
tcp-backlog 511
timeout 0
tcp-keepalive 300
# Performance
maxclients 10000
SQL Server Optimization¶
Connection String:
Server=sql-host;Database=Cesivi;User=spm_app;Password=...;
Min Pool Size=10;Max Pool Size=200;Pooling=true;
Connection Timeout=30;Command Timeout=300;
TrustServerCertificate=True;Encrypt=True;MultipleActiveResultSets=True
Indexing:
-- Create indexes for frequently queried columns
CREATE NONCLUSTERED INDEX IX_ListItems_ListId ON ListItems(ListId);
CREATE NONCLUSTERED INDEX IX_ListItems_Created ON ListItems(Created);
CREATE NONCLUSTERED INDEX IX_Files_Url ON Files(Url);
CREATE NONCLUSTERED INDEX IX_Users_LoginName ON Users(LoginName);
-- Update statistics regularly
UPDATE STATISTICS ListItems WITH FULLSCAN;
UPDATE STATISTICS Files WITH FULLSCAN;
Load Balancer Tuning¶
Nginx:
# nginx.conf optimizations
worker_processes auto;
worker_rlimit_nofile 65535;
events {
worker_connections 10000;
use epoll;
}
http {
# Connection pooling to backends
upstream spm_backend {
least_conn; # Use least-connections instead of round-robin for better distribution
server spm-server-1:5000 max_fails=3 fail_timeout=30s;
server spm-server-2:5000 max_fails=3 fail_timeout=30s;
server spm-server-3:5000 max_fails=3 fail_timeout=30s;
keepalive 64; # Connection pool to backends
}
# Caching (optional, for static assets)
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=spm_cache:10m max_size=1g;
# Timeouts
proxy_connect_timeout 60s;
proxy_send_timeout 60s;
proxy_read_timeout 60s;
keepalive_timeout 65s;
# Buffering
proxy_buffering on;
proxy_buffer_size 4k;
proxy_buffers 8 4k;
proxy_busy_buffers_size 8k;
}
Security Considerations¶
Network Security¶
- TLS/SSL Termination: Terminate HTTPS at load balancer
- Private Networks: Run Redis and SQL Server on private network (no public IP)
- Firewall Rules: Restrict access to management ports
- VPN/Bastion: Use VPN or bastion host for administrative access
Redis Security¶
# redis.conf security settings
# Require password
requirepass YourStrongPasswordHere123!
# Disable dangerous commands
rename-command FLUSHDB ""
rename-command FLUSHALL ""
rename-command CONFIG ""
rename-command SHUTDOWN ""
# Bind to private IP only
bind 10.0.1.100
# Enable TLS (Redis 6+)
tls-port 6380
tls-cert-file /path/to/redis.crt
tls-key-file /path/to/redis.key
tls-ca-cert-file /path/to/ca.crt
SQL Server Security¶
- Least Privilege: Use dedicated SQL user with minimal permissions
- Encryption: Enable TLS encryption (
Encrypt=True) - Firewall: Restrict SQL Server port 1433 to SPM servers only
- Backup Encryption: Encrypt SQL Server backups
- Audit Logging: Enable SQL Server audit logging
Application Security¶
- Secrets Management: Use Azure Key Vault, AWS Secrets Manager, etc.
- Environment Variables: Never commit credentials to git
- Token Rotation: Implement token rotation for bearer tokens
- Rate Limiting: Implement rate limiting at load balancer
- CORS: Configure CORS policies appropriately
Summary¶
Multi-server deployment of Cesivi Server enables:
- ✅ High Availability: No single point of failure
- ✅ Horizontal Scalability: Scale out as needed
- ✅ Session Persistence: Sessions survive server restarts
- ✅ Cache Coordination: Pub/sub invalidation across servers
- ✅ Production-Ready: Enterprise-grade architecture
Next Steps:
- Review Configuration section for your environment
- Choose a Deployment Scenario
- Set up Monitoring
- Test failover scenarios
- Implement Security Considerations
For questions or issues, see TROUBLESHOOTING or API_REFERENCE.