Cesivi Operations Manual¶

Home → Documentation → Reference → Operations

This document provides operational procedures for monitoring, troubleshooting, backup, and scaling the Cesivi server in production environments.

Table of Contents¶

Monitoring & Observability
Troubleshooting
Backup & Recovery
Scaling & High Availability
Performance Optimization
Security

Monitoring & Observability¶

Health Check Endpoint¶

Endpoint: GET /_vti_bin/diagnostics

Returns comprehensive server metrics in JSON format:

{
  "status": "Healthy",
  "server": {
    "uptime": { "seconds": 3600, "formatted": "01:00:00" },
    "memoryUsageMB": "245.32",
    "threadCount": 42,
    "dotnetVersion": "10.0.0"
  },
  "requests": {
    "total": 15234,
    "successful": 15102,
    "failed": 132,
    "errorRate": "0.87%",
    "byEndpoint": { ... }
  },
  "cache": {
    "statistics": {
      "HitRate": 62.5,
      "MissRate": 37.5,
      "TotalEntries": 156,
      "Evictions": 23
    }
  }
}

Key Performance Indicators (KPIs)¶

Metric	Healthy Range	Warning	Critical	Action
Memory Usage	< 500MB	500-800MB	> 800MB	Restart service or scale
CPU Usage	< 60%	60-80%	> 80%	Scale horizontally
Error Rate	< 1%	1-5%	> 5%	Check logs, investigate errors
Cache Hit Rate	> 60%	40-60%	< 40%	Review cache config
Response Time (P95)	< 10ms	10-50ms	> 50ms	Profile, optimize, scale
Thread Count	< 100	100-200	> 200%	Check for thread leaks

Monitoring Setup¶

Application Insights (Azure)¶

{
  "ApplicationInsights": {
    "InstrumentationKey": "your-key-here",
    "EnableAdaptiveSampling": true,
    "EnableDependencyTracking": true
  }
}

Prometheus Metrics¶

Cesivi exposes metrics at /metrics (if enabled):

# prometheus.yml
scrape_configs:
  - job_name: 'Cesivi'
    static_configs:
      - targets: ['localhost:5000']
    metrics_path: '/metrics'

Logging Configuration¶

Serilog structured logging is enabled by default.

Edit appsettings.Production.json:

{
  "Serilog": {
    "MinimumLevel": {
      "Default": "Information",
      "Override": {
        "Microsoft": "Warning",
        "System": "Warning"
      }
    },
    "WriteTo": [
      { "Name": "Console" },
      {
        "Name": "File",
        "Args": {
          "path": "logs/Cesivi-.log",
          "rollingInterval": "Day",
          "retainedFileCountLimit": 7
        }
      }
    ]
  }
}

Alert Thresholds¶

Recommended Alerts:

High Error Rate - Error rate > 5% for 5 minutes
High Memory - Memory > 800MB for 10 minutes
Service Down - Health check fails 3 consecutive times
Slow Responses - P95 response time > 100ms for 5 minutes
Low Cache Hit Rate - Cache hit rate < 40% for 15 minutes

Troubleshooting¶

Common Issues and Solutions¶

1. Service Won't Start¶

Symptoms: - Service fails to start - Port binding errors - Permission denied errors

Diagnosis:

# Check if port is already in use
netstat -tulpn | grep :5000

# Check service logs
journalctl -u Cesivi -n 100

# Verify executable permissions
ls -la /opt/Cesivi/Cesivi

Solutions: - Kill process using port 5000: kill $(lsof -t -i:5000) - Fix permissions: chmod +x /opt/Cesivi/Cesivi - Check firewall rules: sudo ufw status

2. High Memory Usage¶

Symptoms: - Memory usage > 1GB - Out of memory errors - Performance degradation

Diagnosis:

# Check memory usage
ps aux | grep Cesivi

# Monitor in real-time
top -p $(pgrep -f Cesivi)

# Check for memory leaks
dotnet-dump collect -p $(pgrep -f Cesivi)
dotnet-dump analyze <dump-file>

Solutions: - Reduce cache size in configuration - Implement cache eviction policies - Restart service: systemctl restart Cesivi - Scale horizontally if persistent

3. Slow Response Times¶

Symptoms: - Response times > 50ms - Timeouts - Poor user experience

Diagnosis:

# Run performance benchmark
cd tools/PerformanceBenchmark
dotnet run

# Check disk I/O
iostat -x 1

# Profile with dotnet-trace
dotnet-trace collect -p $(pgrep -f Cesivi) --duration 00:00:30

Solutions: - Enable response caching - Optimize MockData structure (split large files) - Upgrade to SSD/NVMe storage - Enable compression - Scale horizontally

4. SOAP Service Errors¶

Symptoms: - SOAP fault responses - Invalid XML errors - Serialization failures

Diagnosis:

# Check SOAP request/response in logs
grep "SOAP" logs/Cesivi-*.log

# Test SOAP endpoint
curl -X POST http://localhost:5000/_vti_bin/Lists.asmx \
  -H "Content-Type: text/xml" \
  -d '<soap:Envelope>...</soap:Envelope>'

Solutions: - Validate SOAP envelope structure - Check XML namespace declarations - Verify SOAPAction header - Review error logs for stack traces

5. REST API 404 Errors¶

Symptoms: - REST endpoints return 404 - OData queries fail - Invalid route errors

Diagnosis:

# Check routing configuration
grep "api" logs/Cesivi-*.log

# Test endpoint
curl -H "Authorization: Basic dGVzdDp0ZXN0" http://localhost:5000/_api/web

Solutions: - Verify site context in URL (e.g., /sites/sitename/_api/web) - Check authentication headers - Review routing middleware configuration - Ensure SharePointRoutingMiddleware is registered

6. Low Cache Hit Rate¶

Symptoms: - Cache hit rate < 40% - Increased disk I/O - Slower response times

Diagnosis:

# Check cache statistics
curl http://localhost:5000/_vti_bin/diagnostics | jq '.cache.statistics'

Solutions: - Increase cache expiration time - Implement cache warming on startup - Review cache key generation logic - Add more cacheable endpoints

Diagnostic Tools¶

dotnet-counters - Real-time metrics

dotnet-counters monitor -p $(pgrep -f Cesivi)

dotnet-trace - Performance profiling

dotnet-trace collect -p $(pgrep -f Cesivi) --duration 00:01:00
dotnet-trace convert trace.nettrace --format speedscope

dotnet-dump - Memory dump analysis

dotnet-dump collect -p $(pgrep -f Cesivi)
dotnet-dump analyze <dump-file>

Backup & Recovery¶

Backup Strategy¶

1. MockData Backup (Critical)¶

Daily automated backup:

#!/bin/bash
# /opt/scripts/backup-Cesivi.sh

BACKUP_DIR="/backup/Cesivi"
DATE=$(date +%Y%m%d-%H%M%S)
SOURCE="/opt/Cesivi/@MockData"

# Create backup
tar -czf "$BACKUP_DIR/mockdata-$DATE.tar.gz" -C "$SOURCE" .

# Keep last 30 days
find "$BACKUP_DIR" -name "mockdata-*.tar.gz" -mtime +30 -delete

# Upload to S3 (optional)
aws s3 cp "$BACKUP_DIR/mockdata-$DATE.tar.gz" s3://your-bucket/Cesivi/

Schedule with cron:

0 2 * * * /opt/scripts/backup-Cesivi.sh

2. Configuration Backup¶

# Backup configuration files
cp /opt/Cesivi/appsettings.Production.json /backup/config/
cp /etc/systemd/system/Cesivi.service /backup/config/

Restore Procedures¶

Full Restore¶

# 1. Stop service
sudo systemctl stop Cesivi

# 2. Restore MockData
sudo rm -rf /opt/Cesivi/@MockData/*
sudo tar -xzf /backup/Cesivi/mockdata-20250106.tar.gz \
  -C /opt/Cesivi/@MockData

# 3. Restore configuration
sudo cp /backup/config/appsettings.Production.json \
  /opt/Cesivi/

# 4. Fix permissions
sudo chown -R Cesivi:Cesivi /opt/Cesivi

# 5. Start service
sudo systemctl start Cesivi

# 6. Verify
curl http://localhost:5000/_vti_bin/diagnostics

Point-in-Time Recovery¶

# List available backups
ls -lh /backup/Cesivi/

# Restore specific backup
sudo tar -xzf /backup/Cesivi/mockdata-20250106-140000.tar.gz \
  -C /opt/Cesivi/@MockData

Disaster Recovery Plan¶

RTO (Recovery Time Objective): < 1 hour
RPO (Recovery Point Objective): < 24 hours

Recovery Steps:

Provision new server (manual or automated)
Install .NET runtime and dependencies
Deploy Cesivi application
Restore latest MockData backup
Restore configuration
Start service and verify
Update DNS/load balancer

Scaling & High Availability¶

Horizontal Scaling¶

Load Balancer Configuration (Nginx)¶

upstream Cesivi_cluster {
    least_conn;
    server 192.168.1.10:5000 max_fails=3 fail_timeout=30s;
    server 192.168.1.11:5000 max_fails=3 fail_timeout=30s;
    server 192.168.1.12:5000 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    server_name Cesivi.company.com;

    location / {
        proxy_pass http://Cesivi_cluster;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        # Health check
        health_check interval=10s fails=3 passes=2 uri=/_vti_bin/diagnostics;
    }
}

Shared Storage for MockData¶

Option 1: NFS

# Mount NFS share
sudo mount -t nfs 192.168.1.100:/exports/mockdata /opt/Cesivi/@MockData

Option 2: Azure Files/AWS EFS

# Azure Files
sudo mount -t cifs //storageaccount.file.core.windows.net/mockdata \
  /opt/Cesivi/@MockData \
  -o credentials=/etc/smbcredentials

Session Management¶

For stateless operation, ensure: - No in-memory session state - Use distributed cache (Redis) if needed - Enable sticky sessions on load balancer if required

Vertical Scaling¶

Increase resources:

# Update systemd service file
sudo nano /etc/systemd/system/Cesivi.service

# Increase limits
MemoryLimit=4G
CPUQuota=400%

# Reload and restart
sudo systemctl daemon-reload
sudo systemctl restart Cesivi

Kubernetes Auto-Scaling¶

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: Cesivi-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: Cesivi
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Performance Optimization¶

See PERFORMANCE.md for detailed performance optimization guide.

Quick Wins¶

Enable Response Caching

{
  "ResponseCaching": {
    "Enabled": true,
    "Duration": 300
  }
}

Enable Compression

{
  "ResponseCompression": {
    "EnableForHttps": true,
    "Providers": ["brotli", "gzip"]
  }
}

Optimize Logging

{
  "Serilog": {
    "MinimumLevel": {
      "Default": "Warning"
    }
  }
}

Security¶

Security Hardening Checklist¶

[ ] Enable HTTPS only (disable HTTP)
[ ] Implement proper authentication validation
[ ] Use secure headers (HSTS, CSP, X-Frame-Options)
[ ] Enable rate limiting
[ ] Restrict CORS origins
[ ] Use firewall rules to limit access
[ ] Regular security updates
[ ] Encrypt data at rest
[ ] Implement audit logging
[ ] Use secrets management (not hardcoded credentials)

HTTPS Configuration¶

See DEPLOYMENT_GUIDE.md for SSL/TLS setup instructions.

Audit Logging¶

Enable audit logging in appsettings.json:

{
  "Audit": {
    "Enabled": true,
    "LogPath": "logs/audit.log",
    "Events": ["Login", "Create", "Update", "Delete"]
  }
}

Operational Runbook¶

Daily Tasks¶

[ ] Check health endpoint status
[ ] Review error logs for anomalies
[ ] Monitor disk space usage
[ ] Verify backup completion

Weekly Tasks¶

[ ] Review performance metrics
[ ] Analyze cache hit rates
[ ] Update dependencies if needed
[ ] Test disaster recovery procedure

Monthly Tasks¶

[ ] Review and archive old logs
[ ] Perform security audit
[ ] Review capacity planning
[ ] Update documentation

Emergency Contacts¶

Role	Contact	Escalation
On-Call Engineer	oncall@company.com	Level 1
DevOps Team	devops@company.com	Level 2
Platform Lead	platform-lead@company.com	Level 3

Support Resources¶

Documentation: /docs
GitHub Issues: https://github.com/yourusername/Cesivi/issues
Internal Wiki: https://wiki.company.com/Cesivi
Slack Channel: #Cesivi-support