Skip to content

Cesivi Operations Manual

HomeDocumentationReference → Operations

This document provides operational procedures for monitoring, troubleshooting, backup, and scaling the Cesivi server in production environments.

Table of Contents

  1. Monitoring & Observability
  2. Troubleshooting
  3. Backup & Recovery
  4. Scaling & High Availability
  5. Performance Optimization
  6. Security

Monitoring & Observability

Health Check Endpoint

Endpoint: GET /_vti_bin/diagnostics

Returns comprehensive server metrics in JSON format:

{
  "status": "Healthy",
  "server": {
    "uptime": { "seconds": 3600, "formatted": "01:00:00" },
    "memoryUsageMB": "245.32",
    "threadCount": 42,
    "dotnetVersion": "10.0.0"
  },
  "requests": {
    "total": 15234,
    "successful": 15102,
    "failed": 132,
    "errorRate": "0.87%",
    "byEndpoint": { ... }
  },
  "cache": {
    "statistics": {
      "HitRate": 62.5,
      "MissRate": 37.5,
      "TotalEntries": 156,
      "Evictions": 23
    }
  }
}

Key Performance Indicators (KPIs)

Metric Healthy Range Warning Critical Action
Memory Usage < 500MB 500-800MB > 800MB Restart service or scale
CPU Usage < 60% 60-80% > 80% Scale horizontally
Error Rate < 1% 1-5% > 5% Check logs, investigate errors
Cache Hit Rate > 60% 40-60% < 40% Review cache config
Response Time (P95) < 10ms 10-50ms > 50ms Profile, optimize, scale
Thread Count < 100 100-200 > 200% Check for thread leaks

Monitoring Setup

Application Insights (Azure)

{
  "ApplicationInsights": {
    "InstrumentationKey": "your-key-here",
    "EnableAdaptiveSampling": true,
    "EnableDependencyTracking": true
  }
}

Prometheus Metrics

Cesivi exposes metrics at /metrics (if enabled):

# prometheus.yml
scrape_configs:
  - job_name: 'Cesivi'
    static_configs:
      - targets: ['localhost:5000']
    metrics_path: '/metrics'

Logging Configuration

Serilog structured logging is enabled by default.

Edit appsettings.Production.json:

{
  "Serilog": {
    "MinimumLevel": {
      "Default": "Information",
      "Override": {
        "Microsoft": "Warning",
        "System": "Warning"
      }
    },
    "WriteTo": [
      { "Name": "Console" },
      {
        "Name": "File",
        "Args": {
          "path": "logs/Cesivi-.log",
          "rollingInterval": "Day",
          "retainedFileCountLimit": 7
        }
      }
    ]
  }
}

Alert Thresholds

Recommended Alerts:

  1. High Error Rate - Error rate > 5% for 5 minutes
  2. High Memory - Memory > 800MB for 10 minutes
  3. Service Down - Health check fails 3 consecutive times
  4. Slow Responses - P95 response time > 100ms for 5 minutes
  5. Low Cache Hit Rate - Cache hit rate < 40% for 15 minutes

Troubleshooting

Common Issues and Solutions

1. Service Won't Start

Symptoms: - Service fails to start - Port binding errors - Permission denied errors

Diagnosis:

# Check if port is already in use
netstat -tulpn | grep :5000

# Check service logs
journalctl -u Cesivi -n 100

# Verify executable permissions
ls -la /opt/Cesivi/Cesivi

Solutions: - Kill process using port 5000: kill $(lsof -t -i:5000) - Fix permissions: chmod +x /opt/Cesivi/Cesivi - Check firewall rules: sudo ufw status

2. High Memory Usage

Symptoms: - Memory usage > 1GB - Out of memory errors - Performance degradation

Diagnosis:

# Check memory usage
ps aux | grep Cesivi

# Monitor in real-time
top -p $(pgrep -f Cesivi)

# Check for memory leaks
dotnet-dump collect -p $(pgrep -f Cesivi)
dotnet-dump analyze <dump-file>

Solutions: - Reduce cache size in configuration - Implement cache eviction policies - Restart service: systemctl restart Cesivi - Scale horizontally if persistent

3. Slow Response Times

Symptoms: - Response times > 50ms - Timeouts - Poor user experience

Diagnosis:

# Run performance benchmark
cd tools/PerformanceBenchmark
dotnet run

# Check disk I/O
iostat -x 1

# Profile with dotnet-trace
dotnet-trace collect -p $(pgrep -f Cesivi) --duration 00:00:30

Solutions: - Enable response caching - Optimize MockData structure (split large files) - Upgrade to SSD/NVMe storage - Enable compression - Scale horizontally

4. SOAP Service Errors

Symptoms: - SOAP fault responses - Invalid XML errors - Serialization failures

Diagnosis:

# Check SOAP request/response in logs
grep "SOAP" logs/Cesivi-*.log

# Test SOAP endpoint
curl -X POST http://localhost:5000/_vti_bin/Lists.asmx \
  -H "Content-Type: text/xml" \
  -d '<soap:Envelope>...</soap:Envelope>'

Solutions: - Validate SOAP envelope structure - Check XML namespace declarations - Verify SOAPAction header - Review error logs for stack traces

5. REST API 404 Errors

Symptoms: - REST endpoints return 404 - OData queries fail - Invalid route errors

Diagnosis:

# Check routing configuration
grep "api" logs/Cesivi-*.log

# Test endpoint
curl -H "Authorization: Basic dGVzdDp0ZXN0" http://localhost:5000/_api/web

Solutions: - Verify site context in URL (e.g., /sites/sitename/_api/web) - Check authentication headers - Review routing middleware configuration - Ensure SharePointRoutingMiddleware is registered

6. Low Cache Hit Rate

Symptoms: - Cache hit rate < 40% - Increased disk I/O - Slower response times

Diagnosis:

# Check cache statistics
curl http://localhost:5000/_vti_bin/diagnostics | jq '.cache.statistics'

Solutions: - Increase cache expiration time - Implement cache warming on startup - Review cache key generation logic - Add more cacheable endpoints

Diagnostic Tools

dotnet-counters - Real-time metrics

dotnet-counters monitor -p $(pgrep -f Cesivi)

dotnet-trace - Performance profiling

dotnet-trace collect -p $(pgrep -f Cesivi) --duration 00:01:00
dotnet-trace convert trace.nettrace --format speedscope

dotnet-dump - Memory dump analysis

dotnet-dump collect -p $(pgrep -f Cesivi)
dotnet-dump analyze <dump-file>


Backup & Recovery

Backup Strategy

1. MockData Backup (Critical)

Daily automated backup:

#!/bin/bash
# /opt/scripts/backup-Cesivi.sh

BACKUP_DIR="/backup/Cesivi"
DATE=$(date +%Y%m%d-%H%M%S)
SOURCE="/opt/Cesivi/@MockData"

# Create backup
tar -czf "$BACKUP_DIR/mockdata-$DATE.tar.gz" -C "$SOURCE" .

# Keep last 30 days
find "$BACKUP_DIR" -name "mockdata-*.tar.gz" -mtime +30 -delete

# Upload to S3 (optional)
aws s3 cp "$BACKUP_DIR/mockdata-$DATE.tar.gz" s3://your-bucket/Cesivi/

Schedule with cron:

0 2 * * * /opt/scripts/backup-Cesivi.sh

2. Configuration Backup

# Backup configuration files
cp /opt/Cesivi/appsettings.Production.json /backup/config/
cp /etc/systemd/system/Cesivi.service /backup/config/

Restore Procedures

Full Restore

# 1. Stop service
sudo systemctl stop Cesivi

# 2. Restore MockData
sudo rm -rf /opt/Cesivi/@MockData/*
sudo tar -xzf /backup/Cesivi/mockdata-20250106.tar.gz \
  -C /opt/Cesivi/@MockData

# 3. Restore configuration
sudo cp /backup/config/appsettings.Production.json \
  /opt/Cesivi/

# 4. Fix permissions
sudo chown -R Cesivi:Cesivi /opt/Cesivi

# 5. Start service
sudo systemctl start Cesivi

# 6. Verify
curl http://localhost:5000/_vti_bin/diagnostics

Point-in-Time Recovery

# List available backups
ls -lh /backup/Cesivi/

# Restore specific backup
sudo tar -xzf /backup/Cesivi/mockdata-20250106-140000.tar.gz \
  -C /opt/Cesivi/@MockData

Disaster Recovery Plan

  1. RTO (Recovery Time Objective): < 1 hour
  2. RPO (Recovery Point Objective): < 24 hours

Recovery Steps:

  1. Provision new server (manual or automated)
  2. Install .NET runtime and dependencies
  3. Deploy Cesivi application
  4. Restore latest MockData backup
  5. Restore configuration
  6. Start service and verify
  7. Update DNS/load balancer

Scaling & High Availability

Horizontal Scaling

Load Balancer Configuration (Nginx)

upstream Cesivi_cluster {
    least_conn;
    server 192.168.1.10:5000 max_fails=3 fail_timeout=30s;
    server 192.168.1.11:5000 max_fails=3 fail_timeout=30s;
    server 192.168.1.12:5000 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    server_name Cesivi.company.com;

    location / {
        proxy_pass http://Cesivi_cluster;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        # Health check
        health_check interval=10s fails=3 passes=2 uri=/_vti_bin/diagnostics;
    }
}

Shared Storage for MockData

Option 1: NFS

# Mount NFS share
sudo mount -t nfs 192.168.1.100:/exports/mockdata /opt/Cesivi/@MockData

Option 2: Azure Files/AWS EFS

# Azure Files
sudo mount -t cifs //storageaccount.file.core.windows.net/mockdata \
  /opt/Cesivi/@MockData \
  -o credentials=/etc/smbcredentials

Session Management

For stateless operation, ensure: - No in-memory session state - Use distributed cache (Redis) if needed - Enable sticky sessions on load balancer if required

Vertical Scaling

Increase resources:

# Update systemd service file
sudo nano /etc/systemd/system/Cesivi.service

# Increase limits
MemoryLimit=4G
CPUQuota=400%

# Reload and restart
sudo systemctl daemon-reload
sudo systemctl restart Cesivi

Kubernetes Auto-Scaling

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: Cesivi-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: Cesivi
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Performance Optimization

See PERFORMANCE.md for detailed performance optimization guide.

Quick Wins

  1. Enable Response Caching

    {
      "ResponseCaching": {
        "Enabled": true,
        "Duration": 300
      }
    }
    

  2. Enable Compression

    {
      "ResponseCompression": {
        "EnableForHttps": true,
        "Providers": ["brotli", "gzip"]
      }
    }
    

  3. Optimize Logging

    {
      "Serilog": {
        "MinimumLevel": {
          "Default": "Warning"
        }
      }
    }
    


Security

Security Hardening Checklist

  • [ ] Enable HTTPS only (disable HTTP)
  • [ ] Implement proper authentication validation
  • [ ] Use secure headers (HSTS, CSP, X-Frame-Options)
  • [ ] Enable rate limiting
  • [ ] Restrict CORS origins
  • [ ] Use firewall rules to limit access
  • [ ] Regular security updates
  • [ ] Encrypt data at rest
  • [ ] Implement audit logging
  • [ ] Use secrets management (not hardcoded credentials)

HTTPS Configuration

See DEPLOYMENT_GUIDE.md for SSL/TLS setup instructions.

Audit Logging

Enable audit logging in appsettings.json:

{
  "Audit": {
    "Enabled": true,
    "LogPath": "logs/audit.log",
    "Events": ["Login", "Create", "Update", "Delete"]
  }
}

Operational Runbook

Daily Tasks

  • [ ] Check health endpoint status
  • [ ] Review error logs for anomalies
  • [ ] Monitor disk space usage
  • [ ] Verify backup completion

Weekly Tasks

  • [ ] Review performance metrics
  • [ ] Analyze cache hit rates
  • [ ] Update dependencies if needed
  • [ ] Test disaster recovery procedure

Monthly Tasks

  • [ ] Review and archive old logs
  • [ ] Perform security audit
  • [ ] Review capacity planning
  • [ ] Update documentation

Emergency Contacts

Role Contact Escalation
On-Call Engineer oncall@company.com Level 1
DevOps Team devops@company.com Level 2
Platform Lead platform-lead@company.com Level 3

Support Resources

  • Documentation: /docs
  • GitHub Issues: https://github.com/yourusername/Cesivi/issues
  • Internal Wiki: https://wiki.company.com/Cesivi
  • Slack Channel: #Cesivi-support