Cesivi Operations Manual¶
Home → Documentation → Reference → Operations
This document provides operational procedures for monitoring, troubleshooting, backup, and scaling the Cesivi server in production environments.
Table of Contents¶
- Monitoring & Observability
- Troubleshooting
- Backup & Recovery
- Scaling & High Availability
- Performance Optimization
- Security
Monitoring & Observability¶
Health Check Endpoint¶
Endpoint: GET /_vti_bin/diagnostics
Returns comprehensive server metrics in JSON format:
{
"status": "Healthy",
"server": {
"uptime": { "seconds": 3600, "formatted": "01:00:00" },
"memoryUsageMB": "245.32",
"threadCount": 42,
"dotnetVersion": "10.0.0"
},
"requests": {
"total": 15234,
"successful": 15102,
"failed": 132,
"errorRate": "0.87%",
"byEndpoint": { ... }
},
"cache": {
"statistics": {
"HitRate": 62.5,
"MissRate": 37.5,
"TotalEntries": 156,
"Evictions": 23
}
}
}
Key Performance Indicators (KPIs)¶
| Metric | Healthy Range | Warning | Critical | Action |
|---|---|---|---|---|
| Memory Usage | < 500MB | 500-800MB | > 800MB | Restart service or scale |
| CPU Usage | < 60% | 60-80% | > 80% | Scale horizontally |
| Error Rate | < 1% | 1-5% | > 5% | Check logs, investigate errors |
| Cache Hit Rate | > 60% | 40-60% | < 40% | Review cache config |
| Response Time (P95) | < 10ms | 10-50ms | > 50ms | Profile, optimize, scale |
| Thread Count | < 100 | 100-200 | > 200% | Check for thread leaks |
Monitoring Setup¶
Application Insights (Azure)¶
{
"ApplicationInsights": {
"InstrumentationKey": "your-key-here",
"EnableAdaptiveSampling": true,
"EnableDependencyTracking": true
}
}
Prometheus Metrics¶
Cesivi exposes metrics at /metrics (if enabled):
# prometheus.yml
scrape_configs:
- job_name: 'Cesivi'
static_configs:
- targets: ['localhost:5000']
metrics_path: '/metrics'
Logging Configuration¶
Serilog structured logging is enabled by default.
Edit appsettings.Production.json:
{
"Serilog": {
"MinimumLevel": {
"Default": "Information",
"Override": {
"Microsoft": "Warning",
"System": "Warning"
}
},
"WriteTo": [
{ "Name": "Console" },
{
"Name": "File",
"Args": {
"path": "logs/Cesivi-.log",
"rollingInterval": "Day",
"retainedFileCountLimit": 7
}
}
]
}
}
Alert Thresholds¶
Recommended Alerts:
- High Error Rate - Error rate > 5% for 5 minutes
- High Memory - Memory > 800MB for 10 minutes
- Service Down - Health check fails 3 consecutive times
- Slow Responses - P95 response time > 100ms for 5 minutes
- Low Cache Hit Rate - Cache hit rate < 40% for 15 minutes
Troubleshooting¶
Common Issues and Solutions¶
1. Service Won't Start¶
Symptoms: - Service fails to start - Port binding errors - Permission denied errors
Diagnosis:
# Check if port is already in use
netstat -tulpn | grep :5000
# Check service logs
journalctl -u Cesivi -n 100
# Verify executable permissions
ls -la /opt/Cesivi/Cesivi
Solutions:
- Kill process using port 5000: kill $(lsof -t -i:5000)
- Fix permissions: chmod +x /opt/Cesivi/Cesivi
- Check firewall rules: sudo ufw status
2. High Memory Usage¶
Symptoms: - Memory usage > 1GB - Out of memory errors - Performance degradation
Diagnosis:
# Check memory usage
ps aux | grep Cesivi
# Monitor in real-time
top -p $(pgrep -f Cesivi)
# Check for memory leaks
dotnet-dump collect -p $(pgrep -f Cesivi)
dotnet-dump analyze <dump-file>
Solutions:
- Reduce cache size in configuration
- Implement cache eviction policies
- Restart service: systemctl restart Cesivi
- Scale horizontally if persistent
3. Slow Response Times¶
Symptoms: - Response times > 50ms - Timeouts - Poor user experience
Diagnosis:
# Run performance benchmark
cd tools/PerformanceBenchmark
dotnet run
# Check disk I/O
iostat -x 1
# Profile with dotnet-trace
dotnet-trace collect -p $(pgrep -f Cesivi) --duration 00:00:30
Solutions: - Enable response caching - Optimize MockData structure (split large files) - Upgrade to SSD/NVMe storage - Enable compression - Scale horizontally
4. SOAP Service Errors¶
Symptoms: - SOAP fault responses - Invalid XML errors - Serialization failures
Diagnosis:
# Check SOAP request/response in logs
grep "SOAP" logs/Cesivi-*.log
# Test SOAP endpoint
curl -X POST http://localhost:5000/_vti_bin/Lists.asmx \
-H "Content-Type: text/xml" \
-d '<soap:Envelope>...</soap:Envelope>'
Solutions: - Validate SOAP envelope structure - Check XML namespace declarations - Verify SOAPAction header - Review error logs for stack traces
5. REST API 404 Errors¶
Symptoms: - REST endpoints return 404 - OData queries fail - Invalid route errors
Diagnosis:
# Check routing configuration
grep "api" logs/Cesivi-*.log
# Test endpoint
curl -H "Authorization: Basic dGVzdDp0ZXN0" http://localhost:5000/_api/web
Solutions:
- Verify site context in URL (e.g., /sites/sitename/_api/web)
- Check authentication headers
- Review routing middleware configuration
- Ensure SharePointRoutingMiddleware is registered
6. Low Cache Hit Rate¶
Symptoms: - Cache hit rate < 40% - Increased disk I/O - Slower response times
Diagnosis:
# Check cache statistics
curl http://localhost:5000/_vti_bin/diagnostics | jq '.cache.statistics'
Solutions: - Increase cache expiration time - Implement cache warming on startup - Review cache key generation logic - Add more cacheable endpoints
Diagnostic Tools¶
dotnet-counters - Real-time metrics
dotnet-counters monitor -p $(pgrep -f Cesivi)
dotnet-trace - Performance profiling
dotnet-trace collect -p $(pgrep -f Cesivi) --duration 00:01:00
dotnet-trace convert trace.nettrace --format speedscope
dotnet-dump - Memory dump analysis
dotnet-dump collect -p $(pgrep -f Cesivi)
dotnet-dump analyze <dump-file>
Backup & Recovery¶
Backup Strategy¶
1. MockData Backup (Critical)¶
Daily automated backup:
#!/bin/bash
# /opt/scripts/backup-Cesivi.sh
BACKUP_DIR="/backup/Cesivi"
DATE=$(date +%Y%m%d-%H%M%S)
SOURCE="/opt/Cesivi/@MockData"
# Create backup
tar -czf "$BACKUP_DIR/mockdata-$DATE.tar.gz" -C "$SOURCE" .
# Keep last 30 days
find "$BACKUP_DIR" -name "mockdata-*.tar.gz" -mtime +30 -delete
# Upload to S3 (optional)
aws s3 cp "$BACKUP_DIR/mockdata-$DATE.tar.gz" s3://your-bucket/Cesivi/
Schedule with cron:
0 2 * * * /opt/scripts/backup-Cesivi.sh
2. Configuration Backup¶
# Backup configuration files
cp /opt/Cesivi/appsettings.Production.json /backup/config/
cp /etc/systemd/system/Cesivi.service /backup/config/
Restore Procedures¶
Full Restore¶
# 1. Stop service
sudo systemctl stop Cesivi
# 2. Restore MockData
sudo rm -rf /opt/Cesivi/@MockData/*
sudo tar -xzf /backup/Cesivi/mockdata-20250106.tar.gz \
-C /opt/Cesivi/@MockData
# 3. Restore configuration
sudo cp /backup/config/appsettings.Production.json \
/opt/Cesivi/
# 4. Fix permissions
sudo chown -R Cesivi:Cesivi /opt/Cesivi
# 5. Start service
sudo systemctl start Cesivi
# 6. Verify
curl http://localhost:5000/_vti_bin/diagnostics
Point-in-Time Recovery¶
# List available backups
ls -lh /backup/Cesivi/
# Restore specific backup
sudo tar -xzf /backup/Cesivi/mockdata-20250106-140000.tar.gz \
-C /opt/Cesivi/@MockData
Disaster Recovery Plan¶
- RTO (Recovery Time Objective): < 1 hour
- RPO (Recovery Point Objective): < 24 hours
Recovery Steps:
- Provision new server (manual or automated)
- Install .NET runtime and dependencies
- Deploy Cesivi application
- Restore latest MockData backup
- Restore configuration
- Start service and verify
- Update DNS/load balancer
Scaling & High Availability¶
Horizontal Scaling¶
Load Balancer Configuration (Nginx)¶
upstream Cesivi_cluster {
least_conn;
server 192.168.1.10:5000 max_fails=3 fail_timeout=30s;
server 192.168.1.11:5000 max_fails=3 fail_timeout=30s;
server 192.168.1.12:5000 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
server_name Cesivi.company.com;
location / {
proxy_pass http://Cesivi_cluster;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# Health check
health_check interval=10s fails=3 passes=2 uri=/_vti_bin/diagnostics;
}
}
Shared Storage for MockData¶
Option 1: NFS
# Mount NFS share
sudo mount -t nfs 192.168.1.100:/exports/mockdata /opt/Cesivi/@MockData
Option 2: Azure Files/AWS EFS
# Azure Files
sudo mount -t cifs //storageaccount.file.core.windows.net/mockdata \
/opt/Cesivi/@MockData \
-o credentials=/etc/smbcredentials
Session Management¶
For stateless operation, ensure: - No in-memory session state - Use distributed cache (Redis) if needed - Enable sticky sessions on load balancer if required
Vertical Scaling¶
Increase resources:
# Update systemd service file
sudo nano /etc/systemd/system/Cesivi.service
# Increase limits
MemoryLimit=4G
CPUQuota=400%
# Reload and restart
sudo systemctl daemon-reload
sudo systemctl restart Cesivi
Kubernetes Auto-Scaling¶
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: Cesivi-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: Cesivi
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Performance Optimization¶
See PERFORMANCE.md for detailed performance optimization guide.
Quick Wins¶
-
Enable Response Caching
{ "ResponseCaching": { "Enabled": true, "Duration": 300 } } -
Enable Compression
{ "ResponseCompression": { "EnableForHttps": true, "Providers": ["brotli", "gzip"] } } -
Optimize Logging
{ "Serilog": { "MinimumLevel": { "Default": "Warning" } } }
Security¶
Security Hardening Checklist¶
- [ ] Enable HTTPS only (disable HTTP)
- [ ] Implement proper authentication validation
- [ ] Use secure headers (HSTS, CSP, X-Frame-Options)
- [ ] Enable rate limiting
- [ ] Restrict CORS origins
- [ ] Use firewall rules to limit access
- [ ] Regular security updates
- [ ] Encrypt data at rest
- [ ] Implement audit logging
- [ ] Use secrets management (not hardcoded credentials)
HTTPS Configuration¶
See DEPLOYMENT_GUIDE.md for SSL/TLS setup instructions.
Audit Logging¶
Enable audit logging in appsettings.json:
{
"Audit": {
"Enabled": true,
"LogPath": "logs/audit.log",
"Events": ["Login", "Create", "Update", "Delete"]
}
}
Operational Runbook¶
Daily Tasks¶
- [ ] Check health endpoint status
- [ ] Review error logs for anomalies
- [ ] Monitor disk space usage
- [ ] Verify backup completion
Weekly Tasks¶
- [ ] Review performance metrics
- [ ] Analyze cache hit rates
- [ ] Update dependencies if needed
- [ ] Test disaster recovery procedure
Monthly Tasks¶
- [ ] Review and archive old logs
- [ ] Perform security audit
- [ ] Review capacity planning
- [ ] Update documentation
Emergency Contacts¶
| Role | Contact | Escalation |
|---|---|---|
| On-Call Engineer | oncall@company.com | Level 1 |
| DevOps Team | devops@company.com | Level 2 |
| Platform Lead | platform-lead@company.com | Level 3 |
Support Resources¶
- Documentation:
/docs - GitHub Issues:
https://github.com/yourusername/Cesivi/issues - Internal Wiki:
https://wiki.company.com/Cesivi - Slack Channel:
#Cesivi-support