Skip to content

Archive Cluster Deployment Guide

Introduced: v1.2 (PLAN-1624)

This guide covers deploying Cesivi in two-node cluster mode for high-availability archive workloads. Both nodes share a single FileSystem DataRoot over a network path; leader election and the SignalR backplane run through a shared Garnet instance.


When to use cluster mode

Use a two-node cluster when you need:

  • High availability — if one node is patched or fails, the second continues serving reads and the leader-only background services fail over automatically.
  • Regulatory durability — a second node with a separate process boundary satisfies some compliance frameworks that require active/standby operation (HIPAA §164.308(a)(7), ISO 27001 A.17.1).
  • Throughput relief — two nodes share inbound REST/CSOM request load while reading from the same archive store.

Not required for: single-tenant pilot deployments, development environments, or archives with fewer than ~1 M items where a single node is sufficient.


Architecture overview

                    ┌──────────────────────────────┐
                    │      Shared FileSystem        │
                    │  DataRootPath (SMB / NFS /    │
                    │  DFS-R — NOT replicated NTFS) │
                    └──────────┬───────────────────┘
                               │ read/write
                 ┌─────────────┼─────────────┐
                 │             │             │
          ┌──────▼──────┐  ┌──▼──────────┐  │
          │  Node A     │  │  Node B     │  │
          │  (leader*)  │  │  (follower) │  │
          │  port 5100  │  │  port 5101  │  │
          └──────┬──────┘  └──────┬──────┘  │
                 │                │         │
                 └───────┬────────┘         │
                         │                  │
                 ┌───────▼────────┐         │
                 │  Garnet         │◄────────┘
                 │  (leader elect  │  SignalR
                 │  + SignalR bus) │  backplane
                 │  port 6380      │
                 └────────────────┘

* Leader role is dynamic — any node can hold it.

All FileSystem-backed stores (farm registry, identity snapshots, audit WORM log, integrity records, retention records, legal holds) write through the shared DataRootPath. Both nodes see every change immediately with no replication delay.


Prerequisites

Component Requirement
Shared network path SMB share (\\fileserver\cesivi-data), NFS mount, or DFS namespace. NOT replicated NTFS (DFS-R, Robocopy sync, etc.) — only one writer at a time is safe for the WORM log.
Garnet Single Garnet instance (or Redis-compatible server) reachable from both nodes. Default port 6380.
Cesivi.exe Same version on both nodes.
.NET runtime .NET 10.0 (both nodes).
Network Both nodes must reach each other's health endpoints for load-balancer health checks.

Configuration

appsettings.json (identical on both nodes)

{
  "Cesivi": {
    "DataRootPath": "\\\\fileserver\\cesivi-data",
    "Cluster": {
      "Enabled": true,
      "GarnetConnectionString": "fileserver:6380"
    },
    "SignalR": {
      "UseBackplane": true
    },
    "Audit": {
      "ChainVerification": {
        "IntervalHours": 6
      },
      "Reaper": {
        "IntervalHours": 24,
        "RequireHoldCheckBeforeReap": true
      }
    }
  }
}

Key configuration options

Key Default Notes
Cesivi:DataRootPath R:/MockData Must resolve to the same physical directory on every node. Use a UNC path or a consistently mounted drive letter.
Cesivi:Cluster:Enabled false Set true on all nodes. Activates leader election via Garnet.
Cesivi:Cluster:GarnetConnectionString (none) host:port of the shared Garnet instance. Both nodes must point to the same instance.
Cesivi:SignalR:UseBackplane false Set true to fan SignalR hub messages across nodes via Garnet. Required for cross-node real-time notifications (integrity walk progress, change notifications).
Cesivi:Audit:ChainVerification:IntervalHours 6 Interval for WormChainVerificationService. Leader-only.
Cesivi:Audit:Reaper:IntervalHours 24 Interval for WormSegmentReaper. Leader-only.

Leader-only background services

The following services run only on the leader node. When the leader is killed or steps down, the new leader picks them up automatically within one leader-election cycle (typically < 30 s):

Service What it does Config key
WormChainVerificationService Walks the WORM hash-chain for each farm; emits chain_verified or chain_break audit events Cesivi:Audit:ChainVerification:IntervalHours
WormSegmentReaper Purges WORM log segments that have passed their retention window and have no active legal hold Cesivi:Audit:Reaper:IntervalHours
IntegrityVerificationService (scheduled pass) Runs the SHA-256 integrity sample-pass on a timer Cesivi:Integrity:SamplePassIntervalHours
ACL recalculation background jobs Re-evaluates inherited permissions after group or policy changes (internal)

REST-triggered operations (e.g. POST /_api/archive/integrity/sites/{id}/walks/run) run on whichever node receives the request — they do not require leader status.


Health and observability

Endpoints

Endpoint What to check
GET /healthz Returns 200 OK when the node is up. Use as load-balancer probe.
GET /_api/archive/dashboard 7-section KPI dashboard. FS-backed KPIs (sites_archived, identity.snapshots, legal_hold.active_holds, retention.items_under_retention) are identical on both nodes. importer.items_total is per-node (in-memory ring buffer).
GET /_api/archive/audit-events?farmId=X&eventType=WormConfigChanged Shows WORM chain verifier and reaper events from the shared log.

Observability checklist

  • [ ] Both /healthz endpoints return 200.
  • [ ] archive_mode.sites_archived on Node A equals Node B (cross-node FS consistency check).
  • [ ] legal_hold.active_holds matches on both nodes.
  • [ ] No chain_break events in audit-events?eventType=WormConfigChanged for any farm.
  • [ ] Garnet: monitor memory usage and replication lag (if running Garnet cluster).
  • [ ] FileSystem mount: check that DataRootPath is accessible and writable from both nodes (Test-Path / df -h).
  • [ ] SignalR backplane: confirm WalkProgress events arrive on the non-triggering node when an integrity walk is started on the other.

Operator runbook

Graceful node drain (for patching)

  1. Remove the node from the load-balancer rotation.
  2. Wait for in-flight requests to drain (check /healthz stops receiving traffic).
  3. Stop the Cesivi.exe process: Stop-Process -Name Cesivi (Windows) or systemctl stop cesivi (Linux).
  4. If the drained node was the leader, Garnet leader election elects the remaining node within ~30 s. Verify the remaining node now serves WormConfigChanged events.
  5. Apply the patch / update.
  6. Start Cesivi.exe on the patched node. It rejoins as follower.
  7. Re-add to load-balancer rotation.

Kill the leader (emergency)

If the leader process becomes unresponsive:

  1. Kill the process: taskkill /F /IM Cesivi.exe (Windows) or kill -9 <pid> (Linux).
  2. The follower detects the Garnet lease expiry and becomes leader within ≤ one lease-timeout (default 30 s).
  3. Background services (chain verifier, reaper) resume on the new leader at their next scheduled interval.
  4. REST-triggered operations continue immediately on the new leader.

Verify data integrity after failover

GET /_api/archive/dashboard

Compare sites_archived, identity.snapshots, legal_hold.active_holds, and retention.items_under_retention across both nodes. If the counts diverge, check filesystem mount health — a disconnected mount is the most common cause.

Adding a third node

Not yet supported. The current leader-election model supports exactly two nodes. A three-node raft variant is on the v1.3 roadmap. Do not add a third node in v1.2 — the Garnet-based lease protocol does not handle split-brain between three concurrent nodes.


Known limitations (v1.2)

Limitation Notes
Max two nodes See above.
Single Garnet instance No Garnet HA in v1.2. If Garnet becomes unavailable, leader-only background services stop but REST endpoints continue serving from the shared FS.
WORM log is single-writer Both nodes may read the WORM log concurrently, but only one should write (ensured by file-lock on the active segment). Do not place DataRootPath on a path where concurrent writes from two OS-level processes can bypass file locking (e.g. some NAS appliances with opportunistic locking disabled).
importer.items_total is per-node This dashboard counter reflects only events emitted by the local AuditEventSink. It is intentionally not shared — use the total field from GET /_api/archive/audit-events for cross-node totals.

Document Version: 1.0 Last Updated: 2026-05-28 (PLAN-1624 — v1.2 Cluster-Mode Archive Validation)


See also: Archive Mode

See also: Archive Audit Log — WORM Substrate

See also: Archive Integrity Verification

See also: Archive Legal Hold

See also: Archive Retention Enforcement

See also: Archive Admin Bundle — ControlCenter Quick Tour

See also: Archive Tools Operator Guide

See also: Tutorial G — SharePoint On-Premises Retirement Archive