Archive Cluster Deployment Guide¶

Introduced: v1.2 (PLAN-1624)

This guide covers deploying Cesivi in two-node cluster mode for high-availability archive workloads. Both nodes share a single FileSystem DataRoot over a network path; leader election and the SignalR backplane run through a shared Garnet instance.

When to use cluster mode¶

Use a two-node cluster when you need:

High availability — if one node is patched or fails, the second continues serving reads and the leader-only background services fail over automatically.
Regulatory durability — a second node with a separate process boundary satisfies some compliance frameworks that require active/standby operation (HIPAA §164.308(a)(7), ISO 27001 A.17.1).
Throughput relief — two nodes share inbound REST/CSOM request load while reading from the same archive store.

Not required for: single-tenant pilot deployments, development environments, or archives with fewer than ~1 M items where a single node is sufficient.

Architecture overview¶

                    ┌──────────────────────────────┐
                    │      Shared FileSystem        │
                    │  DataRootPath (SMB / NFS /    │
                    │  DFS-R — NOT replicated NTFS) │
                    └──────────┬───────────────────┘
                               │ read/write
                 ┌─────────────┼─────────────┐
                 │             │             │
          ┌──────▼──────┐  ┌──▼──────────┐  │
          │  Node A     │  │  Node B     │  │
          │  (leader*)  │  │  (follower) │  │
          │  port 5100  │  │  port 5101  │  │
          └──────┬──────┘  └──────┬──────┘  │
                 │                │         │
                 └───────┬────────┘         │
                         │                  │
                 ┌───────▼────────┐         │
                 │  Garnet         │◄────────┘
                 │  (leader elect  │  SignalR
                 │  + SignalR bus) │  backplane
                 │  port 6380      │
                 └────────────────┘

* Leader role is dynamic — any node can hold it.

All FileSystem-backed stores (farm registry, identity snapshots, audit WORM log, integrity records, retention records, legal holds) write through the shared DataRootPath. Both nodes see every change immediately with no replication delay.

Prerequisites¶

Component	Requirement
Shared network path	SMB share (`\\fileserver\cesivi-data`), NFS mount, or DFS namespace. NOT replicated NTFS (DFS-R, Robocopy sync, etc.) — only one writer at a time is safe for the WORM log.
Garnet	Single Garnet instance (or Redis-compatible server) reachable from both nodes. Default port 6380.
Cesivi.exe	Same version on both nodes.
.NET runtime	.NET 10.0 (both nodes).
Network	Both nodes must reach each other's health endpoints for load-balancer health checks.

Configuration¶

`appsettings.json` (identical on both nodes)¶

{
  "Cesivi": {
    "DataRootPath": "\\\\fileserver\\cesivi-data",
    "Cluster": {
      "Enabled": true,
      "GarnetConnectionString": "fileserver:6380"
    },
    "SignalR": {
      "UseBackplane": true
    },
    "Audit": {
      "ChainVerification": {
        "IntervalHours": 6
      },
      "Reaper": {
        "IntervalHours": 24,
        "RequireHoldCheckBeforeReap": true
      }
    }
  }
}

Key configuration options¶

Key	Default	Notes
`Cesivi:DataRootPath`	`R:/MockData`	Must resolve to the same physical directory on every node. Use a UNC path or a consistently mounted drive letter.
`Cesivi:Cluster:Enabled`	`false`	Set `true` on all nodes. Activates leader election via Garnet.
`Cesivi:Cluster:GarnetConnectionString`	(none)	`host:port` of the shared Garnet instance. Both nodes must point to the same instance.
`Cesivi:SignalR:UseBackplane`	`false`	Set `true` to fan SignalR hub messages across nodes via Garnet. Required for cross-node real-time notifications (integrity walk progress, change notifications).
`Cesivi:Audit:ChainVerification:IntervalHours`	`6`	Interval for `WormChainVerificationService`. Leader-only.
`Cesivi:Audit:Reaper:IntervalHours`	`24`	Interval for `WormSegmentReaper`. Leader-only.

Leader-only background services¶

The following services run only on the leader node. When the leader is killed or steps down, the new leader picks them up automatically within one leader-election cycle (typically < 30 s):

Service	What it does	Config key
`WormChainVerificationService`	Walks the WORM hash-chain for each farm; emits `chain_verified` or `chain_break` audit events	`Cesivi:Audit:ChainVerification:IntervalHours`
`WormSegmentReaper`	Purges WORM log segments that have passed their retention window and have no active legal hold	`Cesivi:Audit:Reaper:IntervalHours`
`IntegrityVerificationService` (scheduled pass)	Runs the SHA-256 integrity sample-pass on a timer	`Cesivi:Integrity:SamplePassIntervalHours`
ACL recalculation background jobs	Re-evaluates inherited permissions after group or policy changes	(internal)

REST-triggered operations (e.g. POST /_api/archive/integrity/sites/{id}/walks/run) run on whichever node receives the request — they do not require leader status.

Health and observability¶

Endpoints¶

Endpoint	What to check
`GET /healthz`	Returns `200 OK` when the node is up. Use as load-balancer probe.
`GET /_api/archive/dashboard`	7-section KPI dashboard. FS-backed KPIs (`sites_archived`, `identity.snapshots`, `legal_hold.active_holds`, `retention.items_under_retention`) are identical on both nodes. `importer.items_total` is per-node (in-memory ring buffer).
`GET /_api/archive/audit-events?farmId=X&eventType=WormConfigChanged`	Shows WORM chain verifier and reaper events from the shared log.

Observability checklist¶

[ ] Both /healthz endpoints return 200.
[ ] archive_mode.sites_archived on Node A equals Node B (cross-node FS consistency check).
[ ] legal_hold.active_holds matches on both nodes.
[ ] No chain_break events in audit-events?eventType=WormConfigChanged for any farm.
[ ] Garnet: monitor memory usage and replication lag (if running Garnet cluster).
[ ] FileSystem mount: check that DataRootPath is accessible and writable from both nodes (Test-Path / df -h).
[ ] SignalR backplane: confirm WalkProgress events arrive on the non-triggering node when an integrity walk is started on the other.

Operator runbook¶

Graceful node drain (for patching)¶

Remove the node from the load-balancer rotation.
Wait for in-flight requests to drain (check /healthz stops receiving traffic).
Stop the Cesivi.exe process: Stop-Process -Name Cesivi (Windows) or systemctl stop cesivi (Linux).
If the drained node was the leader, Garnet leader election elects the remaining node within ~30 s. Verify the remaining node now serves WormConfigChanged events.
Apply the patch / update.
Start Cesivi.exe on the patched node. It rejoins as follower.
Re-add to load-balancer rotation.

Kill the leader (emergency)¶

If the leader process becomes unresponsive:

Kill the process: taskkill /F /IM Cesivi.exe (Windows) or kill -9 <pid> (Linux).
The follower detects the Garnet lease expiry and becomes leader within ≤ one lease-timeout (default 30 s).
Background services (chain verifier, reaper) resume on the new leader at their next scheduled interval.
REST-triggered operations continue immediately on the new leader.

Verify data integrity after failover¶

GET /_api/archive/dashboard

Compare sites_archived, identity.snapshots, legal_hold.active_holds, and retention.items_under_retention across both nodes. If the counts diverge, check filesystem mount health — a disconnected mount is the most common cause.

Adding a third node¶

Not yet supported. The current leader-election model supports exactly two nodes. A three-node raft variant is on the v1.3 roadmap. Do not add a third node in v1.2 — the Garnet-based lease protocol does not handle split-brain between three concurrent nodes.

Known limitations (v1.2)¶

Limitation	Notes
Max two nodes	See above.
Single Garnet instance	No Garnet HA in v1.2. If Garnet becomes unavailable, leader-only background services stop but REST endpoints continue serving from the shared FS.
WORM log is single-writer	Both nodes may read the WORM log concurrently, but only one should write (ensured by file-lock on the active segment). Do not place `DataRootPath` on a path where concurrent writes from two OS-level processes can bypass file locking (e.g. some NAS appliances with opportunistic locking disabled).
`importer.items_total` is per-node	This dashboard counter reflects only events emitted by the local `AuditEventSink`. It is intentionally not shared — use the `total` field from `GET /_api/archive/audit-events` for cross-node totals.

Document Version: 1.0 Last Updated: 2026-05-28 (PLAN-1624 — v1.2 Cluster-Mode Archive Validation)

See also: Archive Mode

See also: Archive Audit Log — WORM Substrate

See also: Archive Integrity Verification

See also: Archive Legal Hold

See also: Archive Retention Enforcement

See also: Archive Admin Bundle — ControlCenter Quick Tour

See also: Archive Tools Operator Guide

See also: Tutorial G — SharePoint On-Premises Retirement Archive