Disaster Recovery

The disaster recovery specification defines durability guarantees, high availability architectures, and recovery procedures for OJS backends.

Durability Levels

Level	Description	Data Loss Risk
0	Memory-only	Jobs lost on process restart
1	Single-node persistent	Jobs survive process restart, lost on disk failure
2	Replicated persistent	Jobs survive node failure

Production deployments SHOULD use Level 2 (replicated persistent).

One primary node handles all traffic. A standby replica takes over on primary failure.

Multiple nodes handle traffic simultaneously. Requires conflict resolution for concurrent operations.

Writes are not acknowledged until replicated. Guarantees zero data loss but adds latency.

Writes are acknowledged immediately and replicated in the background. Lower latency but may lose recently written jobs on failover.

OJS requires that visibility timeouts and job state transitions are replicated before acknowledgment in synchronous mode.

When failover occurs:

In-flight jobs on the failed node are recovered via heartbeat timeout.
Clients reconnect to the new primary (via DNS, load balancer, or service discovery).
Jobs in active state on the failed node transition to available after visibility timeout.

Backends MUST implement split-brain prevention to avoid dual-primary scenarios:

Fencing tokens: Each primary holds a monotonically increasing token. Stale primaries are fenced out.
Distributed locking: Consensus-based leader election (e.g., Raft, etcd, ZooKeeper).

Backends SHOULD support restoring to a specific point in time using:

Backends SHOULD support exporting jobs in a portable OJS JSON format for cross-backend migration.

When a backend is partially available:

Mode	Behavior
Read-only	Accept queries but reject new jobs
Queue-level degradation	Some queues available, others unavailable
Circuit breaker	Stop accepting traffic, return 503

Metric	Description
`ojs.replication.lag`	Replication delay in milliseconds
`ojs.failover.count`	Number of failover events
`ojs.backup.age`	Time since last backup
`ojs.backup.size`	Size of last backup

Event	Description
`system.failover.started`	Failover initiated
`system.failover.completed`	New primary active
`system.replication.lag_warning`	Replication lag exceeds threshold