Skip to content

Disaster Recovery

The disaster recovery specification defines durability guarantees, high availability architectures, and recovery procedures for OJS backends.

LevelDescriptionData Loss Risk
0Memory-onlyJobs lost on process restart
1Single-node persistentJobs survive process restart, lost on disk failure
2Replicated persistentJobs survive node failure

Production deployments SHOULD use Level 2 (replicated persistent).

One primary node handles all traffic. A standby replica takes over on primary failure.

  • RPO: Depends on replication lag (synchronous = 0, asynchronous = seconds)
  • RTO: Failover detection + promotion time (typically 10–30 seconds)
  • Complexity: Low

Multiple nodes handle traffic simultaneously. Requires conflict resolution for concurrent operations.

  • RPO: 0 (all writes are durable across nodes)
  • RTO: Near-zero (remaining nodes continue serving)
  • Complexity: High

Writes are not acknowledged until replicated. Guarantees zero data loss but adds latency.

Writes are acknowledged immediately and replicated in the background. Lower latency but may lose recently written jobs on failover.

OJS requires that visibility timeouts and job state transitions are replicated before acknowledgment in synchronous mode.

When failover occurs:

  1. In-flight jobs on the failed node are recovered via heartbeat timeout.
  2. Clients reconnect to the new primary (via DNS, load balancer, or service discovery).
  3. Jobs in active state on the failed node transition to available after visibility timeout.

Backends MUST implement split-brain prevention to avoid dual-primary scenarios:

  • Fencing tokens: Each primary holds a monotonically increasing token. Stale primaries are fenced out.
  • Distributed locking: Consensus-based leader election (e.g., Raft, etcd, ZooKeeper).

Backends SHOULD support restoring to a specific point in time using:

  • Redis: RDB snapshots + AOF replay
  • PostgreSQL: WAL archiving + PITR
  • DynamoDB: Point-in-time recovery (PITR)

Backends SHOULD support exporting jobs in a portable OJS JSON format for cross-backend migration.

When a backend is partially available:

ModeBehavior
Read-onlyAccept queries but reject new jobs
Queue-level degradationSome queues available, others unavailable
Circuit breakerStop accepting traffic, return 503
MetricDescription
ojs.replication.lagReplication delay in milliseconds
ojs.failover.countNumber of failover events
ojs.backup.ageTime since last backup
ojs.backup.sizeSize of last backup
EventDescription
system.failover.startedFailover initiated
system.failover.completedNew primary active
system.replication.lag_warningReplication lag exceeds threshold