Stateless HA Scheduling: Solving the Leader Election Problem at the Broker Layer
In modern web development, background jobs and scheduled tasks are the backbone of asynchronous operations—handling everything from daily email reports to outbox processing and periodic data cleanups.
However, running scheduled tasks (cron jobs) reliably in a clustered, high-availability (HA) environment is a notoriously difficult problem. Traditionally, developers have had to choose between heavy distributed locking protocols or complex, stateful leader-election daemons.
With the release of our updated background jobs engine, we took a step back and asked: What if we completely eliminated the stateful scheduler process, the database locks, and the leader election code?
Here is how we built a stateless, broker-deduplicated scheduling system that scales horizontally with zero administrative baggage.
The Traditional Dilemma of HA Scheduling
If you run multiple replicas of a background worker or api server, each running a standard scheduler loop (like Celery Beat), you face the double-trigger problem. When the clock hits 9:00 AM, every replica wakes up and triggers the daily report task, resulting in duplicate emails sent to your users.
Historically, frameworks have solved this in one of two ways:
- The Singleton Scheduler (Active/Passive): You run exactly one instance of the scheduler daemon. To prevent a single point of failure (SPOF), you implement a leader election consensus protocol (like Raft or Consul) so that a hot-standby node takes over if the primary node crashes.
- Pessimistic Locking:
All scheduler nodes wake up, query a database, and try to acquire a distributed lock (using Redis
SETNXor databaseSELECT ... FOR UPDATE). The node that secures the lock runs the schedule, while the others back off.
Both approaches introduce significant technical debt: they couple your application core to stateful locking engines, require careful management of lock TTLs, and can fail spectacularly under network partitions (split-brain scenarios) or server clock drift.
In cloud-native environments like Kubernetes, the alternative of spinning up a separate CronJob container every minute is also a well-known anti-pattern—introducing pod startup latency, API server strain, and substantial resource waste just to execute a 1-second task.
Shifting the Lock: Broker-Level Deduplication
Our new architecture resolves this by shifting the coordination responsibility from the application layer to the backing service (the message broker) where it belongs.
Instead of locking the scheduler, we allow all schedulers to trigger concurrently and let the broker discard the duplicates.
+---------------+
| Worker Pod 1 | --(Publish tick:job:09-00)--> +-----------------+
+---------------+ | |
| NATS JetStream | --> [Worker Pool]
+---------------+ | Deduplication | (Executes Once)
| Worker Pod 2 | --(Publish tick:job:09-00)--> +-----------------+
+---------------+ |
v
(Silently Discarded)
This is made possible by NATS JetStream’s native Message Deduplication capabilities.
1. The Agnostic Port
In the core interfaces (framework-m-core), we updated the JobQueueProtocol to accept an optional deduplication_id:
class JobQueueProtocol(Protocol):
async def enqueue(
self,
job_name: str,
deduplication_id: str | None = None,
**kwargs: Any
) -> str: ...
This interface remains completely clean and has no knowledge of NATS, Redis, or Celery.
2. The Dumb Ticker Loop
Instead of running a separate scheduler daemon, the scheduler loop is embedded directly inside the worker process (m worker) via a pluggable TickerProtocol.
Every worker replica runs a lightweight, stateless async loop (InProcessTickerAdapter) that wakes up on minute boundaries. When an hourly or weekly job is due, all replicas simultaneously generate the exact same deduplication ID based on the target time slice:
# Generates "tick:send_daily_report:1700000000"
deduplication_id = f"tick:{schedule.job_name}:{timestamp}"
Each replica then publishes the job to the queue with this ID.
3. Native Deduplication
When using NATS JetStream, the NATS adapter maps the deduplication_id directly to the standard Nats-Msg-Id header.
NATS JetStream is configured with a 2-minute deduplication window. The first message that hits NATS is successfully queued. When the other 49 replicas publish the same tick milliseconds later, NATS sees the duplicate Nats-Msg-Id and silently discards the publishes.
Note: We are still fundamentally depending on a consensus algorithm like Raft. NATS JetStream relies on its own robust, enterprise-grade Raft implementation to replicate stream state consistently across NATS cluster nodes. The key distinction is that the application code no longer has to implement or manage this consensus. By delegating deduplication to the broker, we outsource the distributed systems complexity entirely to the backing service.
The NATS Configuration "Hiccup" & Code-Based Stream Sync
Moving scheduling to the broker layer sounds simple, but real-world engineering always brings hiccups.
In NATS JetStream, deduplication is a stream-level setting (duplicate_window). If a stream (like our background job stream framework_m_jobs) already exists on a server, trying to startup a worker with a modified configuration (such as adding a deduplication window) causes NATS to reject the connection:
nats.js.errors.BadRequestError: stream name already in use with a different configuration
This is the messaging queue equivalent of a database schema mismatch. If not handled, it forces developers to manually delete streams or face startup crashes.
The Solution: NATS Schema Migrations
To keep our "0-cliff" setup promise, we implemented a NATS Schema Sync startup hook.
Before the worker or event bus adapters initialize, a central bootstrap utility connects to the NATS server, inspects the existing streams, and compares them with the required configurations in the codebase.
If it detects a mismatch (e.g. the server stream has no duplicate window, but our code expects a 120-second window), the hook dynamically calls js.update_stream(config) to migrate the stream configuration cleanly in-place without losing any currently queued messages.
High-Scale Optimization: The --no-scheduler Flag
In a small deployment, running the embedded scheduler loop on every worker replica is completely fine. However, in a large enterprise deployment, you might scale your worker pool to 50 or 100 replicas.
Having 50 pods wake up simultaneously every minute to publish identical tick messages to NATS is unnecessary. Although NATS JetStream will safely discard the duplicate publishes, this still creates unnecessary CPU, network socket, and NATS server overhead.
To optimize high-scale deployments, the m worker command supports disabling the embedded scheduler loop:
m worker --no-scheduler
This allows a highly optimized, dual-pool scaling strategy in production:
- The Ticker Pool (HA): Run exactly
2replicas ofm worker(with the scheduler loop active). This provides perfect high-availability failover. If one dies, the other continues ticking, and NATS deduplicates the few duplicate ticks they send. - The Worker Pool (Scale): Run
48replicas ofm worker --no-scheduler. These nodes run no internal clock loops and generate no NATS tick traffic; they concentrate 100% of their resources on executing tasks from the queue.
This gives operators maximum scalability and resource isolation without sacrificing the zero-cliff local development setup where a single m worker command handles both roles.
Why this Architecture Wins
- Active/Active Symmetry: Every single worker replica is identical. There is no active/passive state to coordinate. If some replicas crash, the remaining ones continue ticking and executing tasks without interruption.
- High-Frequency Tickers: Because the ticker is stateless and runs in-process inside the worker, it is extremely lightweight. We can easily run sub-minute or high-frequency ticks (e.g., executing reconciliation loops every 5 seconds) with near-zero overhead.
- No Database Baggage: No database tables are queried or locked to maintain scheduling state.
- Broker Agnostic: The core application logic only knows about the abstract
JobQueueProtocolanddeduplication_id. If tomorrow we swap NATS for Redis, the Redis adapter can implement the same deduplication logic using a standardSET NXkey with a TTL, requiring zero changes to your schedule configurations.
By pushing coordination down to the message broker, we've stripped away layers of complex locking code, resulting in a clean codebase and a far more resilient deployment.
