ADR-0003: Use NATS for WebSocket Backplane
- Status: Accepted
- Date: 2025-12-29
- Deciders: @anshpansuriya14
- Supersedes: N/A
- Superseded by: N/A
Context
Framework M needs real-time updates pushed to connected clients:
- Document changes (real-time collaboration)
- Notifications
- Background job completion
- System alerts
The Distributed WebSocket Problem: In Kubernetes/multi-pod deployments, a client connects to Pod A but an event occurs on Pod B. How does the event reach the client?
Forces at play:
- Horizontal Scaling: Must work across multiple server instances
- Low Latency: Real-time means
<100msdelivery - Consistency: Same infrastructure as jobs/events
- Simplicity: Minimal operational overhead
Decision
We will use NATS as the WebSocket backplane for distributing messages across server instances.
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Pod A │ │ Pod B │ │ Pod C │
│ (WS:✓) │ │ │ │ │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└──────────────┼──────────────┘
│
┌───────▼───────┐
│ NATS │
│ (Backplane) │
└───────────────┘
Flow:
- Event occurs on Pod B
- Pod B publishes to NATS:
ws.user.123 - NATS broadcasts to all subscribers
- Pod A (holding user 123's WebSocket) receives and pushes to client
Consequences
Positive
- Unified Infrastructure: NATS already used for jobs (ADR-0001) and events (ADR-0002)
- Simple Pub/Sub: No complex consumer groups needed
- Low Latency: NATS is optimized for fast messaging
- No State Management: NATS handles routing, pods are stateless
Negative
- Single Point of Failure: NATS cluster must be highly available
- No Built-in Presence: Must implement "who's online" separately
Neutral
- Each pod subscribes to
ws.>pattern on startup - User-targeted messages use
ws.user.{user_id}subject - Broadcast messages use
ws.broadcast.*
Alternatives Considered
| Option | Pros | Cons |
|---|---|---|
| Chosen: NATS | Already deployed, fast, simple | Requires HA setup |
| Redis Pub/Sub | Proven, simple API | Another system, no persistence |
| Kafka | Durable, ordered | Overkill for ephemeral WS messages |
| Sticky Sessions | No backplane needed | Breaks on pod restart, poor scaling |
References
- NATS Pub/Sub
- Related: ADR-0001, ADR-0002