Skip to main content

ADR-0003: Use NATS for WebSocket Backplane

  • Status: Accepted
  • Date: 2025-12-29
  • Deciders: @anshpansuriya14
  • Supersedes: N/A
  • Superseded by: N/A

Context

Framework M needs real-time updates pushed to connected clients:

  • Document changes (real-time collaboration)
  • Notifications
  • Background job completion
  • System alerts

The Distributed WebSocket Problem: In Kubernetes/multi-pod deployments, a client connects to Pod A but an event occurs on Pod B. How does the event reach the client?

Forces at play:

  • Horizontal Scaling: Must work across multiple server instances
  • Low Latency: Real-time means <100ms delivery
  • Consistency: Same infrastructure as jobs/events
  • Simplicity: Minimal operational overhead

Decision

We will use NATS as the WebSocket backplane for distributing messages across server instances.

┌─────────┐    ┌─────────┐    ┌─────────┐
│ Pod A │ │ Pod B │ │ Pod C │
│ (WS:✓) │ │ │ │ │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└──────────────┼──────────────┘

┌───────▼───────┐
│ NATS │
│ (Backplane) │
└───────────────┘

Flow:

  1. Event occurs on Pod B
  2. Pod B publishes to NATS: ws.user.123
  3. NATS broadcasts to all subscribers
  4. Pod A (holding user 123's WebSocket) receives and pushes to client

Consequences

Positive

  • Unified Infrastructure: NATS already used for jobs (ADR-0001) and events (ADR-0002)
  • Simple Pub/Sub: No complex consumer groups needed
  • Low Latency: NATS is optimized for fast messaging
  • No State Management: NATS handles routing, pods are stateless

Negative

  • Single Point of Failure: NATS cluster must be highly available
  • No Built-in Presence: Must implement "who's online" separately

Neutral

  • Each pod subscribes to ws.> pattern on startup
  • User-targeted messages use ws.user.{user_id} subject
  • Broadcast messages use ws.broadcast.*

Alternatives Considered

OptionProsCons
Chosen: NATSAlready deployed, fast, simpleRequires HA setup
Redis Pub/SubProven, simple APIAnother system, no persistence
KafkaDurable, orderedOverkill for ephemeral WS messages
Sticky SessionsNo backplane neededBreaks on pod restart, poor scaling

References