Back to blog
Solana DeFi WebSocket Systems Design

Your WebSocket Is a Hint: Reconciling Perp Positions Without Trusting the Stream

How we keep a local view of exchange-held perpetual futures positions correct when every channel feeding it can lie, lag, or go silent — and why the boring polling loop is the load-bearing part.

How we keep a local view of exchange-held perpetual futures positions correct when every channel feeding it can lie, lag, or go silent — and why the boring polling loop is the load-bearing part.


We run autonomous trading agents that hold perpetual futures positions on an external venue (Imperial.space, a perps aggregator on Solana). Our orchestration service keeps a local copy of every agent’s positions, because everything downstream depends on it: risk checks, liquidation-distance monitoring, lifecycle bookkeeping, the UI.

That local copy has one honest description: it is a cache. The truth lives on the venue’s servers. And the moment you accept that framing, the design question stops being “how do we process position updates?” and becomes “how does a cache stay correct when every channel feeding it is unreliable?”

The channels you actually get

Like most venues, Imperial offers two ways to learn about your positions:

  • a REST endpoint (/positions) that returns the full current state, and
  • a WebSocket that pushes updates in real time.

The instinctive architecture is: take a snapshot once at startup, then apply WS deltas forever. It’s also a trap, because a WebSocket can fail in four distinct ways:

  1. Drop — connection dies, you reconnect, fine.
  2. Reconnect with a gap — everything that happened while you were down is simply gone.
  3. Duplicates and reordering — the same event twice, or out of order.
  4. Connected-but-silently-stale — the worst one. The socket is open, TCP is happy, and nothing is flowing. There is no error. Silence looks identical to “nothing happened.”

If your correctness depends on receiving any particular WS message, then a missed message can mean a missed liquidation. That is not a latency bug; that’s lost money.

A live discovery that simplified everything

When we probed the venue’s WebSocket, we found something that initially looked like a limitation and turned out to be a gift: the stream carries no state at all.

It never says “position X is now closed at price Y.” It sends coalesced invalidation pings — {"type":"positions_updated"} — that mean exactly one thing: something changed, go look. (It also silently ignores malformed frames, with no error response. The app-level heartbeat is the only liveness signal you get.)

So even on the happy path, the only way to learn the new state is to fetch the full REST snapshot. Which collapses the architecture beautifully: there is exactly one way state ever enters our system — fetch a snapshot, apply it to the store. The WebSocket and a timer are just two different reasons to run that same fetch.

No delta-application logic. No replay buffers. No “what did I miss while disconnected” reconstruction. One code path, exercised constantly.

The backstop: a poll that doesn’t wait for permission

Here is the load-bearing component, and it’s the least glamorous one. On a fixed interval — 12 seconds by default — a sweep walks every agent and fetches its full position snapshot. We call it the reconcile backstop, and its defining property is:

It runs even when the WebSocket is perfectly healthy.

It is not a fallback that activates when failure is detected. It is always on. That distinction matters more than it looks: a fallback needs failure-detection logic, and failure-detection logic can itself fail (remember failure mode #4 — the silently stale socket that looks exactly like a quiet market). The backstop has no such dependency. It doesn’t ask whether the WS is fine. It just goes and looks.

The consequence is a clean separation of duties:

  • WebSocket = latency. A fill is reflected locally in ~100ms.
  • Backstop = correctness. If every WS message in the world is lost, the local view is still at most one poll interval behind venue truth.

Losing the WebSocket degrades latency, never correctness. The poll interval becomes your worst-case reaction time — a deliberate, tunable dial rather than an accident.

(One practical note for multi-agent setups: attach each agent’s auth token to its reconcile reads if the venue keys rate limits per-wallet when authenticated. Otherwise N agents share one per-IP budget and your safety net rate-limits itself.)

The race you create by being safe

Two independent triggers — timer ticks and WS pings — now race to fetch and apply snapshots. That buys robustness and creates exactly one hazard: a slow old fetch finishing after a fast new one.

Walk through it:

  1. The timer starts fetch A. The store stamps it generation 5. The network is slow.
  2. Meanwhile, the position closes. A WS ping triggers fetch Bgeneration 6 — which returns quickly and applies. The store now says closed.
  3. A’s response finally arrives, carrying the older world in which the position is still open.

Without a guard, applying A resurrects a closed position. Stale data overwrites fresh data, and your risk engine starts reasoning about a position that doesn’t exist. This is the classic read-after-read race, and it shows up in any system where concurrent fetches feed one cache.

The fix is two small rules in the store:

Rule 1 — stamp at fetch start, not completion. A snapshot’s generation is taken before the request goes out, because that’s the moment that bounds how fresh its data can possibly be. A response that took 30 seconds to arrive is still generation-5 data.

Rule 2 — apply only if newer. A snapshot is applied only if its generation is strictly greater than the last applied one. Fetch A (5 < 6) is dropped on the floor. Duplicates die by the same comparison, for free.

pub async fn apply_snapshot(&self, agent: &str, generation: u64, positions: Vec<Lifecycle>) -> bool {
    let entry = self.entry(agent).await;
    let mut state = entry.state.lock().await;   // single writer per agent
    if generation <= state.applied_generation {
        return false;                            // stale or duplicate: dropped
    }
    state.applied_generation = generation;
    state.positions = positions.into_iter().map(|p| (p.lifecycle_key(), p)).collect();
    true
}

The mutex in there is the third, quieter rule: one writer per agent. All applies for an agent serialize through a per-agent lock, so a timer apply and a WS-triggered apply can never interleave halfway and leave a chimera of two snapshots.

Two venue quirks worth stealing defenses for

Identity is not what it looks like. The venue’s on-chain position account (positionPda) looked like the natural primary key — until a live session showed two different position lifecycles, opened hours apart, reusing the same account. The venue recycles position accounts per wallet/market/side. The actual unique key is the lifecycle’s UUID, with (account, openedAt) as a fallback. Lesson: never assume an external system’s most prominent identifier is a unique key. Verify with live data.

Absence of evidence isn’t evidence of absence. The venue’s indexer lags writes by seconds to tens of seconds. Read /positions immediately after a confirmed order and you may get an empty list. So every reconcile read retries with backoff, and “zero positions right after a write” is never treated as proof that nothing exists — the snapshot reflects what the indexer has seen, and the next tick refines it.

Why this adds up to correctness

The whole design reduces to one invariant:

The store always holds some complete, recent snapshot of venue truth, and it only ever moves forward in time.

Each property is carried by one mechanism:

PropertyCarried by
CompleteFull snapshots — there are no deltas to mis-apply
RecentThe backstop — recency holds regardless of WS health
Forward-onlyThe generation guard — stale fetches can’t regress state
AtomicThe per-agent single writer — no interleaved half-applies

Everything downstream — most importantly the liquidation-distance checks — reads a view that is wrong by at most one poll interval, and never wrong in the dangerous “confidently stale” way. “We missed the WebSocket message” can never become “we missed the liquidation.”

Glossary

  • Reconciliation — periodically replacing a local view with authoritative remote state, instead of trusting that incremental updates were all received and applied correctly.
  • Backstop — a safety mechanism that runs unconditionally behind a faster primary path, so the system stays correct even if the primary silently fails. The defining property: it does not wait for a failure signal, so there is no failure-detection logic that can itself fail.
  • Snapshot — a complete, self-contained read of remote state. Applying one requires no history; it fully replaces what came before.
  • Invalidation signal — a notification that state changed, carrying none of the new state itself. It can only trigger a fetch; it can never be applied directly, which means losing one can never corrupt anything.
  • Generation — a counter handed out when a fetch starts, stamping each snapshot with its place in time. Comparing generations orders snapshots by how fresh their data can possibly be — regardless of the order their responses arrive in.
  • Monotonic (apply) — state may only move forward in time. An update carrying older information than what is already applied is rejected, so stale data can never overwrite fresh data.
  • Single writer — all mutations of one entity’s state are serialized through one lock (or actor/queue), making each apply atomic and eliminating interleaving races between concurrent update paths.
  • Idempotent — safe to do twice. Snapshot application is naturally idempotent: re-applying the same snapshot changes nothing, which is exactly what makes at-least-once triggering (timer + WS + reconnects) safe.

The takeaway

If you remember one thing: treat the push channel as a hint and the poll as the truth-keeper, and make the store refuse to move backwards. Push gives you speed. Poll gives you a guarantee. Generations keep the two from fighting. None of it is exotic — a counter, a comparison, a mutex, and a timer that doesn’t ask permission — but together they turn “we hope we didn’t miss a message” into “we are wrong by at most twelve seconds, provably.”