One Token, Eight Callers: Single-Flighting JWT Mints in an Async Rust Service

A war story about a 401 that appeared four hundred milliseconds after startup, the cache-stampede pattern hiding behind it, and why “force refresh” is the subtle part of any single-flight design.

Our trading-agent orchestrator authenticates each agent against an external perps venue (Imperial.space) and holds the resulting JWT in memory. The venue’s auth handshake is wallet-based: sign the message

imperial:mobile-connect:{wallet}:{nonce}

with the agent’s key, exchange the signature for a one-time code, exchange the code for a JWT. The nonce is the current Unix time in milliseconds — effectively JavaScript’s Date.now() — and the venue enforces both freshness and uniqueness: replay a nonce and you get a 401.

The JWT lives for 30 days, so re-authentication is rare and cheap. We built a small per-agent token manager: a map of agent_id → token, a token_for() that returns the cached token or mints a new one, and a low-cadence background sweep that proactively re-mints tokens approaching expiry. Simple, boring, correct.

Almost.

The 401 that took 400 milliseconds to appear

First full live run of the service. The logs, lightly trimmed:

15:42:39.874  INFO  perps connectivity started            agents=["dev_1"]
15:42:40.249  INFO  perps /ws connected + subscribed      agent_id="dev_1"
15:42:40.576  WARN  proactive perps re-auth failed        agent_id="dev_1"
                    error=minting JWT (a 401 may indicate local clock skew...)
15:42:41.240  INFO  perps JWT minted/refreshed            agent_id="dev_1"

A 401, four hundred milliseconds after startup, that healed itself one second later. Everything worked. The kind of warning that’s so easy to shrug at.

Here’s what actually happened. At startup, two independent components woke up at the same instant, and both needed a token for the same agent:

the reconciliation loop, doing its first position fetch (it attaches the JWT to reads so they count against the agent’s own rate budget), and
the proactive re-auth sweep, whose interval timer fires immediately on start.

Both checked the cache. Both found it empty. Both started a mint. Both built the auth message with now_ms() as the nonce — in the same millisecond. Identical wallet, identical nonce, identical message: the venue accepted the first signature and rejected the second as a replay. Hence the 401.

This is a cache stampede (or thundering herd): N concurrent callers all observe a cold cache and all pay the cost of filling it. Usually the cost is wasted work — N identical database queries, N identical HTTP calls. Our version had a twist that made it visible: the venue’s nonce-uniqueness rule turned redundant work into an outright failure. In a way, we got lucky. Most stampedes silently waste capacity; ours filed a bug report against itself in the logs.

And it scales badly in exactly the dimension we care about: with N agents and more components needing tokens (the risk engine and order paths come next), startup concurrency only grows.

The fix: single-flight, per agent

The classic answer is single-flight: when multiple callers want the same expensive thing concurrently, exactly one does the work, everyone else waits for and shares the result. (Go users know this as golang.org/x/sync/singleflight.)

In async Rust the minimal version is a per-key mutex plus a re-check after acquiring it — double-checked locking, done with a lock that actually works across await points:

pub struct JwtManager {
    es: Arc<dyn PerpsEsClient>,
    /// Per-agent token map — never a single global token.
    jwts: RwLock<HashMap<String, JwtInfo>>,
    /// Per-agent single-flight guards: concurrent mints for one agent
    /// coalesce onto one auth handshake.
    mint_locks: RwLock<HashMap<String, Arc<Mutex<()>>>>,
}

The mint path takes the agent’s lock, and — crucially — re-checks the cache after acquiring it:

async fn mint(&self, agent_id: &str, prior: Option<String>) -> Result<String> {
    let lock = self.mint_lock(agent_id).await;
    let _guard = lock.lock().await;

    // Re-check under the lock: did a concurrent flight already refresh?
    if let Some(current) = self.jwts.read().await.get(agent_id)
        && prior.as_deref() != Some(current.token.as_str())
    {
        return Ok(current.token.clone());
    }

    let jwt = self.es.authenticate(agent_id).await?;   // the one real handshake
    let token = jwt.token.clone();
    self.jwts.write().await.insert(agent_id.to_string(), jwt);
    Ok(token)
}

Eight concurrent token_for() calls on a cold cache now produce exactly one venue handshake. The first caller takes the lock and mints; the other seven queue on the mutex, wake up, find a token in the cache that differs from what they saw before, and return it. One nonce, no replay, no 401.

Note the lock is per agent, not global. Agent A re-authenticating must never block agent B — the whole token store is a per-agent map by design, and the single-flight guard follows the same key.

The subtle part: what does “force refresh” mean?

The naive re-check would be: if the cache has a valid-looking token, return it. That’s wrong, and the reason is the second entry point into mint.

Besides token_for() (cache miss / nearing expiry), there’s force_reauth() — called when an authenticated request just came back 401, meaning the cached token is bad even though its expiry timestamp looks fine (revoked server-side, clock skew, whatever). For that caller, “the cache has a token” is precisely not good enough: the cache has the token they just watched fail.

The naive re-check creates a TOCTOU hole: the force-refresher checks the cache, sees the bad token, takes the lock, re-checks, sees the same bad token, and the naive rule (“cache has something → return it”) hands the poison right back.

The fix is to make every caller carry a witness: the token it observed when it decided a mint was needed (prior in the code above — None for “cache was empty”). The rule under the lock becomes:

If the cached token differs from the one you came in holding, someone else minted while you waited — take theirs. If it’s the same token you already decided was missing/stale/bad, the job is still yours — mint.

This gives both call sites the semantics they actually want from one rule:

Cold-start stampede: caller arrives with prior = None, finds a token under the lock, None ≠ token, returns it. Coalesced.
Force-refresh after a 401: caller arrives with prior = bad_token. If a concurrent flight already replaced it, bad_token ≠ new_token — the new one is fresh by construction (it was minted after the failure), take it. If nothing changed, prior == current, mint a genuinely new token. Never returns the poison.
Two concurrent force-refreshes: both hold the same bad witness; the first mints, the second sees a different token and coalesces. One handshake, not two.

The witness comparison is doing quiet but real work: it encodes “fresh relative to what I knew” instead of “fresh according to a timestamp” — and timestamps were exactly the thing the 401 proved unreliable.

Testing it without flaky sleeps

Race-condition tests rot when they depend on timing. Two things kept these deterministic:

The stampede test is timing-independent by construction. Spawn eight tasks calling token_for() against a mock venue with a 50ms mint delay, and count handshakes. Every interleaving funnels through the same lock-and-recheck, so the assertion (mints == 1, all eight tokens identical) holds regardless of scheduling.

The witness logic is tested without concurrency at all. Since the coalescing rule is just “compare prior to the cache under the lock,” you can drive it sequentially: mint with witness t0, then call again with the same stale witness t0 — it must coalesce onto the replacement instead of minting again. No sleeps, no spawns, no luck.

let t0 = tokens.pop().unwrap();
let t1 = mgr.mint("dev_1", Some(t0.clone())).await.unwrap();  // re-mints: 2nd handshake
let t2 = mgr.mint("dev_1", Some(t0)).await.unwrap();          // stale witness: coalesces
assert_eq!(t1, t2);
assert_eq!(venue.mints(), 2, "second flight must coalesce");

Glossary

Cache stampede / thundering herd — N concurrent callers observe a cold (or expired) cache entry and all independently do the expensive fill. Usually wastes work; in our case the venue’s nonce-uniqueness turned it into a hard failure.
Single-flight — a coalescing pattern: among concurrent requests for the same key, exactly one performs the work; the rest wait and share its result.
Double-checked locking — check, take the lock, check again. The second check is the whole point: the world may have changed while you waited for the lock.
Witness (prior) — the value a caller observed when deciding to act, carried into the critical section. Comparing it against current state distinguishes “someone already did my job” from “my job is still pending” — and, here, “fresh token” from “the same token I just watched fail.”
TOCTOU (time-of-check to time-of-use) — a race where state changes between checking a condition and acting on it. The witness comparison under the lock closes ours.
Nonce — a number used once. The venue’s auth nonce is the current epoch millisecond, which makes same-millisecond concurrent mints inherently colliding — the venue-specific reason the stampede failed loudly instead of silently.
Per-key locking — one lock per entity (here, per agent) instead of one global lock, so coalescing within an agent never serializes across agents.

The takeaway

The stampede itself is textbook. Two details are worth carrying to your own systems:

A failure-on-duplicate dependency turns silent waste into a visible bug. Our venue’s nonce rule made the stampede log a warning on day one instead of quietly doubling auth traffic for months. If you’re building an API, uniqueness checks do your clients this favor; if you’re consuming one, treat such 401s as a signal to go look for concurrency, not just clock skew.
Single-flight is easy; force-refresh is where it gets interesting. “Return the cached value if present” is the wrong re-check the moment any caller exists whose problem is the cached value. Carry a witness, compare under the lock, and both kinds of caller get exactly the semantics they need from one rule.