I wrote about AI having ‘durable sessions’ to support async agentic applications, and in the comments everyone said:
“Token streaming over SSE is easy”.
…so I figured I’d dig into that claim.
Agents used to be a thing you talked to synchronously. Now they’re a thing that runs in the background while you work. When you make that change, the transport breaks.
But a lot of folks are saying: “No, you can just use Server-Sent Events (SSE) with Last-Event-ID to get a durable stream, it’s easy”. So lets look at that.
How token streaming typically works over SSE
Typically, a user will make an HTTP POST request with the prompt and current conversation history to the server. The server will run the agent, and the client will hold the response open waiting for an SSE stream of response tokens from the agent.
Once streaming is finished, the server will store the full response in the database. The ‘conversation history’ stored in the database is really to re-hydrate the client on page refresh, it’s not for anything on the server.
Apps end up with behaviour like in this gif, where if you refresh the page the in-progress token stream is lost. But the response becomes available sometime later once the token stream is finished and the ‘full’ response is stored in the database.

What even are tokens?
The responses you get back from LLM providers, or AI SDKs have slightly different structure and format, but pretty much all follow a similar pattern.
Some kind of ‘start’ event, some ‘delta’ events that contain text or tool call requests, and then some kind of ’end’ event.
To get the full response text, you either concatenate the text deltas together, or some of the APIs will give you the ‘full’ response as it’s own event type at the end.
Vercel AI SDK:
| |
OpenAI Responses API:
| |
Anthropic API:
| |
What you’ll notice, is that each ’line’ of the responses contain quite a lot of metadata for not very much text-delta data.
For example, a single event (line) from the Anthropic API contains 125 characters for just 5 characters of text delta:
| |
Why does this matter? Well, we will come back to that in a sec.
How do you make your SSE stream resumable?
Well, Last-Event-ID is the answer. The idea is that if every ’event’ in your server-sent events
stream has a unique ID, then the client can keep track of the last event it received, and when the
connection drops, it can reconnect and tell the server to resume from that last event.
Maybe you decide that each ‘response’ will have an id, say: abcxyz and each. And each token
in-order will have an index. So abcxyz:0 is the first token, abcxyz:1 is the second token, and so on.
Most applications are built on an architecture like the one above, where there are a number of stateless horizontally scaleable server replicas that can handle client requests. Data is stored in the database.
So all the tokens, from each LLM response, need to be stored in the database or in some token cache
in case the ‘resume’ request with the Last-Event-ID is routed to a different server replica than
the one that started the stream.
Now, you need to write every token to the database.
Remember that each token event has a lot of metadata for not that much text delta? Well, that means a lot of database writes for not much text.
The other problem is that you don’t know that the client is going to disconnect or drop the connection. So you have to do a lot of database work for the off-chance that the client might disconnect. And as soon as that response is finished by the LLM, the individual tokens are useless, because the ‘full’ response supersedes them all.
Finally, cancellations are really hard to do with this architecture too. Maybe before you assumed
that a dropped connection from the client meant the server could cancel the request. But now, you
assume that the client will come back with a Last-Event-ID to resume the stream, so a dropped
connection doesn’t mean a cancellation. Instead, if you want to build cancellations, you need to
thread those through the database as well, as the replica that is handling the LLM response might
not be the replica that receives the cancellation request from the client. It’s all a mess of data
being routed through the database.
How do you do multi-device and multi-user?
Other folks are saying “Maybe I’m missing something, but once you’ve got durable state, don’t you get durable transport more or less “for free” with SSE and Last-Event-ID?”
Well the durable transport that I described before is much more than just one client on one device. It supports multiple clients, and multi device. So how do you do multi-device?
You’re already some of the way, because you’re storing the individual tokens of the response in the database, so you can at least serve those to multiple devices. i.e. A second device can request the same conversation and receive the history and then the token stream for any in-flight responses.
But the problem becomes;
If device A sends a new prompt and starts receiving a token stream response, how does device B know there’s a new prompt and response that it should render?
… well device B doesn’t know.
Folks would tell you that you need all your clients to poll the server for new data. In “Patterns for building realtime features” we discussed why polling sucks; it’s a trade-off of latency if you poll in-frequently or hammering your servers with traffic if you poll frequently. Neither is great.
Is it still easy?
I admit that all these problems are solvable, but I contest that they are ’easy’. They are janky and inefficient solutions to work around the fact that HTTP is just not a good transport for streaming LLM tokens and for building async agentic applications.
Stop reading here if you just wanted to see the problems with SSE token streaming. Because I’m going to talk about what I think is better, and that is probably too ‘commercial’ for some folks.
I work for Ably, and I’m building a dedicated transport for AI applications that supports token streaming really well. It’s based on the pub/sub pattern in the diagram above.
The key differences from HTTP based SSE streaming are:
- The pub/sub channel still exists even if a client drops the connection to it, so the server can keep publishing tokens, and those tokens are available to the client the moment it reconnects. This decouples the connection lifetime from the agent lifecycle.
- Multiple users or multiple devices can connect to the same pub/sub channel and get the exact same token stream. The channel makes sure that clients get the tokens in realtime, and the Ably SDKs automatically handle reconnection, rewind for missed tokens, and history.
- The channel automatically compacts the token-deltas into full responses, so clients who are catching up only get 1 message for each full response instead of streaming every single token.
- Cancellations, interrupts, and steering are easy, because the channel handles the routing from the client publishing the interrupt on the client to the server process that’s running the agent. No more routing cancellations or followups through the database, the channel handles routing for you automatically.
So yes, you can build all these things if you want. But I’m not convinced they are ’easy’ or that SSE over HTTP is a suitable transport for streaming tokens from LLMs and building async agentic applications.