AI token streaming isn't about SSE vs WebSockets

At Ably, we’ve solved production token streaming, so you don’t have to. And the hard-part isn’t SSE or WebSockets.

Ask an agentic coding tool or chatbot “how to stream AI tokens to a client in production” and it’ll give you a section of the answer on SSE vs WebSockets. But that’s not the question, or really the answer.

In a pure comparison of using SSE or WebSockets as the transport, SSE is the simpler choice, and is also the better choice for most usecases. The architecture you should build for production token streaming looks like the diagram below. It’s got separation of ‘prompt’ request and ‘response’ stream, and a token cache/data store for storing the tokens in allowing for resume and reconnection.

fig. 01

ChatGPTs design

flowchart LR
    H[human] -->|1. prompt request| S[POST /messages] --> L[llm]:::accent
    L -.->|sse tokens| S
    DB[(token cache)]
    S -.->|store tokens| DB
    H -.->|2. stream repsonse|SS
    SS[GET /streams/:id] 
    SS -.->|read tokens| DB
    classDef accent stroke:#7eb8a4,stroke-width:1.5px

The LLM response tokens are threaded through some datastore. The request and response are separated.

The WebSockets version looks almost exactly the same, except the client opens a WebSocket connection and sends the prompt as a message, and the server responds with token messages on the same connection. The token cache/data store is still needed for resume and reconnection.

fig. 02

WebSocket design

flowchart LR
    H[human] -->|open websocket| S[Server] --> L[llm]:::accent
    L -.->|sse tokens| S
    DB[(token cache)]
    S -.->|store tokens| DB
    S -.->|read tokens| DB
    S -.->|stream tokens on websocket| H
    classDef accent stroke:#7eb8a4,stroke-width:1.5px

The same as the sse design, but with a websocket connection

Why this works with SSE

Most peoples system design is based around the idea that servers are stateless, and all the state is stored in a database. This allows the servers to horizontally scale to handle more requests, and allows for better scaling as any server can handle any request. There’s generally a load balancer in front of the servers that routes requests to them. Mostly that load balancer will be based on sharing the load across servers rather than any kind of sticky session or session affinity.

Turns out, SSE drops into this architecture really nicely. The client makes a POST request to the server with a prompt, and gets a stream ID back. The client then connects to that stream and gets the token streaming response. Any server can handle the original prompt request, and any server can handle the stream response because tokens are threaded through the database.

WebSockets is basically exactly the same, except the client opens a WebSocket connection and sends the prompt as a message on the connection. The response tokens are sent back on the same connection. The connection is longer-lived, and you have to build your own message/request framing layer for the shape of messages sent back and forth on the WebSocket. So WebSockets are more complex to build and maintain, and don’t really add any value in this architecture.

So it seems like an obvious choice to go with SSE for streaming your response tokens.

What other features do you need in a production deployment?

So far we’ve got a design for streaming tokens to the client in the ‘happy path’; where nothing goes wrong. We’ve got the architecture to support the possible failure cases and how to recover from them, but we don’t have the features yet. So what else do you need to do to make those features possible?

Reconnection, resume, and recovery

The client needs to be able to reconnect to the stream if the connection drops, and resume from the last token that was sent to the client. This is where the token cache/data store comes in. The server needs to store the tokens so that a reconnecting client can pick up where it was before. But the client also needs to be able to indicate where it got to before the disconnect, and that needs to correlate with some position on the server.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
{
  "type": "text-delta",
  "delta": "hello",
  "position": 1
}
{
  "type": "text-delta",
  "delta": " world",
  "position": 2
}
{
  "type": "text-delta",
  "delta": "!",
  "position": 3
}

All the tokens need to a sequence ID in the stream, so that when the client reconnects it can say “the last token I got was at position 2”, and the server can then send tokens starting at position 3. The SSE spec has Last-Event-ID header built into it for this purpose, but you still need to build the plumbing to support it on the server and client side.

Detecting dropped connections

The server and client need to be able to detect when the connection has dropped, so that the client can attempt to reconnect. To do this, the server has to send a heartbeat message every so often (e.g. every 10 seconds) to keep the connection alive and to detect if the client has dropped. The client also needs to have a timeout for how long it will wait for a message before it considers the connection dropped and attempts to reconnect.

Handle cancellations

The client needs to be able to cancel the stream if the user decides they don’t want to wait for the response anymore. Lots of chatbot interfaces have a stop button that will interrupt the streaming of the current response. The client can make a cancel request to the server, for the stream ID that they want to cancel. The server then needs to stop any response generation associated with that stream.

The easy way to do this is to put the cancel in the token cache, and readers from the token cache know that the stream is cancelled, but writers to the token cache now need to also read from it before write in order to check if there’s a new cancel signal in there. This becomes pretty awkward if the token cache is actually some kind of queue, rather than a simple key-value store of relational database.

Token rollup and compaction

There are two places where token rollup matter: on reconnect, and on completed response.

The completed response case is the easiest. Once the server has finished streaming the response, those tokens need to be compacted into the full response and stored somewhere as that full response. Later ‘history’ requests for earlier messages in the chatbot conversation should serve the compacted full response and not stream that response token-by-token.

The harder case is on reconnect. If the client reconnects and has missed 10, 100, or 1000 tokens, those tokens need to be rolled up into a compacted response that can be sent to the client in one. There’s no point forcing the client to consume those tokens one-by-one, just because that’s how they are stored in the token cache. Again, this becomes harder if that cache is actually some kind of queue; because you won’t always know how many tokens are in the queue for a given stream.

Multi-user or multi device

The human user needs to be able to pick up on another device, or another user needs to be able to join the existing chat. Multi-device is actually the simpler case here, because there is still a single human operating that chat conversation, when they switch to a new device the device can fetch the history, and then it is that device knowing it needs to get a new stream response after it made a new prompt request.

The mult-user case is much harder. If you’ve got two users both connected to a chat conversation, and user A makes a prompt request in that chat, user A will know they need to connect to the stream for that prompt request and receive the response tokens. But user B, who is also in that chat won’t know to connect to that stream, and won’t get the response tokens. You could solve this by making the conversation and the stream response the same entity. That is, when the users first connect to the chat they are actually connecting to the response stream for that chat. And all streamed responses to any prompt requests submitted in that chat are streamed over the chat-wide response stream. This allows both users access to the responses but not the requests. To solve the requests, well now you need to make prompt requests for the chat also go over the same response stream so that new prompts are shared with other users in the same chat.

There’s a massive gap between demos and production

When you look at the architecture that you need to build, and the features you need to support for a production deployment of token streaming, you can see that there’s a massive gap between the simple demos you see in github and blog posts, and what you actually need to do.

Most folks fall into two camps, either a) they don’t know about this gap because they’ve not built a production deployment of token streaming before, or b) they know about the gap but they underestimate how much work it is to fill it.

I know and care about this gap because my day-job is solving AI token streaming problems. We’ve solved these AI token streaming problems with a pub/sub product that automatically supports multi-user, multi-device, reconnect, resume, cancellation, token rollup, history and token compaction. It’s a product based on our realtime pub/sub system; and all you have to do is:

1
2
3
4
5
6
7
const ably = new Ably.Realtime({ key: process.env.ABLY_API_KEY })

const channel = ably.channels.get(conversationId)

for await (const token of llm.generateTokens(prompt)) {
  channel.publish(token)
}