I work on AI token streaming at Ably.
Mental model
Most people’s mental model of AI applications is stuck in November 2022. It’s a mental model of a chat window, a back and forth conversation, and a clever seeming response from an LLM. That mental model is now two generations out of date.
Conversational
The conversational generation of AI applications came first. ChatGPT launched in November 2022, and through the first half of 2023 the Chat product category evolved. In early 2024 Google Gemini joined the race, and the Claude 3 family of models launched. These products are all part of the conversational generation of AI applications. It’s this generation of AI apps that still matches most people’s mental models. The core interaction of a conversational app is a text box at the bottom of the screen, you type a question or instruction, and the AI replies in the same window, in prose. This is also the design of most AI code library’s examples. This is the design that uses HTTP request/response and SSE streamed responses. It’s the design that fits well into companies’ existing technologies and architectures. This mental model is closer to instant messaging than anything else, which is why some of the first areas of disruption were the areas where users were already interacting with a chat-box. Customer support, and search. In the conversational generation of AI applications, there’s no sense that the AI is doing anything for you. You are consulting the AI and it’s responding to you; answering your questions, asking you questions. Most people’s workflows operated on copy-pasting information in and out of the conversation. The AI’s response is essentially the whole product in the first generation of AI applications.
Delegative
The next generation is delegative. Where you delegate a task to the AI and it takes actions to fulfil that task. In Summer 2024 AI applications got the ability to call tools, and operate external systems became table stakes. There were many iterations of this through 2023 with GPT Actions, ChatGPT plugins, and Devin, but through 2024 we get the ability to build Artifacts (side-panel rendering of code, documents, diagrams), use Tools (being able to operate external systems), and MCPs (now the defacto standard for connecting AI to external systems). These product advancements changed the AI products from conversational to delegative. It started to feel like the AI was actually doing something, pulling in context from external systems over MCP, rendering documents that you could download and not just copy paste. We even saw the release of Computer use in late 2024; Claude Desktop was launched that could navigate a screen and take actions.
The big step for delegative AI applications came in 2025 when human relationship to the work shifted. AI products started to become agentic systems that humans delegate tasks to. Claude code launched in early 2025, and in the Claude 4 generation of models tool use became consistently good across long sessions. The models could reliably use and execute tools without going off the rails. This is the big shift from conversational to delegative; and it’s incredibly subtle. In the conversational era, humans were consulting the models to augment and improve their own work. Humans were still the executor, and the model was the assistant[1][1] You still see the ‘assistant’ naming in AI model APIs. . In the deletagive era, humans are now the supervisors who delegate work to agents. The agents take the actions based on the instructions or goal set by the human. The unit of work, and of value, shifts from a prompt and response to a task that the agent is fulfilling. This is how we see delegative applications today.
But delegative applications are also a lot harder to build than the original request-response architectures that engineers built for the conversational generation of AI applications. There are now long-running processes operating agentic loops, making tool calls, and performing multiple tasks over multiple turns. Each turn in the agentic loop can be potentially quite expensive, so engineers and architects start looking for mechanisms to make this agentic execution durable. Temporal and other durable execution frameworks start to take off, as they make the agentic loop durable, simplify the stateful aspect of the execution, automatically retry, and snapshot expensive computation or lookups.
Durable execution frameworks help with the computation, but they don’t help with the transport of the AI generated responses. As the agents become asynchronous long running processes, managing the connection between the agent and the client or human who made the request becomes a nightmare. The original HTTP request-response model doesn’t scale well to long-running async processes. A dropped connection becomes a real-pain to try and re-connect. Engineers resort to storing all the AI generated response fragments in a database, adding sorting and ordering keys, and trying to build resumable SSE streams over those responses. These are infrastructure problems, often completely adjacent and unrelated to the AI applications that these engineers are trying to actually build.
Collaborative
The next generation of AI applications that we are starting to see are collaborative. Claude Design is the first good example of this collaborative generation of AI applications. Back in 2024 when Anthropic released Artifacts (documents, diagrams, and code in a side-panel) and when OpenAI released Canvas (their equivalent) we started to see the seeds of collaborative experiences. The idea that the output of the model shouldn’t be stuck in a scrolling chat history, it should be elevated to something the conversational and delegative experiences could work on. Claude Design is the cleanest example of what makes the collaborative experience different from the previous generations. You open Claude Design, describe what you want, and Claude renders the draft directly into a workspace that you can actually edit. You can change colours, text, sizes. You can play with different design ideas through tweaks. The delegative and conversational generations of AI applications were still based almost entirely on a chat interface. But the Claude Design interface gives you many more input parameters beyond chat, allowing you to actually change elements of the design rather than just describing the changes that you want into a chat-box.
Claude Design’s different input and collaboration modes (through text, tweaks, and direct edits) are brilliant, but the next move is the surface you’re interacting with itself becoming dynamic. The collaborative example of Claude Design as it exists today is still a workspace with a fixed shape, but with the seeds of collaborative controls that go beyond a chat interface. Next comes generative UIs; where interfaces and the controls exposed to allow you to collaborate don’t exist until you ask for them. We’re starting to see this in MCP Apps for embedded interactive surfaces. The chat-box stops being where the work happens and starts being where the work gets requested, with the actual interface assembled in real time around the task that you and the AI then collaborate on.
Engineering challenges
So far, I’ve hinted at how software engineers are working to solve the challenges presented in each of the generations of AI applications. The new generative interfaces are a genuine engineering nightmare under traditional web architecture. The problems that existed for long-running agentic applications in the delegative era compound now that it’s not just tool-use and responses that need to stream to the UI, it’s the actual UI that needs to stream. Scaling existing HTTP long-polling, SSE, or request-response models on stateless servers connected to stateful caches and databases becomes an actual issue. None of the AI libraries are even close to a solution for this.
Engineering teams have spent the best part of 15 years optimising stacks for low-latency request-response and stateless horizontal scaling. The new mode wants the opposite; persistent connections, server-pushed state, long-running compute and a kind of client-server session affinity that load balancers were specifically built to avoid. Building this on top of an HTTP based REST API is possible but painful. Building native support requires rethinking the architecture from the connection layer up.
The developer mindshare is actually the harder problem here. Too many engineers, designers, product managers, and executives still have the mental model of AI shaped entirely by ChatGPT in late 2022. They think the interface is the chat box, the output is LLM generated text. It’s taking a while for the builders in the industry to catch up, and have real hands-on experience of the problems of building delegative and collaborative generations of AI applications. For too many, the problems don’t exist or aren’t problems unless they have run into them directly. You see these engineers diving in the HN comments with; “can’t you just use X”, or “what about Y”. But the best engineering teams are already tackling these engineering challenges. And at Ably we’re already working with them.
I work on Ably’s AI Transport product. It started as a pub/sub replacement for token streaming, with built-in token-compaction, and conversation history and rewind support. It originated as a product to solve the delegative experiences where engineering teams were starting to build long-running async agents; where traditional HTTP streaming tied an AI session to a single fragile connection. AI Transport decouples that connection and long-running process, by providing a shared, realtime, durable, pub/sub transport to both the agent and client. The same transport allows bi-directional messaging so users can steer and prompt the agentic loop.
Before I worked on the AI Transport product, I worked on the LiveObjects product. LiveObjects is a realtime-collaborative state product also built on top of pub/sub channels. It’s a series of CRDT datatypes that allow persisted state on a pub/sub channel, and allows multiple clients to collaborate on that state. Changes to the state are fanned-out to all parties collaborating in realtime.
The two hardest problems to solve when building delegative AI applications, or AI applications with generative UI are a durable connection/session between the agent and client, and live shared persistent state that the agent and human can collaborate on. I’ve worked directly on both of those products at Ably.
Both products make it incredibly easy to build the new generation of AI applications. AI Transport gives you a durable connection and session between the agent and client, so you don’t have to worry about dropped connections, or trying to build a Frankenstein solution on top of HTTP. LiveObjects gives you a set of CRDT datatypes that allow you to build live shared state that the agent and human can collaborate on, without having to try and force a collaborative, generative, dynamic application into a stateless REST API.
The next generation of AI applications are coming, and I feel for the engineering teams who are still trying to shoe-horn those applications into a system design that just doesn’t work for them.
Sources:
- https://jakobnielsenphd.substack.com/p/2026-predictions
- https://steadman.ai/newsletters/david/ai-usage-spectrum.html
- Anthropic / OpenAI / Google blog posts and release notes (release notes pages for each product).