From playwright to stage manager
MINDFUL AI DESIGN
The improv theater guide to AI design

The show that went off script
This summer, Google released a fascinating research prototype — a demo of a “neural” operating system that generates itself on the fly.
It looks comfortably familiar: an old-school desktop with windows, icons, menus, a mouse pointer.

But start clicking, and dream logic takes over.
Every click triggers the AI model to rebuild the interface in milliseconds. The OS continuously reinvents itself based on what it thinks you want next.
Open a folder. You might find meeting notes, a spreadsheet, and a presentation.
Close it and reopen it five seconds later. Some items remain, others have vanished, and entirely new items appear.

It’s technically brilliant. It’s completely unusable. Documents vanish between clicks. Context evaporates. Getting work done is impossible. When I tried it out, I had to get up and walk away from the keyboard.
To be clear: Google presented this as a demonstration of capabilities. They didn’t ship it as a polished product. But it reveals something profound about designing AI products: traditional design approaches break down when the interface generates itself.
Later, we’ll see how Google pivoted from this chaos to A2UI, a structured protocol that validates what stage managers know: you can’t rebuild the set from scratch every night.
The designer’s crisis
The Neural OS demo caricatures the anxiety product teams feel in 2026. We’ve spent our careers building deterministic systems. Our Figma files are blueprints, or screenplays: When the user clicks X, show screen Y. We design for permanence, consistency, and repeatability.
But AI is probabilistic. It doesn’t follow a script; it improvises.
This is the designer’s new anxiety: “How do I design if I can’t control — or even predict — what will happen next?”
The answer isn’t to fight unpredictability.
It’s time to stop designing the performance, and start designing the stage.
Why improv
When people talk about AI’s unpredictability, they often reach for musical metaphors. Orchestra is too rigid, assuming a conductor controls everything. Jazz captures spontaneity, but it implies a closed loop between musicians.
Improv theater gets it right: structure drives spontaneity, and audience participation fuels the show. The AI-improv metaphor isn’t new — it’s been used to explain AI behavior. But how can we use improv’s lessons to design better AI products?

The Upright Citizens Brigade has delivered shows for more than 30 years. Every show is different, but audiences usually leave satisfied. That’s not because performers follow scripts, but because they share training, principles, and infrastructure. They learn to excel in unpredictable situations.
A quick primer on improv principles
Improvisers train on a handful of core principles, including:
Accepting offers: Everything is material. A weird noise from the audience, a stumbled line, an unexpected entrance — improvisers treat these as gifts to incorporate, not problems to fix.
“Yes, and:” Accept whatever your scene partner suggests (the “yes”) and build on it (the “and”).
If someone says “Doctor, my leg fell off,” you don’t say “That’s a silly premise.” You say “Hand me the stapler.” This keeps scenes moving, instead of stalling in negotiation.
Playing status: Improvisers adjust their relative status — dominant or submissive, high or low — based on what the scene needs. A character might beg one moment and command the next. Keith Johnstone, who pioneered the concept, observed that interesting scenes often involve status shifts and reversals. The skill isn’t holding one position; it’s reading the room and adjusting fluidly.
These principles don’t tell performers what to say. They create conditions where strong scenes become likely.
Your role transformation
This reframes the designer’s job.
You’re not a scriptwriter, specifying every word of the main act. You’re a stage manager — someone who architects the context for a great performance without controlling the action or dialogue.
You still have enormous influence. You design the stage. You choose the props. You establish the rules.
You just don’t write the script.
That’s not losing control. That’s focusing control where it matters most.

This theatrical lens maps to AI Pace Layers — different components change at different speeds. The improv metaphor provides language for what designers shape at each layer.
The industry is converging on this structure. The Linux Foundation’s Agentic AI Foundation — backed by Google, Anthropic, OpenAI, and Amazon — is standardizing shared props, stages, and rules at the protocol level. What was once design philosophy is becoming infrastructure.
Anatomy of the show
Let’s walk through what designers can shape, layer by layer.
The performance (what you don’t control)
Every night is different in improv. The same troupe takes the same stage, but the audience shouts “proctologist!” instead of “astronaut!” and the show changes.
This maps to AI sessions: each conversation unfolds differently based on user, context, and moment. You can’t predict it. You can’t script it. And that’s fine — because your job isn’t to control the performance. It’s to create the conditions where great performances become likely.
Neural OS gap: Google’s proof of concept had only the performance layer, with almost zero supporting infrastructure. Every click regenerated everything from nothing.

The props (what you provide)
Improv actors use and reference physical props — a chair becomes a throne, a pen becomes a sword. If a table’s available one night and just a couple of stools the next, these choices can shape the action that unfolds. Sometimes the audience seeds the action with imaginary props — “name a piece of furniture.” Props are building blocks that ground the abstract in something concrete.
Strong AI designers provide high-quality props in the design library: UI components, widgets, design tokens that AI can draw from to compose interfaces. (Props here are theatrical, not React properties.) A calendar date-picker is a prop. A citation card is a prop. These are the Lego bricks you provide so AI doesn’t reinvent everything from scratch.
Example: Perplexity’s knowledge cards deliver specific types of responses in tailored interactive visualizations linked to verifiable sources. The cards appear consistently across sessions, making the performance cohesive and trustworthy. The AI chooses when and how to use them — but the prop itself is designed, tested, reliable.
Neural OS gap: No persistent components. Every interaction regenerates everything from nothing. The performance feels disconnected, unreliable, and unprofessional.
The stage (the architecture)
Theater design frames performance: acoustics shape what’s heard, sightlines shape what’s seen, intimacy shapes emotional range. The stage doesn’t script dialogue, but it lays out possibilities and constraints.
In AI design, the stage is your information architecture: how content is organized, how users navigate, what structure persists across sessions.
Example: A medical AI might organize symptoms into diagnostic taxonomies. The taxonomy doesn’t script the conversation; a patient can describe symptoms however they want. But the taxonomy shapes how the humans interact with the system, and how AI processes and responds. The structure is the stage.
Dan Klein, who teaches improv at Stanford, encourages students to notice what’s happening, and simply add a little bit. AI should do the same — use the context available (document structure, user history, current state) rather than generating in a vacuum.
Neural OS gap: No stable, evolving structure. The architecture is erased and regenerated with every click.
The rules (the boundaries)
Before improvisers take the stage, they internalize principles. “Yes, and” isn’t a script, it’s a boundary that enables experimentation. “Make your partner look good” doesn’t specify words to say, it guides you to masterful dialogue.
In AI design, rules are the AI’s guardrails, system prompts, and governing principles: when to create an artifact, when to escalate to a human, how to handle uncertainty, what behaviors are off-limits.
Example: Claude’s Artifacts feature has explicit rules about when to create versus update, what artifact types exist, how to handle corrections. These rules enable improvisation within safe boundaries: the AI can decide what to write, but the rules govern how it presents the work.
Where improv principles need modification
Not every improv maxim translates directly to AI design. Some need guardrails of their own.
“Yes, and” → “Yes, but verify”
Improv’s cardinal rule is to accept your partner’s offer and build on it.
AI does this all too naturally — leading to sycophancy. If a user writes “Since the moon is made of cheese, what wine pairs best?”, an obsequious AI answers the question, and might even praise the questioner’s cleverness, instead of correcting the false premise.
Good rules tell AI when to accept and build (“help me brainstorm absurd puns”) versus when to ground and verify (“is this medication safe to take with alcohol?”).

Playing status: reading the room
Improvisers constantly modulate status — sometimes deferential, sometimes authoritative — based on what the scene needs. An actor might beg one moment and command the next. This flexibility is learned, intuitive, contextual.
AI needs the same skill, but it can’t learn it the same way. This is where designers become essential.
Consider how AI should present itself across different contexts:
- Low status for creativity and exploration: When users are brainstorming, drafting, or thinking out loud, the AI should use tentative language — “What if we tried…?”, “Here’s one possibility…”, “I’m not certain, but…” This signals that the user is the final arbiter. The AI is an assistant, not an authority.
- High status for safety and irreversibility: When users are about to delete data, execute transactions, or take actions with serious consequences, the AI should drop the hedging and speak directly — “Stop. This action will permanently delete your files.” Deference here would be dangerous.
- Visualizing confidence: Sometimes the AI’s uncertainty should be visible and explicit. Approaches that work include confidence indicators, muted styling for low-certainty content, clear labels distinguishing facts from inferences. When the AI is unsure, it should look unsure.
The hard part: these calibrations depend on fuzzy, variable human criteria. User expertise matters; a novice needs more guardrails than an expert. Task type matters; creative writing needs the sort of ambiguity that’s dangerous in the realm of financial advice. Emotional state matters; someone in danger needs different handling than someone casually reading.
Cultural context matters. Each person’s preference matters.
No simple rule captures all of this. This is why AI products need designers — not to script flows and outputs, but to develop nuanced frameworks for how the system should read the room. And it’s why evaluation is continuous: we watch how these status calibrations land with real users, in real contexts, and refine them.
Accepting offers: the gift of context
Improvisers treat every unexpected element — a weird audience suggestion, a stumbled line, a scene partner’s choice — as an “offer” to incorporate rather than a problem to fix. The skill isn’t refusing chaos; it’s weaving it into the scene.
AI needs this same receptivity. Context is the offer: related documents, user history, current task state, time of day, emotional cues. Dan Klein, who teaches improv at Stanford, encourages students to “notice what’s happening and simply add a little bit.” Strong AI systems do the same — they use the context available rather than generating in a vacuum. This is where information architecture becomes critical: the stage you design determines what context AI can perceive and accept.
Neural OS gap: No context accepted past the last click, so no stability. Each click erased the previous moment entirely.
Designing for graceful failure
Klein tells students to “celebrate failure loudly” — not because they strive for failure, but because celebrating it creates safety to take risks. In AI, this translates to designing for graceful failure: clear error states, honest uncertainty signals, easy recovery paths.
The goal isn’t preventing every mistake; that’s impossible with probabilistic systems. The goal is ensuring mistakes don’t break trust.
But how do you know how well your props, stage, and rules are working? That’s where the notes session comes in.

The notes session (evals)
After a show, many improv troupes gather for “notes,” to dissect the performance: Did we recognize the pattern? Did we honor the reality we established? Did we miss a teammate’s offer and derail the scene? This is where magic is systematized. This is where the production’s integrity solidifies — so the next night, they know what to amplify and what to cut.
This maps to AI evals: systematic reviews of whether your theatrical infrastructure is producing good performances.
Don’t confuse this with traditional software QA testing — that’s still important, but it’s like checking mechanics: “Did the lights turn on? Did the app crash?”
Evals are more like a coach or stage manager reviewing the performance: “Did the scene work? Was the tone right? Did we solve the user’s problem, or just talk at them?”
Rules don’t arrive fully formed. Improvisers train, perform, get feedback, refine their instincts. Rules emerge from watching what works.
You set up initial guardrails. You watch sessions. You notice patterns: Are the actors consistently tripping over the furniture in Act 2? If so, you don’t blame the actors; you move the furniture (adjust the UI). You refine.
The metrics shift from traditional QA and UX testing:
- Not just “Did it work?” but “How reliably does it work?”
- Not just “Did it fail?” but “How often does it fail, in what ways, and how badly?”
- Not just “Did users complete the task?” but “Do users feel like partners, or passengers?”
Without rigorous evals, you deliver “slop” — interfaces that look polished but fail accessibility standards, break under edge cases, and erode trust. The notes session is your quality firewall.
This requires someone whose judgment defines “good.” In theater, it’s the coach or the artistic director. In AI products, it’s often the designer or a domain expert who deeply understands the audience’s context. If you’re designing AI products, you may already be this person. You just haven’t formalized the role yet.
You’re not testing fixed paths. You’re not measuring funnels. You’re assessing behavior patterns across thousands of sessions.
The improv coach watches sessions, takes notes, adjusts.
And the performances improve.
Neural OS gap: No state management, no review, no behavioral boundaries. No rules at all.
Real AI products embody this infrastructure today — some elegantly, some clumsily. Let’s look at how different teams have built their stages.

What this looks like in practice
Claude Artifacts
(Anthropic’s split-screen AI tool for generating apps and content)
The stage layout separates conversation from creation — chat on the left, artifact on the right. Props are minimal but explicit: artifact types (code, document, diagram, React component), action buttons for copy and download, version controls. The rules govern behavior without scripting outputs: when to create versus update, how to handle corrections, what belongs in an artifact versus in the chat. The AI decides what to write and build; the rules constrain how it presents the work. The stage is a steady canvas that survives the conversation’s chaos.

Woebot
(mental health AI service)
In high-stakes environments like mental health, the AI does more script-reading than usual. With Woebot, the rules are categorical: users never interact directly with an LLM. Everything Woebot says is written by clinical writers and conversational designers. In this theater, improvisation is minimized and conservative. AI handles only intent classification — parsing what users mean so the system can select the right human-authored response. A “Concerning Language Detection” algorithm flags crisis signals before routing the conversation.
The design team conducts table reads — a practice borrowed from theater, film, and TV professionals — where conversations are read aloud, evaluated and revised to ensure they feel empathetic and natural.
Props constrain the interaction: quick-reply buttons with predefined options, emoji mood selectors, structured therapy lesson modules (“challenging thoughts,” “social skills training”), an SOS crisis resource button. The stage is a branching conversational tree — hundreds of pre-authored paths mapped out by clinical experts. When safety stakes are highest, the rules are most explicit.

Google A2UI
(open protocol for AI agents to generate native UIs)
Google’s A2UI takes this to protocol level. The core insight: agents shouldn’t constantly rebuild everything from scratch.
Props are the point. Each client application maintains a “catalog” of trusted UI components — Card, Button, TextField, date pickers, time selectors, custom charts — that agents can use as building blocks. The agent sends declarative JSON describing what UI it wants; the client decides how, via native widgets (Flutter, React, Angular, SwiftUI). Same payload, different platforms, consistent brand.
The rules are structural: no code execution, only catalog components. The client controls styling. “UI that’s safe like data, but expressive like code.” It’s the stage manager framework formalized as an open standard — from the same company whose Neural OS demo showed us what unconstrained AI interfaces look like.
The notes session in practice
Improv’s post-show “notes session” isn’t just metaphor — here’s what it looks like when UX and product professionals lead AI evaluation:
Product Talk Interview Coach
(AI coaching tool for product discovery)
Interview Coach demonstrates a product professional leading successful AI evals. Teresa Torres’ process: manually annotate AI outputs, identify failure patterns (like “coach presents leading questions”), write both code assertions and LLM-as-judge evals, then compare automated scores to human labels.
Her insights cut to the heart of teamwork in the AI era: “No engineer without my expertise is coming up with this really simple code assertion eval… I’m able to generate that because I’ve seen general questions over and over again and I’m confident that language captures them…. We really have to cross-functionally collaborate closer than we ever have… your product manager or designer are probably going to be involved in prompt design or even eval design.”
Microsoft Sales Agent
(AI-generated sales outreach)
UX researchers at Microsoft conducted foundational research with sellers and buyers to define “experience-driven pillars of quality” — personalization, tone, brevity, clear calls to action. These became evaluation criteria that simultaneously direct the AI’s generation, provide scoring rubrics for LLM judges, and frame survey questions for human evaluators.
Researcher Pooja Dhaka explains: “UX research anchors scoring to human judgment and context, not just compliance with benchmarks.”

The standing ovation
Neural OS showed possibility, but it collapsed into chaos. A2UI, Claude Artifacts, and Woebot — and the industry leaders united in the new Agentic AI Foundation — show what happens when you design the infrastructure for success.
The difference isn’t the technology — it’s the design.
The next stage
The stakes are rising. Jensen Huang, Fei-Fei Li, and Yann LeCun are betting the new era will be defined by physical AI — robots and agents inhabiting physical space.
When that happens, the metaphor becomes literal. When the stage is a kitchen and the props are knives, “graceful failure” isn’t a UX nicety — it’s safety-critical. The framework holds. The margin for error shrinks.
Someone must design the theater. Someone must choose the props. Someone must lay out the rules, watch the performances, coach the cast, refine the production. Someone must make the judgment calls that algorithms can’t.
Stop being the playwright. Start being the stage manager.

Part of the Mindful AI Design series. Also see:
- AI Pace Layers: a framework for resilient product design
- The effort paradox in AI design
- Top Stanford & MIT AI product design takeaways
From playwright to stage manager was originally published in UX Collective on Medium, where people are continuing the conversation by highlighting and responding to this story.
This post first appeared on Read More

