Co-constructing intent with AI agents

How can we move beyond the confines of a simple input field to help agents evolve from mere “tools” into true “partners” that can perceive our unspoken intentions and spark new ways of thinking?

When we share our vague, half-formed ideas with AI agents, are we looking for the predictable, standard answers we expect, or a genuine surprise we didn’t see coming?

As the capabilities of AI agents evolve at a breathtaking pace, we increasingly expect them to be intuitive and understanding.

Yet, the reality often falls short. Vague or abstract questions typically yield only generic, catch-all answers. This traps us in a frustrating loop of rephrasing and refining, where we might land on a satisfactory result, but only after a frustrating cycle of refinement.

An image contrasting a simple, direct question (“What are the most popular tourist attractions in Japan?”) with a complex, ambiguous request (“Help me plan a trip to Japan that the whole family will enjoy.”), showing the clarifying questions needed for the latter.
Clear questions vs. Vague or abstract questions

This is a dominant mode of human-AI interaction. But is this the future we really want?

True connection often sparks in a dialogue between equals. The conversations that leave a lasting impression aren’t the simple question-and-answer exchanges. Instead, they are the ones where our true intent gradually surfaces through a back-and-forth of clarifying questions, mutual confirmations, and shared moments of insight.

If AI agents is to make the leap from “tool” to “partner,” perhaps we need to reimagine its role. Should it merely provide answers on command, or could it become a true companion in our explorations — one that provokes our thoughts, and ultimately, helps us discover what we truly want?

Speed ≠ Understanding

Imagine sending a morning greeting to your family on a hazy, half-awake morning. Your finger instinctively finds the “Send” button. A light tap, and it’s done. This simple, natural action — how different would it have been just a few decades ago?

You would have had to carefully type out lines of code on a screen, where a single typo or an extra space would cause the computer to rebuke you with an even more cryptic string of garbled text.

At its core, the difference between these two experiences lies in the challenge of “translating” between fuzzy human intent and precise computer instructions. Back in the 1980s, Don Norman defined this challenge with two concepts: the “Gulf of Execution”, which separates our thoughts from the machine’s commands, and the “Gulf of Evaluation”, which separates the machine’s feedback from our ability to understand it.

The narrower these gulfs, the more seamless the process of conveying our intent and interpreting the results. Decades of progress in human-computer interaction, from the command-line interface to the graphical user interface, have been dedicated to bridging this divide.

Diagram showing the user and the world. Arrow from user to world is the Gulf of Execution: “How do I use this system?” Arrow from world to user is the Gulf of Evaluation: “What’s the current system state?”
Whitenton, K. (2018, March 12). The two UX gulfs: evaluation and execution. Nielsen Norman Group.

Today’s AI-powered applications are following the same trajectory, continuously lowering the barrier to execution and making it ever faster to communicate in natural language.

But does this speed truly mean our intent is being understood?

Are AI agents really helping us explore?

“What’s the capital of the Netherlands?”

“The capital of the Netherlands is Amsterdam, but the seat of government, central ministries, parliament, and foreign embassies are all in The Hague…”

“How high is Mount Everest?”

“The latest official height of Mount Everest is 8,848.86 meters, a result from a joint survey by China and Nepal in 2020…”

“Generate a working timer demo.”

“Sure, here’s a simple countdown timer in Python. The user can input the number of seconds, and the script will count down and print a message at the end…”

Yes. For tasks with a clear intent, a specific goal, and low-context requirements — like summarizing articles, translating text, or doing basic math — AI agents deliver accurate and prompt answers. But when faced with the messy, complex challenges of real life, does that efficiency hold up?

Impeccable, but useless

“Help me plan a trip to Japan that the whole family will love.”

“Of course. To plan a trip to Japan the whole family can enjoy, here are some suggestions you can tailor to your needs. The milder weather in spring (March-May) and autumn (September-November) is ideal for family travel. Determine the trip’s duration based on everyone’s vacation time. If you’re traveling between cities, consider a Japan Rail Pass, but you’ll need to calculate if it’s cost-effective for your specific itinerary…”

Our conversation with the agent seems free of any “Gulf of Execution.” We had no trouble sending the request, and the agent promptly returned a standard-issue response. But how much is that response really worth to us?

The parents want to unwind in a hot spring, the kids are dying to go to Universal Studios, and your partner is hoping for a more niche, authentic cultural experience. Everyone has different hopes for the trip, but the agent’s generic advice fails to address any of them. But why didn’t we just give the agent all these details from the start?

The slot machine conversation trap

When we turn to AI with these kinds of vague, complex problems, 99% of the time we are funneled into a single input box. It’s the dominant interface for AI today, a model that originated with ChatGPT’s goal of giving people the most direct path to experiencing the power of large language models.

Screenshot showing the home screens of four AI chat tools: ChatGPT (top left), Claude (top right), Gemini (bottom left), and DeepSeek (bottom right).
The predominant way we interact with AI is almost entirely centered around the input field.

However, the thought of cramming every detail into that tiny box — everyone’s preferences, the family budget, and all the nuances from memory — and then endlessly editing it, is just exhausting.

“This is too much trouble, just simplify it.”

Our brains are wired for shortcuts. To get our vague idea out quickly, we subconsciously strip away all the context, preferences, and other details that are hard to articulate, compressing everything into the oversimplified phrase “make the family happy.” We toss it into the input box and pin all our hopes on the agent’s abilities.

Then, like a gambler, we pull the lever and pray for a lucky spin that happens to read our minds. To increase its “hit rate” with such pitiful context, the agent can only flex its capabilities, calling on every tool at its disposal to generate a broad, catch-all answer.

The result isn’t a helpful guide that inspires new thinking, but an undigested information dump. This interaction becomes less like a conversation and more like a slot machine, defined by uncertainty. It invisibly adds to our cognitive load and pushes us further away from discovering what we really need.

Even as AI agents have evolved to handle high-dimensional, ambiguous, and exploratory tasks, the way we communicate with it remains a low-dimensional channel, ill-suited for expressing our own complex thoughts.

“However, difficulties in obtaining the desired outcome arise from both the AI’s interpretation and the translation of intentions into prompts.An evolution in the user experience of AI systems is necessary, integrating GUI-like characteristics with intent-based interaction.”

On the usability of generative AI: Human generative AI

Stop guessing, start exploring the real problem

Let’s revisit the original idea. If you truly wanted to plan a trip to make your whole family happy, how would you do it without an AI? You’d probably engage in a series of exploratory actions — reflecting, researching, and running “what-if” scenarios to find a plan that balances everyone’s different needs.

Our daily reality isn’t about clear instructions and direct execution; it’s about navigating vague and messy challenges. Whether planning a family vacation or kicking off a new project at work, the hardest problem we face is often how to transform a fuzzy impulse into a clear and valuable goal.

So how can we design our interactions with AI to help us explore these vague, fragile impulses? How can we build a more coherent, natural dialogue instead of getting stuck in a constant guessing game?

“Good design is thorough down to the last detail. Nothing must be arbitrary or left to chance. Care and accuracy in the design process show respect towards the user.” — Dieter Rams

Like partners: The power of co-constructing intent

“Do you think this potted plant would look better somewhere else?”

“Oh? What’s on your mind? I thought you liked it where it was.”

“It’s not that I don’t… I just feel like nothing looks right lately. I guess I’m just looking for a change of scenery.”

When we talk things over with friends, partners, or family, we rarely expect an immediate, clear-cut answer. The conversation often begins with a vague impulse or a half-formed idea.

They might build on your thought: “How about by the window? The sunlight might help it thrive.” Or they might probe deeper, sensing the motive behind the question: “Have you been feeling a bit drained lately? It sounds like you want to move more than just the plant — maybe you’re looking to bring something new into your life.”

Human conversation is a dynamic, exploratory journey. It’s not about simply transferring information. It’s about two people taking a fuzzy idea and, through a back-and-forth exchange, co-discovering, refining, and even shaping it into something entirely new — uncharted territory neither had imagined at the start. This is a process of Intent Co-construction.

As our relationship with AI evolves from “tool” to “partner,” we find ourselves sharing more of these ambiguous intentions. To meet this changing need, how can we learn from our human relationships to design interactions that foster deep connection and co-construct intent with our AI counterparts?

Claude webpage with headline “Meet Claude, your thinking partner.”
Anthropic’s official introduction: Meet Claude, your thinking partner — screenshot via

Reading between the lines with multimodality

Picture a perfect sunny weekend. You’re driving with the windows down, your favorite album playing, on your way to that new park you’ve been wanting to visit.

You tell your voice assistant your destination. It instantly displays three routes, color-coded by time and traffic, and helpfully highlights the one its algorithm deems fastest.

You subconsciously take its advice, but halfway there, something feels wrong.

While it may be the shortest path physically, the route involves constant lane changes on streets barely wide enough for one car. You’re flanked by parked cars whose doors could swing open at any moment and kids who might dart into the road. Your nerves are frayed, your palms are sweating on the wheel, and you find yourself muttering about the cramped, crowded conditions, nearly rear-ending an e-bike.

Through it all, the navigation remains indifferent, stubbornly sticking to its original recommendation.

Yes, multimodal inputs allow us to give clearer commands. But when our initial command is incomplete, we still end up with a generic solution. A true partner would think:

“They seem stressed by this complex route. Should I suggest a longer but easier alternative?”

“I’m detecting swearing and frequent hard braking. Is this road too difficult for them to handle?”

The real breakthrough isn’t just understanding what users say, but how they say it — combining their words with environmental cues and situational context. Do they type fluently or constantly backspace? Do they circle a data point with confidence or hesitation? These subconscious signals often reveal our true state of mind.

A screenshot of Hume AI, showing a podcast on the left and emotion and behavior analysis on the right, displaying three people with their emotions labeled as joy, awe, and curiosity based on the analysis.
Hume AI can analyze the emotion in a speaker’s voice and respond with empathetic intelligence.

The AI we need isn’t just one that can process text, voice, images, and gestures simultaneously. We need a partner that, while respecting our privacy, can keenly and continuously read between the lines, detecting the unspoken truth in the dissonance between these multimodal signals.

“To design the best UX, pay attention to what users do, not what they say. Self-reported claims are unreliable, as are user speculations about future behavior. Users do not know what they want.”

Jakob Nielsen

Now, let’s take this one step further. Imagine an AI that, through multimodal sensing, has perfectly understood our true intent. If it simply serves up a flawless answer like a data report, is that really the best way for us to learn and grow?

Information as a flowing process

Let’s rewind and take that drive to the park again. This time, instead of an AI, your co-pilot is a living, breathing friend.

When you reach that same algorithm-approved turnoff, you tense up at the sight of the narrow lane. Your friend notices immediately and guides you through the challenge:

“This road looks rough. Let me guide you to a better one.”

“Turn right just after that coffee shop up ahead.”

“We’re almost there. See the people with picnic blankets?”

The journey is seamless. You realize your friend didn’t necessarily give you more information than the AI, but they delivered the right information at the right time, in a way that made sense in the moment.

Similarly, AI-generated information can be delivered through diverse mediums; text is by no means the only way. Think about a recent conversation that stuck with you. Was it memorable for its dictionary-like volume of facts? More likely, you were captivated by how the story was told — in a way that helped you visualize it. This power of visualization is rooted in metaphor.

“…we often think we use metaphors to explain ideas, but I believe good metaphors don’t explain but rather transform how our minds engage with ideas, opening entirely new ways of thinking.”

The Secret of Good Metaphors

Files that look like paper, directories that look like folders, icons for calculators, notepads, and clocks — back in the earliest days of personal computing, designers used graphical metaphors based on familiar physical objects to make strange and complex command lines feel intuitive and accessible.

A screenshot of the Apple Lisa 2’s black-and-white graphical user interface, showing the cursor selecting the “Copy” command from the “Edit” menu.
Apple Lisa 2 (1984): Features like desktop icons, the menu bar, and graphical windows significantly lowered the barrier to entry for personal computers

Metaphors work by tapping into our past experiences and connecting them to something new, bridging the gap to understanding. So, how does this apply to AI output?

Think about how we typically use an AI to explore a complex topic. We might ask it a direct question, have it synthesize industry reports, or feed it a pile of research to summarize. Even with the AI’s best efforts, clicking open a result to find a wall of text can feel overwhelming.

We can’t see its thought process. We don’t know if it considered all the angles we did. We don’t know where to begin. What we truly need isn’t just a final answer, but to feel like a friend is walking us through their thinking — transforming information delivery from a static report into a guided process of shared discovery.

A mind map in Chinese that visually lays out an AI entire thinking process for analyzing and solving a complex problem.
Metaso: Visualizes its entire thinking process on a canvas as it works on a problem.

But what if, even after seeing the process, the answer is still too abstract?

We naturally understand information through different forms: charts for trends, diagrams for processes, and stories told through sound and images. Any good communication orchestrates different dimensions of information into a presentation that conveys meaning more effectively.

A screenshot of the Google NotebookLM interface demonstrating how source materials on art history are automatically transformed into a narrated video overview titled “The World of Surrealism”.
Google NotebookLM can transform source materials into various easy-to-digest formats, such as narrated video overviews, conversational podcasts, and interactive mind maps. This shifts learning from a process of passive consumption to a dynamic, co-creative experience.

NotebookLM (Google): Can autonomously transform source materials into various accessible formats like illustrated videos, podcasts, or mind maps, turning passive learning into active co-creation.

However, there’s a risk. When an AI uses carefully crafted metaphors to present an output that is clear, beautiful, and logically flawless, it can feel like an unchallengeable final answer.

Is that how our conversations with human partners work?

When a friend shares an idea, we don’t just agree. Our responses are filled with questions, doubts, and counter-arguments. Sometimes, a single insightful comment can change the direction of an entire project. A meaningful dialogue is less about the period at the end of a sentence and more about the comma or the question mark that keeps the conversation going.

Progressive construction through dialogue and memory

“Let’s go hiking this weekend. I want to challenge myself.”

“Sounds good! But remember last time? You said your knee was bothering you halfway up. Are you sure? We could find an easier trail.”

“I’m fine, my knee’s all better.”

“Don’t push yourself…”

A true partner remembers your past knee injury. They remember you’re directionally challenged and that you’re not a fan of reading long texts. This long-term memory allows your interactions to build on a shared history, moving beyond simple Q&A into a state of mutual understanding where you can anticipate each other’s needs without lengthy explanations.

A composite image explaining the “Dia Memory” feature, showing a person recording their environment with a phone on the left, and a corresponding personalized AI chat suggestion on the right.
Google’s Project Astra remembers what it sees and hears in real time, allowing it to answer contextual questions like, “Where did I leave my glasses?” The Dia browser’s memory feature continuously learns from your browsing history to develop a genuine understanding of your tastes

For an AI to co-construct intent like a partner, persistent memory is not just a feature — it’s essential.

Agent failures aren’t only model failures; they are context failures.

The New Skill in AI is Not Prompting, It’s Context Engineering

But memory alone isn’t enough; we need to use it to foster deeper exploration. As we said from the start, the goal isn’t to get an instant answer, but to refine our intentions and formulate better, more insightful questions.

A screenshot of a conversation with an AI tutor, where the AI asks clarifying questions about the user’s math background and goals before explaining Bayes’ theorem to personalize the lesson.
ChatGPT Study Mode. When given a task, its first instinct isn’t to jump straight to an answer. Instead, it begins by asking the user clarifying questions to better define the problem

When a vague idea or question surfaces, we want an AI that is more than an answer machine. We want a true thinking partner: one that can reach beyond the immediate context, draw on our shared history to initiate meaningful dialogue, and guide us as we peel back the layers of our own thoughts. In this progressive, co-constructive process, it helps us finally articulate what we truly intend.

Where co-construction ends, we begin

Deeper insights through multimodality, dynamic presentations that clarify information, and a back-and-forth conversational loop that feels like chatting with a friend… As our dialogue with an AI becomes deeper and more meaningful, so too does our understanding of the problem, and our own intent becomes clearer.

But is that the end of the journey?

In the film Her, through countless conversations with the AI “Samantha,” Theodore is compelled to confront his emotions, his past failed marriage, and his own conflicting fear and desire to reconnect. Throughout this process, Samantha’s curiosity, learning, and gentle challenges to his preconceptions help him see himself with new clarity, allowing him to truly feel and face his life again.

A still from the movie ‘Her,’ where the protagonist, Theodore, smiles gently while looking out a window, listening to his AI companion through a wireless earbud.
screenshot via ‘Her

The world of ‘Her’ is not some distant future; in many ways, it is a portrait of our present moment. In a future where AI companions will be a long-term presence in our lives, their ultimate purpose may not be to replace human connection, but to act as a catalyst for our own growth.

The ultimate value of co-constructive interaction is not just to help us understand ourselves more deeply. It is to act as an engine, converting that profound self-awareness into the motivation and clarity needed to achieve our potential in the real world.

Of course, times change, but the fundamentals do not. This has always been the goal of the pioneers of human-computer interaction:

Boosting mankind’s capability for coping with complex, urgent problems.

Doug Engelbart

Reference

  1. Johnson, Jeff. Designing with the mind in mind: simple guide to understanding user interface design guidelines. Morgan Kaufmann, 2020.
  2. Whitenton, K. (2018, March 12). The two UX gulfs: evaluation and execution. Nielsen Norman Group. https://www.nngroup.com/articles/two-ux-gulfs-evaluation-execution/
  3. DOC — The secret of good metaphors. (n.d.). https://www.doc.cc/articles/good-metaphors
  4. Nielsen, J., Gibbons, S., & Mugunthan, T. (2024, January 30). Accordion Editing and Apple Picking: Early Generative-AI User Behaviors. Nielsen Norman Group. https://www.nngroup.com/articles/accordion-editing-apple-picking/
  5. Varanasi, L. (2025, May 25). Meta chief AI scientist Yann LeCun says current AI models lack 4 key human traits. Business Insider. https://www.businessinsider.com/meta-yann-lecun-ai-models-lack-4-key-human-traits-2025-5?utm_source=chatgpt.com
  6. Perry, T. S., & Voelcker, J. (2023, August 7). How the Graphical User Interface Was Invented. IEEE Spectrum. https://spectrum.ieee.org/graphical-user-interface
  7. Gerlich, M. (2025). AI tools in society: Impacts on cognitive offloading and the future of critical thinking. Societies, 15(1), 6.
  8. Ravera, A., & Gena, C. (2025). On the usability of generative AI: Human generative AI. arXiv preprint arXiv:2502.17714.


Co-constructing intent with AI agents was originally published in UX Collective on Medium, where people are continuing the conversation by highlighting and responding to this story.

 

This post first appeared on Read More