ConvApparel: Why Your AI User Simulators Are Still Bad and How to Fix Them

ConvApparel: Why Your AI User Simulators Are Still Bad and How to Fix Them

6 0 0

Google Research just dropped ConvApparel, and honestly, it’s about time someone took a hard look at how terrible LLM-based user simulators actually are.

We’ve all seen the pattern. You build a conversational AI agent, test it against another LLM playing the role of a user, everything looks great, and then real humans show up and the whole thing falls apart. The agent forgets constraints, rambles, or just doesn’t handle frustration well. The problem isn’t the agent — it’s the simulator.

The realism gap nobody wants to talk about

LLMs are trained to be helpful assistants. They’re polite, patient, and have encyclopedic knowledge of whatever domain you throw at them. That’s great when you want a chatbot that answers questions. It’s terrible when you need a simulator that acts like a real, flawed, easily annoyed human.

Think about it. You’re testing a new booking agent. Your simulator never gets frustrated when the agent asks for the same information three times. It never says “I already told you that.” It never just hangs up. Real users do all of those things, and if your agent only sees polite, patient simulators, it’s going to get wrecked in production.

The ConvApparel paper calls this the “realism gap.” I’d call it a chasm. The simulators we have today are like flight simulators that only show perfect weather and smooth landings. No turbulence, no bird strikes, no engine failures. You can’t train a good pilot that way, and you can’t train a good conversational agent that way either.

The counterfactual problem

Here’s where it gets interesting. The researchers point out a subtle but critical issue: how do you test whether a simulator has actually learned human behavior, or if it’s just repeating patterns from its training data?

Most simulators are trained on conversations with a specific agent. When you change the agent — maybe you’re testing a new, intentionally frustrating one to see how the system handles edge cases — the simulator should adapt. A real human would get annoyed if the agent suddenly became useless. But an LLM simulator might just keep being polite because that’s what it learned from the training data.

This is the counterfactual validation problem. ConvApparel introduces a clever way to measure this: they built a dataset with both “Good” agents (helpful, normal) and “Bad” agents (intentionally unhelpful, frustrating). Then they check whether simulators react differently to each. If a simulator treats a bad agent the same as a good one, it’s not simulating humans — it’s just regurgitating training patterns.

What ConvApparel actually is

The dataset itself is a collection of human-AI conversations in the context of conversational recommender systems (think: an AI helping you pick a movie or a restaurant). They used a dual-agent setup where real users were randomly assigned to either a helpful agent or a deliberately bad one. This gives them a spectrum of human behavior from satisfied to profoundly annoyed.

The evaluation framework has three pillars:

  • Population-level statistics: do simulators produce the same aggregate behaviors as real humans?
  • Human-likeness scoring: do individual conversations look like they came from a real person?
  • Counterfactual validation: do simulators react appropriately to out-of-distribution agent behavior?

This is more thorough than most evaluation approaches I’ve seen. Most papers just check if the simulator’s outputs look vaguely human. ConvApparel actually tests whether the simulator can handle situations it wasn’t trained on.

Why this matters for building better agents

If you’re building conversational AI, you’re probably using LLM-based simulators for testing. You’re also probably frustrated by how often your agents fail in production. The ConvApparel framework gives you a way to measure that gap and start closing it.

One thing I appreciate about this work is that they don’t just identify the problem — they provide a concrete dataset and evaluation methodology. You can actually use ConvApparel to benchmark your simulators and see where they fall short.

The bottom line: your simulators are probably too polite, too knowledgeable, and too patient. ConvApparel gives you the tools to measure that and start building simulators that actually behave like real humans. Your agents will thank you when they don’t crash and burn in production.

Comments (0)

Be the first to comment!