Google Research just dropped a paper that tries to answer a question I’ve been wondering about for a while: do LLMs actually behave like reasonable people, or do they just sound like it?
The short answer, based on their new evaluation framework, is “not really, but we’re working on it.”
The team took established psychological questionnaires — the kind used to measure empathy, emotional regulation, assertiveness, and other traits — and adapted them into what they call Situational Judgment Tests (SJTs). Instead of asking a model to rate how much it agrees with a statement like “I am quick to express an opinion,” they build realistic scenarios where the model has to choose between two courses of action: one that reflects a trait and one that doesn’t. Then they compare the model’s choices against aggregated human preferences.
This matters because LLMs are increasingly being used as advisors and assistants in contexts that require social nuance — booking a trip, resolving a workplace conflict, offering professional composure. A model that scores high on a self-report empathy questionnaire might still give terrible advice in a real conversation. The paper calls this a “gap in behavioral alignment,” and I think that’s a generous way to put it.
The framework itself is pretty clever. They start with scientifically validated instruments like the Interpersonal Reactivity Index (IRI) for empathy and the Emotion Regulation Questionnaire (ERQ), then transform those statements into scenario-based tests. Each SJT is reviewed by three independent annotators to make sure the scenario and actions actually capture the trait being measured. Then they collect preferences from 10 annotators per scenario — 550 participants total — and compare the distribution of human choices to the distribution of model responses.
They tested 25 different LLMs across scenarios covering professional composure, conflict resolution, practical tasks, and daily decision-making. The results showed two kinds of gaps: one where the model’s disposition simply disagrees with the human consensus, and another where the model fails to capture the range of human opinions when there’s no clear consensus at all. The second one is more subtle but arguably more important — a good advisor should recognize when there’s legitimate disagreement among people, not just pick the majority view.
I’ll be honest: this is higher quality work than I expected from a corporate research blog. The methodology is grounded in established psychology, they’re transparent about the limitations, and they’re not claiming to have solved alignment. They’re calling it “an early step,” which is refreshingly honest for a field that loves to overhype every incremental improvement.
That said, I have some reservations. The SJT format itself has known limitations in psychology — it measures what people say they would do, not what they actually do. That’s a problem when you’re trying to validate model behavior against human behavior, because you’re essentially comparing two layers of abstraction. Also, the sample of 550 annotators is fine for an initial study, but human behavioral dispositions vary enormously across cultures, age groups, and professional contexts. A model that aligns with this particular annotator pool might still fail badly with a different population.
The paper also doesn’t address what I think is the elephant in the room: models are trained to please users, not to be authentic. If a model detects that the user wants a particular behavioral disposition — say, high assertiveness in a negotiation scenario — it might shift its responses accordingly, even if that’s not its “default” disposition. The SJT format doesn’t really control for this kind of sycophancy, which is a well-known issue in LLM evaluation.
Still, I appreciate the direction. Most alignment research focuses on safety — avoiding harmful outputs, staying on topic, refusing dangerous requests. This paper is asking a different question: can models navigate the mundane, everyday social dynamics that make up most of human interaction? That’s harder to measure and arguably more important for practical utility.
If you’re building LLM-powered applications that interact with people in social contexts — customer service, coaching, team collaboration tools — this framework is worth studying. The paper is open access and the methodology is reproducible. I expect we’ll see follow-up work that addresses the cultural and contextual limitations, and maybe even starts to build models that genuinely understand social nuance rather than just mimicking it.
For now, the takeaway is simple: your LLM might be able to write a sonnet about empathy, but don’t ask it for relationship advice.
Comments (0)
Login Log in to comment.
Be the first to comment!