Simula: A Smarter Way to Generate Synthetic Data by Thinking First

We all know the drill: big AI models get smarter because they feast on internet-scale data. But what happens when you need a model for something niche, new, or privacy-sensitive? You hit a wall. Real-world data for those scenarios is expensive to collect, slow to curate, and often just not there.

Google Research’s latest paper, “Reasoning-Driven Synthetic Data Generation and Evaluation” (published in TMLR), tackles this head-on with a framework called Simula. The core idea is refreshingly different: instead of generating synthetic data sample by sample, they treat the whole dataset as a design problem.

From sample-level to dataset-level thinking

Most synthetic data methods today rely on manual prompts, evolutionary algorithms, or seed data from the target distribution. Those approaches have real problems. They don’t scale well because they depend on human effort or existing data. They’re black boxes, so you can’t explain why certain samples were generated. And control is messy — tweak one parameter and everything shifts.

Simula flips this. It reframes synthetic data generation as “mechanism design.” Instead of optimizing one data point at a time, you design the dataset as a whole. This gives you independent control over three key variables: coverage, complexity, and quality. You can dial up coverage to hit the long tail of a domain without accidentally making everything too complex.

Reasoning first, data second

The secret sauce is “reasoning-first.” Simula doesn’t start with random sampling. It uses reasoning models to map out the conceptual space of a target domain into deep, hierarchical taxonomies. Think of it as a scaffold for sampling.

The process is recursive: the system proposes sub-categories, evaluates them, merges duplicates, and filters weak ones. A critic model keeps things honest. The result is a dense taxonomy — like a tree for cyber threat intelligence — that ensures your dataset covers the obscure edge cases, not just the obvious ones.

This is seedless and agentic. No need for human-curated examples to kick things off. And as reasoning models get better (which they will), Simula’s generation capabilities improve naturally. That’s a nice property — your synthetic data pipeline gets smarter over time without you having to rebuild it.

Four steps, fine-grained control

Simula breaks generation into four steps:

Global Diversification: Build the taxonomy scaffold. This is where you control coverage.
Local Specification: For each node in the taxonomy, specify what kind of data you want — complexity, format, difficulty level.
Sample Generation: Generate individual samples using the specifications. This can use LLMs, diffusion models, or whatever fits.
Quality Control: Filter and refine generated samples against predefined quality metrics.

What I like about this decomposition is that you can swap out components. Don’t like the underlying model? Replace it. Need a different quality metric? Plug it in. It’s modular in a way that most synthetic data pipelines aren’t.

The real-world angle

Google positions this as a solution for “privacy-sensitive or data-scarce domains.” That’s fair, but I’d argue the real win is for domains where you need to proactively generate edge cases — think safety testing, rare disease diagnosis, or fraud detection. You don’t want to wait for failures to happen in the wild. Simula lets you stress-test systems against scenarios that haven’t occurred yet.

The paper also points out that synthetic data enables “programmable workflows.” Treat data like code: version it, reproduce it, inspect it. That’s a massive operational improvement over static real-world datasets that slow down development cycles.

Not without caveats

Simula is clever, but it’s not magic. The quality of the generated data still depends on the underlying reasoning model. If your model’s reasoning is weak, your taxonomies will be shallow and your samples will be mediocre. There’s also the question of evaluation — how do you know your synthetic dataset is actually representative of the real world? The paper proposes metrics, but I’d want to see more empirical validation on downstream tasks.

Still, this is a thoughtful step forward. Most synthetic data research focuses on generation techniques. Simula focuses on dataset design. That shift in perspective is worth paying attention to.

If you’re building AI for a niche domain and hitting data scarcity, this framework is worth a look. It’s not a silver bullet, but it’s a smarter way to think about the problem.