Google's New Gemini 3.1 Flash TTS Lets You Direct AI Speech Like a Director

Google just dropped Gemini 3.1 Flash TTS, and honestly, this is the most interesting text-to-speech release I’ve seen from them in a while. Not because the voice quality got a bump — though it did — but because they finally let you reach into the audio engine and tweak things with natural language commands.

What’s new beyond better voice quality

The model itself is more natural than previous versions. On the Artificial Analysis TTS leaderboard, which runs thousands of blind human preference tests, 3.1 Flash TTS scored an Elo of 1,211. That’s solid enough to land it in what they call the “most attractive quadrant” for the balance of quality and cost. But raw quality isn’t the headline here.

The real story is audio tags. You can now embed instructions directly into the text input to control vocal style, pacing, and delivery. Think of it like stage directions for AI speech. You write the script, drop in tags like [whisper] or [fast] or [excited], and the model follows. It’s not a new idea — some smaller TTS tools have tried similar approaches — but Google’s implementation feels more robust because it’s baked into a foundation model that already handles 70+ languages natively.

How audio tags actually work in practice

I got my hands on the preview in Google AI Studio. The workflow is dead simple: you type or paste your text, insert tags wherever you want the delivery to shift, and hit generate. The model interprets the tags and adjusts prosody, speed, and emotional tone accordingly. It supports multi-speaker dialogue natively, so you can tag different characters in a conversation and get distinct voices without setting up separate configurations.

Here’s a quick example from the demo: a single paragraph with alternating [calm] and [urgent] tags produces two clearly different vocal tones. The transition isn’t jarring either — it’s smooth enough to pass for a human reading with intention. That’s the kind of control developers have been asking for, and it’s refreshing to see Google ship it without overcomplicating the interface.

Availability and the SynthID watermark

As of today, Gemini 3.1 Flash TTS is rolling out in preview across three surfaces:

Gemini API and Google AI Studio for developers
Vertex AI for enterprise customers
Google Vids for Workspace users

Every piece of generated audio carries a SynthID watermark. That’s Google’s inaudible digital signature that tags content as AI-generated. It’s not foolproof against determined bad actors, but it’s a meaningful step for accountability, especially as synthetic voice quality approaches indistinguishable from human recordings.

One thing that bugs me

Google positions this as “the next generation of expressive AI speech,” and the tech is genuinely impressive. But the preview is limited. You can’t export the fine-tuned voice settings easily — you have to work inside AI Studio or the API. If you want to use the same voice configuration across different applications, you’re stuck rebuilding it each time. That’s a workflow blocker for anyone building production systems. I’d like to see saved voice profiles in a future update.

Bottom line

Gemini 3.1 Flash TTS isn’t just a quality bump. The audio tags give developers real, granular control over AI speech in a way that feels natural rather than technical. Combined with 70+ language support and SynthID watermarking, this is a strong release. Go play with it in AI Studio and see if you can break the tag system — I’m curious how far you can push it before the model gets confused.

Google’s New Gemini 3.1 Flash TTS Lets You Direct AI Speech Like a Director

What’s new beyond better voice quality

How audio tags actually work in practice

Availability and the SynthID watermark

One thing that bugs me

Bottom line

Comments (0)