Google Research just dropped WAXAL, a big open dataset for speech tech in 27 African languages. I’ve been watching the speech tech gap for years — most voice assistants and transcription tools work great in English, Mandarin, or Spanish, but if you speak a language from Sub-Saharan Africa, you’re basically locked out. That’s over 2,000 languages, hundreds of millions of people.
WAXAL is their attempt to fix that. It’s been in the works since 2021, and they collaborated with African universities and community groups to make it happen. The dataset comes in two parts: WAXAL-ASR with about 1,846 hours of transcribed natural speech for automatic speech recognition, and WAXAL-TTS with over 565 hours of high-fidelity recordings for text-to-speech synthesis. That’s a lot of data, especially considering how scarce African language speech datasets usually are.
What I like about WAXAL-ASR is they didn’t just have people read scripted sentences. Instead, participants described images from Google’s Open Images — 50+ topics, unscripted and spontaneous. This captures real speech patterns, including tonal variations and code-switching, which scripted recordings often miss. That’s smart, because African languages are incredibly tonal and context-dependent.
The TTS side is equally interesting. Local community members worked in pairs, drafting 10,000–20,000 word scripts, then alternating between reading and recording. Some even built custom studio boxes with project funding to get professional-grade acoustics. That level of community involvement isn’t common in big tech datasets, and it shows in the quality.
Both datasets are released under a Creative Commons CC-BY-4.0 license, which means anyone can use them for research or commercial projects, as long as they give credit. That’s huge. Most speech datasets are locked behind restrictive licenses or paywalls. WAXAL is genuinely open.
Now, 27 languages is a start, but Africa has over 2,000. WAXAL covers languages like Swahili, Yoruba, Igbo, Hausa, Amharic, Zulu, and others spoken by over 100 million people across 26+ countries. But it’s still a fraction of what’s needed. Google says they intend to expand it over time, which is good, because the gap is enormous.
One thing I wish they’d done differently: the dataset is large, but not huge by global standards. For comparison, English ASR datasets can run 10,000+ hours. 1,846 hours across 27 languages averages out to about 68 hours per language. That’s enough to train decent models, but not state-of-the-art ones without augmentation. Still, it’s a massive improvement over the near-zero that existed before.
The paper and dataset are linked in their announcement. If you’re working on African language speech tech, this is probably the most important resource to drop this year. I’m curious to see what the research community builds with it.
Comments (0)
Login Log in to comment.
Be the first to comment!