Google Research just released something that actually made me stop scrolling: Groundsource. It’s a framework that takes unstructured news reports and, using Gemini, turns them into structured, historical disaster data. The first output is a dataset of 2.6 million flash flood events spanning more than 150 countries, going back to the year 2000.
Let me be direct: this is the kind of applied AI work that doesn’t get enough attention. Everyone’s obsessed with the next multimodal model or whatever chatbot can write poetry. Meanwhile, there’s a genuine data crisis in climate modeling, particularly for fast-moving disasters like flash floods.
Why flash floods are a data desert
Earthquakes have global sensor networks. Hurricanes get tracked by satellites and aircraft. But flash floods? They’re notoriously underdocumented. The existing archives have real problems:
- The Global Flood Database (GFD) relies on satellites, which means cloud cover blocks the view and revisit times miss short-duration events.
- The Dartmouth Flood Observatory (DFO) is valuable but tends to capture only large, long-lasting floods.
- GDACS, a UN-European Commission system, has about 10,000 entries. That sounds like a lot until you realize it’s mostly high-impact events.
Ten thousand records is a drop in the bucket for training global-scale AI. If you’re trying to build a model that can predict flash floods in Jakarta, Nairobi, or Houston, you need orders of magnitude more data. And you need it to include the small, localized events that never make it into official hazard databases.
This is where Groundsource comes in.
How Groundsource works
The core idea is simple: news articles contain a massive amount of unstructured information about historical events. Government reports, local bulletins, even wire service dispatches. The problem is scale. No human team could read millions of articles and extract structured data consistently.
Groundsource uses Gemini to do exactly that. It processes news reports, identifies flood-related events, and extracts location, date, severity, and other relevant details. The output is a structured record for each event. The system then validates and cross-references entries to filter out noise and duplicates.
The result? 2.6 million records. That’s not a typo. That’s more than 200 times the size of the GDACS inventory.
What I find particularly clever is that this approach captures the local events that satellite-based systems miss. A flash flood that floods a neighborhood in a developing country might never trigger a satellite alert, but it will likely be reported in local news. Groundsource picks that up.
What’s in the dataset
The first Groundsource dataset covers urban flash floods from 2000 to 2025, spanning 150+ countries. It’s openly available, which is the right call. Google could have kept this proprietary, but they’re releasing it under an open-access license.
The chart they published shows an exponential growth in digitized news and corresponding flood events captured by the pipeline, with significant density in the 2020-2025 period. That makes sense — more news is being published digitally, and more of it is accessible to the system.
I would have liked to see some validation metrics. How accurate is the extraction? What’s the false positive rate? The paper addresses some of this, but I’d want to see independent verification before I’d trust the dataset for critical applications. Still, even with some noise, 2.6 million records is a transformative resource.
The bigger picture
Groundsource isn’t just about floods. The methodology is generalizable. The same approach could be applied to other hazards: wildfires, heatwaves, landslides, even disease outbreaks. If you can find news reports about it, you can probably build a historical dataset.
This is the kind of thing that makes AI genuinely useful. Not generating marketing copy or summarizing meetings, but filling critical data gaps that have real human consequences. Better historical data means better predictive models, which means more accurate warnings, which means fewer people die.
I’m also interested in the downstream implications. Insurance companies, urban planners, and emergency response organizations all need this kind of data. Having it openly available could level the playing field for researchers and organizations in developing countries that can’t afford expensive proprietary datasets.
There are obvious limitations. News coverage isn’t uniform across the world. A flood in a wealthy country with a robust media ecosystem will generate more reports than an equivalent event in a region with limited press freedom or fewer news outlets. The dataset will have geographic biases baked in. The paper acknowledges this, but it’s worth keeping in mind.
Still, this is a net positive. I’ve seen too many AI projects that solve problems nobody has. Groundsource addresses a genuine, well-documented need. If you work in climate modeling, hydrology, or disaster risk reduction, go check out the dataset. This is the kind of research that actually moves the needle.
Comments (0)
Login Log in to comment.
Be the first to comment!