If you’ve been following Arabic LLM evaluation for a while, you’ve probably felt the same unease I have. More benchmarks keep popping up, more leaderboards rank models, but something feels off. Are we actually measuring Arabic language capability, or just measuring how well models game flawed tests?
A team from TII built QIMMA (قمّة, Arabic for “summit”) to answer that question properly. Instead of the usual approach — grab some benchmarks, run models, publish scores — they did something painfully obvious that nobody else bothered to do: they checked whether the benchmarks themselves were any good first.
What they found should worry anyone building on Arabic LLMs
The validation pipeline is straightforward but thorough. Two strong LLMs — Qwen3-235B-A22B-Instruct and DeepSeek-V3-671B — independently score every sample against a 10-point rubric. Samples that fail get flagged. If both models agree it’s bad, it’s out. If only one flags it, human annotators — native Arabic speakers — make the final call.
This matters because even widely-used Arabic benchmarks have systematic problems. Translation artifacts from English originals, annotation inconsistencies, encoding errors, culturally misaligned questions. The kind of noise that quietly inflates or deflates model scores without anyone noticing.
What’s actually in QIMMA
The final suite is substantial: 109 subsets from 14 source benchmarks, over 52,000 samples across 7 domains. Cultural knowledge, STEM, legal, medical, safety, poetry and literature, and coding. 99% native Arabic content — the only exception is code evaluation, which is language-agnostic by nature.
It’s also the first Arabic leaderboard to include code evaluation, using Arabic-adapted versions of HumanEval+ and MBPP+. That’s a genuinely useful addition if you care about practical model capability rather than just multiple-choice trivia.
The uncomfortable part
The fact that QIMMA found systematic quality issues in established benchmarks isn’t surprising to anyone who’s worked with Arabic NLP data. But it’s still uncomfortable to see it quantified. These aren’t edge cases — they’re recurring problems in resources the community has been treating as ground truth.
The full paper goes into the specific failure patterns, and honestly, some of it reads like a catalog of everything that can go wrong when you translate benchmarks without cultural adaptation. Questions that make sense in English become nonsensical in Arabic. Answers that are correct in one dialect are wrong in another. Encoding issues that silently corrupt evaluation runs.
What this means for practitioners
If you’re building Arabic applications on top of LLMs, QIMMA’s rankings are probably more reliable than anything else available right now. But more importantly, the methodology itself should make you think twice about trusting any benchmark score that hasn’t been validated.
The leaderboard is open source, the outputs are public, and the validation pipeline is documented. That’s more than I can say for most Arabic evaluation efforts. Whether other leaderboards will adopt similar quality checks — or keep publishing questionable numbers — remains to be seen.
I’d bet on the latter until the community starts demanding better.
Comments (0)
Login Log in to comment.
Be the first to comment!