Granite 4.1 LLMs: A Hands-On Look at How IBM Built Them

IBM’s Granite team just dropped a detailed technical walkthrough of how they built the Granite 4.1 LLMs, and honestly, it’s refreshing to see a company actually talk about data curation instead of just bragging about compute budgets.

Granite 4.1 is a family of dense, decoder-only models at 3B, 8B, and 30B parameters, trained from scratch on roughly 15 trillion tokens. The big claim here? The 8B instruct model matches or beats the previous Granite 4.0-H-Small (a 32B-A9B MoE model) despite being simpler and smaller. That’s not nothing.

Architecture Choices

The architecture isn’t revolutionary, but it’s solid: Grouped Query Attention, Rotary Position Embeddings, SwiGLU activations, RMSNorm, and shared input/output embeddings. The 3B uses 40 layers with embedding size 2560, while the 8B and 30B both use 4096 embeddings but differ in layer count (40 vs 64). KV heads stay at 8 across all sizes, which keeps memory manageable during inference.

What’s more interesting is how they handled training. Instead of one monolithic run, they split pre-training into five distinct phases.

Phase 1: General Pre-Training (10T tokens)

Phase 1 is the standard stuff: ~59% CommonCrawl, 20% code, 7% math, 10.5% technical content, 2% multilingual, and 1.5% domain-specific. Power learning rate schedule with warmup. Nothing flashy, but it establishes broad language understanding.

Phase 2: Math/Code Pivot (2T tokens)

This is where things get interesting. They crank up math from 7% to 35% (a 5x increase) and code from 20% to 30%. CommonCrawl drops to 12%, but it’s a high-quality subset. They also start introducing 9% synthetic data. The goal is obvious: push reasoning capabilities early.

Phase 3: High-Quality Annealing (2T tokens)

Phase 3 marks the shift to mid-training with an exponential decay learning rate. The data mix becomes more balanced: CommonCrawl-HQ, math, and code each at ~16.67%, but now they’re blending in 12.5% long chain-of-thought data and 12% instruction tuning data (language + code). This is where the model starts learning to follow instructions.

Phase 4: Refinement (0.5T tokens)

Phase 4 is a shorter refinement stage with linear LR decay to zero. CommonCrawl-HQ jumps to 40%, code and math each at 20%, and they keep some chain-of-thought and instruction data. It’s basically a quality-focused polish.

Phase 5: Long Context Extension (LCE)

The final phase extends the context window from 4K to 512K tokens through staged extensions: 32K, then 128K, then 512K. For 128K and 512K, they use 80% books + 20% code repository data (only for 8B and 30B). After each stage, they do a model merge to preserve short-context performance. RULER benchmarks show the 8B base hitting 83.6 at 32K, 79.1 at 64K, and 73.0 at 128K. Not bad for a dense model.

Supervised Fine-Tuning

For instruction tuning, they curated ~4.1 million samples using an LLM-as-Judge framework. This is becoming standard practice, but the scale here is notable. They specifically focused on math, coding, instruction following, and general chat.

Reinforcement Learning

They used on-policy GRPO with DAPO loss (from Yu et al., 2025). This is a newer approach that avoids the complexity of PPO while maintaining stability. The multi-stage RL pipeline systematically strengthens specific capabilities without regressing on others.

My Take

Granite 4.1 isn’t trying to compete with GPT-4 or Claude. It’s a family of efficient, open-source models (Apache 2.0) designed for enterprise deployment. The 8B model matching a 32B MoE is genuinely impressive and shows that data quality can compensate for raw parameter count.

The five-phase training pipeline is over-engineered for most teams, but the core lesson is clear: progressive data refinement works. Start broad, pivot to reasoning, then anneal with high-quality data, and finally extend context. It’s a blueprint that smaller teams could adapt with fewer phases.

One thing I wish they’d covered more: the LLM-as-Judge framework details. What model did they use? What criteria? That’s the kind of practical info that helps others replicate the approach.

Still, this is a solid release from IBM. No hype, just good engineering.