OpenMed, the open-source healthcare AI group, just published Part II of their protein engineering series. They built a pipeline that goes from a protein concept to synthesis-ready DNA in an afternoon. And they did it for $165 in compute costs across 25 species.
That number caught my attention. Not because $165 is cheap for training (it is), but because they actually compared architectures instead of just picking the latest shiny thing. Most bio-AI papers skip this. They pick one model, train it, publish. OpenMed ran the experiment properly.
What They Built
The pipeline has three stages:
- Protein folding: ESMFold v1 on 30 protein chains. Average pTM of 0.79. Nothing groundbreaking here, but it works as a batch pipeline.
- Sequence design: ProteinMPNN on scaffold 7K00. 42% sequence recovery. Again, established tools.
- mRNA optimization: This is where they actually did original work. Trained multiple transformer variants on 250k coding sequences, then scaled to 381k sequences across 25 species.
The folding and design parts are basically glue code around existing models. The mRNA optimization is the real contribution. And it’s where things get interesting.
The Architecture Shootout
They tested five architectures head-to-head on codon-level language modeling. Codons are triplets from a 64-token alphabet, with species-specific usage biases. Different from amino acid sequences, different from natural language.
| Model | Parameters | Architecture |
|—|—|—|
| CodonBERT (baseline) | 6M | BERT-tiny (6 layers) |
| ModernBERT-base | 90M | ModernBERT (22 layers, RoPE) |
| CodonRoBERTa-base | 92M | RoBERTa (12 layers) |
| CodonRoBERTa-large | 312M | RoBERTa (24 layers) |
| CodonRoBERTa-large-v2 | 312M | RoBERTa (24 layers, refined) |
The choice of RoBERTa was deliberate. Meta’s ESM-2 (which powers ESMFold) is itself a RoBERTa variant. The hypothesis was that the same architecture that learns amino acid patterns might transfer well to codon patterns.
What Won and Why
CodonRoBERTa-large-v2 was the clear winner: perplexity of 4.10 and a Spearman CAI correlation of 0.40. That significantly outperformed ModernBERT.
I was surprised ModernBERT didn’t do better. ModernBERT has all the latest NLP innovations: rotary position embeddings, efficient attention, long context support. But for codon sequences, the older RoBERTa architecture just worked better. The lesson: architectural innovations from NLP don’t always transfer to biological sequences.
The v2 version used better hyperparameters: learning rate scheduling, longer warmup, different masking strategy. Small changes, big impact. This is the kind of detail most papers gloss over.
Scaling to 25 Species for $165
They trained 4 production models in 55 GPU-hours. Total cost: $165. That’s using cloud GPU instances, not some special deal.
Each model was species-conditioned, meaning you can specify the target organism and the model optimizes codons for that specific species. No other open-source project offers this. The closest is commercial tools like GeneArt or IDT’s Codon Optimization Tool, which are black boxes with per-gene pricing.
The training data covered 25 species including human, mouse, zebrafish, E. coli, yeast, and several plants. They used coding sequences from NCBI’s RefSeq database.
Where It Falls Short
Let me be honest about the limitations:
- The folding and design stages are just wrappers around existing tools. Useful, but not novel.
- 25 species is a good start, but the real world needs hundreds. Therapeutic proteins get expressed in CHO cells, HEK293, E. coli, yeast, insect cells, and plants. Each has different codon preferences.
- The evaluation metric (CAI correlation) is useful but incomplete. CAI only measures similarity to natural codon usage. It doesn’t directly measure expression levels. You’d need wet-lab validation to really know if these optimizations work.
- The perplexity of 4.10 is good but not state-of-the-art. Commercial tools likely do better, they just don’t publish their numbers.
What I’d Do Differently
If I were building this pipeline, I’d want to see:
- Direct comparison with codon usage tables (the traditional approach). Does the learned model actually beat simple frequency-based methods?
- Wet-lab validation. Even a small experiment with 10-20 sequences in E. coli would make this much more compelling.
- More species. 25 is fine for a proof of concept, but practical applications need coverage for common expression hosts.
The Bigger Picture
This pipeline represents something important: the cost of doing serious bio-AI work is dropping fast. $165 for a multi-species model that would have cost $10,000+ three years ago. The democratization of protein engineering is real.
OpenMed has released all the code and model weights. If you’re working on therapeutic proteins, mRNA vaccines, or recombinant protein production, this is worth trying. The pipeline is rough around the edges, but it works.
And that’s the point. You don’t need a million-dollar compute budget to do meaningful work in protein AI anymore. You just need good ideas and $165.
Comments (0)
Login Log in to comment.
Be the first to comment!