I’ve been messing with on-device AI in browser extensions lately, and the folks at Hugging Face just dropped a really practical write-up about their Transformers.js demo extension powered by Gemma 4 E2B. They shared it on the HF blog, and honestly, it’s the kind of stuff you wish existed before you start building.
They ran into all the usual Manifest V3 headaches – service worker lifecycles, model caching across origins, keeping the UI snappy while inference runs in the background. I’ve dealt with enough of these to know that the architecture decisions they made are worth a closer look.
What’s actually going on here
This isn’t a toy. The extension runs a full text generation model (Gemma 4, quantized to q4f16) and a separate embedding model (MiniLM-L6) entirely in the browser. No cloud calls, no API keys. Everything happens inside a Chrome extension’s background service worker, which is… let’s say, a constrained environment.
The source is on GitHub (nico-martin/gemma4-browser-extension) and the extension itself is live on the Chrome Web Store. I pulled it apart to see how they handled the hard parts.
Architecture: three runtimes, one brain
They split the extension into three distinct contexts, which is the right call for MV3:
- Background service worker – this is where the brains live. Model initialization, agent logic, tool execution, conversation state. It’s the single coordinator for everything.
- Side panel – the chat UI. Thin, reactive, sends requests to the background and renders whatever comes back.
- Content script – the page-level bridge. Extracts DOM content, applies highlights, but doesn’t touch models at all.
This split avoids the classic mistake of loading models in multiple places. The background owns the models, the UI just talks to it through typed messages. Conversation history lives in the background too – the UI sends AGENT_GENERATE_TEXT, the background appends the message, runs inference, and emits MESSAGES_UPDATE back. Clean.
The hard part: messaging under MV3
Manifest V3 service workers can be suspended and restarted at any point. That means you can’t just assume your model stays loaded. The team handled this with explicit lifecycle management:
- CHECK_MODELS to see what’s cached
- INITIALIZE_MODELS to download and set up
- DOWNLOAD_PROGRESS events back to the UI
All messages are typed through enums in a shared types file. The background talks to both the side panel and the content script, but never the other way around. Side panel and content script are specialized workers that request actions and render results.
One thing I really like: they use the extension origin (chrome-extension://) for model caching instead of per-website origins. This means one shared cache for the entire extension install, which is way more predictable than having each tab’s content script try to download models independently.
Two models, two jobs
The model split is pragmatic:
- Gemma 4 E2B (text-generation, q4f16) handles reasoning and tool decisions. This is the heavy lifter for the chat interface.
- all-MiniLM-L6-v2 (feature-extraction, fp32) generates vector embeddings for semantic search across page content and browsing history.
Both run in the background via pipelines. Text generation uses the new DynamicCache class for consistent KV caching across generations. Embeddings get normalized vectors for similarity search.
This is higher memory than I’d like for a single extension, but the tradeoff makes sense – you get both reasoning and semantic search without any server calls. The embedding model is small enough that it barely registers.
What I’d watch out for
Service worker suspension is real. If Chrome decides to kill your background worker while a model is mid-download, you need to handle that gracefully. The team’s explicit lifecycle checks are the right approach, but I’d add some persistence layer for download state – IndexedDB or chrome.storage.session – so you don’t restart downloads from scratch every time.
Also, the side panel approach means users need to open the panel explicitly. That’s fine for a chat assistant, but if you wanted passive features (like automatic page analysis), you’d need a different pattern – probably a popup or badge-based interaction.
Final thoughts
This is a solid reference implementation for anyone wanting to run local AI in a Chrome extension. The architecture choices are battle-tested against MV3’s quirks, and the source is clean enough to fork and modify.
If you’re building something similar, steal their messaging pattern. It’s the cleanest part of the whole thing.
Comments (0)
Login Log in to comment.
Be the first to comment!