Hugging Face just added DeepInfra to their Inference Providers lineup, and honestly, this is one of those moves that makes you wonder why it took this long.
DeepInfra has been quietly building one of the most cost-effective serverless inference platforms out there. They’ve got over 100 models, support everything from LLMs to text-to-image and video generation, and their pricing per token has consistently undercut bigger names. Now it’s directly integrated into the Hugging Face Hub, which means you can use it without bouncing between dashboards.
What’s actually supported right now
The initial integration covers conversational and text-generation tasks. That’s the bread and butter stuff – you can hit models like DeepSeek V4, Kimi-K2.6, GLM-5.1, and a bunch of other popular open-weight LLMs directly from the model page widget. Text-to-image, text-to-video, embeddings, and other tasks are coming soon. I’d expect that rollout to happen relatively quickly since DeepInfra already supports those workloads on their own platform.
How the routing works
There are two ways to use this, and the distinction matters.
First, you can set your own DeepInfra API key in your Hugging Face account settings. In that case, requests go directly from your client to DeepInfra’s servers. You get billed by DeepInfra, and you’re subject to whatever rate limits and quotas you’ve set up with them.
Second, you can let Hugging Face route the request. You authenticate with your HF token, and the charges hit your HF account instead. Hugging Face says they’re passing through the provider costs without markup, which is refreshingly honest. No hidden fees, no platform tax. They mention they might add revenue-sharing agreements in the future, but for now, you’re paying the same rate you’d pay if you went directly to DeepInfra.
This is actually a pretty good deal for PRO users. Hugging Face PRO gives you $2 in inference credits every month, and those credits work across all providers. That’s not a ton of compute, but it’s enough to run serious benchmarks or prototype without opening your wallet. Free users get a small quota too, but don’t expect to run heavy workloads on it.
The SDK integration is smooth
If you’re using the Hugging Face SDKs – huggingface_hub >= 1.11.2 for Python or @huggingface/inference for JavaScript – the integration is seamless. You just specify the model with a :deepinfra suffix and authenticate with your HF token. The router handles the rest.
Here’s a quick Python example using the OpenAI-compatible client:
import os
from openai import OpenAI
client = OpenAI(
base_url="https://router.huggingface.co/v1",
api_key=os.environ["HF_TOKEN"],
)
completion = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Pro:deepinfra",
messages=[
{
"role": "user",
"content": "Write a Python function that returns the nth Fibonacci number using memoization."
}
],
)
print(completion.choices[0].message)
And the JS equivalent is equally straightforward:
import { OpenAI } from "openai";
const client = new OpenAI({
baseURL: "https://router.huggingface.co/v1",
apiKey: process.env.HF_TOKEN,
});
const chatCompletion = await client.chat.completions.create({
model: "deepseek-ai/DeepSeek-V4-Pro:deepinfra",
messages: [
{
role: "user",
content: "Write a Python function that returns the nth Fibonacci number using memoization.",
},
],
});
console.log(chatCompletion.choices[0].message);
Agent harnesses get it too
This is the part that actually excites me. Hugging Face Inference Providers are already integrated into most major agent frameworks – Pi, OpenCode, Hermes Agents, OpenClaw, and more. That means you can plug a DeepInfra-hosted model into your agent setup without writing any glue code. Just point your agent harness at the HF router endpoint and pick your model.
The model widget experience
On the Hugging Face model pages, you’ll now see DeepInfra listed alongside other providers in the inference widget. The ordering respects your personal preferences, which you can set in your account settings. If you’ve configured DeepInfra as your preferred provider for a model they support, the widget will default to it.

What I’d like to see next
DeepInfra’s pricing has always been competitive, and the model selection is solid. But the real test will be latency and reliability under load. I’ve seen serverless inference providers struggle when traffic spikes, especially on popular models. Hugging Face’s routing layer adds another hop, so I’m curious whether that introduces noticeable overhead.
The other thing is the “additional tasks” promise. Text-to-image and embeddings are table stakes at this point. If DeepInfra can deliver those with the same cost efficiency they’ve shown for LLMs, this becomes a genuinely compelling one-stop shop for inference.
For now, this is a solid addition to the Hugging Face ecosystem. If you’re already using DeepInfra, you can keep your existing setup and just add the HF integration as an alternative access point. If you’re new to them, this is a low-friction way to test their service without committing to a separate account.
Go check it out, run some benchmarks, and see if the pricing holds up for your use case. I’ll be doing the same.
Comments (0)
Login Log in to comment.
Be the first to comment!