AI & Robotics News

Hugging Face: 5 ways enterprises can slash AI costs without sacrificing performance

August 19, 2025

Hugging Face on Smarter AI Model Development

Enterprises seem to accept it as a basic fact: AI models require a significant amount of compute; they simply have to find ways to obtain more of it.

But it doesn’t have to be that way, according to Sasha Luccioni, AI and climate lead at Hugging Face. What if there’s a smarter way to use AI? What if, instead of striving for more (often unnecessary) compute and ways to power it, they can focus on improving model performance and accuracy?

Ultimately, model makers and enterprises are focusing on the wrong issue: They should be computing smarter, not harder or doing more, Luccioni says.

“There are smarter ways of doing things that we’re currently under-exploring, because we’re so blinded by: We need more FLOPS, we need more GPUs, we need more time,” she said.

Here are five key learnings from Hugging Face that can help enterprises of all sizes use AI more efficiently.

1: Right-size the model to the task

Avoid defaulting to giant, general-purpose models for every use case. Task-specific or distilled models can match, or even surpass, larger models in terms of accuracy for targeted workloads — at a lower cost and with reduced energy consumption.

Luccioni, in fact, has found in testing that a task-specific model uses 20 to 30 times less energy than a general-purpose one. “Because it’s a model that can do that one task, as opposed to any task that you throw at it, which is often the case with large language models,” she said.

Distillation is key here; a full model could initially be trained from scratch and then refined for a specific task. DeepSeek R1, for instance, is “so huge that most organizations can’t afford to use it” because you need at least 8 GPUs, Luccioni noted. By contrast, distilled versions can be 10, 20 or even 30X smaller and run on a single GPU.

In general, open-source models help with efficiency, she noted, as they don’t need to be trained from scratch. That’s compared to just a few years ago, when enterprises were wasting resources because they couldn’t find the model they needed; nowadays, they can start out with a base model and fine-tune and adapt it.

“It provides incremental shared innovation, as opposed to siloed, everyone’s training their models on their datasets and essentially wasting compute in the process,” said Luccioni.

It’s becoming clear that companies are quickly getting disillusioned with gen AI, as costs are not yet proportionate to the benefits. Generic use cases, such as writing emails or transcribing meeting notes, are genuinely helpful. However, task-specific models still require “a lot of work” because out-of-the-box models don’t cut it and are also more costly, said Luccioni.

This is the next frontier of added value. “A lot of companies do want a specific task done,” Luccioni noted. “They don’t want AGI, they want specific intelligence. And that’s the gap that needs to be bridged.”

2. Make efficiency the default

Adopt “nudge theory” in system design, set conservative reasoning budgets, limit always-on generative features and require opt-in for high-cost compute modes.

In cognitive science, “nudge theory” is a behavioral change management approach designed to influence human behavior subtly. The “canonical example,” Luccioni noted, is adding cutlery to takeout: Having people decide whether they want plastic utensils, rather than automatically including them with every order, can significantly reduce waste.

“Just getting people to opt into something versus opting out of something is actually a very powerful mechanism for changing people’s behavior,” said Luccioni.

Default mechanisms are also unnecessary, as they increase use and, therefore, costs because models are doing more work than they need to. For instance, with popular search engines such as Google, a gen AI summary automatically populates at the top by default. Luccioni also noted that, when she recently used OpenAI’s GPT-5, the model automatically worked in full reasoning mode on “very simple questions.”

“For me, it should be the exception,” she said. “Like, ‘what’s the meaning of life, then sure, I want a gen AI summary.’ But with ‘What’s the weather like in Montreal,’ or ‘What are the opening hours of my local pharmacy?’ I do not need a generative AI summary, yet it’s the default. I think that the default mode should be no reasoning.”

3. Optimize hardware utilization

Use batching; adjust precision and fine-tune batch sizes for specific hardware generation to minimize wasted memory and power draw.

For instance, enterprises should ask themselves: Does the model need to be on all the time? Will people be pinging it in real time, 100 requests at once? In that case, always-on optimization is necessary, Luccioni noted. However, in many others, it’s not; the model can be run periodically to optimize memory usage, and batching can ensure optimal memory utilization.

“It’s kind of like an engineering challenge, but a very specific one, so it’s hard to say, ‘Just distill all the models,’ or ‘change the precision on all the models,’” said Luccioni.

In one of her recent studies, she found that batch size depends on hardware, even down to the specific type or version. Going from one batch size to plus-one can increase energy use because models need more memory bars.

“This is something that people don’t really look at, they’re just like, ‘Oh, I’m gonna maximize the batch size,’ but it really comes down to tweaking all these different things, and all of a sudden it’s super efficient, but it only works in your specific context,” Luccioni explained.

4. Incentivize energy transparency

It always helps when people are incentivized; to this end, Hugging Face earlier this year launched AI Energy Score. It’s a novel way to promote more energy efficiency, utilizing a 1- to 5-star rating system, with the most efficient models earning a “five-star” status.

It could be considered the “Energy Star for AI,” and was inspired by the potentially-soon-to-be-defunct federal program, which set energy efficiency specifications and branded qualifying appliances with an Energy Star logo.

“For a couple of decades, it was really a positive motivation, people wanted that star rating, right?,” said Luccioni. “Something similar with Energy Score would be great.”

Hugging Face has a leaderboard up now, which it plans to update with new models (DeepSeek, GPT-oss) in September, and continually do so every 6 months or sooner as new models become available. The goal is that model builders will consider the rating as a “badge of honor,” Luccioni said.

5. Rethink the “more compute is better” mindset

Instead of chasing the largest GPU clusters, begin with the question: “What is the smartest way to achieve the result?” For many workloads, smarter architectures and better-curated data outperform brute-force scaling.

“I think that people probably don’t need as many GPUs as they think they do,” said Luccioni. Instead of simply going for the biggest clusters, she urged enterprises to rethink the tasks GPUs will be completing and why they need them, how they performed those types of tasks before, and what adding extra GPUs will ultimately get them.

“It’s kind of this race to the bottom where we need a bigger cluster,” she said. “It’s thinking about what you’re using AI for, what technique do you need, what does that require?”

Author: Taryn Plumb
Source: Venturebeat
Reviewed By: Editorial Team

AI models Hugging Face

2058

0