It’s getting harder to know which companies are winning AI races, says he’s hugging Face co-founder

- Hugging Face’s Thomas Wolf says it’s getting harder to communicate which AI model It’s the best as traditional AI benchmarks become saturated. Going forward, Wolfe said the AI industry could rely on two new benchmark approaches, agency-based and use case-specific.
Thomas Wolf, co-founder and chief Scientist hugging his face, We believe that new ways to measure AI models will be needed.
Wolf told the audience Brainstorming ai In London, as AI models become more advanced, it is becoming more and more difficult to know which is performing best.
“It’s getting harder to tell you what the best model is,” he said. Google. “All of them seem to be very close in reality.”
“The benchmark world has evolved a lot. We had this very academic benchmark that almost measures the knowledge of the model. The most well-known was MMLU (understanding of a large multitasking language), which was essentially a set of graduate or PHD-level questions that the model had to answer,” he said. “Now all of these benchmarks are saturated.”
Over the past year, popular AI benchmarks such as MMLU, Glue and Hellaswag have reached saturation, and games are in place and a growing chorus of voices from academia, industry and policy claiming they do not reflect real-world utility.
In a study published in February, researchers from the European Commission’s Joint Research Center published a paper called “Can you trust AI benchmarks? An interdisciplinary review of current issues in AI assessment.” This has found “systematic flaws in current benchmark practices.” This includes incentives incorrect, impaired configuration effectiveness, game of consequences, and data.
Going forward, Wolf said the AI industry should rely on two major benchmarks that will enter 2025. One is expected to perform the tasks to evaluate the model’s agency, and the other is tailored to each use case of the model.
The hugging face is already working on the latter.
The company’s new program, Your Bench, aims to help users decide which model to use for a particular task. Users feed several documents into the program. This allows users to automatically generate specific benchmarks of the type of work that they can apply to different models, and see which model is best suited for their use cases.
“Just because all of these models work the same way in this academic benchmark doesn’t mean they’re all exactly the same,” Wolf said.
Open source “Chatgpt Moment”
Founded in 2016 by Wolf, Clément Delangue and Julien Chaumond, Hugging Face has long been the champion of open source AI.
Often called machine learning Github, the company offers an open source platform that allows developers, researchers and companies to build, share and deploy machine models, datasets and applications at scale. Users can also view models and datasets uploaded by others.
Wolfe told Brainstorm AI audiences that Face’s “business model is really aligned with open source,” and that “the company’s ‘goal’ is to get the most out of people who can join this kind of open community and share the model with them.”
Wolfe predicted that open source AI will continue to thrive, especially after Deepseek’s success earlier this year.
After its release late last year, the Chinese-made AI model Deepseek R1 sent shockwaves into the world of AI.
Wolf said Deepseek was an open source AI “ChatGpt moment.”
“Just as ChatGpt discovered the whole world of AI, Deepseek was the moment when the whole world discovered this open society,” he said.
This story was originally introduced Fortune.com