AI & Robotics News

MLPerf 3.0 benchmark adds LLMs and shows dramatic rise in AI training performance

June 27, 2023

As the hype and momentum behind generative AI continue to grow, so too does the performance of the underlying systems that enable machine learning (ML) training.

MLCommons today announced the latest set of results for its MLPerf training 3.0 benchmark. This aims to provide an industry standard set of measurements for ML model training performance. MLCommons is an open engineering consortium focused on ML benchmarks, datasets and best practices to accelerate the development of AI. The group has a series of benchmarks for ML including MLPerf inference, which was last updated in April. Its MLPerf Training 2.1 results were released in November 2022.

The big new inclusion with MLPerf Training 3.0 is the introduction of testing for training large language models (LLMs), specifically starting with GPT-3. The addition of LLMs to the benchmark suite comes at a critical time as organizations build out generative AI technologies.

Overall, the latest round of training benchmarks includes more than 250 different performance results from 16 vendors including: ASUSTek, Microsoft Azure, Dell, Fujitsu, GIGABYTE, H3C, IEI, Intel and Habana Labs, Krai, Lenovo, Nvidia, CoreWeave + Nvidia, Quanta Cloud Technology, Supermicro and xFusion.

Fundamentally what the MLPerf Training 3.0 benchmark results show across all results is a significant boost in performance that reveals how ML capabilities are outpacing Moore’s Law.

“As an industry, Moore’s Law is what kind of drives us forward; that is the barometer by which many people are used to thinking about progress in electronics,” MLCommons executive director David Kanter said during a press briefing. “The performance gains that we’ve seen since 2018 are something in the neighborhood of 30 to 50X, which is incredible, and that’s about 10X faster than Moore’s Law.”

Looking specifically at the MLPerf Training data over the past year alone, Kanter said that all the results have seen gains of between 5% on the low end and 54% on the top end.

There are a number of reasons why ML training keeps getting faster, and at a rate that is outpacing Moore’s Law.

One of the primary levers to make training faster is with improved silicon, which is something that industry vendors including Nvidia and Intel have been aggressively iterating on. Kanter noted that when MLPerf benchmarks got started, the most advanced silicon used a 16 nanometer process. In contrast, today the most advanced is at 5 nanometers, offering orders of magnitude more density and performance as a result.

Beyond this hardware are algorithms and software. Kanter noted that vendors and researchers are constantly developing new and efficient ways to execute operations. Additionally, there are general improvements in the development toolchain with foundational components such as code compilers. Then there’s the matter of scale: Building bigger systems with more communication bandwidth.

Nvidia has been building out its InfiniBand based connectivity in recent years to support high speed communications bandwidth. For its part, Intel has been working to improve ethernet to support increased performance for ML operations.

“We demonstrated that with [Intel] Xeon you can get 97 to 100% scaling with a finely tuned standard Ethernet fabric,” Jordan Plawner, Intel’s senior director of AI products said during the MLCommons press call.

The move to integrate an LLM training benchmark specifically for GPT-3 was no small task for MLCommons. GPT-3 is a 175 billion parameter model; in contrast, the BERT natural language processing (NLP) model is much smaller at 340 million parameters.

“This is by far and away the most computationally demanding of our benchmarks,” Kanter said.

Even for Nvidia, the LLM benchmark took a notable amount of effort to run evaluation. In a briefing, Nvidia’s director of AI benchmarking and cloud Dave Salvator explained that his company did a joint submission alongside cloud platform provider CoreWeave for the benchmark. The evaluation used 3,484 GPUs across multiple MLPerf Training 3.0 benchmarks.

Salvator noted that CoreWeave announced the general availability of its massive GPU instances back at Nvidia GTC event in March. He added that CoreWeave was a first mover to make their HGX H100 instances generally available.

“Through this collaboration, we either set or broke records on pretty much every workload,” Salvator said. “What’s also interesting about this is that the instance is a live commercial instance.”

The same CoreWeave HGX H100 instances used for the MLPerf benchmarks are also being used by startup Inflection AI, which has developed its own personal AI that they’re calling Pi. Salvator noted that Inflection AI also assisted Nvidia and CoreWeave with some of the fine tuning of the GPU instances.

“The test results that we’re getting at MLPerf are not some sort of sterile air gapped laboratory that is not a real world environment,” Salvator said. “This is a very real-world commercially available instance where we’re seeing those results, and we have a customer like Inflection AI who’s working on a cutting edge LLM and using that very same instance and seeing great results.”

Join top executives in San Francisco on July 11-12, to hear how leaders are integrating and optimizing AI investments for success. Learn More

As the hype and momentum behind generative AI continue to grow, so too does the performance of the underlying systems that enable machine learning (ML) training.

MLCommons today announced the latest set of results for its MLPerf training 3.0 benchmark. This aims to provide an industry standard set of measurements for ML model training performance. MLCommons is an open engineering consortium focused on ML benchmarks, datasets and best practices to accelerate the development of AI. The group has a series of benchmarks for ML including MLPerf inference, which was last updated in April. Its MLPerf Training 2.1 results were released in November 2022.

The big new inclusion with MLPerf Training 3.0 is the introduction of testing for training large language models (LLMs), specifically starting with GPT-3. The addition of LLMs to the benchmark suite comes at a critical time as organizations build out generative AI technologies.

Overall, the latest round of training benchmarks includes more than 250 different performance results from 16 vendors including: ASUSTek, Microsoft Azure, Dell, Fujitsu, GIGABYTE, H3C, IEI, Intel and Habana Labs, Krai, Lenovo, Nvidia, CoreWeave + Nvidia, Quanta Cloud Technology, Supermicro and xFusion.

Event

Transform 2023

Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.

ML capabilities outpacing Moore’s Law

Fundamentally what the MLPerf Training 3.0 benchmark results show across all results is a significant boost in performance that reveals how ML capabilities are outpacing Moore’s Law.

“As an industry, Moore’s Law is what kind of drives us forward; that is the barometer by which many people are used to thinking about progress in electronics,” MLCommons executive director David Kanter said during a press briefing. “The performance gains that we’ve seen since 2018 are something in the neighborhood of 30 to 50X, which is incredible, and that’s about 10X faster than Moore’s Law.”

Looking specifically at the MLPerf Training data over the past year alone, Kanter said that all the results have seen gains of between 5% on the low end and 54% on the top end.

Why ML training keeps getting faster

Credit: Nvidia

There are a number of reasons why ML training keeps getting faster, and at a rate that is outpacing Moore’s Law.

One of the primary levers to make training faster is with improved silicon, which is something that industry vendors including Nvidia and Intel have been aggressively iterating on. Kanter noted that when MLPerf benchmarks got started, the most advanced silicon used a 16 nanometer process. In contrast, today the most advanced is at 5 nanometers, offering orders of magnitude more density and performance as a result.

Beyond this hardware are algorithms and software. Kanter noted that vendors and researchers are constantly developing new and efficient ways to execute operations. Additionally, there are general improvements in the development toolchain with foundational components such as code compilers. Then there’s the matter of scale: Building bigger systems with more communication bandwidth.

Nvidia has been building out its InfiniBand based connectivity in recent years to support high speed communications bandwidth. For its part, Intel has been working to improve ethernet to support increased performance for ML operations.

“We demonstrated that with [Intel] Xeon you can get 97 to 100% scaling with a finely tuned standard Ethernet fabric,” Jordan Plawner, Intel’s senior director of AI products said during the MLCommons press call.

Benchmarking LLM training not an easy task

The move to integrate an LLM training benchmark specifically for GPT-3 was no small task for MLCommons. GPT-3 is a 175 billion parameter model; in contrast, the BERT natural language processing (NLP) model is much smaller at 340 million parameters.

“This is by far and away the most computationally demanding of our benchmarks,” Kanter said.

Even for Nvidia, the LLM benchmark took a notable amount of effort to run evaluation. In a briefing, Nvidia’s director of AI benchmarking and cloud Dave Salvator explained that his company did a joint submission alongside cloud platform provider CoreWeave for the benchmark. The evaluation used 3,484 GPUs across multiple MLPerf Training 3.0 benchmarks.

Salvator noted that CoreWeave announced the general availability of its massive GPU instances back at Nvidia GTC event in March. He added that CoreWeave was a first mover to make their HGX H100 instances generally available.

“Through this collaboration, we either set or broke records on pretty much every workload,” Salvator said. “What’s also interesting about this is that the instance is a live commercial instance.”

The same CoreWeave HGX H100 instances used for the MLPerf benchmarks are also being used by startup Inflection AI, which has developed its own personal AI that they’re calling Pi. Salvator noted that Inflection AI also assisted Nvidia and CoreWeave with some of the fine tuning of the GPU instances.

“The test results that we’re getting at MLPerf are not some sort of sterile air gapped laboratory that is not a real world environment,” Salvator said. “This is a very real-world commercially available instance where we’re seeing those results, and we have a customer like Inflection AI who’s working on a cutting edge LLM and using that very same instance and seeing great results.”

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.

Author: Sean Michael Kerner
Source: Venturebeat

689

0