NVIDIA Blackwell Up To 2.2x Faster Than Hopper In MLPerf v4.1 AI Training Benchmarks, New World Records Set & Hopper Now Even Better

Hassan Mujtaba Comments

NVIDIA has shared the first benchmarks of its Blackwell GPUs in MLPerf v4.1 AI Training workloads, delivering a 2.2x gain over Hopper.

NVIDIA Demolishes The Competition With Blackwell GPUs, Delivering Up To A 2.2x Gain In MLPerf v4.1 AI Training Benchmarks Versus Hopper

Back in August, NVIDIA's Blackwell made its debut in the MLPerf v4.1 AI Inference benchmarks, showcasing strong performance uplifts versus the last-gen Hopper chips and also against the competition. Today, NVIDIA is sharing the first Blackwell benchmarks in the MLPerf v4.1 AI Training workloads which showcase stunning results.

Related Story NVIDIA Blackwell & AMD MI325X Showdown In Latest MLPerf Inference Benchmarks: B200 Shatters Records, Instinct Fights Against Hopper

NVIDIA states that the demand for compute in the AI segment is increasing at an exponential scale with the launch of new models. This requires both accelerated training and inference abilities. The inference workloads were benchmarked a few months ago, & it's time to look at the training tests that encompass the same set of workloads, such as:

  • Llama 2 70B (LLM Fine-Tuning)
  • Stable Diffusion (Text-to-Image)
  • DLRMv2 (Recommender)
  • BERT (NLP)
  • RetinaNet (Object Detection)
  • GPT-3 175B (LLM Pre-Training)
  • R-GAT (Graph Neural Network)

These are some of the most popular and diverse use cases to evaluate the AI Training performance of AI accelerators and these are all evaluated in the MLPerf Training 4.1 tests. These workloads are very accurate when it comes to time-to-train time (in minutes) for the required evaluation and have 125+ MLCommons members in the consortium & affiliates backing them up to align the tests with the market.

Starting first with Hopper, the H100 GPUs are now 1.3x faster in LLM pre-training performance per GPU since their first submission and offer the highest performance in AI Training among the stack of chips that are available on every benchmark. With Hopper, NVIDIA also made the highest at-scale submission for MLPerf using 11,616 Hopper H100 GPUs and the chips are datacenter-scale using NVLink, NVSwitch, ConnectX-7 SuperNIC, and Quantum-X400 IB Switches.

Since launch, the NVIDIA Hopper GPUs have scaled up in performance thanks to continued software optimizations within the CUDA AI stack, now offering a 6x growth in performance versus HGX A100 and a 70% uplift over the June 2023 submission of HGX H100 in GPT-3 (175B Training) using 512 GPUs across each set of submission.

Rounding up its previous Hopper Inference benchmarks, the chips offer 1.9x higher performance in Llama 3.1, 3x faster in TTFT with GH200 NVL32, and 1.5x faster throughput in Llama 3.1 405B, which once again shows the continued innovations to the software stack.

There's a reason why the competition is having a hard time competing against Hopper with their new chips, let alone Blackwell.

That brings us to Blackwell, the heart of the next-gen AI Data Centers. Right off the bat, NVIDIA has claimed seven per-accelerator records using its Nyx AI supercomputer which is built using DGX B200 systems.

This supercomputer offers 2.2x faster Llama 2 70B (Fine-Tuning) pref versus Hopper H100, 2x faster GPT-3 175B (Pre-Training) performance versus Hopper H100, and also smashes through the entire set of workloads within the MLPerf Training 4.1 suite.

With Blackwell, NVIDIA is not only doubling the performance but bringing an advanced set of technologies which we detailed in the full deep-dive provided during Hot Chips 2024. More so, NVIDIA's partners are also showcasing outstanding performance using their Hopper-based systems and a total of 11 partner submissions have been made which shows the momentum surrounding Hopper and Blackwell GPUs.

The first Blackwell training submission to the MLCommons Consortium — which creates standardized, unbiased and rigorously peer-reviewed testing for industry participants — highlights how the architecture is advancing generative AI training performance. For instance, the architecture includes new kernels that make more efficient use of Tensor Cores. Kernels are optimized, purpose-built math operations like matrix-multiplies that are at the heart of many deep learning algorithms.

Blackwell’s higher per-GPU compute throughput and significantly larger and faster high bandwidth memory allows it to run the GPT-3 175B benchmark on fewer GPUs while achieving excellent per-GPU performance.

Taking advantage of higher-bandwidth HBM3e memory, just 64 Blackwell GPUs were run in the GPT-3 LLM benchmark without compromising per-GPU performance. The same benchmark run using Hopper needed 256 GPUs to achieve the same performance.

via NVIDIA

NVIDIA also sheds some light on its yearly cadence, which doesn't only mean building new chips as fast as possible but also validating them at the data center scale and deploying them faster at the super-cluster scale.

The green team makes it clear that they aren't just a company that makes chips, but they are a data center solution and system provider at scale.

This is why the company has already shared its next-gen AI roadmap featuring Blackwell Ultra as the follow-up to Blackwell with more memory (288 GB HBM3e) and more compute horsepower in 2025. The Blackwell Ultra platform is expected to use the B300 naming convention.

nvidia-blackwell-mlperf-ai-training-benchmarks-vs-hopper-_12
nvidia-blackwell-mlperf-ai-training-benchmarks-vs-hopper-_13

The follow-up to that is in the form of Rubin, which comes in the standard flavor in 2026 with the 8S HBM4 and 12S HBM4 variants in 2027. Lastly, NVIDIA confirms that Blackwell is now in full mass production state, so expect it to result in record-smashing revenue and performance figures in the coming quarters.

Products mentioned

Deal of the Day