Nvidia Pascal Architecture Detailed – DX12 Async Compute & Scheduling Improved, CUDA Core Clusters Entirely Redsigned

Khalid Moammer Comments

Today at the 2016 GPU Technology Conference Nvidia announced the Tesla P100, the company's most ambitious graphics card to date. The P100 features Nvidia's most powerful and most complex GPU ever conceived by the company, code named GP100. This flagship Pascal GPU is an engineering marvel, and in this piece we'll provide an overview of the Pascal architecture and in particular all the details that Nvidia has revealed about this spectacular graphics chip and Pascal architecture.

Nvidia Tesla P100 accelerator

This overview is derived from an excellent talk given by Nvidia's Senior Architect, Lars Nyland and Chief Technologist, GPU Computing Software Mark Harris.
So let's get straight to it!

The Five "Miracles" Of Nvidia's GP100 GPU & Tesla P100 Accelerator

At his keynote earlier today, jen-Husn Huang, Nvidia's Co-Founder & CEO jokingly said that Nvidia never relies on more than one technical miracle with a given architecture. Despite that, with the GP100 GPU the company was successful in creating the most ambitious and most miraculous graphics chip to date, by relying on not one but five technological "miracles".

Nvidia GTC-12
Jen-Hsun summarized these miracles in the slide above. And they are :

- Next generation Pascal graphics architecture.
- TSMC's  16nm FinFET manufacturing process technology.
- Next generation, vertically stacked High Bandwidth Memory ( HBM 2 )
- The company's brand new revolution in platform atomics, the high speed NV-Link GPU interconnect.
- And finally, the workload that GP100 was designed for and excells at, AI.

Nvidia's Pascal Architecture & The GP100 GPU, Opening The Taps

It has been a long tradition at Nvidia to introduce major performance and power efficiency advancements with each of its next generation graphics architectures and Pascal is no exception. The pivotal structure that's the basic building block for every Pascal GPU is called the SM, short for streaming multiprocessor. Maxwell before Pascal had the SMM , Streaming Maxwell Multiprocessor, as its building block and Kepler before both had the SMX. .The streaming multiprocessor is the engine that "creates, manages, schedules and executes instructions from many threads in parallel."

The GP100 GPU is comprised of  3840 CUDA cores, 240 texture units and a 4096bit memory interface, arranged in eight 512bit segments. The 3840 CUDA cores make up six Graphics Processing Clusters, or GPCs for short. Each of these has 10 Pascal Streaming Multiprocessors.

NVIDIA GP100 Block DiagramNvidia Pascal GP100 GPU Block Diagram

Each Pascal streaming multiprocessor includes 64 FP32 CUDA cores, half that of Maxwell. Within each Pascal streaming multiprocessor there are two 32 CUDA core partitions, two dispatch units, a warp scheduler and a fairly large instruction buffer, matching that of Maxwell.

The GP100 GPU is actually enormous coming in roughly at 610mm² and 15 billion transistors, pretty much making it double the GM200 GPU powering NVidia's GTX Titan X and GTX 980 Ti graphics cards. GP100 has significantly more pascal streaming multiprocessors, or CUDA core blocks, compared to GM200. Again because each Pascal SM is only comprised of 64 CUDA cores as opposed to 128 like in Maxwell.

Additionally because each Pascal SM the same number of registers as Maxwell's 128 CUDA core SMM. This translates to each Pascal CUDA core having access to twice the registers. This in turn means that not only does GP100 has more threads than Nvidia's prior large GPUs, but each thread inside has access to more registers and thus a lot more throughput.

As always the goal was to deliver higher performance and improved power efficiency. As such Pascal builds on the changes that were implemented into Maxwell after Kepler.

gp100_SM_diagramThe Pascal  Streaming Multiprocessor

The combined 14MB of register files and 4MB Overall shared memory across the GP100 GPU result in a two fold increase in overall bandwidth inside the chip compared to GM200.

Chief Technologist, GPU Computing Software Mark Harris
A higher ratio of shared memory, registers, and warps per SM in GP100 allows the SM to more efficiently execute code. There are more warps for the instruction scheduler to choose from, more loads to initiate, and more per-thread bandwidth to shared memory (per thread).

NVIDIA Pascal GP100 SM

According to Nvidia the end result is that each Pascal SM actually requires less power and area to manage data transfers even compared to a Kepler SMX. Which improves both performance and power efficiency. Pascal also includes an updated scheduler that not only improves SM utilization ( editorial note : better async compute performance anyone?.. ) but is also more intelligent and power efficient. Finally, each warp scheduler can dispatch two instructions per clock.

Nvidia's Senior Architect, Lars Nyland admits that the 16nm FinFET process played an important role in realizing the team's power efficiency goals, but maintains that numerous architectural improvements aided in further reducing the energy footprint of the architecture.

p100_575px_2
nvidia_tesla_p100_gpu_front4_575px_2
p100back
nvidia_tesla_p100_gpu_topangleleft4_575px

The table below is a high-level comparison of the Tesla P100's specifications in comparison with previous generation Tesla accelerators.

Tesla ProductsTesla K40Tesla M40Tesla P100
GPUGK110 (Kepler)GM200 (Maxwell)GP100 (Pascal)
SMs152456
TPCs152428
FP32 CUDA Cores / SM19212864
FP32 CUDA Cores / GPU288030723584
FP64 CUDA Cores / SM64432
FP64 CUDA Cores / GPU960961792
Base Clock745 MHz948 MHz1328 MHz
GPU Boost Clock810/875 MHz1114 MHz1480 MHz
Compute Performance - FP32 5.04 TFLOPS6.82 TFLOPS10.6 TFLOPS
Compute Performance - FP64 1.68 TFLOPS0.21 TFLOPS5.3 TFLOPS
Texture Units240192224
Memory Interface384-bit GDDR5384-bit GDDR54096-bit HBM2
Memory SizeUp to 12 GBUp to 24 GB16 GB
L2 Cache Size1536 KB3072 KB4096 KB
Register File Size / SM256 KB256 KB256 KB
Register File Size / GPU3840 KB6144 KB14336 KB
TDP235 Watts250 Watts300 Watts
Transistors7.1 billion8 billion15.3 billion
GPU Die Size551 mm²601 mm²610 mm²
Manufacturing Process28-nm28-nm16-nm

A quick look at the table above shows one of the wonderful advantages of FinFET besides the area and power improvements and that's the much faster transistor switching speeds. This has clearly translated to significantly higher clock speeds for Nvidia with the Pascal GP100 GPU compared to its 28nm predecessors. The Tesla P100 actually features a boost frequency of 1480mhz, very nearly touching 1.5Ghz.

That's a whopping 33% gain in clock speeds over Maxwell. Considering how the GeForce GTX 900 series graphics cards can be overclocked to 1.5Ghz and beyond with ease I have very little doubts that we'll see enthusiasts pushing their GeForce Pascal graphics cards to 2Ghz and beyond with little effort.

Serious Compute Is Back! GP100 Features A 1:2 Ratio Of FP64 to FP32

GP100 is Nvidia's first GPU ever to feature double precision compute performance at half the rate of single precision compute. The Kepler based GK110 featured a ratio of 3:1 and Maxwell was almost completely ridden of double precision with a ratio of 32:1. That is for every block of 32 FP32 CUDA cores there was only 1 FP64 CUDA core.  Pascal brings Nvidia back to the HPC , High Performance Computing, space where double precision rules the roost.

This is an area where AMD's Hawaii GPU was simply uncontested since it launched in late 2013, being the only GPU from either company on the market to sport a 2:1 ratio of FP32 to FP64.

Interestingly, the changes that Nvidia has been implementing in its streaming multiprocessors over the past several years, starting with 192 CUDA core Kepler SMX in 2011 to the Maxwell 128 CUDA core SMM and finally to Pascal have been morphing the company's graphics architecture to something that's much closer to that of AMD's GCN. The basic building block of which, the Compute Unit, has 64 GCN cores.

The similarities don't end there either, with Pascal Nvidia is renewing its focus on double precision compute, an area that GCN has traditionally excelled at. The updates that Nvidia has made to Pascal's scheduler are also a clear indicator that it's moving its architecture towards the same direction that AMD has taken with its GCN which supports advanced hardware scheduling implementations and unique asynchronous compute engines.

Nvidia has also confirmed that Pascal is compliant with IEEE 754‐2008 single and double precision arithmetic and supports FMA, Fused Multiply Add, instructions operation in addition to denormalized values at full speed.

nvidia-pascal-gpu_gtc_performance-per-watt
nvidia-pascal-gpu_gtc_mixed-precision
nvidia-pascal-gpu_gtc_memory-capacity
nvidia-pascal-gpu_gtc_bandwidth

FP16 At Double The Rate of FP32 Is A BIG Deal For Deep Learning

If you watched Jen-Hsun's GTC keynote earlier today you will know that it was all about deep learning. This field is already reshaping the future and what we as humans perceive of our own intelligence and the limits of AI. I'm not going to deep dive on deep learning, no pun intended, in this particular overview. If you're interested in learning more about this subject Nvidia's Tim Dettmers has written a great piece about it titled "Deep Learning in a Nutshell: Core Concepts" that you should check out.

Deep learning workloads represent a perfect scenario where mixed precision can be leveraged to pretty much double the performance. These workloads inherently require less precision and using FP16 instructions would result in very significant reductions in memory usage that will allow deep learning to occur in considerably larger networks. Essentially allowing machines to learn much more effectively.

Because each Pascal CUDA core can run two FP16 operations at once and each 32-bit register can store two FP16 values at once, the GP100 GPU can effectively do FP16 compute work at twice the speed of FP32, and this is where that doubling in performance comes from.

Nvidia Bringing Improved Memory Coherency With Pascal

Memory coherency is an essential attribute of modern accelerators. It allows data to flow freely, to be shared without unnecessary copies or any wasteful energy burning protocols. AMD, Nvidia's principle rival, has pushed the development of memory coherency in its IP much more aggressively, even compared to the much larger rival of both companies Intel. Primarily because AMD has built its future around heterogeneous computing. Memory coherency was essential to creating a truly heterogeneous APU, accelerated processing unit, ( A processor that includes a CPU and a GPU ).

In fact some of the very early work of the HSA Foundation ,Heterogeneous System Architecture, which AMD founded early in the decade was to realize the goal of truly coherent shared memory. As such AMD's GCN - graphics core next - architecture was designed with this in mind. Intel for the very similar reasons, quickly caught up and successfully introduced memory coherency to its chips.  Nvidia's introduction of Maxwell marked Nvidia's entrance to the party and naturally it's been improved upon even further with Pascal.

Pascal now supports coherent FP64 add instructions in global memory, something that Maxwell only supported with compare-and-swap loops. Enabling this functionlaity via a native instruction inherently improves performance. This addition is only a logical one. Pascal's double precision compute capability far outsteps Maxwell's, so extending coherency to FP64 instructions makes perfect sense.

GPUKepler GK110Maxwell GM200Pascal GP100Volta GV100
Compute Capability3.55.36.07.0
Threads / Warp32323232
Max Warps / Multiprocessor64646464
Max Threads / Multiprocessor2048204820482048
Max Thread Blocks / Multiprocessor16323232
Max 32-bit Registers / SM65536655366553665536
Max Registers / Block65536327686553665536
Max Registers / Thread255255255255
Max Thread Block Size1024102410241024
CUDA Cores / SM1921286464
Shared Memory Size / SM Configurations (bytes)16K/32K/48K96K64K96K

Next Generation Memory Technology- HBM2 Is Key

Both GPU makers have talked to great lengths about how detrimental the slow progression of memory standards has been to the steady growth of the performance of modern parallel processors. GPU design had reached a point where any additional compute performance was offset by the energy spent on the memory ecosystem necessary to feed the GPU with the bandwidth it needs to deliver its intended performance.

HBM Power Curve AMD Die Stacking And The System Bryan BlackHotChips 2012 - Die Stacking & The System  by Bryan Black, AMD's head of the die stacking program 

The issue of memory bandwidth and HBM's role in tackling this challenge actually came to the forefront four years ago, when AMD's Bryan Black publicly spoke about it for the first time. Fast forward to today and we have HBM products on the market from AMD with second generation HBM to be deployed in the not too distant future on GPUs from both vendors. So HBM2 should see much wider spread use in the industry as adoption and volume pick up.

Second generation stacked High Bandwidth Memory plays an instrumental role in allowing the GP100 GPU to reach its performance potential. Simply put without HBM GP100 wouldn't exist, and high performance GPU design would fall demonstrably behind moore's law.

NVIDIA Confirmed To Be Supplied With Second-Gen HBM From Both Samsung And SK Hynix
sk-hynix-hbm2-4-gb_1
hynix-hbm-24
hynix-hbm-23
hynix-hbm-22
hynix-hbm-21
hynix-hbm-20
hynix-hbm-19
hynix-hbm-18
hynix-hbm-17
hynix-hbm-16
hynix-hbm-15
hynix-hbm-14
hynix-hbm-12
hynix-hbm-11
hynix-hbm-10
hynix-hbm-9
hynix-hbm-8
hynix-hbm-7
hynix-hbm-6
hynix-hbm-5
hynix-hbm-4
hynix-hbm-3
hynix-hbm-2

HBM2 allows nvidia to tackle two challenges with the Tesla P100. Having enough bandwidth to keep the execution engines fed and having enough memory capacity overall to do the actual work. Especially, again, in deep learning workloads where there are massive data sets that eat through the capacity of the frame buffer.

In addition to delivering significantly higher density and bandwidth compared to GDDR5, HBM2 is also considerably more power efficient. The Tesla P100 package includes four 4-Hi HBM2 stacks, for a total of 16 GB of memory, and 720 GB/s peak bandwidth. That's three times as much bandwidth as Nvidia's previous flagship accelerator the Tesla M40.

Interestingly the Telsa P100's bandwidth figure is below that of the JEDEC HBM2 spec that SK Hynix & Samsung both adhere to. Which dictates that every 4-HI HBM2 stack should operate at a 2Ghz clock speed to deliver 256GB/s of bandwidth for a total of 1TB/s for four stacks. The HBM2 modules on the Tesla P100 actually operate well below the spec at only 1.4Ghz. Considering that the Tesla P100 is rated at a surprisingly high TDP of 300W, 50 watts more than the Telsa M40. This could then perhaps be a conscious decision on the part of Nvidia to reduce the overall power of the package.

NVIDIA HBM Memory CrisisSC15 ( Super Computing 2015 ) - Dr. Stephen W. Keckler, Senior Director of Architectural Research

Final Thoughts

Well there you have it folks. This year's GTC definitely did not disappoint. It's been a double whammy for enthusiasts. First, Nvidia's announcement of its most powerful GPU yet, the Pascal flagship that everyone has been eager to hear more about. Second was the surprisingly deep level of detail that the company had revealed about its next generation Pascal architecture. The detailed specs that Nvidia released for its GP100 GPU have also been a pleasant treat.

We can't wait to see what Nvidia has in store for us with its Pascal powered, next generation GeForce GPUs. We're certainly hoping that Nvidia's preparing just as strong of a double whammy this summer with its GP104 based GTX 980 and GTX 970 successors.

Full slide deck [Nvidia GTC 2016 - Jen-Hsun Huang Keynote]

gtc2016-160405225732-page-052
gtc2016-160405225732-page-051
gtc2016-160405225732-page-050
gtc2016-160405225732-page-049
gtc2016-160405225732-page-048
gtc2016-160405225732-page-047
gtc2016-160405225732-page-046
gtc2016-160405225732-page-045
gtc2016-160405225732-page-044
gtc2016-160405225732-page-043
gtc2016-160405225732-page-042
gtc2016-160405225732-page-041
gtc2016-160405225732-page-040
gtc2016-160405225732-page-039
gtc2016-160405225732-page-038
gtc2016-160405225732-page-037
gtc2016-160405225732-page-036
gtc2016-160405225732-page-035
gtc2016-160405225732-page-034
gtc2016-160405225732-page-033
gtc2016-160405225732-page-032
gtc2016-160405225732-page-031
gtc2016-160405225732-page-030
gtc2016-160405225732-page-029
gtc2016-160405225732-page-028
gtc2016-160405225732-page-027
gtc2016-160405225732-page-026
gtc2016-160405225732-page-025
gtc2016-160405225732-page-024
gtc2016-160405225732-page-023
gtc2016-160405225732-page-022
gtc2016-160405225732-page-021
gtc2016-160405225732-page-020
gtc2016-160405225732-page-019
gtc2016-160405225732-page-018
gtc2016-160405225732-page-017
gtc2016-160405225732-page-016
gtc2016-160405225732-page-015
gtc2016-160405225732-page-014
gtc2016-160405225732-page-013
gtc2016-160405225732-page-012
gtc2016-160405225732-page-011
gtc2016-160405225732-page-010
gtc2016-160405225732-page-009
gtc2016-160405225732-page-008
gtc2016-160405225732-page-007
gtc2016-160405225732-page-006
gtc2016-160405225732-page-005
gtc2016-160405225732-page-004
gtc2016-160405225732-page-003
gtc2016-160405225732-page-002
gtc2016-160405225732-page-001
gtc2016-160405225732-page-058
gtc2016-160405225732-page-057
gtc2016-160405225732-page-056
gtc2016-160405225732-page-055
gtc2016-160405225732-page-054
gtc2016-160405225732-page-053

Deal of the Day