The first benchmark results of the NVIDIA GP100 GPU accelerator have been revealed (via Exxact Corp). Featured on the Tesla P100 graphics board, the GP100 GPU is directed at hyperscale servers and high-performance computing (HPC) in general. The Tesla P100 is already shipping to NVIDIA's priority customers which includes super computing companies and one of the organization decided to shows us the first performance numbers of the GP100 GPU. (Source: PCGamesHardware via Videocardz)
At GTC 2016, NVIDIA announced the Tesla P100, their most advanced hyperscale GPU to date.
NVIDIA Tesla P100 Accelerator Benched in HPC Workloads - GP100 GPU First Tests Unveiled
The benchmarks we will be looking at are from a tool known as AMBER which stands for Assisted Model Building with Energy Refinement. This tool was co-developed by Ross Walker from San Diego Supercomputer Center and Scott Le Grand from Amazon Web Services. Amber has two uses, it simulates how force fields are used to affect biomolecules. It also contains package of molecular simulation programs such as source codes and demos.
"Amber" refers to two things: a set of molecular mechanical force fields for the simulation of biomolecules (which are in the public domain, and are used in a variety of simulation programs); and a package of molecular simulation programswhich includes source code and demos. Amber is distributed in two parts: AmberTools16 and Amber16. via Ambermd.org

All of these benchmarks are part of HPC simulations and have nothing do with general performance in gaming applications. It provides us an overview of how well the GP100 GPU performs in such tasks against a range of other NVIDIA GPUs such as GP104, GM200 and GK110. The following configuration was used in the benchmark run:
Exxact AMBER Certified 2U GPU Workstation:
- CPU = Dual x 8 Core Intel E5-2650v3 (2.3GHz), 64 GB DDR4 Ram
- (note the cheaper 6 Core E5-2620v3 and v4 CPUs would also give the same performance for GPU runs)
- MPICH v3.1.4 - GNU v4.8.5 - Centos 7.2
- CUDA Toolkit NVCC v7.5 (8.0RC1 for GTX-1080 and P100)
- NVIDIA Driver Linux 64 - 361.43
- Precision Model = SPFP (GPU), Double Precision (CPU)
Now there's a few things to note before we look at the benchmarks. The tests were conducted on SPFP (GPU) precision model. This means that all GPUs used their single precision throughput to conduct these benchmarks while the CPUs were ran in double precision (FP64) model. It should also be mentioned that these tests were conducted at the time when both Tesla P100 and GTX 1080 were not publicly launched. So the question arises, how did Amber managed to get these cards before their announcement?

Amber works really close to NVIDIA and since the program was developed and written with NVIDIA's help to accelerate research based simulations, Amber managed to get first hand access on these cards. This however means that both cards are in engineering phase and should not be compared to the final retail versions whose performance should be better optimized. The NVIDIA Pascal GPUs also ran a pre-release version of CUDA 8.0.
At the time of writing GTX-1080 and P100 (DGX-1) cards had not been publically released. The benchmarks here are from pre-release hardware. As such they represent a bottom end to the performance. It is hoped that with access to released hardware that optimization of AMBER 16 specific to the Pascal architecture will be possible resulting in improved performance. (Pascal hardware benchmarks made use of a pre-release version of CUDA 8.0). So without further a do, let's take a look at the benchmarks:
NVIDIA Tesla P100 GP100 GPU Benchmarks:
In the benchmarks provided below, we can see that a single Tesla P100 is giving enough throughput to out perform a quad Titan X configuration. We also note that in some cases, the GeForce GTX 1080 is around as fast as the GP100 GPU which is due to the fact that GP104 is also a 9.3 TFLOPs graphics chip which is close to the 10 (10.6) TFLOPs output of the Tesla P100 accelerator. That changes when multiple boards are used. Tesla P100 is fastest without a doubt but with proper implementation of NVLINK in the final models which are now shipping to customers, we can see even bigger gains.

The NVIDIA DGX-1 is a supercomputing rack capable of delivering up to 170 TFLOPs of compute performance.
The NVIDIA DGX-1 system uses up to 8 Tesla P100 boards and costs $129,000 US. The system includes the following specifications:
- Up to 170 teraflops of half-precision (FP16) peak performance
- Eight Tesla P100 GPU accelerators, 16GB memory per GPU
- NVLink Hybrid Cube Mesh
- 7TB SSD DL Cache
- Dual 10GbE, Quad InfiniBand 100Gb networking
- 3U – 3200W
NVIDIA Pascal GP100 With Tesla P100 Graphics Board Benchmarks (Image Credits: Ambermd)
The following tests are too small to effectively scale to multiple modern GPUs and since we are looking at pre-release hardware, NVLINK isn't fine tuned to make use of all Tesla P100 hardware (Up To 4 in the benchmarks provided below).
NVIDIA Pascal GP100 With Tesla P100 Graphics Board Benchmarks (Image Credits: Ambermd)
For those expecting gaming benchmarks, we already made it clear that these results have nothing to do with general application performance. These workloads are specific to the HPC sector and that's what the GP100 GPU has been designed to handle. We have heard rumors that NVIDIA is preparing a more cost effective 16 FinFET based GP102 GPU which might launched later this year as a flagship Titan product with similar specs as the Tesla P100. We don't have any confirmation but we will update you as more news comes our way.
NVIDIA Volta Tesla V100S Specs:
NVIDIA Tesla Graphics Card | Tesla K40 (PCI-Express) | Tesla M40 (PCI-Express) | Tesla P100 (PCI-Express) | Tesla P100 (SXM2) | Tesla V100 (PCI-Express) | Tesla V100 (SXM2) | Tesla V100S (PCIe) |
---|---|---|---|---|---|---|---|
GPU | GK110 (Kepler) | GM200 (Maxwell) | GP100 (Pascal) | GP100 (Pascal) | GV100 (Volta) | GV100 (Volta) | GV100 (Volta) |
Process Node | 28nm | 28nm | 16nm | 16nm | 12nm | 12nm | 12nm |
Transistors | 7.1 Billion | 8 Billion | 15.3 Billion | 15.3 Billion | 21.1 Billion | 21.1 Billion | 21.1 Billion |
GPU Die Size | 551 mm2 | 601 mm2 | 610 mm2 | 610 mm2 | 815mm2 | 815mm2 | 815mm2 |
SMs | 15 | 24 | 56 | 56 | 80 | 80 | 80 |
TPCs | 15 | 24 | 28 | 28 | 40 | 40 | 40 |
CUDA Cores Per SM | 192 | 128 | 64 | 64 | 64 | 64 | 64 |
CUDA Cores (Total) | 2880 | 3072 | 3584 | 3584 | 5120 | 5120 | 5120 |
Texture Units | 240 | 192 | 224 | 224 | 320 | 320 | 320 |
FP64 CUDA Cores / SM | 64 | 4 | 32 | 32 | 32 | 32 | 32 |
FP64 CUDA Cores / GPU | 960 | 96 | 1792 | 1792 | 2560 | 2560 | 2560 |
Base Clock | 745 MHz | 948 MHz | 1190 MHz | 1328 MHz | 1230 MHz | 1297 MHz | TBD |
Boost Clock | 875 MHz | 1114 MHz | 1329MHz | 1480 MHz | 1380 MHz | 1530 MHz | 1601 MHz |
FP16 Compute | N/A | N/A | 18.7 TFLOPs | 21.2 TFLOPs | 28.0 TFLOPs | 30.4 TFLOPs | 32.8 TFLOPs |
FP32 Compute | 5.04 TFLOPs | 6.8 TFLOPs | 10.0 TFLOPs | 10.6 TFLOPs | 14.0 TFLOPs | 15.7 TFLOPs | 16.4 TFLOPs |
FP64 Compute | 1.68 TFLOPs | 0.2 TFLOPs | 4.7 TFLOPs | 5.30 TFLOPs | 7.0 TFLOPs | 7.80 TFLOPs | 8.2 TFLOPs |
Memory Interface | 384-bit GDDR5 | 384-bit GDDR5 | 4096-bit HBM2 | 4096-bit HBM2 | 4096-bit HBM2 | 4096-bit HBM2 | 4096-bit HBM |
Memory Size | 12 GB GDDR5 @ 288 GB/s | 24 GB GDDR5 @ 288 GB/s | 16 GB HBM2 @ 732 GB/s 12 GB HBM2 @ 549 GB/s | 16 GB HBM2 @ 732 GB/s | 16 GB HBM2 @ 900 GB/s | 16 GB HBM2 @ 900 GB/s | 16 GB HBM2 @ 1134 GB/s |
L2 Cache Size | 1536 KB | 3072 KB | 4096 KB | 4096 KB | 6144 KB | 6144 KB | 6144 KB |
TDP | 235W | 250W | 250W | 300W | 250W | 300W | 250W |