Boston labs – Nvidia® K80® GPU – What You Need to Know

NVIDIA Tesla

The NVIDIA Tesla Accelerated Computing Platform is the leading platform for accelerating big data analytics and scientific computing. With advanced system management features, accelerated communication technology, and supported by popular infrastructure management software, the Tesla platform provides high performance computing professionals the tools to easily build, test, and deploy accelerated applications in the datacenter, achieving up to 10x better performance than the pure CPU solutions.

Tesla K80

In last year’s Supercomputing Conference, NVIDIA introduced the Tesla K40, based on the GK110B variant of NVIDA’s GPU, which was the first Kepler Tesla card with all 2880 CUDA cores enabled. At this year’s SC, SC’14, NVIDIA introduced the Tesla K80, their fastest and densest Tesla card ever.
A dual GPU board that combines 24 GB of memory with blazing fast memory bandwidth and up to 2.91 Tflops double precision performance with NVIDIA GPU Boost, the Tesla K80 GPU is designed for the most demanding computational tasks. It’s ideal for single and double precision workloads that not only require leading compute performance but also demands high data throughput.

NVIDIA GPU Boost

NVIDIA GPU Boost is a feature available on NVIDIA® GeForce®, NVIDIA® Quadro® and NVIDIA® Tesla® graphics processing units (GPUs) that boosts application performance by increasing GPU core and memory clock rates when sufficient power and thermal headroom are available.

K40’s base clock is 745MHz, though it can be boosted up to 810MHz or 875MHz. The GPU Boost technology has been taken further on the Tesla K80 cards. A K80 has a base clock of 562MHz, but it can climb up to 875MHz, at 13MHz increments. Another new feature of the K80, is Autoboost, which ensures that out of the box, the Tesla K80 will always try to achieve the best possible GPU clock and maximize performance, automatically.

K80 – K40 specifications

Features Tesla K80 Tesla K40
GPU 2x Kepler GK210 1 Kepler GK110B
Core Clock 562 MHz 745 MHz
Boost Clock 875 MHz 810 MHz, 875 MHz
Peak double precision floating point performance 2.91 Tflops (GPU Boost Clocks)1.87 Tflops (Base Clocks) 1.66 Tflops (GPU Boost Clocks)
1.43 Tflops (Base Clocks)
Peak single precision floating point performance 8.74 Tflops (GPU Boost Clocks)5.6 Tflops (Base Clocks) 5 Tflops (GPU Boost Clocks)4.29 Tflops (Base Clocks)
Memory bandwidth (ECC off) 480 GB/sec (240 GB/sec per GPU) 288 GB/sec
Memory size (GDDR5) 24 GB (12GB per GPU) 12 GB
CUDA cores 4992 ( 2496 per GPU) 2880
TDP 300W 235W
Cooling Passive Passive/Active

K80 vs K40 Performance

Firstly, we tested the Double Precision (DP) Floating Point Operations per second (Flops) of a system, using both CPUs and GPUs, with the CUDA version of HPL. This HPL version uses both CPUs and GPUs, by offloading workload onto the NVIDIA cards, to calculate the performance. The system used included:

  • 2x Intel Xeon 2660v3 CPUs
  • 128GB DDR4 2133MHz Memory
  • HPL binaries with CUDA 5.5
  • 2x NVIDIA GPU Cards

It must be noted that ECC was turned off on all of the cards (K80s and K40s), in order to have the maximum amount of GDDR memory available on each card. In addition, K80s were boosted up to the 745 MHz clock, as it produced the optimal performance and it matched the clock of K40s.

The results are shown in Figure 1.

k80-k40 HPL

Figure 1: Performance comparison with CUDA HPL

K80 vs K40 Memory Bandwidth

Except for the HPL benchmark, which measures the raw system performance, NVIDIA’s bandwidthTest was used, in order to test the bandwidth of each NVIDIA card. The results are shown in the following graphs, one of which relates to the Host-to-Device (CPU-to-GPU) memory bandwidth, while the other one shows the Device-to-Device (GPU-to-GPU) memory bandwidth. The memory bandwidth, on both cases, is calculated for different message sizes.

host-dev bw

Figure 2: Host-to-Device Bandwidth
dev-dev bw

Figure 3: Device-to-Device Bandwidth

In both cases, it can be seen that K80s can provide double the memory bandwidth, compared to K40s. One notable fact, is that the Device-to-Device bandwidth is 20 times larger than the Host-to-Device one, and that’s because the Device-to-Device bandwidthTest measures the bandwidth of a memory copy between two buffers in the same GPU memory, without the need to transfer data to host’s memory.

Lastly, in Figure 3, we can see that there is a point where the curve drops, and that’s happening just after the 716.8 KB transfer size, i.e. 819.2 KB. This is happening because the L2 GPU cache memory size is 1536 KB, which is enough to hold two buffers of 768 KB each, but the test is allocating a total of 1638.4 KB (2 buffers of 819.2 KB), which does not fit in cache, thus the bandwidth decreases. After that size, the GPU DRAM bandwidth dominates, instead of the cache bandwidth.