Boston labs – Nvidia® K40® GPU – What You Need to Know

Nvidia

15 Years ago Nvidia intorduced the world to the power of GPUs for graphics.  Over the years they introduced the GPU to High performance Computing.  Nvidia are the world leading manufactor of GPUs and have extensive support for their use with the CUDA language.

At SC 13 Nvidia announce the latest generation of the Tesla Series of GPUs, the K40.  The K40 provides are range of performance improvements along side new features.  Along side the latest release of CUDA, version 5.5 this performance can be leveraged to solve scientific problems in all subject areas,

K40 and K20 Specifications

Tesla K40

Tesla K20X

Tesla K20

Stream Processors

2880

2688

2496

Core Clock

745MHz

732MHz

706MHz

Boost Clock(s)

810MHz, 875MHz

N/A

N/A

Shader Clock

N/A

N/A

N/A

Memory Clock

6GHz GDDR5

5.2GHz GDDR5

5.2GHz GDDR5

Memory Bus Width

384-bit

384-bit

320-bit

VRAM

12GB

6GB

5GB

Single Precision

4.29 TFLOPS

3.95 TFLOPS

3.52 TFLOPS

Double Precision

1.43 TFLOPS (1/3)

1.31 TFLOPS (1/3)

1.17 TFLOPS (1/3)

Transistor Count

7.1B

7.1B

7.1B

TDP

235W

235W

225W

Cooling

Active/Passive

Passive

Active/Passive

Nvidia Boost Technology

The k40 is limited to a power budget of 235W, but some codes will not fully utilize this.  Nvidia Boost Technology give the user the ability to harness the unused power to increase the performance.

The K40 allows the clock speed to be set to one of three levels:  base, boost1 and boost2.  The two boost levels increase the clock speed of the cores  from the base level 745MHz to either 810MHz or 845 MHz, maximising the performance of the GPU.

HPL Performance Comparison 

k20k40HPL

The K40 had a 30% improvement on the equivalent k20.  HPL was run on identical systems using Centos 6.4 and a single GPU.    The K40 HPL benchmark used the base Clock frequency.

Real World Testing

As part of our Nvidia Test Drive Cluster a larger number of users have tested a variety of real world codes on the K40 GPUs.  The results were compared to to their current hardware configurations.  In most cases there was a significant improvement in performance.  A few comments from test drives are shown below:

The increase in global memory to 12GB of the K40 allowed for significant opportunity to mitigate host-to-device memory transfers

The boost functionality provided a 20% increased in compute performance

This test demonstrates the Big O complexity of brytlyt’s RIP algorithm is linear and therefore scales extremely well