STREAM Results on Large Scale SMP with Numascale

Very quick post this week – no fluff to parse through 🙂

 

In the lab we setup 8 AMD nodes with 2x 6380 CPUs and 256GB per box, all connected via the numascale adaptors. Each box was capable of ~60GB memory bandwidth delivered via STREAM. When we scaled up the tests we were pleasantly provided with virtually linear scaling as per below:

 

numascale-logo

[boston@numademo stream_numascale]$ OMP_NUM_THREADS=128 GOMP_CPU_AFFINITY=0-255:2 ./stream_c.exe.gnu.178g
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 8000000000 (elements), Offset = 0 (elements)
Memory per array = 61035.2 MiB (= 59.6 GiB).
Total memory required = 183105.5 MiB (= 178.8 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 128
Number of Threads counted = 128
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 307665 microseconds.
(= 307665 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 486693.8 0.275797 0.262999 0.295733
Scale: 485919.8 0.268221 0.263418 0.276273
Add: 466143.8 0.426430 0.411890 0.475929
Triad: 473972.0 0.435202 0.405087 0.523107
-------------------------------------------------------------

For those of you interested in learning a little more on what exactly the STREAM benchmark is and what (copy/scale/add/triad) actually do under the hood; please take a look at this great article from Jeff Layton over at Admin Magazine (HPC): <a href=http://www.admin-magazine.com/HPC/Articles/Finding-Bandwidth-Bottlenecks-with-Stream>Finding Bandwidth Bottlenecks with STREAM</a>