It’s a Multicore World

John Urbanic
Pittsburgh Supercomputing Center
Parallel Computing Scientist
Moore's Law abandoned serial programming around 2004

![Graph showing SPEC CINT Performance (log scale) with labels for CPU92, CPU95, CPU2000, CPU2006, and CPU2017. The graph indicates a historic performance growth rate and an advent of multicore, with a 7 Years Behind and a 12x increase.](image-url)
Moore’s Law is not at all dead…

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Feature Size</td>
<td>90nm</td>
<td>65nm</td>
<td>45nm</td>
<td>32nm</td>
<td>22nm</td>
<td>16nm</td>
<td>11nm</td>
<td>8nm</td>
</tr>
<tr>
<td>Integration Capacity</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>(Billions of Transistors)</td>
<td>2</td>
<td>4</td>
<td>8</td>
<td>16</td>
<td>32</td>
<td>64</td>
<td>128</td>
<td>256</td>
</tr>
</tbody>
</table>

![Transistor for 90nm Process](Source: Intel)

![Influenza Virus](Source: CDC)
At end of day, we keep using all those new transistors.
That Power and Clock Inflection Point in 2004…
didn’t get better.

Fun fact: At 100+ Watts and <1V, currents are beginning to exceed 100A at the point of load!
Not a new problem, just a new scale...

Cray-2 with cooling tower in foreground, circa 1985
And how to get more performance from more transistors with the same power.

**RULE OF THUMB**

<table>
<thead>
<tr>
<th>Frequency Reduction</th>
<th>Power Reduction</th>
<th>Performance Reduction</th>
</tr>
</thead>
<tbody>
<tr>
<td>15%</td>
<td>45%</td>
<td>10%</td>
</tr>
</tbody>
</table>

A 15% Reduction In Voltage Yields

**SINGLE CORE**

- Area = 1
- Voltage = 1
- Freq = 1
- Power = 1
- Perf = 1

**DUAL CORE**

- Area = 2
- Voltage = 0.85
- Freq = 0.85
- Power = 1
- Perf = ~1.8

A 15% Reduction In Voltage Yields
# Single Socket Parallelism

<table>
<thead>
<tr>
<th>Processor</th>
<th>Year</th>
<th>Vector</th>
<th>Bits</th>
<th>SP FLOPs / core / cycle</th>
<th>Cores</th>
<th>FLOPs/cycle</th>
</tr>
</thead>
<tbody>
<tr>
<td>Pentium III</td>
<td>1999</td>
<td>SSE</td>
<td>128</td>
<td>3</td>
<td>1</td>
<td>3</td>
</tr>
<tr>
<td>Pentium IV</td>
<td>2001</td>
<td>SSE2</td>
<td>128</td>
<td>4</td>
<td>1</td>
<td>4</td>
</tr>
<tr>
<td>Core</td>
<td>2006</td>
<td>SSE3</td>
<td>128</td>
<td>8</td>
<td>2</td>
<td>16</td>
</tr>
<tr>
<td>Nehalem</td>
<td>2008</td>
<td>SSE4</td>
<td>128</td>
<td>8</td>
<td>10</td>
<td>80</td>
</tr>
<tr>
<td>Sandybridge</td>
<td>2011</td>
<td>AVX</td>
<td>256</td>
<td>16</td>
<td>12</td>
<td>192</td>
</tr>
<tr>
<td>Haswell</td>
<td>2013</td>
<td>AVX2</td>
<td>256</td>
<td>32</td>
<td>18</td>
<td>576</td>
</tr>
<tr>
<td>KNC</td>
<td>2012</td>
<td>AVX512</td>
<td>512</td>
<td>32</td>
<td>64</td>
<td>2048</td>
</tr>
<tr>
<td>KNL</td>
<td>2016</td>
<td>AVX512</td>
<td>512</td>
<td>64</td>
<td>72</td>
<td>4608</td>
</tr>
<tr>
<td>Skylake</td>
<td>2017</td>
<td>AVX512</td>
<td>512</td>
<td>96</td>
<td>28</td>
<td>2688</td>
</tr>
</tbody>
</table>
Prototypical Application: Serial Weather Model
First Parallel Weather Modeling Algorithm: Richardson in 1917

Courtesy John Burkhardt, Virginia Tech
Weather Model: Shared Memory (OpenMP)

Four meteorologists in the same room sharing the map.

Fortran:

```fortran
!$omp parallel do
do i = 1, n
   a(i) = b(i) + c(i)
enddo
```

C/C++:

```c
#pragma omp parallel for
for(i=1; i<=n; i++)
   a[i] = b[i] + c[i];
```
Weather Model: Accelerator (OpenACC)

1 meteorologists coordinating 1000 math savants using tin cans and a string.

```c
#pragma acc kernels
for (i=0; i<N; i++)  {
    double t = (double)((i+0.05)/N);
    pi += 4.0/(1.0+t*t);
}

__global__ void saxpy_kernel( float a, float* x, float* y, int n ){
    int i;
    i = blockIdx.x*blockDim.x + threadIdx.x;
    if( i <= n ) x[i] = a*x[i] + y[i];
}
```
call MPI_Send( numbertosend, 1, MPI_INTEGER, index, 10, MPI_COMM_WORLD, errcode)
.
.
call MPI_Recv( numbertoreceive, 1, MPI_INTEGER, 0, 10, MPI_COMM_WORLD, status, errcode)
.
.
call MPI_Barrier(MPI_COMM_WORLD, errcode)
.
50 meteorologists using a telegraph.
The pieces fit like this…

- OpenMP
- OpenACC
- MPI
Many Levels and Types of Parallelism

- Vector (SIMD)
- Instruction Level (ILP)
  - Instruction pipelining
  - Superscaler (multiple instruction units)
  - Out-of-order
  - Register renaming
  - Speculative execution
  - Branch prediction
- Multi-Core (Threads)
- SMP/Multi-socket
- Accelerators: GPU & MIC
- Clusters
- MPPs

Compiler (not your problem)

Also Important
- ASIC/FPGA/DSP
- RAID/IO
Cores, Nodes, Processors, PEs?

- The most unambiguous way to refer to the smallest useful computing device is as a Processing Element, or PE.
- This is usually the same as a single core.
- “Processors” usually have more than one core – as per the previous list.
- “Nodes” is commonly used to refer to an actual physical unit, most commonly a circuit board or blade with a network connection. These often have multiple processors.

I will try to use the term PE consistently here, but I may slip up myself. Get used to it as you will quite often hear all of the above terms used interchangeably where they shouldn’t be.
MPPs (Massively Parallel Processors)

Distributed memory at largest scale. Shared memory at lower level.

Summit (ORNL)
- 122 PFlops Rmax and 187 PFlops Rpeak
- IBM Power 9, 22 core, 3GHz CPUs
- 2,282,544 cores
- NVIDIA Volta GPUs
- EDR Infiniband

Sunway TaihuLight (NSC, China)
- 93 PFlops Rmax and 125 PFlops Rpeak
- Sunway SW26010 260 core, 1.45GHz CPU
- 10,649,600 cores
- Sunway interconnect
From a document you should read if you are interested in this:

<table>
<thead>
<tr>
<th>#</th>
<th>Site</th>
<th>Manufacturer</th>
<th>Computer</th>
<th>CPU Interconnect [Accelerator]</th>
<th>Cores</th>
<th>Rmax (Tflops)</th>
<th>Rpeak (Tflops)</th>
<th>Power (MW)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>DOE/SC/ORNL United States</td>
<td>IBM</td>
<td>Summit</td>
<td>Power9 22C 3.0 GHz Dual-rail Infiniband EDR NVIDIA V100</td>
<td>2,282,544</td>
<td>122,300</td>
<td>187,659</td>
<td>8.8</td>
</tr>
<tr>
<td>2</td>
<td>National Super Computer Center in Guangzhou China</td>
<td>NRCPC</td>
<td>Sunway TaihuLight</td>
<td>Sunway SW26010 260C 1.45GHz</td>
<td>10,649,600</td>
<td>93,014</td>
<td>125,435</td>
<td>15.3</td>
</tr>
<tr>
<td>3</td>
<td>DOE/NNSA/LLNL United States</td>
<td>IBM</td>
<td>Sierra</td>
<td>Power9 3.1 GHz 22C Infiniband EDR NVIDIA V100</td>
<td>1,572,480</td>
<td>71,610</td>
<td>119,193</td>
<td></td>
</tr>
<tr>
<td>4</td>
<td>National Super Computer Center in Guangzhou China</td>
<td>NUDT</td>
<td>Tianhe-2 (MilkyWay-2)</td>
<td>Intel Xeon E5-2692 2.2 GHz TH Express-2 Intel Xeon Phi 31S1P</td>
<td>4,981,760</td>
<td>61,444</td>
<td>100,678</td>
<td>18.4</td>
</tr>
<tr>
<td>5</td>
<td>AIST Japan</td>
<td>Fujitsu</td>
<td>AI Bridging Cloud Primergy</td>
<td>Xeon 6148 2.4GHz Infiniband EDR NVIDIA V100</td>
<td>391,680</td>
<td>19,880</td>
<td>32,576</td>
<td>1.6</td>
</tr>
<tr>
<td>6</td>
<td>Swiss National Supercomputing Centre (CSCS) Switzerland</td>
<td>Cray</td>
<td>Piz Daint</td>
<td>Xeon E5-2690 2.6 GHz Aries NVIDIA P100</td>
<td>361,760</td>
<td>19,590</td>
<td>25,326</td>
<td>2.2</td>
</tr>
<tr>
<td>7</td>
<td>DOE/SC/Oak Ridge National Laboratory United States</td>
<td>Cray</td>
<td>Titan</td>
<td>Opteron 6274 2.2 GHz Gemini NVIDIA K20x</td>
<td>560,640</td>
<td>17,590</td>
<td>27,112</td>
<td>8.2</td>
</tr>
<tr>
<td>8</td>
<td>DOE/NNSA/LLNL United States</td>
<td>IBM</td>
<td>Sequoia BlueGene/Q</td>
<td>Power BQC 1.6 GHz Custom</td>
<td>1,572,864</td>
<td>17,173</td>
<td>20,132</td>
<td>7.8</td>
</tr>
<tr>
<td>9</td>
<td>DOE/NNSA/LANL/SNL United States</td>
<td>Cray</td>
<td>Trinity</td>
<td>Xeon E5-2698v3 2.3 GHz Aries Intel Xeon Phi 7250</td>
<td>979,968</td>
<td>14,137</td>
<td>43,902</td>
<td>7.8</td>
</tr>
<tr>
<td>10</td>
<td>DOE/SC/LBNL/NERSC United States</td>
<td>Cray</td>
<td>Cori</td>
<td>Xeon E5-2698v3 2.3 GHz Aries Intel Xeon Phi 7250</td>
<td>622,336</td>
<td>14,014</td>
<td>27,880</td>
<td>3.9</td>
</tr>
</tbody>
</table>

OpenACC is a first class API!
Amdahl’s Law

- If there is x% of serial component, speedup cannot be better than 100/x.

- If you decompose a problem into many parts, then the parallel time cannot be less than the largest of the parts.

- If the critical path through a computation is T, you cannot complete in less time than T, no matter how many processors you use.

- Amdahl's law used to be cited by the knowledgeable as a limitation.

- These days it is mostly raised by the uninformed.

- Massive scaling is commonplace:
  - Science Literature
  - Web (map reduce everywhere)
  - Data Centers (Spark, etc.)
  - Machine Learning (GPUs and others)
In Conclusion...

- OpenMP
- OpenACC
- MPI