# Intro To Parallel Computing

#### John Urbanic

Parallel Computing Scientist
Pittsburgh Supercomputing Center

### Purpose of this talk

- This is the 50,000 ft. view of the parallel computing landscape. We want to orient you a bit before parachuting you down into the trenches.
- This talk bookends our technical content along with the Outro to Parallel Computing talk.
   The Intro has a strong emphasis on hardware, as this dictates the reasons that the software has the form and function that it has. Hopefully our programming constraints will seem less arbitrary.
- The Outro talk can discuss alternative software approaches in a meaningful way because you will then have one base of knowledge against which we can compare and contrast.
- The plan is that you walk away with a knowledge of not just MPI, etc. but where it fits into the world of High Performance Computing.

### FLOPS we need: Climate change analysis



#### **Simulations**

- Cloud resolution, quantifying uncertainty, understanding tipping points, etc., will drive climate to exascale platforms
- New math, models, and systems support will be needed

#### **Extreme data**

- "Reanalysis" projects need 100x more computing to analyze observations
- Machine learning and other analytics are needed today for petabyte data sets
- Combined simulation/observation will empower policy makers and scientists

### **Exascale combustion simulations**

- Goal: 50% improvement in engine efficiency
- Center for Exascale Simulation of Combustion in Turbulence (ExaCT)
  - Combines simulation and experimentation
  - Uses new algorithms, programming models, and computer science











### Modha Group at IBM Almaden



S:  $128 \times 10^9$  448 x  $10^9$  6.1 x  $10^{12}$  20 x  $10^{12}$  220 x  $10^{12}$ 

Recent simulations achieve unprecedented scale of



Almaden Watson WatsonShaheen LLNL Dawn LLNL Sequoia

BG/L BG/P BG/P BG/Q

December, 2006 April, 2007 March, 2009 May, 2009 June, 2012

### 'Nuff Said

There is an appendix with many more important exascale challenge applications at the end of our Outro To Parallel Computing talk.

And, many of you doubtless brough your own immediate research concerns. Great!

### Moore's Law abandoned serial programming around 2004



### Moore's Law is not dead yet. Maybe.

#### Intel process technology capabilities











| High Volume<br>Manufacturing                         | 2004 | 2006 | 2008 | 2010 | 2012        | 2014 | 2016 | 2018 | 2020 |
|------------------------------------------------------|------|------|------|------|-------------|------|------|------|------|
| Feature Size                                         | 90nm | 65nm | 45nm | 32nm | <b>22nm</b> | 16nm | 14nm | 10nm | 7nm  |
| Integration Capacity<br>(Billions of<br>Transistors) | 2    | 4    | 8    | 16   | 32          | 64   | 128  | 256  | 000  |



**Transistor for 90nm Process** 

Source: Intel



**Influenza Virus** 

Source: CDC

#### But, at end of day we keep using getting more transistors.



Data source: Wikipedia (https://en.wikipedia.org/wiki/Transistor\_count)
The data visualization is available at OurWorldinData.org. There you find more visualizations and research on this topic.

That Power and Clock Inflection Point in 2004... didn't get better.



## Not a new problem, just a new scale...



Cray-2 with cooling tower in foreground, circa 1985

And how to get more performance from more transistors with the same power.



#### **RULE OF THUMB**

| Frequency | Power     | Performance |
|-----------|-----------|-------------|
| Reduction | Reduction | Reduction   |
| 15%       | 45%       | 10%         |





Area = :

Voltage = 1

Freq = 1

Power = 1

Perf = 1

#### **DUAL CORE**



Area = 2

Voltage = 0.85

Freq = 0.85

Power = 1

Perf =  $\sim 1.8$ 

### **Single Socket Parallelism**

| Processor   | Year | Vector | Bits | SP FLOPs / core /<br>cycle | Cores | FLOPs/cycle |
|-------------|------|--------|------|----------------------------|-------|-------------|
| Pentium III | 1999 | SSE    | 128  | 3                          | 1     | 3           |
| Pentium IV  | 2001 | SSE2   | 128  | 4                          | 1     | 4           |
| Core        | 2006 | SSE3   | 128  | 8                          | 2     | 16          |
| Nehalem     | 2008 | SSE4   | 128  | 8                          | 10    | 80          |
| Sandybridge | 2011 | AVX    | 256  | 16                         | 12    | 192         |
| Haswell     | 2013 | AVX2   | 256  | 32                         | 18    | 576         |
| KNC         | 2012 | AVX512 | 512  | 32                         | 64    | 2048        |
| KNL         | 2016 | AVX512 | 512  | 64                         | 72    | 4608        |
| Skylake     | 2017 | AVX512 | 512  | 96                         | 28    | 2688        |

### **Putting It All Together**



Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2017 by K. Rupp

## **Parallel Computing**

One woman can make a baby in 9 months.

Can 9 women make a baby in 1 month?

But 9 women can make 9 babies in 9 months.

# Prototypical Application: Serial Weather Model



# First Parallel Weather Modeling Algorithm: Richardson in 1917



Courtesy John Burkhardt, Virginia Tech

# Weather Model: Shared Memory (OpenMP)



Four meteorologists in the

### V100 GPU and SM



Volta GV100 GPU with 85 Streaming Multiprocessor (SM) units

Volta GV100 SM

# Weather Model: Accelerator (OpenACC)



1 meteorologists coordinating 1000 math savants using tin cans and a string.

# Weather Model: Distributed Memory (MPI)



### 50 meteorologists using a telegraph.

### The pieces fit like this...



### Cores, Nodes, Processors, PEs?

- A "core" can run an independent thread of code. Hence the temptation to refer to it as a processor.
- "Processors" refer to a physical chip. Today these almost always have more than one core.
- "Nodes" is used to refer to an actual physical unit with a network connection; usually a circuit board or "blade" in a cabinet. These often have multiple processors.
- To avoid ambiguity, it is precise to refer to the smallest useful computing device as a Processing Element, or PE. On normal processors this corresponds to a core.

I will try to use the term PE consistently myself, but I may slip up. Get used to it as you will quite often hear all of the above terms used interchangeably where they shouldn't be. Context usually makes it clear.

#### **Many Levels and Types of Parallelism**

- Vector (SIMD)
- Instruction Level (ILP)
  - Instruction pipelining
  - Superscaler (multiple instruction units)
  - Out-of-order
  - Register renaming
  - Speculative execution
  - Branch prediction



OpenMP 4/5 can help!



- Multi-Core (Threads)
- SMP/Multi-socket
- Accelerators: GPU & MIC
- Clusters
- MPPs

#### Also Important

- ASIC/FPGA/DSP
- RAID/IO

### MPPs (Massively Parallel Processors)

Distributed memory at largest scale. Shared memory at lower level.

#### **Summit (ORNL)**

- 122 PFlops Rmax and 187 PFlops Rpeak
- IBM Power 9, 22 core, 3GHz CPUs
- 2,282,544 cores
- NVIDIA Volta GPUs
- EDR Infiniband



#### **Sunway TaihuLight (NSC, China)**

- 93 PFlops Rmax and 125 PFlops Rpeak
- Sunway SW26010 260 core, 1.45GHz CPU
- 10,649,600 cores
- Sunway interconnect



### Top 10 Systems as of June 2020

Dell

IBM

Cray

Marconi100

Piz Daint

Cray XC50

8

9

10

Center/Univ. of Texas

**Swiss National Supercomputing** 

**United States** 

Centre (CSCS)

Switzerland

Cineca

Italy

|   |                                                    |         |                          | [Accelerator]                                                      |            |         |         |      |
|---|----------------------------------------------------|---------|--------------------------|--------------------------------------------------------------------|------------|---------|---------|------|
| 1 | RIKEN Center for Computational<br>Science<br>Japan | Fujitsu | Fugaku                   | ARM 8.2A+ 48C 2.2GHz<br>Torus Fusion Interconnect                  | 7,299,072  | 415,530 | 513,854 | 28.3 |
| 2 | DOE/SC/ORNL<br>United States                       | IBM     | Summit                   | Power9 22C 3.0 GHz<br>Dual-rail Infiniband EDR<br>NVIDIA V100      | 2,414,592  | 148,600 | 200,794 | 10.1 |
| 3 | DOE/NNSA/LLNL<br>United States                     | IBM     | Sierra                   | Power9 3.1 GHz 22C<br>Infiniband EDR<br>NVIDIA V100                | 1,572,480  | 94,640  | 125,712 | 7.4  |
| 4 | National Super Computer Center<br>in Wuxi<br>China | NRCPC   | Sunway TaihuLight        | Sunway SW26010 260C<br>1.45GHz                                     | 10,649,600 | 93,014  | 125,435 | 15.3 |
| 5 | National Super Computer Center in Guangzhou China  | NUDT    | Tianhe-2<br>(MilkyWay-2) | Intel Xeon E5-2692 2.2 GHz<br>TH Express-2<br>Intel Xeon Phi 31S1P | 4,981,760  | 61,444  | 100,678 | 18.4 |
| 6 | Eni S.p.A<br>Italy                                 | Dell    | HPc5                     | Xeon 24C 2.1 GHz<br>Infiniband HDR<br>NVIDIA V100                  | 669,760    | 35,450  | 51,720  | 2.2  |
|   | Eni S.p.A                                          |         | Selene                   | EPYC 64C 2.25GHz                                                   | 272,800    | 27,580  | 34,568  | 1.3  |

| 5 | National Super Computer Center in Guangzhou China | NUDT | Tianhe-2<br>(MilkyWay-2) | Intel Xeon E5-2692 2.2 GHz<br>TH Express-2<br>Intel Xeon Phi 31S1P | 4,981,760 | 61,444 | 100,678 |
|---|---------------------------------------------------|------|--------------------------|--------------------------------------------------------------------|-----------|--------|---------|
| 6 | Eni S.p.A<br>Italy                                | Dell | HPc5                     | Xeon 24C 2.1 GHz<br>Infiniband HDR<br>NVIDIA V100                  | 669,760   | 35,450 | 51,720  |
|   | Eni S.p.A                                         |      | Selene                   | EPYC 64C 2.25GHz                                                   | 272,800   | 27,580 | 34,568  |

InfiniBand HDR

Infiniband EDR

**NVIDIA V100** 

ΝΙΛΙΟΙΆ ΕΊΟΟ

Aries

Power9 16C 3.0 GHz

Xeon E5-2690 2.6 GHz

347,776

387,872

21,640

21,230

29,354

27,154

1.5

2.4

| China                    |        |          | Intel Xeon Phi 31S1P                              |         |        |        |  |
|--------------------------|--------|----------|---------------------------------------------------|---------|--------|--------|--|
| Eni S.p.A<br>Italy       | Dell   | HPc5     | Xeon 24C 2.1 GHz<br>Infiniband HDR<br>NVIDIA V100 | 669,760 | 35,450 | 51,720 |  |
| Eni S.p.A<br>Italy       | NVIDIA | Selene   | EPYC 64C 2.25GHz<br>Infiniband HDR<br>NVIDIA A100 | 272,800 | 27,580 | 34,568 |  |
| Texas Advanced Computing |        | Frontera | Intel Xeon 8280 28C 2.7 GHz                       | 448,448 | 23,516 | 38,745 |  |

### **Networks**

### 3 characteristics sum up the network:

#### Latency

The time to send a 0 byte packet of data on the network

#### Bandwidth

The rate at which a very large packet of information can be sent







#### Topology

The configuration of the network that determines how processing units are directly connected.

## **Ethernet with Workstations**



# **Complete Connectivity**



# Crossbar



# **Binary Tree**



# **Fat Tree**



### **Other Fat Trees**







Atlas @ LLNL





Tsubame @ Tokyo Inst. of Tech

### **Dragonfly**

A newer innovation in network design is the dragonfly topology, which benefits from advanced hardware capabilities like:

- High-Radix Switches
- Adaptive Routing
- Optical Links



Purple links are optical, and blue are electrical.

### **3-D Torus**



Torus simply means that "ends" are connected. This means A is really connected to B and the cube has no real boundary.

### Parallel IO (RAID...)

- There are increasing numbers of applications for which many PB of data need to be written.
- Checkpointing is also becoming very important due to MTBF issues (a whole 'nother talk).
- Build a large, fast, reliable filesystem from a collection of smaller drives.
- Supposed to be transparent to the programmer.
- Increasingly mixing in SSD.



# The Future Is Now!

**Exascale Computing and you.** 

## Welcome to 2021: the year of Exascale!

exa =  $10^{18}$  = 1,000,000,000,000,000,000 = quintillion 64-bit precision floating point operations per second





23,899,33 Cray Ref Storm 500 2004 (425 Telops)

## **Sustaining Performance Improvements**





# **USA: ECP by the Numbers**

7 YEARS \$1.7B

A seven-year, \$1.7 B R&D effort that launched in 2016

6 CORE DOE LABS Six core DOE National Laboratories: Argonne, Lawrence Berkeley, Lawrence Livermore, Oak Ridge, Sandia, Los Alamos

 Staff from most of the 17 DOE national laboratories take part in the project

3 FOCUS AREAS

Three technical focus areas: Hardware and Integration, Software Technology, Application Development supported by a Project Management Office

100 R&D TEAMS 1000 RESEARCHERS

More than 100 top-notch R&D teams

Hundreds of consequential milestones delivered on schedule and within budget since project inception

## The Plan



LLNL

Cray

LANL/SNL

TBD

CROSSROADS

LLNL

IBM/NVIDIA

**SIERRA** 

LANL/SNL

Cray/Intel

LLNL

IBM BG/Q

**SEQUOIA** 

# **System Designs**

| System               | Performance | Power    | Interconnect                      | Node                                  |
|----------------------|-------------|----------|-----------------------------------|---------------------------------------|
| Aurora<br>(ANL)      | > 1 EF      |          | 100 GB/s Cray Slingshot Dragonfly | 2 Intel Xeon CPU +<br>6 Intel Xe GPUs |
| El Capitan<br>(LLNL) | > 1.5 EF    | 30-40 MW | 100 GB/s Cray Slingshot Dragonfly | AMD Epyc CPU +<br>4 Radeon GPUs       |
| Frontier<br>(ORNL)   | > 1.5 EF    |          | 100 GB/s Cray Slingshot Dragonfly | AMD Epyc CPU +<br>4 Radeon GPUs       |
| Perlmutter<br>(LBNL) |             |          | Cray Slingshot Dragonfly          | 2 AMD Epyc CPU +<br>4 Volta GPUs      |

## Two Additional Boosts to Improve Flops/Watt and **Reach Exascale Target**

Third Boost: SiPh (2020 – 2024)



First boost: many-core/accelerator



#### It is not just "exaflops" – we are changing the whole computational model Current programming systems have WRONG optimization targets

#### **Old Constraints**

- Peak clock frequency as primary limiter for performance improvement
- Cost: FLOPs are biggest cost for system: optimize for compute
- Concurrency: Modest growth of parallelism by adding nodes
- Memory scaling: maintain byte per flop capacity and bandwidth
- Locality: MPI+X model (uniform costs within node & between nodes)
- Uniformity: Assume uniform system performance
- Reliability: It's the hardware's problem

#### **New Constraints**

- Power is primary design constraint for future HPC system design
- Cost: Data movement dominates: optimize to minimize data movement
- Concurrency: Exponential growth of parallelism within chips
- **Memory Scaling:** Compute growing 2x faster than capacity or bandwidth
- Locality: must reason about data locality and possibly topology
- Heterogeneity: Architectural and performance non-uniformity increase
- Reliability: Cannot count on hardware protection alone









Fundamentally breaks our current programming paradigm and computing ecosystem

## **End of Moore's Law Will Lead to New Architectures**



## It would only be the 6th paradigm.



## We can do better. We have a role model.

- Straight forward extrapolation results in a real-time human brain scale simulation at about 1 - 10 Exaflop/s with 4 PB of memory
- Current predictions envision Exascale computers in 2022+ with a power consumption of at best 20 - 30 MW
- The human brain takes 20W
- Even under best assumptions in 2020 our brain will still be a million times more power efficient





Revised and expanded

in very little time. Performing a billion operations, on the other hand, could take minutes or hours, though it's still possible provided you are patient. Performing a trillion operations, however, will basically take forever. So a fair rule of thumb is that the calculations we can perform on a computer are ones that can be done with *about a billion operations or less*.

Mark Newman

Copyrighted Material

# Where are those 10 or 12 orders of magnitude?

### How do we get there from here?

BTW, that's a bigger gap than



VS.



IBM 709 12 kiloflops

## Why you should be (extra) motivated.

- This parallel computing thing is no fad.
- The laws of physics are drawing this roadmap.
- If you get on board (the right bus), you can ride this trend for a long, exciting trip.

Let's learn how to use these things!

## In Conclusion...

