# TECHNOLOGICAL UPDATE

**NVIDIA**.

SEPT 2022

# NVIDIA H100

Unprecedented Performance, Scalability, and Security for Every Data Center

HIGHEST AI AND HPC PERFORMANCE

4PF FP8 (6X)| 2PF FP16 (3X)| 1PF TF32 (3X)| 60TF FP64 (3X) 3TB/s (1.5X), 80GB HBM3 memory

TRANFORMER MODEL OPTIMIZATIONS

6X faster on largest transformer models

HIGHEST UTILIZATION EFFICIENCY AND SECURITY

7 Fully isolated & secured instances, guaranteed QoS 2<sup>nd</sup> Gen MIG | Confidential Computing

FASTEST, SCALABLE INTERCONNECT

900 GB/s GPU-2-GPU connectivity (1.5X) up to 256 GPUs with NVLink Switch | 128GB/s PCI Gen5



FP8, FP16, TF32 performance include sparsity. X-factor compared to A100

### HOPPER TECHNOLOGICAL BREAKTHROUGHS



Dynamic Programming Instructions HGX H100 4-GPU vs dual socket 32 core IceLake. HGX H100 performance projections.

💿 NVIDIA

# **NVIDIA MODULUS**

**Physics Machine Learning Platform** 



INDUSTRIAL HPC NETL: 10,000X Faster Build Of highfidelity surrogate models

# MILLION-X SPEEDUP FOR INNOVATION AND DISCOVERY











📀 RVIDIA.

# H100 SUPERCHARGES NVIDIA AI

### NVIDIA Furthers Al Inference Performance Leadership

- H100 Tops All Data Center Tests Up to 4.5X Higher Performance than A100
- 12 NVIDIA Partners Submitted
- A100 still delivery good performance and has improved by 6x since June 2020 thanks to software enhancements



🕺 NVIDIA

# **GTC SESSION TO FOLLOW**

#### GTC22 Fall

- <u>A Deep Dive into the Latest HPC Software [A41133]</u>
- CUDA: New Features and Beyond [A41100]
- Developing HPC Applications with Standard C++, Fortran, and Python [A41087]

#### GTC22 Spring

- C++ Standard Parallelism [S41960]
- Future of Standard and CUDA C++ [S41961]
- <u>Shifting through the Gears of GPU Programming: Understanding Performance and Portability Trade-offs</u> [S41620]
- From Directives to DO CONCURRENT: A Case Study in Standard Parallelism [S41318]
- Evaluating Your Options for Accelerated Numerical Computing in Pure Python [S41645]
- How to Develop Performance Portable Codes using the Latest Parallel Programming Standards [S41618]

# GENERAL PURPOSE MEETS HIGH PERFORMANCE

GRACE HELPS TO FILL THE GAP OF THE NOT-YET-ACCELERATED FRONTIER

- There are over 1 billion lines of FORTRAN in HPC workloads today.
  - Fortran was first released in 1957.
- The vast majority of applications can be accelerated.
  - ...and a lot of what your customers run already is.
- Most will be accelerated someday.
- Grace provides an NVIDIA solution for every HPC workload <u>today.</u>



# **NVIDIA GRACE PLATFORM**

### Grace Hopper Superchip Giant Scale AI & HPC



Accelerated applications where CPU performance and system memory BW are critical since AI models continue to get bigger and our GPUs get even faster

### Grace CPU Superchip CPU Computing



Applications that are not accelerated yet but where absolute performance, energy efficiency, and datacenter density matter, such as in scientific computing, data analytics, and hyperscale computing applications

### **GRACE CPU SUPERCHIP** HIGHER HPC PERFORMANCE AT FRACTION OF POWER



Applications that are not accelerated yet but where absolute performance, energy efficiency, and datacenter density matter, such as in scientific computing, data analytics, and hyperscale computing applications

| Specifications | Grace SuperChip                                                                         |
|----------------|-----------------------------------------------------------------------------------------|
| Architecture   | Armv9, SVE2 with 4x 128b pipeline/core                                                  |
| Cores / Speed  | 144 cores                                                                               |
| Memory         | LPDDR5x soldered down, 1TB/s BW<br>Up to 1TB per superchip                              |
| Cache          | L1: 64KB i-cache + 64KB d-cache per core<br>L2: 1MB per core<br>L3: 240MB per superchip |
| Power          | 500W including LPDDR5x memory                                                           |
| Interfaces     | Up to 8x PCIe Gen5 x16 HS interface                                                     |
| Process Node   | TSMC 4N                                                                                 |

### GRACE SIMPLIFIES BUILDING AND RUNNING COMPUTE INFRASTRUCTURE

#### 144 Cores Grace 0 Grace 1 3Ghz 900 GB/s Numa 0 Numa 1 Freq GB/S GB/S **180 256GB** RAM 128 GB 128 GB **500W** NIC CPU+MEM

**GRACE SUPERCHIP** 

2 NUMA

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

### X86 2-SOCKET SERVER



# **PROCESSOR SOC/NOC**

Augmented custom ISA logic to support memory movement

- Monolithic SoC
  - Up to 120MB shared L3 cache
  - 3TB/s on-die mesh bisection BW
- Extensive set of Core and un-Core perf counters
- Thermal monitoring and power management
- DVFS support with multiple voltage domain
- Individual core power and clock gating support
- Tx and Rx paths optimized for 400Gbps fabric
- ARM V9 ISA virtualization and security support
- Custom SoC level logic support for GPUDirect, CPU-GPU me movement and synchronization





# NVIDIA GRACE VS. FUJITSU A64FX

A64FX is an outlier in every way - your Grace experience will be different

#### MAINSTREAM LEADERSHIP HPC







#### **EXTREME HPC CODESIGN**

### **ARM IN HPC**

A Growing Ecosystem



🕺 NVIDIA

### THE NVIDIA HPC SOFTWARE PLATFORM

Simplified Status Assessment



NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

### **NVIDIA HPC SDK**

Available at developer.nvidia.com/hpc-sdk, on NGC, via Spack, and in the Cloud



Develop for the NVIDIA Platform: GPU, CPU and Interconnect Libraries | Accelerated C++ and Fortran | Directives | CUDA

### x86\_64 | **AArch64** | OpenPOWER

7-8 Releases Per Year | Freely Available

### ACCELERATED STANDARD LANGUAGES

Parallel performance for wherever your code runs





### 35% BETTER PERFORMANCE ON AWS GRAVITON3

Amazon EC2 C7g instances

 "We benchmarked 35% better performance on AWS Graviton3-based C7g instances compared to the previous generation instances. With the support of LS-DYNA on the AWS Graviton3 processor, Ansys customers will get the best of both worlds – access to a world-class multiphysics solver without comprising on speed, and lower energy and costs."

-- Prith Banerjee, Chief Technology Officer - Ansys

# " Porting to Arm is boring."

- Simon McIntosh-Smith ... and many more

### **PORTING TO ARM**

Most HPC applications recompile easily and work "out of the box"



### USE STANDARDS-COMPLIANT MULTI-PLATFORM COMPILERS

You're not porting to Arm. You're porting *away* from ifort, xlf, etc!

- Use any portable multi-platform compiler: NVIDIA, GCC, LLVM, etc.
- Use the most recent compiler possible. If using GCC, version 11+ is strongly recommended.
- Beware of non-standard build systems
  - icc, ifort, xlf, etc. may be hard-coded into the build system
  - Be explicit about which compiler to use. Don't let the build system make assumptions
- Beware of non-standard default compilers
  - Make sure default compiler commands (cc, fc, gcc, etc.) invoke a recent cross-platform compiler
  - Use `mpicc -show` or similar to verify that MPI compiler wrappers are invoking the right compiler
- Log the build, then check the log afterward

21 📀 nvidia

### **PORTING ASSEMBLY AND X86 VECTOR INTRINSICS**

Translate intrinsics to port functionality, then focus on performance tuning

- For a quick fix, use a drop-in header-based intrinsics translator
  - SIMD Everywhere (SIMDe): <u>https://github.com/simd-everywhere/simde</u>
  - SSE2NEON: <u>https://github.com/DLTcollab/sse2neon</u>
  - Quick tutorial using BWA-MEM2: <u>https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41702/</u>
- Follow Arm's documentation on rewriting x86 vector intrinsics
  - Porting and Optimizing HPC Applications for Arm SVE [https://developer.arm.com/documentation/101726/latest]
  - Coding for NEON [https://developer.arm.com/documentation/101725/0300/Coding-for-Neon]
- Arm assembly is simpler than x86
  - Arm processors have a much simpler and general set of registers than x86. Just assign a one-to-one mapping from an x86 register to an Arm register when porting code.
  - Complex x86 instructions will become multiple Arm instructions

### **ARM SIMD PROGRAMMING APPROACHES**

Follow these recommendations in order, e.g. prefer auto-vectorization over intrinsics

- Compilers
  - Auto-vectorization: NVIDIA, GCC, LLVM, ACfL
  - Compiler directives, e.g. OpenMP
    - #pragma omp parallel for simd
    - #pragma vector always
- Libraries:
  - NVIDIA Math Libraries
  - Arm Performance Library (ArmPL)
  - Open Source Scientific Libraries (BLIS, FFTW, PETSc, etc.)
- Intrinsics (ACLE):
  - Arm C Language Extensions for SVE
  - Arm Scalable Vector Extensions and Application to Machine Learning
- Assembly:
  - SVE ISA Specification: The Scalable Vector Extension for Armv8-A

### ARM SCALABLE VECTOR EXTENSION (SVE)

An ISA feature which Arm partners can implement at length - 128 to 2048 bits

### How SVE works

The hardware sets the vector length



In software, vectors have no length

The *exact same* binary code runs on hardware with different vector lengths

А 🕂 В 💳 С



### SVE improves auto-vectorization



Gather-load and scatter-store



Per-lane predication

| for (i = | = 0; | i < | n; + | +i) |
|----------|------|-----|------|-----|
| INDEX i  | n-2  | n-1 | n    | n+1 |
| CMPLT n  | 1    | 1   | 0    | 0   |

Predicate-driven loop control and management



Vector partitioning and software-managed speculation



Extended floating-point horizontal reductions



### SVE SUPPORT IS MATURE

Arm actively posting SVE open source patches upstream since 2016

Beginning with first public announcement of SVE at HotChips 2016

### Available upstream

| GNU Binutils-2.28: | Released Feb 2017, includes SVE assembler & disassembler |
|--------------------|----------------------------------------------------------|
| GCC 8:             | Full assembly, disassembly and basic auto-vectorization  |
| LLVM 7:            | Full assembly, disassembly                               |
| QEMU 3:            | User space SVE emulation                                 |
| GDB 8.2            | HPC use cases fully included                             |
| LLVM:              | Since Nov 2016, as presented at LLVM conference          |
| Linux kernel:      | Since Mar 2017, LWN article on SVE support               |

# **INSIDE GRACE WATCH PARTY**

Ask us the invite to the "Inside Grace" watch party organized by Filippo Spiga





# **INSIDE GRACE WATCH PARTY**

Ask us the invite to the "Inside Grace" watch party organized by Filippo Spiga





lovely Spanish village with castle in Extremadura