System Architecture

NRP is a heterogenous, nationally distributed, open system that features CPUs, FP32- and FP64-optimized GPUs, and FPGAs, arranged into two types of subsystems ("high-performance" at SDSC and two "FP32-optimized" at UNL, MGHPCC), specialized for a wide range of data science, simulations, and machine learning or artificial intelligence, allowing data access through a federated national-scale content delivery network (CDN). The NRP HPC system features a novel extremely low-latency fabric from GigaIO that allows dynamic composition of hardware including FPGAs, GPUs, and NVMe storage. Each of the three sites (SDSC, UNL, and MGHPCC) includes ~1 PB of useable disk space. The three storage systems function as data origins of a content delivery network (CDN) that will provide data access anywhere in the country within an RTT of ~10ms via use of network caches. NRP's data infrastructure supports a national "Bring Your Own Resource" (BYOR) program for campuses to add additional compute, data, and storage resources to NRP. The system also can be scaled out via "Bring Your Own Device" (BYOD) programs.

NRP HPC Subsystem (located at SDSC)

System Component Configuration
NVIDIA A100 HGX GPU Servers
HGX A100 servers 8
NVIDIA GPUs/server 8
HBM2 Memory per GPU 80 GB
Host CPU (2 per server) AMD EPYC 7742
Host CPU memory (per server) 512 GB @ 3200 MHz
FabreX Gen4 Network Adapter (per server) 8
Solid State Disk (2 per server) 1 TB
Xilinx Alveo FPGA Servers
GigaIO Gen4 Pooling Appliance 4
FPGAs/appliance 8
FPGA Alveo U55C
High Core-count CPU Servers
Number of servers 2
Processor (2 per server) AMD EPYC 7742
Memory (per server) 1TB @ 3200 MHz
FabreX Network Adapter (per server) 1
Mellanox Connect-X6 Network Adapter (per server) 1
Low Core-count CPU Servers
Number of servers 2
Processors (2 per server) AMD EPYC 7F72
Memory (per server) 1TB @ 3200 MHz
FabreX Network Adapter (per server) 1
Mellanox Connect-X6 Network Adapter (per server) 1
Network Infrastructure
GigaIO FabreX 24port Gen4 PCIe switches 18
GigaIO FabreX Network Adapters 36
Mellanox Connect-X6 Network Adapters 10
FabreX-connected NVMe Resource
GigaIO Gen3 NVMe Pooling Appliance for NVMe resource 4
Capacity per NVMe resource 122 TB
Ancillary Systems
Home File system 1.6 PB
Service Nodes 2
Data Cache (8) 50 TB each

NRP FP32 Subsystem (1 each at UNL and MGHPCC)

System Component Configuration
NVIDIA A10 GPUs (One Each at UNL and MGHPCC)
GPU servers 18
NVIDIA A10 GPUs/node 8
Host CPU (2 per server) AMD EPYC 7502
Host CPU memory 512 GB @ 3200 MHz
Node-local NVMe 8 TB
Network adapters 1x1Gbps; 2x10Gbps
Ancillary Systems
Service Nodes (2 per site) AMD EPYC 7402P
Home File System 1.6 PB