Monday, May 13, 2024

Aurora links 63,744 GPUs with Cray Slingshot Interconnects

The Aurora supercomputer at the U.S. Department of Energy’s Argonne National Laboratory has officially surpassed the exascale threshold, achieving over a quintillion calculations per second, as announced today at the ISC High Performance 2024 conference in Hamburg, Germany.

Built by Intel and Hewlett Packard Enterprise (HPE), Aurora features a groundbreaking architecture, inclusing 63,744 graphics processing units (GPUs), making it  the world's largest GPU-powered system with more interconnect endpoints than any other system to date.



Aurora Architecture Highlights

Processing Units

  • Intel CPUs: Aurora is equipped with next-generation Intel Xeon Scalable processors.
  • Intel GPUs: The system includes Intel's Ponte Vecchio GPUs, which are designed for high-performance computing (HPC) and artificial intelligence (AI) workloads.

Performance

Exascale Performance: Aurora is expected to deliver performance exceeding one exaFLOP (10^18 floating-point operations per second). This places it among the first exascale systems in the world, capable of performing a quintillion calculations per second.

Memory

  • High-Bandwidth Memory: Aurora incorporates high-bandwidth memory (HBM) for both its CPUs and GPUs, which enhances data transfer rates and overall computational efficiency.
  • Unified Memory Architecture: The system uses a unified memory architecture that allows for seamless data sharing between CPUs and GPUs, reducing latency and improving performance.

Interconnect

  • Cray Slingshot: Aurora uses the Cray Slingshot high-speed interconnect, which offers advanced network capabilities, low latency, and high bandwidth. The Cray Slingshot interconnect is based on Ethernet technology, rather than Infiniband. 
  • Per-Link Throughput: Each link in the Slingshot network provides up to 200 gigabits per second (Gbps) of bandwidth. This high per-link throughput ensures rapid data transfer rates, crucial for the vast data sets and intensive computations typical in HPC workloads.
  • Network Scalability: Slingshot's architecture allows for scaling up to very large node counts, providing high aggregate bandwidth that can support thousands of nodes in an exascale system.
  • Adaptive Routing: Dynamic selection of optimal paths to avoid congestion and improve efficiency.
  • Quality of Service (QoS): Multiple QoS levels to prioritize critical traffic.
  • Scalability: Supports large-scale deployments with thousands of nodes, making it suitable for exascale systems.

Storage

  • Lustre File System: Aurora is expected to use the Lustre parallel file system, providing fast and scalable storage solutions that can handle the immense data throughput generated by exascale computing workloads.

The installation team, comprising staff from Argonne, Intel, and HPE, is focused on system validation, verification, and scaling up. They are addressing various hardware and software issues as the system approaches full-scale operations.

“Aurora is fundamentally transforming how we do science for our country,” Argonne Laboratory Director Paul Kearns said. ​“It will accelerate scientific discovery by combining high performance computing and AI to fight climate change, develop life-saving medical treatments, create new materials, understand the universe and so much more.”