Wednesday, May 29, 2024

Arista to align compute and network domains as a single managed AI entity

Arista Networks, in collaboration with NVIDIA, hosted a technology demonstration showcasing AI Data Centers that integrate compute and network domains into a single managed AI entity. This initiative aims to help customers configure, manage, and monitor AI clusters uniformly across key components, including networks, NICs, and servers. By demonstrating this unified approach, Arista and NVIDIA highlight the potential for a multi-vendor, interoperable ecosystem that allows for better control and coordination between AI networking and compute infrastructure.

The technology demonstration introduced an Arista EOS-based remote AI agent, which enables the combined AI cluster to be managed as a single solution. With EOS running on the network, this remote AI agent extends its capabilities to servers and SuperNICs, allowing for real-time tracking and reporting of performance issues between hosts and networks. This integration ensures that any performance degradation or failures can be quickly isolated and mitigated, optimizing the end-to-end quality of service (QoS) within the AI Data Center.

As AI clusters and large language models (LLMs) grow in complexity and size, the need for uniform controls across AI servers and network switches becomes critical. The demonstration addressed the challenges of managing disparate components such as GPUs, NICs, switches, optics, and cables. By providing a single point of control and visibility, the Arista EOS-based solution helps prevent misconfigurations and misalignments that can adversely affect job completion times. Additionally, the coordinated management and monitoring of compute and network resources ensure efficient congestion management, minimizing packet drops and optimizing GPU utilization.

Highlights of the demo

  • Collaboration between Arista Networks and NVIDIA for AI Data Centers.
  • Unified management of AI clusters across networks, NICs, and servers.
  • Demonstration of a multi-vendor, interoperable ecosystem.
  • Introduction of an Arista EOS-based remote AI agent.
  • Real-time tracking and reporting of performance issues.
  • Optimization of end-to-end QoS within the AI Data Center.
  • Single point of control and visibility for AI clusters.
  • Efficient congestion management and optimization of GPU utilization.

“Arista aims to improve efficiency of communication between the discovered network and GPU topology to improve job completion times through coordinated orchestration, configuration, validation, and monitoring of NVIDIA accelerated compute, NVIDIA SuperNICs, and Arista network infrastructure,” said John McCool, Chief Platform Officer for Arista Networks.


false