Tuesday, May 23, 2023

DriveNets designs network fabric for large-scale AI workloads

DriveNets introduced a networking solution designed to maximize the utilization of AI infrastructures and improve the performance and utilization of large-scale AI workloads. DriveNets Network Cloud-AI aims for 30% improvement in JCT (Job Completion Time) of large-scale AI workloads, substantially improving resource utilization, while also supporting a standard Ethernet which allows for vendor interoperability and choice.

"AI compute resources are extremely costly and must be fully utilized to avoid 'idle cycles' as they await networking tasks," said Ido Susan, DriveNets co-founder and CEO. "Leveraging our experience supporting the world's largest networks, we have developed DriveNets Network Cloud-AI. Network Cloud-AI has already achieved up to a 30% reduction in idle time in recent trials, enabling exponentially higher AI throughput compared to a standard Ethernet solution. This reduction also means the network effectively 'pays for itself' through more efficient use of AI resources." 

"Network Cloud-AI provides balanced fabric connectivity between all GPUs in a cluster just as InfiniBand does," said Susan. "The difference is that Network Cloud-AI interfaces with servers on standard Ethernet. InfiniBand uses proprietary equipment which creates vendor lock on the networking and GPU level." 

DriveNets Network Cloud-AI is based on OCP's Distributed Disaggregated Chassis (DDC) architecture which is built on a distributed leaf-and-spine model designed to support service provider high-scale networks. Highlights:

  • Scale - connects up to 32,000 GPUs at speeds ranging from 100G to 800G to a single AI cluster with perfect load balancing
  • Maximum utilization - equally distributes traffic across the AI network fabric, ensuring maximum network utilization and zero packet loss under the highest loads
  • Shortest JCT - supports congestion-free operations through end-to-end traffic scheduling, avoids flow collisions and jitter, and provides zero-impact failover with sub-10ms automatic path convergence
  • Openness - is an Ethernet-based solution that avoids proprietary approaches and supports vendor interoperability with a variety of white box manufacturers (ODMs), Network Interface Cards (NICs), and AI accelerator ASICs

DriveNets said early trials by leading hyperscalers using Network Cloud-AI over white boxes with Broadcom's Jericho family chipset achieved up to 30% improvement in JCT compared to other Ethernet solutions. 

"Ethernet has proven time and again to be the best choice for all networking needs by enabling an open, healthy, and competitive ecosystem," said Ram Velaga, senior vice president and general manager, Core Switching Group, Broadcom. "Large-scale training and inference of AI models will benefit from networks that can perform at 100% utilization like DriveNets Network Cloud-AI. Broadcom's Jericho3-AI delivers an Ethernet network with perfect load balancing and end-to-end congestion management, resulting in significant reduction in job completion time compared to any other alternative."

https://drivenets.com/solutions/ai-networking/