Sunday, April 14, 2024

Blueprint: Reimagining Data Center Networks for AI

by Aniket Khosla, Spirent’s VP of Wireline Product Management

By now, you’ve probably seen plenty of examples of artificial intelligence in action, whether creating new code with ChatGPT, using enhanced web search with Google Gemini, even generating full-motion videos on demand. Even so, we’re just scratching the surface of AI’s potential to transform applications and industries. Language processing, image recognition, AI-driven recommendation systems and automation, it’s all incredibly exciting. In the data centers where these AI applications run, however, processing these workloads brings extraordinary new challenges. 

To process evolving AI workloads, AI cluster networks must provide unprecedented throughput, extremely low latency, and support for new traffic patterns such as micro data bursts. Even more challenging, the traditional approach for improving network performance—throwing more hardware at the problem—won’t work. The characteristics of AI workloads and their exponential growth demand an entirely new kind of data center network. 

The industry is still in the process of determining exactly what tomorrow’s AI-optimized networks will look like, and several open questions remain. Among the most significant: the role that Ethernet will play in front- and back-end networks, and the pace at which data center operators will adopt next-generation Ethernet speeds and protocols. 

Why is AI so different and challenging for data center networks? And what do operators and vendors supporting them need to know to make informed decisions? Let’s take a closer look. 

Inside the Cluster

To understand the impact of AI workloads on data center networks, let’s review what’s happening behind the scenes. There are two basic phases of AI processing:
  • Training involves ingesting vast amounts of data for an AI model to train on. By analyzing patterns and relationships between inputs and outputs, the model learns how to make increasingly accurate predictions.  
  • Inferencing is where AI models put their training into practice. The model receives an input (such as a query from a chatbot user), and then performs some classification or prediction action to generate an output. 
Both phases require large amounts of compute in the form of specialized (and expensive) processing units. These “xPUs” could be CPUs, GPUs, Field Programmable Gate Arrays (FPGAs), or other types of accelerators. For AI inferencing in particular though, the network connecting all those xPUs (and servers, and storage) must deliver extremely high bandwidth, with extremely low latency, while ensuring no packet loss. 

These requirements are already pushing the limits of hyperscale data centers, where the number and variety of AI workloads have skyrocketed. And that’s just today’s AI models—which are growing 1,000x more complex every three years. Operators can’t meet these demands by scaling data center fabrics as they have in the past. They need to fundamentally re-architect them. 

AI Fabric Requirements

Today, operators are investing in massive numbers of xPUs for AI workloads. How many they’ll ultimately require—and the scale of the network connecting them—will depend on future AI applications. But needing a fabric that can support tens of thousands of xPUs and trillions of dense parameters seems likely. 

For AI training, operators should be able to support workloads well enough with typical refresh cycles, which will push front-end networks to 800Gbps and beyond over the next several years. Inferencing, however, is another matter. Here, they need scalable back-end infrastructures capable of connecting thousands of xPUs and delivering: 
  • Massive scale: Compared to training, AI inferencing can generate 5x more traffic per accelerator and require 5x higher network bandwidth. Back-end networks must also be able to support thousands of synchronized jobs in parallel and more data- and compute-intensive workloads.  
  • Extremely low latency: For real-time AI inferencing (from an AI vision application flagging a production flaw to an AI chatbot responding to a customer), every millisecond counts. These workloads must progress through large numbers of nodes, and any delay in this flow can impede scalability or cause timeouts. Depending on the application, such delays could result in poor user experiences, expensive manufacturing errors, or worse. 
  • No packet loss: AI applications running in data centers are highly sensitive to packet loss, which increases latency and makes the network less predictable. AI cluster fabrics therefore must be lossless. 
These back-end network requirements are already affecting AI user experiences. According to Meta, roughly a third of elapsed time for AI workloads is spent waiting on the network.  
 

Evolving Interfaces

Given these extreme demands, how should data center operators and the vendors supporting them respond? What’s the best infrastructure approach to sustain both current AI workloads and those on the horizon? This is still an open question. 
Even in new back-end data center infrastructures, we see significant variation, with Google, Microsoft, and Amazon all taking different paths. Based on what operators are saying, however, and the interfaces they’re investing in, we’re starting to get a clearer picture. 
In front-end networks used for data ingestion, Dell’Oro Group forecasts that by 2027, one third of all Ethernet ports will by 800 Gbps or higher. In back-end networks, where operators need higher throughput and lower latency immediately, things are moving more quickly. Nearly all back-end ports will be 800G and above by 2027, with bandwidth growing at a triple-digit growth rate.

While most operators will continue using Ethernet in front-end networks, back-end infrastructures will vary. Depending on the AI applications they’re supporting, some operators will want a lossless technology like InfiniBand. Others will prefer the familiarity and economics of standardized Ethernet in conjunction with technology like the RoCEv2 (RDMA over Converged Ethernet, version 2) protocol, which facilitates lossless and low latency flows. Still others will use both InfiniBand and Ethernet. 

For now, there is no single “right” answer. Apart from considerations like price and deployment size, data center operators will need to weigh multiple factors based on the AI workloads they expect to support, including: 
  • Bandwidth and latency requirements
  • Whether model training will be performed in-house or outsourced
  • Standardized versus proprietary technologies 
  • Comfort with future roadmaps for technologies under consideration 

Looking Ahead

Despite the uncertainty, vendors developing connectivity solutions for AI clusters have little choice but to push ahead under accelerated timelines. The customer’s need is just too great. Even as 400G Ethernet deployments grow, vendors are manufacturing 800G chipsets as quickly as possible, and work on the 1.6-Tbps Ethernet standard is progressing. 

In the meantime, rigorous testing will become even more important—and demand new test and emulation tools designed for the speeds and scale of AI infrastructures. Vendors will need the ability to validate new Ethernet products, perform high-speed timing and synchronization testing, assure interoperability of multi-vendor components, and more. Given the exorbitant costs of building AI clusters for lab environments, vendors will also need tools to emulate cluster behavior with lifelike accuracy. 

The good news is that testing solutions are evolving as quickly as data center networks themselves. As new questions arise about the future of AI networking, the industry will be ready to answer them.