by Aniket Khosla, Spirent’s VP of Wireline Product Management
By now, you’ve probably seen plenty of examples of artificial intelligence in action, whether creating new code with ChatGPT, using enhanced web search with Google Gemini, even generating full-motion videos on demand. Even so, we’re just scratching the surface of AI’s potential to transform applications and industries. Language processing, image recognition, AI-driven recommendation systems and automation, it’s all incredibly exciting. In the data centers where these AI applications run, however, processing these workloads brings extraordinary new challenges.
To process evolving AI workloads, AI cluster networks must provide unprecedented throughput, extremely low latency, and support for new traffic patterns such as micro data bursts. Even more challenging, the traditional approach for improving network performance—throwing more hardware at the problem—won’t work. The characteristics of AI workloads and their exponential growth demand an entirely new kind of data center network.
The industry is still in the process of determining exactly what tomorrow’s AI-optimized networks will look like, and several open questions remain. Among the most significant: the role that Ethernet will play in front- and back-end networks, and the pace at which data center operators will adopt next-generation Ethernet speeds and protocols.
Why is AI so different and challenging for data center networks? And what do operators and vendors supporting them need to know to make informed decisions? Let’s take a closer look.
Inside the Cluster
To understand the impact of AI workloads on data center networks, let’s review what’s happening behind the scenes. There are two basic phases of AI processing:
- Training involves ingesting vast amounts of data for an AI model to train on. By analyzing patterns and relationships between inputs and outputs, the model learns how to make increasingly accurate predictions.
- Inferencing is where AI models put their training into practice. The model receives an input (such as a query from a chatbot user), and then performs some classification or prediction action to generate an output.
These requirements are already pushing the limits of hyperscale data centers, where the number and variety of AI workloads have skyrocketed. And that’s just today’s AI models—which are growing 1,000x more complex every three years. Operators can’t meet these demands by scaling data center fabrics as they have in the past. They need to fundamentally re-architect them.
AI Fabric Requirements
Today, operators are investing in massive numbers of xPUs for AI workloads. How many they’ll ultimately require—and the scale of the network connecting them—will depend on future AI applications. But needing a fabric that can support tens of thousands of xPUs and trillions of dense parameters seems likely.
For AI training, operators should be able to support workloads well enough with typical refresh cycles, which will push front-end networks to 800Gbps and beyond over the next several years. Inferencing, however, is another matter. Here, they need scalable back-end infrastructures capable of connecting thousands of xPUs and delivering:
- Massive scale: Compared to training, AI inferencing can generate 5x more traffic per accelerator and require 5x higher network bandwidth. Back-end networks must also be able to support thousands of synchronized jobs in parallel and more data- and compute-intensive workloads.
- Extremely low latency: For real-time AI inferencing (from an AI vision application flagging a production flaw to an AI chatbot responding to a customer), every millisecond counts. These workloads must progress through large numbers of nodes, and any delay in this flow can impede scalability or cause timeouts. Depending on the application, such delays could result in poor user experiences, expensive manufacturing errors, or worse.
- No packet loss: AI applications running in data centers are highly sensitive to packet loss, which increases latency and makes the network less predictable. AI cluster fabrics therefore must be lossless.
Evolving Interfaces
Given these extreme demands, how should data center operators and the vendors supporting them respond? What’s the best infrastructure approach to sustain both current AI workloads and those on the horizon? This is still an open question.
Even in new back-end data center infrastructures, we see significant variation, with Google, Microsoft, and Amazon all taking different paths. Based on what operators are saying, however, and the interfaces they’re investing in, we’re starting to get a clearer picture.
In front-end networks used for data ingestion, Dell’Oro Group forecasts that by 2027, one third of all Ethernet ports will by 800 Gbps or higher. In back-end networks, where operators need higher throughput and lower latency immediately, things are moving more quickly. Nearly all back-end ports will be 800G and above by 2027, with bandwidth growing at a triple-digit growth rate.
Even in new back-end data center infrastructures, we see significant variation, with Google, Microsoft, and Amazon all taking different paths. Based on what operators are saying, however, and the interfaces they’re investing in, we’re starting to get a clearer picture.
In front-end networks used for data ingestion, Dell’Oro Group forecasts that by 2027, one third of all Ethernet ports will by 800 Gbps or higher. In back-end networks, where operators need higher throughput and lower latency immediately, things are moving more quickly. Nearly all back-end ports will be 800G and above by 2027, with bandwidth growing at a triple-digit growth rate.
While most operators will continue using Ethernet in front-end networks, back-end infrastructures will vary. Depending on the AI applications they’re supporting, some operators will want a lossless technology like InfiniBand. Others will prefer the familiarity and economics of standardized Ethernet in conjunction with technology like the RoCEv2 (RDMA over Converged Ethernet, version 2) protocol, which facilitates lossless and low latency flows. Still others will use both InfiniBand and Ethernet.
For now, there is no single “right” answer. Apart from considerations like price and deployment size, data center operators will need to weigh multiple factors based on the AI workloads they expect to support, including:
- Bandwidth and latency requirements
- Whether model training will be performed in-house or outsourced
- Standardized versus proprietary technologies
- Comfort with future roadmaps for technologies under consideration
Looking Ahead
Despite the uncertainty, vendors developing connectivity solutions for AI clusters have little choice but to push ahead under accelerated timelines. The customer’s need is just too great. Even as 400G Ethernet deployments grow, vendors are manufacturing 800G chipsets as quickly as possible, and work on the 1.6-Tbps Ethernet standard is progressing.
In the meantime, rigorous testing will become even more important—and demand new test and emulation tools designed for the speeds and scale of AI infrastructures. Vendors will need the ability to validate new Ethernet products, perform high-speed timing and synchronization testing, assure interoperability of multi-vendor components, and more. Given the exorbitant costs of building AI clusters for lab environments, vendors will also need tools to emulate cluster behavior with lifelike accuracy.
The good news is that testing solutions are evolving as quickly as data center networks themselves. As new questions arise about the future of AI networking, the industry will be ready to answer them.