by Aniket Khosla, Spirent’s VP of Wireline Product Management
To process evolving AI workloads, AI cluster networks must provide unprecedented throughput, extremely low latency, and support for new traffic patterns such as micro data bursts. Even more challenging, the traditional approach for improving network performance—throwing more hardware at the problem—won’t work. The characteristics of AI workloads and their exponential growth demand an entirely new kind of data center network.
Inside the Cluster
- Training involves ingesting vast amounts of data for an AI model to train on. By analyzing patterns and relationships between inputs and outputs, the model learns how to make increasingly accurate predictions.
- Inferencing is where AI models put their training into practice. The model receives an input (such as a query from a chatbot user), and then performs some classification or prediction action to generate an output.
These requirements are already pushing the limits of hyperscale data centers, where the number and variety of AI workloads have skyrocketed. And that’s just today’s AI models—which are growing 1,000x more complex every three years. Operators can’t meet these demands by scaling data center fabrics as they have in the past. They need to fundamentally re-architect them.
AI Fabric Requirements
For AI training, operators should be able to support workloads well enough with typical refresh cycles, which will push front-end networks to 800Gbps and beyond over the next several years. Inferencing, however, is another matter. Here, they need scalable back-end infrastructures capable of connecting thousands of xPUs and delivering:
- Massive scale: Compared to training, AI inferencing can generate 5x more traffic per accelerator and require 5x higher network bandwidth. Back-end networks must also be able to support thousands of synchronized jobs in parallel and more data- and compute-intensive workloads.
- Extremely low latency: For real-time AI inferencing (from an AI vision application flagging a production flaw to an AI chatbot responding to a customer), every millisecond counts. These workloads must progress through large numbers of nodes, and any delay in this flow can impede scalability or cause timeouts. Depending on the application, such delays could result in poor user experiences, expensive manufacturing errors, or worse.
- No packet loss: AI applications running in data centers are highly sensitive to packet loss, which increases latency and makes the network less predictable. AI cluster fabrics therefore must be lossless.
Evolving Interfaces
Even in new back-end data center infrastructures, we see significant variation, with Google, Microsoft, and Amazon all taking different paths. Based on what operators are saying, however, and the interfaces they’re investing in, we’re starting to get a clearer picture.
In front-end networks used for data ingestion, Dell’Oro Group forecasts that by 2027, one third of all Ethernet ports will by 800 Gbps or higher. In back-end networks, where operators need higher throughput and lower latency immediately, things are moving more quickly. Nearly all back-end ports will be 800G and above by 2027, with bandwidth growing at a triple-digit growth rate.
- Bandwidth and latency requirements
- Whether model training will be performed in-house or outsourced
- Standardized versus proprietary technologies
- Comfort with future roadmaps for technologies under consideration
Looking Ahead
In the meantime, rigorous testing will become even more important—and demand new test and emulation tools designed for the speeds and scale of AI infrastructures. Vendors will need the ability to validate new Ethernet products, perform high-speed timing and synchronization testing, assure interoperability of multi-vendor components, and more. Given the exorbitant costs of building AI clusters for lab environments, vendors will also need tools to emulate cluster behavior with lifelike accuracy.