Sunday, April 14, 2024

Blueprint: Reimagining Data Center Networks for AI

by Aniket Khosla, Spirent’s VP of Wireline Product Management

By now, you’ve probably seen plenty of examples of artificial intelligence in action, whether creating new code with ChatGPT, using enhanced web search with Google Gemini, even generating full-motion videos on demand. Even so, we’re just scratching the surface of AI’s potential to transform applications and industries. Language processing, image recognition, AI-driven recommendation systems and automation, it’s all incredibly exciting. In the data centers where these AI applications run, however, processing these workloads brings extraordinary new challenges. 

To process evolving AI workloads, AI cluster networks must provide unprecedented throughput, extremely low latency, and support for new traffic patterns such as micro data bursts. Even more challenging, the traditional approach for improving network performance—throwing more hardware at the problem—won’t work. The characteristics of AI workloads and their exponential growth demand an entirely new kind of data center network. 

The industry is still in the process of determining exactly what tomorrow’s AI-optimized networks will look like, and several open questions remain. Among the most significant: the role that Ethernet will play in front- and back-end networks, and the pace at which data center operators will adopt next-generation Ethernet speeds and protocols. 

Why is AI so different and challenging for data center networks? And what do operators and vendors supporting them need to know to make informed decisions? Let’s take a closer look. 

Inside the Cluster

To understand the impact of AI workloads on data center networks, let’s review what’s happening behind the scenes. There are two basic phases of AI processing:
  • Training involves ingesting vast amounts of data for an AI model to train on. By analyzing patterns and relationships between inputs and outputs, the model learns how to make increasingly accurate predictions.  
  • Inferencing is where AI models put their training into practice. The model receives an input (such as a query from a chatbot user), and then performs some classification or prediction action to generate an output. 
Both phases require large amounts of compute in the form of specialized (and expensive) processing units. These “xPUs” could be CPUs, GPUs, Field Programmable Gate Arrays (FPGAs), or other types of accelerators. For AI inferencing in particular though, the network connecting all those xPUs (and servers, and storage) must deliver extremely high bandwidth, with extremely low latency, while ensuring no packet loss. 

These requirements are already pushing the limits of hyperscale data centers, where the number and variety of AI workloads have skyrocketed. And that’s just today’s AI models—which are growing 1,000x more complex every three years. Operators can’t meet these demands by scaling data center fabrics as they have in the past. They need to fundamentally re-architect them. 

AI Fabric Requirements

Today, operators are investing in massive numbers of xPUs for AI workloads. How many they’ll ultimately require—and the scale of the network connecting them—will depend on future AI applications. But needing a fabric that can support tens of thousands of xPUs and trillions of dense parameters seems likely. 

For AI training, operators should be able to support workloads well enough with typical refresh cycles, which will push front-end networks to 800Gbps and beyond over the next several years. Inferencing, however, is another matter. Here, they need scalable back-end infrastructures capable of connecting thousands of xPUs and delivering: 
  • Massive scale: Compared to training, AI inferencing can generate 5x more traffic per accelerator and require 5x higher network bandwidth. Back-end networks must also be able to support thousands of synchronized jobs in parallel and more data- and compute-intensive workloads.  
  • Extremely low latency: For real-time AI inferencing (from an AI vision application flagging a production flaw to an AI chatbot responding to a customer), every millisecond counts. These workloads must progress through large numbers of nodes, and any delay in this flow can impede scalability or cause timeouts. Depending on the application, such delays could result in poor user experiences, expensive manufacturing errors, or worse. 
  • No packet loss: AI applications running in data centers are highly sensitive to packet loss, which increases latency and makes the network less predictable. AI cluster fabrics therefore must be lossless. 
These back-end network requirements are already affecting AI user experiences. According to Meta, roughly a third of elapsed time for AI workloads is spent waiting on the network.  
 

Evolving Interfaces

Given these extreme demands, how should data center operators and the vendors supporting them respond? What’s the best infrastructure approach to sustain both current AI workloads and those on the horizon? This is still an open question. 
Even in new back-end data center infrastructures, we see significant variation, with Google, Microsoft, and Amazon all taking different paths. Based on what operators are saying, however, and the interfaces they’re investing in, we’re starting to get a clearer picture. 
In front-end networks used for data ingestion, Dell’Oro Group forecasts that by 2027, one third of all Ethernet ports will by 800 Gbps or higher. In back-end networks, where operators need higher throughput and lower latency immediately, things are moving more quickly. Nearly all back-end ports will be 800G and above by 2027, with bandwidth growing at a triple-digit growth rate.

While most operators will continue using Ethernet in front-end networks, back-end infrastructures will vary. Depending on the AI applications they’re supporting, some operators will want a lossless technology like InfiniBand. Others will prefer the familiarity and economics of standardized Ethernet in conjunction with technology like the RoCEv2 (RDMA over Converged Ethernet, version 2) protocol, which facilitates lossless and low latency flows. Still others will use both InfiniBand and Ethernet. 

For now, there is no single “right” answer. Apart from considerations like price and deployment size, data center operators will need to weigh multiple factors based on the AI workloads they expect to support, including: 
  • Bandwidth and latency requirements
  • Whether model training will be performed in-house or outsourced
  • Standardized versus proprietary technologies 
  • Comfort with future roadmaps for technologies under consideration 

Looking Ahead

Despite the uncertainty, vendors developing connectivity solutions for AI clusters have little choice but to push ahead under accelerated timelines. The customer’s need is just too great. Even as 400G Ethernet deployments grow, vendors are manufacturing 800G chipsets as quickly as possible, and work on the 1.6-Tbps Ethernet standard is progressing. 

In the meantime, rigorous testing will become even more important—and demand new test and emulation tools designed for the speeds and scale of AI infrastructures. Vendors will need the ability to validate new Ethernet products, perform high-speed timing and synchronization testing, assure interoperability of multi-vendor components, and more. Given the exorbitant costs of building AI clusters for lab environments, vendors will also need tools to emulate cluster behavior with lifelike accuracy. 

The good news is that testing solutions are evolving as quickly as data center networks themselves. As new questions arise about the future of AI networking, the industry will be ready to answer them. 

Tech Update: NTT Research - Innovations for a Better Future

As demand for network and computing capacity continues to rise, AI is pushing for a revolutionary new infrastructure that is quicker, more sustainable, and secure. This will support cutting-edge innovations. Implementing NTT's IOWN concept, which aims to shift from electronic to photonic technologies in networks and computing, will play a crucial role in developing a more sustainable and scalable AI infrastructure. 

Kazuhiro Gomi, President and CEO from NTT Research, provides updates from Upgrade 2024 event in San Francisco :

- The role of photonics technology as a potential solution to the overheating problem in IT tech infrastructure, especially as AI requires significant computational power and energy consumption.

- The steps towards integrating photonics technology into our systems, starting with replacing connections between processors and memory, and eventually leading to the development of optical computers.

- The importance of investing in fundamental basic research, as it is the starting point for big innovations, and how NTT Research allocates a significant portion of its operating income towards this type of research.

https://youtu.be/Won83-wrh7U

Have a tech update that you want to brief us on? Contact info@nextgeninfra.io!

Check out other Tech Updates on our YouTube Channel (subscribe today): https://www.youtube.com/@NextGenInfra and check out our latest reports at: https://nextgeninfra.io/


false

Tech Update: Pioneering the Future with Photonic Computing & IOWN

Check out other Tech Updates on our YouTube Channel (subscribe today): https://www.youtube.com/@NextGenInfra and check out our latest reports at: https://nextgeninfra.io/

NTT Research is making bit bets on photonics/ Chris Shaw, Chief Marketing Officer from NTT Research, explains:

- The concept of the Innovative Optical and Wireless Network (IOWN), a big bet NTT Research made five years ago, envisioning the future

- The successful deployment of an All Photonics Network (APN) in Osaka, Japan, and its subsequent expansion to the UK and the US, marking the beginning of a global APN deployment.

- The next steps towards photonic computing, including the development of photonic hardware and chipsets, and how these advancements can lead to faster, more efficient communication in the world of AI and more sustainable hardware. 

https://youtu.be/mIx73LcenhQ

Have a tech update that you want to brief us on? Contact info@nextgeninfra.io!

false

Tech Update: The Battle for AI will be Won at the Edge

 How critical is edge infrastructure to the future of AI?  Devin Yaung, SVP, Global Enterprise IoT from NTT, discusses:

- Devin discusses the importance of edge computing in AI and ML platforms, emphasizing that these platforms are only as good as the data you feed them. He explains how the edge, being closest to the physical environment, provides real-time information for these platforms to offer true insights.

- He shares how NTT is helping clients maximize their investments in ERP, AI, and ML by providing data from the edge. He gives an example of a client who was able to significantly increase the value they got from their ERP platform by gaining visibility into their data through a real-time inventory management solution provided by NTT.

- Looking ahead, Devin reveals that NTT will continue to build out the edge and develop advanced platforms on AI, machine learning, and analytics. He encourages viewers to stay updated with NTT's progress through their social media and website.

https://youtu.be/R7b2VuNgnjM

Join Devin as he explores the intersection of AI and edge computing, and learn what NTT is doing at the edge.

Have a tech update that you want to brief us on? Contact info@nextgeninfra.io!

Singtel teams with Huawei on Fibre To The Room

Singtel has partnered Huawei to launch FibreEverywhere, a home broadband solution featuring Huawei’s transparent fibre optic cables that provide unprecedented high speed and reliable internet connection to devices such as laptops and smartphones in every room – a first-in-Singapore. 

The Fibre To The Room solution can also be easily installed without homeowners needing to make any structural adjustments to their homes.

This new connectivity solution provides more than 1Gbps Wi-Fi speeds to every room – up to 2-3 times faster than basic Wi-Fi 6 routers. The installation of the new fibre optic cables and routers is available at a promotional price from S$109 per month for connectivity in two rooms, and S$20 per month for every additional room.


https://www.singtel.com/about-us/media-centre/news-releases/singtel-first-in-singapore-to-launch-next-generation-fibre