Monday, March 18, 2024

NVIDIA Blackwell and 5th gen NVlink advance AI

 NVIDIA unveiled its Blackwell platform, named in honor of David Harold Blackwell, a mathematician who specialized in game theory and statistics, succeeding the NVIDIA Hopper architecture launched two years ago.

Blackwell leverages six new technologies to enable AI training and real-time LLM inference for models scaling up to 10 trillion parameters.

Blackwell Highlights

  • 208 billion transistors 
  • Manufactured using a custom-built 4NP TSMC process with two-reticle limit 
  • GPU dies are connected by 10 TB/second chip-to-chip link into a single, unified GPU
  • Blackwell introduces a 2nd gen  Transformer Engine with new micro-tensor scaling support and NVIDIA’s advanced dynamic range management algorithms
  • Blackwell will support double the compute and model sizes with new 4-bit floating point AI inference capabilities.
  • 5th gen NVLink delivers 1.8TB/s bidirectional throughput per GPU, ensuring seamless high-speed communication among up to 576 GPUs for the most complex LLMs
  • Blackwell-powered GPUs include a dedicated engine for reliability, availability and serviceability.
  • Support for new native interface encryption protocols
  • A dedicated decompression engine supports the latest formats
  • The NVIDIA GB200 Grace Blackwell Superchip connects two NVIDIA B200 Tensor Core GPUs to the NVIDIA Grace CPU over a 900GB/s ultra-low-power NVLink chip-to-chip interconnect.

Building Bigger Systems

For the highest AI performance, GB200-powered systems can be connected with the NVIDIA Quantum-X800 InfiniBand and Spectrum-X800 Ethernet platforms, also announced today, which deliver advanced networking at speeds up to 800 GbPs. The GB200 is a key component of the NVIDIA GB200 NVL72, a multi-node, liquid-cooled, rack-scale system for the most compute-intensive workloads. It combines 36 Grace Blackwell Superchips, which include 72 Blackwell GPUs and 36 Grace CPUs interconnected by fifth-generation NVLink. Additionally, GB200 NVL72 includes NVIDIA BlueField-3 data processing units to enable cloud network acceleration, composable storage, zero-trust security and GPU compute elasticity in hyperscale AI clouds. The GB200 NVL72 provides up to a 30x performance increase compared to the same number of NVIDIA H100 Tensor Core GPUs for LLM inference workloads, and reduces cost and energy consumption by up to 25x. The platform acts as a single GPU with 1.4 exaflops of AI performance and 30TB of fast memory, and is a building block for the newest DGX SuperPOD.

NVIDIA will also offer the HGX B200, a server board that links eight B200 GPUs through NVLink to support x86-based generative AI platforms. HGX B200 supports networking speeds up to 400Gb/s through the NVIDIA Quantum-2 InfiniBand and Spectrum-X Ethernet networking platforms.

Global Network of Blackwell Partners

AWS, Google Cloud, Microsoft Azure and Oracle Cloud Infrastructure will be among the first cloud service providers to offer Blackwell-powered instances, as will NVIDIA Cloud Partner program companies Applied Digital, CoreWeave, Crusoe, IBM Cloud and Lambda. 

Sovereign AI clouds will also provide Blackwell-based cloud services and infrastructure, including Indosat Ooredoo Hutchinson, Nebius, Nexgen Cloud, Oracle EU Sovereign Cloud, the Oracle US, UK, and Australian Government Clouds, Scaleway, Singtel, Northern Data Group's Taiga Cloud, Yotta Data Services’ Shakti Cloud and YTL Power International.

Cisco, Dell, Hewlett Packard Enterprise, Lenovo and Supermicro are expected to deliver a wide range of servers based on Blackwell products, as are Aivres, ASRock Rack, ASUS, Eviden, Foxconn, GIGABYTE, Inventec, Pegatron, QCT, Wistron, Wiwynn and ZT Systems.

Additionally, a growing network of software makers, including Ansys, Cadence and Synopsys — global leaders in engineering simulation — will use Blackwell-based processors to accelerate their software for designing. 

NVIDIA rolls out 800G Networking Switches

NVIDIA introduced its X800 series of 800G Infiniband and Ethernet networking switches designed to cater to massive-scale AI workloads, including those utilizing NVIDIA's newly unveiled Blackwell architecture-based products.

Quantum-X800 Platform: Features the NVIDIA Quantum Q3400 switch and the NVIDIA ConnectX-8 SuperNIC, delivering 5 times higher bandwidth and a 9 times increase in In-Network Computing capabilities compared to previous generations. Advanced features include:

  • The Q3400-RA 4U switch—the first to utilize 200 Gbps-per-lane serializer/deserializer (SerDes) technology. It offers 144 ports of 800 Gbps distributed over 72 OSFP cages and a dedicated management port for NVIDIA UFM (Unified Fabric Manager) connectivity
  • With this very high radix, a two-level fat tree topology can connect up to 10,368 network interface cards (NICs)
  • NVIDIA SHARP v4, Message Passing Interface (MPI) tag matching, MPI_Alltoall, and programmable cores boost NVIDIA In-Network Computing
  • Adaptive routing: The switch and ConnectX-8 SuperNIC, working together, maximize bandwidth and ensure network resilience for AI fabrics
  • Telemetry-based congestion control: These techniques provide noise isolation for multi-tenant AI workloads.
  • The Q3400 is air-cooled and compatible with standard 19-inch rack cabinets. A parallel liquid-cooled system, Q3400-LD, fitting an Open Compute Project (OCP) 21-inch rack, is offered as well.

The NVIDIA ConnectX-8 SuperNIC delivers 800 Gbps networking with performance isolation for multi-tenant generative AI clouds. It provides 800 Gbps data throughput with PCI Express (PCIe) Gen6, offering up to 48 lanes for various use cases such as PCIe switching inside NVIDIA GPU systems. It also supports advanced NVIDIA In-Network Computing, MPI_Alltoall, and MPI tag-matching hardware engines, as well as fabric enhancement features like quality ofservice and congestion control. The ConnectX-8 SuperNIC, featuring single-port OSFP224 and dual-port quad small form-factor pluggable (QSFP) 112 connectors for the adapters, is compatible with various form factors, including OCP 3.0 and Card Electromechanical (CEM) PCIe x16. ConnectX-8 SuperNIC also supports NVIDIA Socket Direct 16-lane auxiliary card expansion

Spectrum-X800 Platform: Tailored for AI cloud and enterprise infrastructure, offering optimized performance for faster processing and analysis of AI workloads. It includes the Spectrum SN5600 800 Gbps switch and the NVIDIA BlueField-3 SuperNIC. Highlights:

  • The Spectrum-X800 SN5600 ASIC boasts 64 ports of 800G OSFP and 51.2 terabits per second (Tbps) of switching capacity. 
  • Support Remote direct-memory access (RDMA) over converged Ethernet (RoCE) adaptive routing: Spectrum-X800 features adaptive routing for lossless networks, closely integrating the switch and SuperNIC to boost bandwidth and resilience in AI fabrics. 
  • Programmable congestion control: Spectrum-X800 uses advanced congestion control techniques to enhance noise isolation in multi-tenant AI environments
  • Software Support: NVIDIA provides a suite of network acceleration libraries and software to enhance the performance of trillion-parameter AI models, including the NVIDIA Collective Communications Library (NCCL) for extending GPU computing tasks to the Quantum-X800 network.

“NVIDIA Networking is central to the scalability of our AI supercomputing infrastructure,” said Gilad Shainer, senior vice president of Networking at NVIDIA. “NVIDIA X800 switches are end-to-end networking platforms that enable us to achieve trillion-parameter-scale generative AI essential for new AI infrastructures.”

Initial adopters of Quantum InfiniBand and Spectrum-X Ethernet include Microsoft Azure and Oracle Cloud Infrastructure.

“AI is a powerful tool to turn data into knowledge. Behind this transformation is the evolution of data centers into high-performance AI engines with increased demands for networking infrastructure,” said Nidhi Chappell, Vice President of AI Infrastructure at Microsoft Azure. “With new integrations of NVIDIA networking solutions, Microsoft Azure will continue to build the infrastructure that pushes the boundaries of cloud AI.

Big Clouds endorse NVIDIA Blackwell platform

AWS will offer NVIDIA Grace Blackwell GPU-based Amazon EC2 instances and NVIDIA DGX Cloud to accelerate performance of building and running inference on multi-trillion-parameter LLMs. Plans also include the integration of  AWS Nitro System, Elastic Fabric Adapter encryption, and AWS Key Management Service with Blackwell encryption to provides end-to-end control of raining data and model weights.  Specifically, AWS will offer the NVIDIA Blackwell platform, featuring GB200 NVL72, with 72 Blackwell GPUs and 36 Grace CPUs interconnected by fifth-generation NVIDIA NVLink. When connected with Amazon’s networking (EFA), and supported by advanced virtualization (AWS Nitro System) and hyper-scale clustering (Amazon EC2 UltraClusters), customers can scale to thousands of GB200 Superchips.

In addition, Project Ceiba, which is a collaboration between NVIDIA and AWS to build one of the world’s fastest AI supercomputers h-osted exclusively on AWS, is available for NVIDIA’s own research and development. This first-of-its-kind supercomputer with 20,736 B200 GPUs is being built using the new NVIDIA GB200 NVL72, a system featuring fifth-generation NVLink, that scales to 20,736 B200 GPUs connected to 10,368 NVIDIA Grace CPUs.

“The deep collaboration between our two organizations goes back more than 13 years, when together we launched the world’s first GPU cloud instance on AWS, and today we offer the widest range of NVIDIA GPU solutions for customers,” said Adam Selipsky, CEO at AWS. “NVIDIA’s next-generation Grace Blackwell processor marks a significant step forward in generative AI and GPU computing. When combined with AWS’s powerful Elastic Fabric Adapter Networking, Amazon EC2 UltraClusters’ hyper-scale clustering, and our unique Nitro system’s advanced virtualization and security capabilities, we make it possible for customers to build and run multi-trillion parameter large language models faster, at massive scale, and more securely than anywhere else. Together, we continue to innovate to make AWS the best place to run NVIDIA GPUs in the cloud.”

Google Cloud confirmed plans to adopt the new NVIDIA Grace Blackwell AI computing platform, as well as the NVIDIA DGX Cloud service on Google Cloud. Additionally, the NVIDIA H100-powered DGX Cloud platform is now generally available on Google Cloud. 

Microsoft will be one of the first organizations to deploy NVIDIA Grace Blackwell GB200 and advanced NVIDIA Quantum-X800 InfiniBand networking to Azure.

Microsoft is also announcing the general availability of its Azure NC H100 v5 VM virtual machine (VM) based on the NVIDIA H100 NVL platform, which is aimed at midrange training and inferencing. The NC series of virtual machines offers customers two classes of VMs from one to two NVIDIA H100 94GB PCIe Tensor Core GPUs and supports NVIDIA Multi-Instance GPU (MIG) technology, which allows customers to partition each GPU into up to seven instances, providing flexibility and scalability for diverse AI workloads.

In addition, NVIDIA GPUs and NVIDIA Triton Inference Server™ help serve AI inference predictions in Microsoft Copilot for Microsoft 365. 

“Together with NVIDIA, we are making the promise of AI real, helping drive new benefits and productivity gains for people and organizations everywhere,” said Satya Nadella, chairman and CEO, Microsoft. “From bringing the GB200 Grace Blackwell processor to Azure, to new integrations between DGX Cloud and Microsoft Fabric, the announcements we are making today will ensure customers have the most comprehensive platforms and tools across every layer of the Copilot stack, from silicon to software, to build their own breakthrough AI capability.”

“AI is transforming our daily lives — opening up a world of new opportunities,” said Jensen Huang, founder and CEO of NVIDIA. “Through our collaboration with Microsoft, we’re building a future that unlocks the promise of AI for customers, helping them deliver innovative solutions to the world.”

Oracle has expanded collaboration with NVIDIA to deliver sovereign AI solutions to customers around the world. Oracle’s distributed cloud, AI infrastructure, and generative AI services, combined with NVIDIA’s accelerated computing and generative AI software, are enabling governments and enterprises to deploy "AI factories." Oracle’s cloud services leverage a range of NVIDIA’s stack, including NVIDIA accelerated computing infrastructure and the NVIDIA AI Enterprise software platform, including newly announced NVIDIA NIM™ inference microservices, which are built on the foundation of NVIDIA inference software such as NVIDIA TensorRT, NVIDIA TensorRT-LLM, and NVIDIA Triton Inference Server.

NVIDIA launches 6G Research Cloud AI platform

NVIDIA unveiled its 6G Research Cloud platform, which aims to advance AI in radio access networks.

The NVIDIA 6G Research Cloud platform consists of three foundational elements:

  • NVIDIA Aerial Omniverse Digital Twin for 6G: A reference application and developer sample that enables physically accurate simulations of complete 6G systems, from a single tower to city scale. It incorporates software-defined RAN and user-equipment simulators, along with realistic terrain and object properties. Using the Omniverse Aerial Digital Twin, researchers will be able to simulate and build base-station algorithms based on site-specific data and to train models in real time to improve transmission efficiency.
  • NVIDIA Aerial CUDA-Accelerated RAN: A software-defined, full-RAN stack that offers significant flexibility for researchers to customize, program and test 6G networks in real time.
  • NVIDIA Sionna Neural Radio Framework: A framework that provides seamless integration with popular frameworks like PyTorch and TensorFlow, leveraging NVIDIA GPUs for generating and capturing data and training AI and machine learning models at scale. This also includes NVIDIA Sionna, the leading link-level research tool for AI/ML-based wireless simulations.

Key adopters and ecosystem partners of this platform include industry leaders like Ansys, Arm, ETH Zurich, Fujitsu, Keysight, Nokia, Northeastern University, Rohde & Schwarz, Samsung, SoftBank Corp., and Viavi, showcasing broad support for NVIDIA's vision.

Orange picks Infinera’s GX Series for international network

Orange has selected Infinera’s GX Series compact modular networking platform to optimize its network infrastructure across Europe.

The network deployment project will leverage Infinera’s latest generation of coherent engines and optical line system. Orange and Infinera recently completed a network in the U.S., and will now connect Paris, Marseille, and Bordeaux in France. Spanning 3,000 kilometers (km), this network includes multiple hub cities for international connectivity via submarine landing stations on France’s Atlantic and Mediterranean coasts and connection points for European data centers and additional Orange affiliates.

“We are excited about this new partnership with Infinera,” said AurĂ©lien Vigano, VP of International Transmission Networks at Orange. “By partnering with Infinera, we are glad to add another key supplier to our network in order to reinforce our core backbone infrastructure across France to double our route between three strategic connectivity hubs, adding resilience and capacity to our network. With Infinera’s flexible GX platform, we will be able to seamlessly reinforce our network as new technologies become available, enabling us to keep pace with rapidly growing customer demands, while providing the best customer experience.”

“Infinera looks forward to our long-term partnership with Orange to deliver our GX Series solution on Orange’s new and existing optical transport routes, expanding Orange’s offerings to network operators and wholesale carriers, with resilient and reliable global connectivity capability,” said Nick Walden, Senior Vice President, Worldwide Sales, Infinera.

Windstream Wholesale upgrades key routes in Virginia

Windstream Wholesale completed two pivotal route enhancement projects across Virginia using FLEX DWDM (FLEX) technology.

The FLEX upgrade extends across Windstream’s ICON network from Ashburn through Richmond, Virginia, and into the Virginia Beach cable landing station (CLS).  The route segment from Ashburn to Richmond serves as a vital connection between key data centers while the extension from Richmond to the Virginia Beach subsea CLS opens new avenues for international connectivity.

The upgrades leverage Ciena’s 6500 Reconfigurable Line System (RLS).

Windstream Wholesale’s FLEX/ICON upgrades and expansions enable 400G wave service, Managed Spectrum service, and greater visibility into network health and performance with technologies like inline optical time-domain reflectometer (OTDR) engineered into the route. Inline OTDR dramatically reduces mean time to repair (MTTR) in the event of a network interference or fiber cut, providing the ability to locate the interference without physically deploying resources. FLEX also provides cost benefits, enabling the direct connection of key data centers to core long-haul network nodes and eliminating the need to bookend equipment to jump between long-haul and regional networks.

“We are proud to announce the completion of these two crucial route initiatives,” said John Nishimoto, Windstream Wholesale senior vice president of product, marketing, and strategy. “These enhancements represent our steadfast commitment to technology leadership and exceeding customer expectations as we continue to drive innovation to meet today’s connectivity demands.”

Windstream Wholesale understands the critical role connectivity plays in today’s digital landscape. By investing in infrastructure expansion and leveraging state-of-the-art technologies like FLEX, the company remains at the forefront of delivering reliable, high-performance network solutions.

Dell'Oro: Broadband Access Equipment Sales Dipped 9% in 2023

The total global revenue for the Broadband Access equipment market decreased to $17.5 B in 2023, down 9 percent year-over-year (Y/Y), according to a new report from Dell'Oro Group. Spending on Cable equipment dropped 3 percent overall, though spending on Remote PHY Devices (RPDs) jumped 21 percent, following a 99 percent Y/Y increase in 2022.

"Cable operators continue to modernize their networks through the deployment of Distributed Access Architectures," said Jeff Heynen, Vice President with Dell'Oro Group. "Though macroeconomic and supply chain issues continue to cloud the short-term horizon, there is no question that the goals of delivering more bandwidth and pushing edge processing and automation closer to subscribers are driving cable operator spending," explained Heynen.

Additional highlights from the 4Q 2023 Broadband Access and Home Networking quarterly report:

  • Total PON equipment spending was down 7 percent from 2022, driven by a 10 percent decline in spending on PON OLTs.
  • Spending on Fixed Wireless CPE increased 7 percent in 2023, driven once again by 5G Sub-6 GHz unit shipments in North America.
  • Total Wi-Fi 6E and Wi-Fi 7 Router and Broadband CPE unit shipments reached 10 M in 2023, with a significant ramp expected for Wi-Fi 7 units in 2024.