Monday, March 13, 2023

Microsoft deploys NVIDIA's H100 Tensor Core GPUs + Quantum-2 InfiniBand

Microsoft announced new massively scalable virtual machines that integrate the latest NVIDIA H100 Tensor Core GPUs and NVIDIA Quantum-2 InfiniBand networking. Azure’s new ND H100 v5 virtual machine for AI developers can scale across thousands of GPUs.

The ND H100 v5 VMservice is supported by:

  • 8x NVIDIA H100 Tensor Core GPUs interconnected via next gen NVSwitch and NVLink 4.0
  • 400 Gb/s NVIDIA Quantum-2 CX7 InfiniBand per GPU with 3.2Tb/s per VM in a non-blocking fat-tree network
  • NVSwitch and NVLink 4.0 with 3.6TB/s bisectional bandwidth between 8 local GPUs within each VM
  • 4th Gen Intel Xeon Scalable processors
  • PCIE Gen5 host to GPU interconnect with 64GB/s bandwidth per GPU
  • 16 Channels of 4800MHz DDR5 DIMMs

“Co-designing supercomputers with Azure has been crucial for scaling our demanding AI training needs, making our research and alignment work on systems like ChatGPT possible.”—Greg Brockman, President and Co-Founder of OpenAI. 

In a blog post, Microsoft's John Roach disclosed that since 2019, Microsoft and OpenAI have been working working in close collaboration to build supercomputing resources for training increasingly powerful AI models. The Azure infrastructure used included thousands of NVIDIA AI-optimized GPUs linked together in a high-throughput, low-latency network based on NVIDIA Quantum InfiniBand.

https://news.microsoft.com/source/features/ai/how-microsofts-bet-on-azure-unlocked-an-ai-revolution/

https://azure.microsoft.com/en-us/blog/azure-previews-powerful-and-scalable-virtual-machine-to-help-customers-accelerate-ai/