Friday, March 24, 2017

Microsoft's Project Olympus provides an opening for ARM

A key observation from this year's Open Compute Summit is that the hyper-scale cloud vendors are indeed calling the shots in terms of hardware design for their data centres. This extends all the way from the chassis configurations to storage, networking, protocol stacks and now customised silicon.

To recap, Facebook's newly refreshed server line-up now has 7 models, each optimised for different workloads: Type 1 (Web); Type 2 - Flash (database); Type 3 – HDD (database); Type 4 (Hadoop); Type 5 (photos); Type 6 (multi-service); and Type 7 (cold storage). Racks of these servers are populated with a ToR switch followed by sleds with either the compute or storage resources.

In comparison, Microsoft, which was also a keynote presenter at this year's OCP Summit, is taking a slightly different approach with its Project Olympus universal server. Here the idea is also to reduce the cost and complexity of its Azure rollout in hyper-scale date centres around the world, but to do so using a universal server platform design. Project Olympus uses either a 1 RU or 2 RU chassis and various modules for adapting the server for various workloads or electrical inputs. Significantly, it is the first OCP server to support both Intel and ARM-based CPUs. 

Not surprisingly, Intel is looking to continue its role as the mainstay CPU supplier for data centre servers. Project Olympus will use the next generation Intel Xeon processors, code-named Skylake, and with its new FPGA capability in-house, Intel is sure to supply more silicon accelerators for Azure data centres. Jason Waxman, GM of Intel's Data Center Group, showed off a prototype Project Olympus server integrating Arria 10 FPGAs. Meanwhile, in a keynote presentation, Microsoft Distinguished Engineer Leendert van Doorn confirmed that ARM processors are now part of Project Olympus.

Microsoft showed Olympus versions running Windows server on Cavium's ThunderX2 and Qualcomm's 10 nm Centriq 2400, which offers 48 cores. AMD is another CPU partner for Olympus with its ARM-based processor, code-named Naples.  In addition, there are other ARM licensees waiting in the wings with designs aimed at data centres, including MACOM (AppliedMicro's X-Gene 3 processor) and Nephos, a spin-out from MediaTek. For Cavium and Qualcomm, the case for ARM-powered servers comes down to optimised performance for certain workloads, and in OCP Summit presentations, both companies cited web indexing and search as one of the first applications that Microsoft is using to test their processors.

Project Olympus is also putting forward an OCP design aimed at accelerating AI in its next-gen cloud infrastructure. Microsoft, together with NVIDIA and Ingrasys, is proposing a hyper-scale GPU accelerator chassis for AI. The design, code named HGX-1, will package eight of NVIDIA's latest Pascal GPUs connected via NVIDIA’s NVLink technology. The NVLink technology can scale to provide extremely high connectivity between as many as 32 GPUs - conceivably 4 HGX-1 boxes linked as one. A standardised AI chassis would enable Microsoft to rapidly rollout the same technology to all of its Azure data centres worldwide.

In tests published a few months ago, NVIDIA said its earlier DGX-1 server, which uses Pascal-powered Tesla P100 GPUs and an NVLink implementation, were delivering 170x of the performance of standard Xeon E5 CPUs when running Microsoft’s Cognitive Toolkit.

Meanwhile, Intel has introduced the second generation of its Rack Scale Design for OCP. This brings improvements in the management software for integrating OCP systems in a hyper-scale data centre and also adds open APIs to the Snap open source telemetry framework so that other partners can contribute to the management of each rack as an integrated system. This concept of easier data centre management was illustrated in an OCP keynote by Yahoo Japan, which amazingly delivers 62 billion page views per day to its users and remains the most popular website in that nation. The Yahoo Japan presentation focused on an OCP-compliant data centre it operates in the state of Washington, its only overseas data centre. The remote data centre facility is manned by only a skeleton crew that through streamlined OCP designs is able to perform most hardware maintenance tasks, such as replacing a disk drive, memory module or CPU, in less than two minutes.

One further note on Intel’s OCP efforts relates to its 100 Gbit/s CWDM4 silicon photonics modules, which it states are ramping up in shipment volume. These are lower cost 100 Gbit/s optical interfaces that run over up to 2 km for cross data centre connectivity.

On the OCP-compliant storage front not everything is flash, with spinning HDDs still in play. Seagate has recently announced a 12 Tbytes 3.5 HDD engineered to accommodate 550 Tbyte workloads annually. The company claims MTBF (mean time between failure) of 2.5 million hours and the drive is designed to operate 24/7 for five years. These 12 Tbyte enable a single 42 U rack to deploy over 10 Pbytes of storage, quite an amazing density considering how much bandwidth would be required to move this volume of data.


Google did not make a keynote appearance at this year’s OCP Summit, but had its own event underway in nearby San Francisco. The Google Cloud Next event gave the company an even bigger stage to present its vision for cloud services and the infrastructure needed to support it.