A key observation from this year's
Open Compute Summit is that the hyper-scale cloud vendors are indeed calling
the shots in terms of hardware design for their data centres. This extends all
the way from the chassis configurations to storage, networking, protocol stacks
and now customised silicon.
To recap, Facebook's newly
refreshed server line-up now has 7 models, each optimised for different
workloads: Type 1 (Web); Type 2 - Flash (database); Type 3 – HDD (database);
Type 4 (Hadoop); Type 5 (photos); Type 6 (multi-service); and Type 7 (cold
storage). Racks of these servers are populated with a ToR switch followed by
sleds with either the compute or storage resources.
In comparison, Microsoft,
which was also a keynote presenter at this year's OCP Summit, is taking a
slightly different approach with its Project Olympus universal server. Here the
idea is also to reduce the cost and complexity of its Azure rollout in hyper-scale
date centres around the world, but to do so using a universal server platform
design. Project Olympus uses either a 1 RU or 2 RU chassis and various modules
for adapting the server for various workloads or electrical inputs. Significantly,
it is the first OCP server to support both Intel and ARM-based CPUs.
Not surprisingly, Intel
is looking to continue its role as the mainstay CPU supplier for data centre
servers. Project Olympus will use the next generation Intel Xeon processors,
code-named Skylake, and with its new FPGA capability in-house, Intel is sure to
supply more silicon accelerators for Azure data centres. Jason Waxman, GM of
Intel's Data Center Group, showed off a prototype Project Olympus server
integrating Arria 10 FPGAs. Meanwhile, in a keynote presentation, Microsoft
Distinguished Engineer Leendert van Doorn confirmed that ARM processors are now
part of Project Olympus.
Microsoft showed Olympus versions
running Windows server on Cavium's
ThunderX2 and Qualcomm's 10 nm Centriq 2400, which offers 48 cores. AMD is
another CPU partner for Olympus with its ARM-based processor, code-named Naples. In addition, there are other ARM licensees
waiting in the wings with designs aimed at data centres, including MACOM
(AppliedMicro's X-Gene 3 processor) and Nephos, a spin-out from MediaTek. For
Cavium and Qualcomm, the case for ARM-powered servers comes down to optimised
performance for certain workloads, and in OCP Summit presentations, both
companies cited web indexing and search as one of the first applications that
Microsoft is using to test their processors.
Project Olympus is also putting
forward an OCP design aimed at accelerating AI in its next-gen cloud
infrastructure. Microsoft, together with NVIDIA and Ingrasys, is proposing a
hyper-scale GPU accelerator chassis for AI. The design, code named HGX-1, will
package eight of NVIDIA's latest Pascal GPUs connected via NVIDIA’s NVLink
technology. The NVLink technology can scale to provide extremely high
connectivity between as many as 32 GPUs - conceivably 4 HGX-1 boxes linked as
one. A standardised AI chassis would enable Microsoft to rapidly rollout the
same technology to all of its Azure data centres worldwide.
In tests published a few months ago,
NVIDIA said its earlier DGX-1 server, which uses Pascal-powered Tesla P100 GPUs
and an NVLink implementation, were delivering 170x of the performance of
standard Xeon E5 CPUs when running Microsoft’s Cognitive Toolkit.
Meanwhile, Intel has introduced the
second generation of its Rack Scale Design for OCP. This brings improvements in
the management software for integrating OCP systems in a hyper-scale data centre
and also adds open APIs to the Snap open source telemetry framework so that
other partners can contribute to the management of each rack as an integrated
system. This concept of easier data centre management was illustrated in an OCP
keynote by Yahoo Japan, which amazingly delivers 62 billion page views per day
to its users and remains the most popular website in that nation. The Yahoo
Japan presentation focused on an OCP-compliant data centre it operates in the
state of Washington, its only overseas data centre. The remote data centre facility
is manned by only a skeleton crew that through streamlined OCP designs is able
to perform most hardware maintenance tasks, such as replacing a disk drive,
memory module or CPU, in less than two minutes.
One further note on Intel’s OCP
efforts relates to its 100 Gbit/s CWDM4 silicon photonics modules, which it
states are ramping up in shipment volume. These are lower cost 100 Gbit/s
optical interfaces that run over up to 2 km for cross data centre connectivity.
On the OCP-compliant storage front
not everything is flash, with spinning HDDs still in play. Seagate has recently
announced a 12 Tbytes 3.5 HDD engineered to accommodate 550 Tbyte workloads
annually. The company claims MTBF (mean time between failure) of 2.5 million
hours and the drive is designed to operate 24/7 for five years. These 12 Tbyte enable
a single 42 U rack to deploy over 10 Pbytes of storage, quite an amazing
density considering how much bandwidth would be required to move this volume of
data.
Google did not make a keynote
appearance at this year’s OCP Summit, but had its own event underway in nearby
San Francisco. The Google Cloud Next event gave the company an even bigger
stage to present its vision for cloud services and the infrastructure needed to
support it.