WAN economics to date have not made Google happy, said Urs Hoelzle, SVP of Technical Infrastructure and Google Fellow, speaking at the Open Networking Summit 2012 in Santa Clara, California. Ideally, the cost per bit should go down as the network scales, but this is not really true in a really massive backbone like Google's. This scale requires more expensive hardware and manual management of very complex software. The goal should be to manage the WAN as a fabric and not as a collection of individual boxes. Current equipment and protocols do not allow this. Google's ambition is to build a WAN that is higher performance, more fault tolerant and cheaper.
Some notes from his presentation:
- Google currently operates two WAN backbones. I-Scale is the Internet facing backbone that carries user traffic. It must have bulletproof performance. G-Scale is the internal backbone that carries traffic between Google's data centers worldwide. The G-Scale network has been used to experiment with SDN.
- Google chose to pursue SDN in order to separate hardware from software. This enables it to choose hardware based on necessary features and to choose software based on protocol requirements.
- SDN provides logically, centralized network control. The goal is to be more deterministic, more efficient and more fault-tolerant.
- SDN enables better centralized traffic engineering, such as an ability for the network to converge quickly to target optimum on a link failure.
- Deterministic behavior should simplify planning vs over provisioning for worst case variability.
- The SDN controller uses modern server hardware, giving it more flexibility than conventional routers.
- Switches are virtualized with real OpenFlow and the company can attach real monitoring and alerting servers. Testing is vastly simplified.
- The move to SDN is really about picking the right tool for the right job.
- Google's OpenFlow WAN activity really started moving in 2010. Less than two years later, Google is now running the G-Scale network on OpenFlow-controlled switches. 100% of its production data center to data center traffic is now on this new SDN-powered network.
- Google built their own OpenFlow switch because none were commercially available. The switch was built from merchant silicon. It has scaled to hundred of nonblocking 10GE ports.
- Google's practice is to simplify every software stack and hardware element as much as possible, removing anything that is not absolutely necessary.
- Multiple switch chassis are used in each domain.
- Google is using open source routing stacks for BGP and ISIS.
- The OpenFlow-controlled switches look like regular routers. BGP/ISIS/OSPF now interfaces with OpenFlow controller to program the switch state.
- All data center backbone traffic is now carried by the new network. The old network is turned off.
- Google started rolling out centralized traffic engineering in January.
- Google is already seeing higher network utilization and gaining the benefit of flexible management of end-to-end paths for maintenance.
- Over the past six months, the new network has seen a high degree of stability with minimal outages.
- The new SDN-powered network is meeting the company's SLAs.
- It is still too early to quantify the economics.
- A key benefit is the unified view of the network fabric -- higher QoS awareness and predictability.
- The OpenFlow protocol is really barebones at this point, but it is good enough for real world networks at Google scale.
- 100% of traffic carried on the new network.
http://www.opennetsummit.org/