On the evening of March 8th, AT&T
experienced a widespread outage of 911 emergency calling service for mobile
customers across a significant portion of the U.S. Some callers were simply
unable to reach an emergency operator. Media reports suggested the outage
impacted at least 14 states and Washington DC, while AT&T Mobility
confirmed that service interruptions prevented callers from reaching 911
emergency centres, but did not disclose the extent of the problem or the cause.
During the course of the outage, which
lasted several hours, FCC chairman Ajit Pai reached out directly to Randall
Stephenson, AT&T's CEO and took to Twitter to express his alarm at the
situation. The FCC has since launched an investigation to track down the root
cause of the outage. Until the report is published, it is not known which
systems failed, whether the issue cascaded from one facility to the next, how
quickly the problem was detected, or how the network recovered.
AT&T is not the only major U.S.
carrier in the news for 911 connectivity trouble. Another major story concerns
911 'ghost' calls from T-Mobile subscribers in the Dallas area. The alarming
situation, as reported by The Washington Post, has meant the T-mobile users
have been placed on hold for extended periods of time. At one point in March,
442 callers in Dallas reportedly were placed on hold for an average of 38
minutes. The technical fault in the city's 911 centre is being blamed for the
death of at least two people. Worse, the problem apparently has happened
before, perhaps dating back several months, and there has not been a sufficient
effort to fix it. Whether the faulty equipment is untimely found to be in the city’s
emergency response centre, in the carrier network, or with some interface
between the two, the ultimate result is that public has been placed in danger
by diminished networking standards.
For big public cloud providers,
recent months have not been great for reliability
On February 28th, Amazon Web
Services suffered a widespread outage with its S3 web-based storage system. The
anomaly involved 'high error rates' with S3 in U.S.-EAST-1, which brought down
many high visibility web sites including Business Insider, Quora, Slack and
others. While other S3 regions were not impacted, the number of websites now
relying on AWS infrastructure is remarkably high and rising. In fact,
SimilarTech.com calculates that 165,344 websites and 137,396 unique domains are
now running on AWS S3. On the positive side, Amazon publishes up-to-the-minute
information on service availability worldwide. The company is also quite
responsive in posting technical updates as the service is being restored. Rather
than waiting months for a fault-finding report, AWS has posted technical
assessments within hours of resolution. For this latest S3 outage, the blame
was attributed to a human error, specifically an S3 team member using an
established playbook executed a command to remove a small number of servers for
one of the S3 subsystems that is used by the S3 billing process. The command
had unintended consequences. A full restart was required, resulting in hours of
downtime for customers, many of whom had real-time, mission-critical
applications.
Meanwhile, on March 16th Microsoft
Azure experienced a storage incident that disrupted services in 26 of the
public cloud’s 28 regions. The disruption has since been characterised as two
separate incidents - the first having a global impact and the second being
confined to its U.S. east region. In September, Azure experienced a different
sort of DNS error that impacted many of its cloud services worldwide for several
hours.
None of the outages cited above
appear to have been caused by malicious intent, which is of course a prime
concern for network reliability, especially given that DDoS attacks continue to
grow in size and sophistication. For instance, the October 2016 Mirai botnet
attack on Dyn's DNS infrastructure reportedly involved tens of millions of
discrete IP addresses from IoT devices.
Are capex budgets sufficient for
maintaining five-nines?
For decades, the expectation has
been that emergency calls would always get connected, even on Mother's Day,
when traffic volumes spike to the highest levels of the year, or if key
equipment were to fail. Five-nines (99.999%) reliability translates as system
downtime of less than 5.26 minutes per year, and the standard was achieved and
maintained through excellence in engineering and in management; many aspired to
six-nines reliability, the equivalent of 31.5 seconds of downtime per year.
AT&T is currently undergoing a
historic transformation to a virtualised network architecture, and the company
talks about its Network 3.0 cloud-centric vision as a guiding force for itself
and the rest of the telecom industry. Earlier this month, AT&T stated that
it has already converted 34% of its network functionality to SDN and is on the
way to 75% by 2020. The network virtualisation goal for year-end 2017 is 55%. It
is unclear if or when the 911 connectivity systems will become part of this
transformation.
One of the touted benefits of new
virtualised system is rapid and easy fail-over. There should be better than
1-to-1 redundancy by using pods of generic x86-based systems rather than the closed,
purpose-built legacy systems. On the other hand, every component of the
traditional systems was designed for high-availability.
One question that perhaps the FCC
investigation will address is whether sufficient capex is being dedicated to
maintaining the legacy systems until the new architecture is fully deployed and
proven to be equally reliable. Last summer, the Communications Workers of
America (CWA) reached out to regulators in New York, New Jersey, Maryland,
Delaware, Pennsylvania, Virginia, and Washington, DC, arguing that Verizon
has been under-investing in its copper access network since at least 2008. The
complaint alleged that Verizon's spending on its Fios fibre infrastructure came
at the expense of maintenance for its aging copper networks, which still serve
some 8 million customers, and for whom the company still has a statutory
obligation to provide safe and reliable service.
Public cloud providers have no such
regulatory requirements to achieve five-nines, but they do maintain service
level agreements with their customers. Hour-long outages are quite costly, and
competitive pressures are even more costly. In the future, as billions of
devices come online, such as self-driving cars, delivery drones in flight,
in-home medical equipment, the need for always-on networking will be more acute
than ever.