Monday, July 22, 2024

Network configuration error blamed for AT&T wireless outage

The Federal Communications Commission (FCC) has issued a comprehensive report detailing the causes and impacts of a nationwide AT&T wireless service outage on February 22, 2024. The outage, which lasted over 12 hours, prevented customers from using voice and data services, blocking more than 92 million phone calls and over 25,000 attempts to reach 911. The report also includes recommendations to prevent similar incidents in the future.

FCC Chairwoman Jessica Rosenworcel emphasized the gravity of the situation, stating, “When you sign up for wireless service, you expect it to be available when you need it – especially for emergencies.” The outage not only disrupted consumer communications but also affected public safety personnel using FirstNet. The FCC’s Public Safety and Homeland Security Bureau promptly launched an investigation, revealing several key findings about the outage’s extensive impact and the subsequent corrective actions taken by AT&T.

The outage affected users in all 50 states, Washington, D.C., Puerto Rico, and the U.S. Virgin Islands, impacting over 125 million devices. During the critical early hours of the outage, all 4G voice and 5G data services were unavailable to AT&T customers, including FirstNet subscribers. This disruption particularly impacted FirstNet, as device registrations approached normal only after the restoration of the dedicated network elements connected to AT&T’s broader network.

The FCC report highlights the technical cause of the outage, pinpointing two key steps that led to the network configuration error. An incorrect configuration was first made by an AT&T employee, followed by another employee loading this erroneous network change. This sequence revealed insufficient oversight and controls within AT&T’s processes, allowing the misconfiguration to propagate. Additionally, the network’s inability to handle the sudden influx of re-registration requests once the error was corrected prolonged the outage significantly.

Key Points:

  • The outage blocked over 92 million voice calls and prevented more than 25,000 calls to 911.
  • AT&T prioritized restoring FirstNet services but delayed notifying FirstNet users about the outage.
  • The outage was caused by a network configuration error due to insufficient oversight and controls.
  • The downstream network element lacked controls to mitigate the error, triggering Protection Mode and disconnecting all users.
  • System limitations caused registration congestion, prolonging the outage even after the initial error was corrected.
  • AT&T has since implemented additional technical controls, forensic work, and peer review procedures to prevent similar issues.

Corrective Actions by AT&T:

  • Implemented additional technical controls within 48 hours of the outage.
  • Scanned and updated network elements to prevent similar errors.
  • Enhanced network robustness and resilience through ongoing forensic work.
  • Adopted new procedures ensuring maintenance work requires completed peer reviews.
  • Improved registration systems to handle higher capacity and quicker recovery from Protection Mode.

The full FCC report can be accessed here..