So it was not an April Fool’s Day. Microsoft management has just revealed the root cause of the recent Azure outage, which lasted about an hour, and which was caused by an influx of Domain Name System (DNS) queries associated with a code flaw.
As a reminder, users reported that the Azure portal, Azure services, Dynamics 365, and Xbox Live were inaccessible during a global outage that occurred on 1is last April. The American giant indicates in its analysis report that the majority of services were restored at 22:30 UTC (0:30, Paris time). While the outage was related to its DNS capabilities, the company’s final root cause analysis released on Sunday sheds new light on the cause – an unprecedented code failure in its DNS service, triggered by excessive customer attempts. DNS.
“Azure DNS servers have seen an abnormal increase in DNS queries from around the world targeting a set of domains hosted on Azure,” Microsoft says. “Normally Azure’s caching and traffic shaping layers should mitigate this push. In this incident, a specific sequence of events exposed a code defect in our DNS service that reduced the efficiency of our DNS Edge caches. “
Drowned in DNS queries
Microsoft’s DNS service was overwhelmed when DNS customers re-initiated requests, which added additional pressure on the service. Microsoft notes that attempts by DNS clients are considered legitimate DNS traffic, so this traffic was not dropped by Microsoft’s volumetric mitigation systems, which reduced the availability of its DNS service in several regions. .
The management of the American giant says it has mitigated the problem by updating the logic of the volumetric peak mitigation system to protect the DNS service against excessive attempts by customers. The Redmond firm apologized to affected customers and explained that it fixed the code flaw in order to efficiently process all cached requests. It has also improved the automatic detection and mitigation of abnormal traffic patterns.
This latest outage was not as long as the one that still occurred on Azure in mid-March. This lasted 14 hours and was attributed to an error in the rotation of the keys used to support the use of OpenID by Azure AD.
Source : ZDNet.com