DNS issue on non-enterprise CDN

Incident Report for Netlify

Postmortem

We attempted to double our CDN capacity so that the network could naturally absorb more requests and attacks without impacting customers. A bug in our configuration management system resulted in a bad configuration being pushed into the monitoring system. This caused us to remove most of our nodes on the regular ADN from the DNS record that serves that ADN. This was reverted and service was restored by 16:10UTC.
Since these incidents, we have revamped the runbooks and deployment/configuration scripts to be more detailed in the mitigation procedures and added detailed validation to the monitoring system's configuration. We have also doubled the ADN capacity to be better prepared for future high load situations.

Posted Jan 16, 2020 - 22:06 UTC

Resolved

This incident has been resolved.

Posted Jan 10, 2020 - 16:43 UTC

Monitoring

We deployed a change which misconfigured DNS for our non-enteprise CDN for 5 minutes from 16:06 UTC to 16:11 UTC, causing an outage on all sites on that CDN. We repaired the config immediately and do not expect recurrence.

Posted Jan 10, 2020 - 16:42 UTC