Increased errors and latency affecting multiple services

Incident Report for Netlify

Postmortem

Summary

On March 9th, 2022, over the course of two hours and thirty-six minutes between 12:45 and 15:21 UTC, some customers experienced issues with requests to our platform and experienced higher build times or API failures. Collectively, the impact to our service lasted intermittently for two hours during the incident. During this time, customers could have experienced various issues with their site serving correctly. If cached content were unavailable, our service would need to reach back to our databases to update the cache and serve the new content. Due to the root issue of the incident being database latency, in some cases, we were unable to refresh the cache in a timely manner. For content that was not cached - such as dynamically generated content or content from new deploys, and password-protected content - these errors/latency were observed as failed requests affecting your visitors.

Summary of impact:

Periods of failed or slow web requests, increased API request latency and errors, password protected sites unavailable:

12:45 - 12:55 UTC
13:10 - 14:30 UTC
14:46 - 15:21 UTC

We’re genuinely sorry for the impact on our customers and everyone who relies on them. We want to provide the best service possible and we take any service disruption seriously. Our vision is to build a better web, and we strive to provide world-class service at every tier. Below we will provide more insight into the specifics of what occurred, as well as the measures we’re already taking to mitigate the risk of future incidents. We understand the serious nature of this event and we are committed to sharing any new information we uncover as we learn more.

Impact and Resolution Steps

On March 9th beginning at 12:45 UTC we encountered an issue with our production databases following planned system maintenance with no expected impact on performance. However, after the maintenance concluded we encountered connectivity issues to our databases which impacted requests served by the API as well as standard customer builds.

Timeline

[2022-03-09 09:00 - 01:00 UTC]: Planned maintenance conducted

[2022-03-09 12:45 UTC]: Latency alerts first occur indicating high latency on the API and database. The team begins investigating

[2022-03-09 12:51 UTC]: Latencies resolved

[2022-03-09 13:02 UTC]: Scheduled maintenance status updated to complete

[2022-03-09 13:10 UTC]: Latencies on database connections increased, again impacting API and build systems and serving uncached content

[2022-03-09 15:21 UTC]: Latencies resolved

[2022-03-09 15:43 UTC]: The source of the latency was identified and we began developing a mitigation

[2022-03-09 16:07 UTC]: Impact is mitigated following code change to resolve database connection issues

[2022-03-09 16:47 UTC]: Additional improvements are made to our database to improve performance

[2022-03-09 16:54 UTC]: Status updated to monitoring following mitigations and monitoring indicating requests and builds are operational

[2022-03-09 17:34 UTC]: Incident declared resolved

Next Steps

We are conducting an in-depth analysis of the processes and practices we followed during this incident. Additionally, we will review the testing procedures conducted prior to starting the operation. This RCA represents our initial response immediately following the outage.
We are preparing new runbooks to debug connections to our database systems. This work will be done by March 31, 2022
We will evaluate options to improve request handling and provide redundancy.
Review our incident practices to improve our response approach and reduce time to resolve. This work will be done by March 31, 2022
We will review and update our maintenance and incident communication runbooks to ensure they effectively reflect our commitment to providing timely, accurate, and transparent communication for incident handling. This work will be done by March 31, 2022
We continue to invest in foundational changes to improve the scalability and resiliency of our platform. We’ll continue to update our community forum as we make progress toward this goal.
We have engaged our external database consultants to help us continue to debug and understand the underlying cause of the latency.

Posted Mar 10, 2022 - 18:28 UTC

Resolved

Service has been stable for over an hour now. Team continues to monitor closely and we will be posting a public write up about the incident as soon as we can. If any further large scale issues occur, we will open a new incident on our status page.

Posted Mar 09, 2022 - 17:34 UTC

Monitoring

We have not observed further substantial errors in the past 45 minutes. Our team continues to monitor closely and work on additional mitigations. We will be writing a public root cause analysis describing what led to the issue and how we've resolved it.

Posted Mar 09, 2022 - 16:53 UTC

Update

We have been stable (few errors, but latency still higher than usual) since our last status update 30 minutes ago, but the team continues to work on the situation and add new mitigations. We will continue to provide updates as the situation develops.

Posted Mar 09, 2022 - 16:20 UTC

Update

Team continues to work to stabilize and resolve the service degradation. Effects should be intermittent errors and latency affecting customer sites, our API, and UI.

Posted Mar 09, 2022 - 15:43 UTC

Identified

Our mitigation was not fully effective and we are seeing more latency, timeouts, and errors for serving uncached content, API responses, and builds for all customers. Our team is working hard on a fix.

Posted Mar 09, 2022 - 15:13 UTC

Monitoring

The degradation is mitigated. We are continuing to monitor.

Posted Mar 09, 2022 - 14:43 UTC

Update

We’re still experiencing high latency with the uncached content, API, and increased error rate with builds. The team continues to investigate this issue.

Posted Mar 09, 2022 - 14:15 UTC

Update

We are experiencing high latency with the uncached content, API, and increased error rate with builds.

Posted Mar 09, 2022 - 13:38 UTC

Investigating

We are experiencing high latency with api origin, and increased error rate with builds.

Posted Mar 09, 2022 - 13:32 UTC

This incident affected: High-Performance Edge Network, Standard Edge Network, Origin Servers, Build Pipeline, Netlify Application UI, and API.