Postmortem on DNS Name Server Migration Interruption
This is a postmortem of a 55-minute global service interruption during our DNS migration from AWS Route 53 to Cloudflare users have experienced on Mar 23, 2025.

tl;dr
On March 23, 2025, while migrating our DNS services from AWS Route 53 to Cloudflare, we experienced an intermittent service interruption lasting 55 minutes, from 22:00 UTC to 22:55 UTC. This disruption impacted all users worldwide. We understand that this caused significant inconvenience, and we appreciate your patience during this time.
Root Cause
The root cause of these interruptions was the intricate nature of DNS caching, which governs how resolvers update their authoritative Name Server (NS) records. During the migration, we observed two main issues:
- Some providers continued to serve outdated delegation data.
- Others temporarily served negative caches.
Despite our efforts to speed up the update process by clearing caches using Google Public DNS and Cloudflare’s 1.1.1.1, the distributed nature of the DNS system means that resolvers around the world refresh their information at different intervals. This variation is influenced by their local policies and Time-to-Live (TTL) settings. As a result, we experienced periods of intermittent unavailability.
Proactive Measures and Unforeseen DNS Propagation Challenges
To prepare for this migration and minimize potential disruptions, we undertook several key preparatory steps:
- A week before the scheduled cutover, we reduced the TTL of existing DNS records in Route 53, including the TTL of the NS record itself. This was intended to shorten the time resolvers would cache old information.
- We meticulously replicated all DNS records to ensure complete functional parity between our old DNS configuration in Route 53 and the new configuration in Cloudflare. This ensured that the correct information would be available once resolvers were updated.
- We updated the registrar to point to the new Cloudflare nameservers, rigorously completed all necessary verification steps, and initiated cache purges further to accelerate the propagation of the updated DNS records.
Despite these proactive measures, we experienced challenges. Specifically, certain Internet Service Providers (ISPs) and DNS resolvers retained the old NS records beyond the expected TTL, or, in some cases, temporarily cached a DNS_PROBE_FINISHED_DOMAIN
response.
These behaviors resulted in intermittent resolution failures for some users.
To lessen the impact, we strategically scheduled the cutover to coincide with our period of lowest traffic volume.
Negative caching contributed to the interruptions. When a resolver receives a transient “not found” or error response, it caches this negative result for a period determined by the negative TTL. This is a standard mechanism to reduce load on authoritative servers.
However, in our migration scenario, this meant that some resolvers, upon encountering a temporary inability to resolve the domain, cached this failure, leading to service unavailability for those users for the duration of the negative TTL. While these behaviors are inherent to DNS and contribute to its overall performance and reliability, they presented challenges during this specific migration.
Strategic Upgrade: Rationale Behind Cloudflare DNS Migration
Our decision to migrate our DNS infrastructure to Cloudflare was driven by a strategic objective to enhance performance and scalability. Cloudflare’s extensive edge network provides several key advantages: faster global DNS resolution, quicker SSL termination, and the ability to leverage Cloudflare Argo Smart Routing for optimized packet routing and congestion avoidance.
This infrastructure upgrade is a core component of our ongoing efforts to improve Novu’s global performance, security, and resilience.
Post-Incident Review: Enhancing Migration Protocols
We are committed to a thorough post-incident analysis to identify areas for improvement in our migration procedures and communication protocols.
Throughout the 55-minute window of disruption, we maintained continuous communication with our users through updates on novustatus.com, providing real-time information and updates on the situation as it evolved.
We aim to prevent similar incidents and ensure a smoother user experience during future infrastructure changes. We keep on taking care of the notifications so you can focus on what matters.
Happy notifying 🔔