Novu

Postmortem on DNS Name Server Migration Interruption

This is a postmortem of a 55-minute global service interruption during our DNS migration from AWS Route 53 to Cloudflare users have experienced on Mar 23, 2025.

inbox notification

tl;dr

On March 23, 2025, while migrating our DNS services from AWS Route 53 to Cloudflare, we experienced an intermittent service interruption lasting 55 minutes, from 22:00 UTC to 22:55 UTC. This disruption impacted all users worldwide. We understand that this caused significant inconvenience, and we appreciate your patience during this time.

Root Cause

The root cause of these interruptions was the intricate nature of DNS caching, which governs how resolvers update their authoritative Name Server (NS) records. During the migration, we observed two main issues:

  • Some providers continued to serve outdated delegation data.
  • Others temporarily served negative caches.

Despite our efforts to speed up the update process by clearing caches using Google Public DNS and Cloudflare’s 1.1.1.1, the distributed nature of the DNS system means that resolvers around the world refresh their information at different intervals. This variation is influenced by their local policies and Time-to-Live (TTL) settings. As a result, we experienced periods of intermittent unavailability.

Proactive Measures and Unforeseen DNS Propagation Challenges

To prepare for this migration and minimize potential disruptions, we undertook several key preparatory steps:

  • A week before the scheduled cutover, we reduced the TTL of existing DNS records in Route 53, including the TTL of the NS record itself. This was intended to shorten the time resolvers would cache old information.
  • We meticulously replicated all DNS records to ensure complete functional parity between our old DNS configuration in Route 53 and the new configuration in Cloudflare. This ensured that the correct information would be available once resolvers were updated.
  • We updated the registrar to point to the new Cloudflare nameservers, rigorously completed all necessary verification steps, and initiated cache purges further to accelerate the propagation of the updated DNS records.

Despite these proactive measures, we experienced challenges. Specifically, certain Internet Service Providers (ISPs) and DNS resolvers retained the old NS records beyond the expected TTL, or, in some cases, temporarily cached a DNS_PROBE_FINISHED_DOMAIN response.

These behaviors resulted in intermittent resolution failures for some users.

To lessen the impact, we strategically scheduled the cutover to coincide with our period of lowest traffic volume.

Negative caching contributed to the interruptions. When a resolver receives a transient “not found” or error response, it caches this negative result for a period determined by the negative TTL. This is a standard mechanism to reduce load on authoritative servers.

However, in our migration scenario, this meant that some resolvers, upon encountering a temporary inability to resolve the domain, cached this failure, leading to service unavailability for those users for the duration of the negative TTL. While these behaviors are inherent to DNS and contribute to its overall performance and reliability, they presented challenges during this specific migration.

Strategic Upgrade: Rationale Behind Cloudflare DNS Migration

Our decision to migrate our DNS infrastructure to Cloudflare was driven by a strategic objective to enhance performance and scalability. Cloudflare’s extensive edge network provides several key advantages: faster global DNS resolution, quicker SSL termination, and the ability to leverage Cloudflare Argo Smart Routing for optimized packet routing and congestion avoidance.

This infrastructure upgrade is a core component of our ongoing efforts to improve Novu’s global performance, security, and resilience.

Post-Incident Review: Enhancing Migration Protocols

We are committed to a thorough post-incident analysis to identify areas for improvement in our migration procedures and communication protocols.

Throughout the 55-minute window of disruption, we maintained continuous communication with our users through updates on novustatus.com, providing real-time information and updates on the situation as it evolved.

We aim to prevent similar incidents and ensure a smoother user experience during future infrastructure changes. We keep on taking care of the notifications so you can focus on what matters.

Happy notifying 🔔

Related Posts

category: Announcement

From Builders, For Builders – Introducing new Novu Pro tier

We’re introducing Novu Pro, a new pricing tier designed for developers who need more flexibility, more scale, and more control.

Dima Grossman
Dima Grossman
category: Announcement

2025 Winter Season Launch Week

The Winter Season Launch Week 2025 brings five major updates to Novu's notification platform: Multi-Environment Support, Template Store, Variable Popover, Auto-Generated SDKs, and Step Conditions. These features enhance workflow management, provide pre-built templates, simplify dynamic content creation, expand SDK support, and enable advanced conditional logic for notifications.

Emil Pearce
Emil Pearce
category: Announcement

The future of Novu

Explore what's next for Novu, the open-source notification infrastructure, including its innovative UI, code-first workflows, and seamless integration for delivering top-notch notification experiences.

Dima Grossman
Dima Grossman

Novu is an open-source notification platform

Add a fully customizable Inbox component and email notifications to your app in minutes. No credit card required.