AWS US EAST-1 Outage Incident Summary

Category
Falit Jain
October 21, 2025
5 min read
AWS US EAST-1 Outage Incident Summary
Table of Content

TL;DR

On October 19–20, 2025, AWS US-EAST-1 (N. Virginia) experienced a 15-hour outage triggered by DNS resolution issues in DynamoDB.

This caused cascading failures across 142 AWS services, including critical ones such as EC2, Lambda, S3, DynamoDB, and CloudWatch.

Popular apps like Snapchat, Reddit, Fortnite, Venmo, Duolingo, and Signal were also affected.

Full recovery was achieved by October 20, 3:01 PM PDT.

Incident Title

[RESOLVED] Increased Error Rates and Latencies – AWS US-EAST-1 Region

Incident Link: AWS Service Health Dashboard

Overall Metrics

  • Total Time of Impact: ~15 hours 12 minutes
  • Mean Time to Resolve (MTTR): ~15 hours
  • Popular External Apps Affected: Snapchat, Reddit, Fortnite, Venmo, Duolingo, Signal

Popular Products Affected

During the outage, many widely used consumer applications and platforms experienced downtime or degraded performance, including:

  • Snapchat (social media/messaging)
  • Reddit (online community platform)
  • Fortnite (Epic Games – online gaming)
  • Venmo (peer-to-peer payments)
  • Duolingo (language learning)
  • Signal (encrypted messaging)

This highlights the far-reaching downstream effects of an AWS regional outage on globally popular applications.

Incident Title

[RESOLVED] Increased Error Rates and Latencies – AWS US-EAST-1 Region

Incident Link - https://health.aws.amazon.com/health/status?path=service-history

Overall Metrics

  • Total Time of Impact: ~15 hours 12 minutes
  • Mean Time to Resolve (MTTR): ~15 hours
  • Popular External Apps Affected: Snapchat, Reddit, Fortnite, Venmo, Duolingo, Signal

Incident Timeline

  • Start: Oct 19, 11:49 PM PDT
  • End: Oct 20, 3:01 PM PDT
  • Duration: ~15 hours

Key Events

  • 11:49 PM – 2:24 AM: DNS resolution issues for DynamoDB endpoints triggered cascading failures across IAM, DynamoDB Global Tables, and other services.
  • 2:24 AM: DNS issue fixed, but EC2 instance launch subsystem impaired due to DynamoDB dependency.
  • ~8:00 AM: Network Load Balancer (NLB) health subsystem degraded, causing connectivity failures for Lambda, DynamoDB, CloudWatch, and others.
  • 9:38 AM: NLB health checks recovered; AWS throttled EC2 launches, SQS queue processing, and async Lambda invocations to stabilize.
  • 12:15 PM – 2:48 PM: Gradual recovery for EC2, ECS, Lambda, Glue, Redshift, Connect, and others. Residual backlogs remained in analytics/reporting pipelines.
  • 3:01 PM: All AWS services reported as recovered.

Root Cause Analysis (RCA)

The outage was triggered by DNS resolution failures in DynamoDB endpoints for US-EAST-1. This cascaded into:

  • Failure of IAM updates and DynamoDB Global Tables.
  • EC2’s internal subsystem for instance launches failing due to reliance on DynamoDB.
  • Impairment of Network Load Balancer health subsystem, spreading connectivity failures to Lambda, CloudWatch, and other core services.
  • Throttling of EC2 launches, Lambda invocations, and SQS processing was necessary to stabilize recovery.

By Oct 20, 3:01 PM PDT, throttling was lifted and services were fully restored, with a few residual backlogs processed later.

Affected Components (142 Services)

Categories of impacted services:

  • Compute & Networking: EC2, ECS, Lambda, EKS, ELB, VPC, Auto Scaling
  • Databases & Storage: DynamoDB, RDS, Redshift, ElastiCache, S3, EFS, FSx
  • Messaging & Events: SQS, SNS, EventBridge, CloudTrail
  • Monitoring & Security: CloudWatch, IAM, STS, Config, GuardDuty, Secrets Manager
  • Analytics & AI/ML: Glue, Athena, SageMaker, Kinesis, Comprehend, Rekognition, Textract, Translate
  • Application Services: API Gateway, AppSync, AppFlow, Cognito, Connect, WorkMail, WorkSpaces
  • Other: IoT services, Chime, GameLift, Elastic Beanstalk, DMS, Directory Service, DataZone, Managed Grafana, Managed Prometheus

Popular consumer products disrupted due to this outage:
Snapchat, Reddit, Fortnite, Venmo, Duolingo, Signal

Learnings

  1. Single Service Dependencies Can Cascade
    • DynamoDB DNS failures impacted IAM, EC2, and downstream systems.
    • Engineering takeaway: design for graceful degradation and isolation of failures.
  2. Monitoring Systems Can Be a Single Point of Failure
    • When NLB health checks failed, it created widespread impact.
    • Lesson: replicate and decouple internal monitoring systems.
  3. Throttling as a Recovery Tool
    • AWS stabilized recovery using throttles on EC2, SQS, and Lambda.
    • Lesson: build backpressure and throttling playbooks into incident response.
  4. Backlog Processing Is a Recovery Step
    • Analytics/reporting pipelines lagged even after service recovery.
    • Lesson: have clear backlog-clearing strategies to restore downstream systems faster.
  5. Multi-AZ/Multi-Region Resilience
    • Heavy reliance on a single region (US-EAST-1) increased customer impact.
    • Lesson: implement multi-region strategies for critical workloads.

Recommended Customer Actions

  • Flush DNS Caches: If you continue to see DynamoDB or IAM errors, clear DNS caches to avoid stale entries.
  • Retry Failed Requests: Ensure your client/service retry logic is enabled for transient AWS API failures.
  • Review Throttling & Backpressure: Validate that your applications handle AWS throttling gracefully during outages.
  • Monitor Backlog-Clearing: Check Lambda queues, EventBridge rules, and analytics pipelines for delayed event processing.
  • Evaluate Multi-Region Deployments: Consider cross-region redundancy for mission-critical services to reduce blast radius of single-region failures.
  • Audit Dependencies: Identify if your systems rely indirectly on DynamoDB or EC2 launches, and document those dependencies for future incident planning.

Current Status (as of Oct 20, 3:01 PM PDT)

All AWS services have recovered to normal operations.

How Can You Stay Ahead and Get Alerts Timely !

Outages like this show how third-party dependencies can ripple through your systems. With Pagerly Monitor, you can:

  • Monitor third-party services (AWS, OpenAI, Heroku, Sure, etc.) in real time.
  • Get paged automatically when critical services degrade.
  • Create incidents instantly inside Slack/Teams.
  • Connect your favorite tools (Jira, Intercom, GitHub, PagerDuty, Opsgenie, and more) for a seamless response.

Stay ahead of outages — let Pagerly help you detect, escalate, and resolve issues faster.

View all
Design
Product
Software Engineering
Customer Success

Latest blogs

Best and Top 5 Alternatives to Pingdom in 2025
October 22, 2025

Best and Top 5 Alternatives to Pingdom in 2025

Best and Top 5 Alternatives to Pingdom in 2025
How to create Rotation or Shift Schedule on Calendar easily
October 22, 2025

How to create Rotation or Shift Schedule on Calendar easily

How to Create Rotation or Shift Schedules on Calendar Easily