AWS US EAST-1 Outage Incident Summary

Falit Jain

October 21, 2025

•

5 min read

Table of Content

Heading H2

Heading H3

‍

TL;DR

On October 19–20, 2025, AWS US-EAST-1 (N. Virginia) experienced a 15-hour outage triggered by DNS resolution issues in DynamoDB.

This caused cascading failures across 142 AWS services, including critical ones such as EC2, Lambda, S3, DynamoDB, and CloudWatch.

Popular apps like Snapchat, Reddit, Fortnite, Venmo, Duolingo, and Signal were also affected.

Full recovery was achieved by October 20, 3:01 PM PDT.

Incident Title

[RESOLVED] Increased Error Rates and Latencies – AWS US-EAST-1 Region

Incident Link: AWS Service Health Dashboard

Overall Metrics

Total Time of Impact: ~15 hours 12 minutes
Mean Time to Resolve (MTTR): ~15 hours
Popular External Apps Affected: Snapchat, Reddit, Fortnite, Venmo, Duolingo, Signal

Popular Products Affected

During the outage, many widely used consumer applications and platforms experienced downtime or degraded performance, including:

Snapchat (social media/messaging)
Reddit (online community platform)
Fortnite (Epic Games – online gaming)
Venmo (peer-to-peer payments)
Duolingo (language learning)
Signal (encrypted messaging)

This highlights the far-reaching downstream effects of an AWS regional outage on globally popular applications.

‍

Incident Title

‍

[RESOLVED] Increased Error Rates and Latencies – AWS US-EAST-1 Region

Incident Link - https://health.aws.amazon.com/health/status?path=service-history

Overall Metrics

Total Time of Impact: ~15 hours 12 minutes
Mean Time to Resolve (MTTR): ~15 hours
Popular External Apps Affected: Snapchat, Reddit, Fortnite, Venmo, Duolingo, Signal

Incident Timeline

Start: Oct 19, 11:49 PM PDT
End: Oct 20, 3:01 PM PDT
Duration: ~15 hours

Key Events

11:49 PM – 2:24 AM: DNS resolution issues for DynamoDB endpoints triggered cascading failures across IAM, DynamoDB Global Tables, and other services.
2:24 AM: DNS issue fixed, but EC2 instance launch subsystem impaired due to DynamoDB dependency.
~8:00 AM: Network Load Balancer (NLB) health subsystem degraded, causing connectivity failures for Lambda, DynamoDB, CloudWatch, and others.
9:38 AM: NLB health checks recovered; AWS throttled EC2 launches, SQS queue processing, and async Lambda invocations to stabilize.
12:15 PM – 2:48 PM: Gradual recovery for EC2, ECS, Lambda, Glue, Redshift, Connect, and others. Residual backlogs remained in analytics/reporting pipelines.
3:01 PM: All AWS services reported as recovered.

Root Cause Analysis (RCA)

The outage was triggered by DNS resolution failures in DynamoDB endpoints for US-EAST-1. This cascaded into:

Failure of IAM updates and DynamoDB Global Tables.
EC2’s internal subsystem for instance launches failing due to reliance on DynamoDB.
Impairment of Network Load Balancer health subsystem, spreading connectivity failures to Lambda, CloudWatch, and other core services.
Throttling of EC2 launches, Lambda invocations, and SQS processing was necessary to stabilize recovery.

By Oct 20, 3:01 PM PDT, throttling was lifted and services were fully restored, with a few residual backlogs processed later.

Affected Components (142 Services)

Categories of impacted services:

Compute & Networking: EC2, ECS, Lambda, EKS, ELB, VPC, Auto Scaling
Databases & Storage: DynamoDB, RDS, Redshift, ElastiCache, S3, EFS, FSx
Messaging & Events: SQS, SNS, EventBridge, CloudTrail
Monitoring & Security: CloudWatch, IAM, STS, Config, GuardDuty, Secrets Manager
Analytics & AI/ML: Glue, Athena, SageMaker, Kinesis, Comprehend, Rekognition, Textract, Translate
Application Services: API Gateway, AppSync, AppFlow, Cognito, Connect, WorkMail, WorkSpaces
Other: IoT services, Chime, GameLift, Elastic Beanstalk, DMS, Directory Service, DataZone, Managed Grafana, Managed Prometheus

Popular consumer products disrupted due to this outage:
Snapchat, Reddit, Fortnite, Venmo, Duolingo, Signal

Learnings

Single Service Dependencies Can Cascade
- DynamoDB DNS failures impacted IAM, EC2, and downstream systems.
- Engineering takeaway: design for graceful degradation and isolation of failures.
Monitoring Systems Can Be a Single Point of Failure
- When NLB health checks failed, it created widespread impact.
- Lesson: replicate and decouple internal monitoring systems.
Throttling as a Recovery Tool
- AWS stabilized recovery using throttles on EC2, SQS, and Lambda.
- Lesson: build backpressure and throttling playbooks into incident response.
Backlog Processing Is a Recovery Step
- Analytics/reporting pipelines lagged even after service recovery.
- Lesson: have clear backlog-clearing strategies to restore downstream systems faster.
Multi-AZ/Multi-Region Resilience
- Heavy reliance on a single region (US-EAST-1) increased customer impact.
- Lesson: implement multi-region strategies for critical workloads.

Recommended Customer Actions

Flush DNS Caches: If you continue to see DynamoDB or IAM errors, clear DNS caches to avoid stale entries.
Retry Failed Requests: Ensure your client/service retry logic is enabled for transient AWS API failures.
Review Throttling & Backpressure: Validate that your applications handle AWS throttling gracefully during outages.
Monitor Backlog-Clearing: Check Lambda queues, EventBridge rules, and analytics pipelines for delayed event processing.
Evaluate Multi-Region Deployments: Consider cross-region redundancy for mission-critical services to reduce blast radius of single-region failures.
Audit Dependencies: Identify if your systems rely indirectly on DynamoDB or EC2 launches, and document those dependencies for future incident planning.

✅ Current Status (as of Oct 20, 3:01 PM PDT)

‍All AWS services have recovered to normal operations.

‍

How Can You Stay Ahead and Get Alerts Timely !

‍

‍

Outages like this show how third-party dependencies can ripple through your systems. With Pagerly Monitor, you can:

Monitor third-party services (AWS, OpenAI, Heroku, Sure, etc.) in real time.
Get paged automatically when critical services degrade.
Create incidents instantly inside Slack/Teams.
Connect your favorite tools (Jira, Intercom, GitHub, PagerDuty, Opsgenie, and more) for a seamless response.

Stay ahead of outages — let Pagerly help you detect, escalate, and resolve issues faster.

‍

Latest blogs

Evaluating Best Alternatives for Opsgenie in 2025

October 22, 2025

AWS US EAST-1 Outage Incident Summary

TL;DR

Incident Title

Overall Metrics

Popular Products Affected

Incident Title

Overall Metrics

Incident Timeline

Key Events

Root Cause Analysis (RCA)

Affected Components (142 Services)

Learnings

Recommended Customer Actions

✅ Current Status (as of Oct 20, 3:01 PM PDT)

How Can You Stay Ahead and Get Alerts Timely !

Latest blogs

Evaluating Best Alternatives for Opsgenie in 2025

Best and Top 5 Alternatives to Pingdom in 2025

How to create Rotation or Shift Schedule on Calendar easily