Skip to main contentWelcome to this lesson on chaos engineering with AWS Fault Injection Simulator (FIS). I’m Nasia Ullas, and I’ll guide you through designing, executing, and analyzing fault-injection experiments to strengthen your system resilience.
As modern architectures grow in complexity, unexpected failures can lead to significant downtime costs:
- 44% of organizations report that 1 hour of downtime costs between $1 million and $5 million.
- In 2021, Facebook incurred $80 million+ in losses from seven hours of downtime.
- A recent “blue screen of death” outage impacted airlines, banks, healthcare providers, and countless other businesses worldwide.
Chaos engineering is the practice of intentionally injecting faults into a system to uncover weaknesses and validate its ability to withstand real-world disruptions. In this course, we’ll leverage AWS Fault Injection Simulator (FIS) to conduct controlled experiments in your AWS environment.
Course Outline
We’ll cover seven high-level modules, each focusing on different AWS services and fault types:
-
Module 1: Basic FIS Experiments
Configure IAM, create experiment templates, execute tests, and monitor results with dashboards.
-
Module 2: Sample Application & Steady-State Metrics
Deploy a reference application and define baseline performance metrics.
-
Module 3: Disk Fill Scenario on EC2
Simulate disk saturation on EC2 instances and analyze its impact on application behavior.
-
Module 4: Aurora Reader Reboot
Inject a reboot fault into an Aurora reader node and observe recovery processes.
-
Module 5: Fargate Load Stress Test
Apply CPU and memory stress to a serverless Fargate task and evaluate performance under high load.
-
Module 6: EKS Memory Stress & Pod Deletion
Perform memory saturation tests and pod-deletion experiments in your EKS cluster to validate self-healing.
-
Module 7: Availability Zone Power Interruption
Simulate a power outage in an entire availability zone to assess multi-AZ resilience.
Conclusion
By the end of this lesson, you’ll have a solid understanding of how to:
- Design robust failure scenarios for cloud applications.
- Execute controlled experiments safely.
- Analyze results to strengthen your system’s resilience.
Let’s get started and build more reliable, fault-tolerant architectures with AWS FIS!
Further Reading