6. Create Your Hypothesis
A well-defined hypothesis clarifies what you expect to happen when a fault is injected. To formulate it:- Identify the affected components
Pinpoint services, instances, or containers targeted by your fault injection. - Describe the expected behavior
Determine how your application should respond under fault conditions. - Define success metrics
Choose key indicators—latency, error rate, throughput—to validate resilience.
A precise hypothesis narrows your experiment’s scope and sets clear success criteria.
7. Design the Experiment
Use AWS FIS to control scope, duration, and safety checks. Configure the following:| Configuration | Description | Example |
|---|---|---|
| Target Resources | Apply tags to focus your fault injection on specific AWS resources. | Tag EC2 instances with env=staging. |
| Duration | Specify how long the fault remains active before auto rollback. | PT5M (5 minutes) |
| Stop Conditions | Define thresholds to abort the experiment if they’re violated. | CPU > 80% for 2 minutes |
8. Run the Experiment
- Start in lower environments
Validate your hypothesis in development or staging before touching production.
Always begin in a non-production account or VPC to avoid unintended impact.
- Validate resilience
Monitor your application as the fault is injected. Check dashboards and alerts to ensure behavior aligns with your hypothesis. - Promote to production
Once confirmed, rerun the experiment against production workloads with the same configuration. - Mark success
A successful run demonstrates that your architecture can withstand the injected fault without violating SLAs.
9. Conduct a Post-Mortem
A structured post-mortem transforms insights into improvements:| Step | Action |
|---|---|
| Analyze Impact | Review logs, metrics, traces, and user experience during the experiment. |
| Blameless Review | Host a session focused on learning, not finger-pointing. |
| Document Findings | Update runbooks, architecture diagrams, and automation scripts based on lessons learned. |
| CI/CD Integration | Automate FIS experiments in your CI/CD pipeline to continuously validate resilience. |
Maintain a blameless culture in your post-mortems to encourage transparent learning and innovation.
