Q. Last restore test you ran (what failed, what changed after)
==============================================================

Across my experience from IBM to AWS environments at OpenText and Diehl Metering, I treat disaster recovery as something that must be tested in real scenarios, not just designed.

One practical restore scenario I handled was in a pre-production AWS environment where an incorrect Terraform change was merged by an apprentice, which unintentionally overrode parts of the infrastructure and caused a major disruption.

Restore Scenario
----------------
- Pre-production AWS environment managed via Terraform  
- Incorrect PR merge led to unintended infrastructure changes  
- Required restoring environment to last known stable state  

What Failed
-----------
- Infrastructure state was impacted due to incorrect Terraform apply  
- Some resources were modified or recreated unexpectedly  
- Dependency issues appeared between services after the change  

Actions Taken
-------------
- Identified last known good Terraform state from remote backend (S3)  
- Reverted Terraform code to previous stable version  
- Re-ran Terraform apply to bring infrastructure back to expected state  

- Validated services after restore:
  - Checked application connectivity  
  - Verified dependent services and configurations  

- Analysed root cause of failure:
  - Lack of strict validation before merge  
  - Insufficient guardrails for production-like environments  

What Changed After
-------------------
- Introduced stricter PR review and approval process  
- Added validation checks in CI/CD before Terraform apply  
- Restricted who can approve infrastructure changes  
- Improved separation between environments (pre-prod vs prod controls)  

- Strengthened Terraform workflow:
  - Better plan review visibility  
  - Clear rollback approach using previous state  

Outcome
-------
- Successfully restored infrastructure to stable state  
- Reduced risk of similar issues through stronger controls  
- Improved confidence in recovery using Terraform state  

This reinforced the importance of treating Terraform state and version control as a recovery mechanism, not just a deployment tool.