Q. Last restore test you ran (what failed, what changed after) ============================================================== Across my experience from IBM to AWS environments at OpenText and Diehl Metering, I treat disaster recovery as something that must be tested in real scenarios, not just designed. One practical restore scenario I handled was in a pre-production AWS environment where an incorrect Terraform change was merged by an apprentice, which unintentionally overrode parts of the infrastructure and caused a major disruption. Restore Scenario ---------------- - Pre-production AWS environment managed via Terraform - Incorrect PR merge led to unintended infrastructure changes - Required restoring environment to last known stable state What Failed ----------- - Infrastructure state was impacted due to incorrect Terraform apply - Some resources were modified or recreated unexpectedly - Dependency issues appeared between services after the change Actions Taken ------------- - Identified last known good Terraform state from remote backend (S3) - Reverted Terraform code to previous stable version - Re-ran Terraform apply to bring infrastructure back to expected state - Validated services after restore: - Checked application connectivity - Verified dependent services and configurations - Analysed root cause of failure: - Lack of strict validation before merge - Insufficient guardrails for production-like environments What Changed After ------------------- - Introduced stricter PR review and approval process - Added validation checks in CI/CD before Terraform apply - Restricted who can approve infrastructure changes - Improved separation between environments (pre-prod vs prod controls) - Strengthened Terraform workflow: - Better plan review visibility - Clear rollback approach using previous state Outcome ------- - Successfully restored infrastructure to stable state - Reduced risk of similar issues through stronger controls - Improved confidence in recovery using Terraform state This reinforced the importance of treating Terraform state and version control as a recovery mechanism, not just a deployment tool.