Q. Last restore test you ran (what failed, what changed after)

Across my experience from IBM to AWS environments at OpenText and Diehl Metering, I treat disaster recovery as something that must be tested in real scenarios, not just designed.

One practical restore scenario I handled was in a pre-production AWS environment where an incorrect Terraform change was merged by an apprentice, which unintentionally overrode parts of the infrastructure and caused a major disruption.

Restore Scenario

Pre-production AWS environment managed via Terraform
Incorrect PR merge led to unintended infrastructure changes
Required restoring environment to last known stable state

What Failed

Infrastructure state was impacted due to incorrect Terraform apply
Some resources were modified or recreated unexpectedly
Dependency issues appeared between services after the change

Actions Taken

Identified last known good Terraform state from remote backend (S3)
Reverted Terraform code to previous stable version
Re-ran Terraform apply to bring infrastructure back to expected state
Validated services after restore: - Checked application connectivity - Verified dependent services and configurations
Analysed root cause of failure: - Lack of strict validation before merge - Insufficient guardrails for production-like environments

What Changed After

Introduced stricter PR review and approval process
Added validation checks in CI/CD before Terraform apply
Restricted who can approve infrastructure changes
Improved separation between environments (pre-prod vs prod controls)
Strengthened Terraform workflow: - Better plan review visibility - Clear rollback approach using previous state

Outcome

Successfully restored infrastructure to stable state
Reduced risk of similar issues through stronger controls
Improved confidence in recovery using Terraform state

This reinforced the importance of treating Terraform state and version control as a recovery mechanism, not just a deployment tool.