Terraform Cloud/Enterprise State Management - When things go wrong
Managing infrastructure with Terraform can be a breeze when everything goes according to plan. However, we all know that the dreaded “something” can and will go wrong, especially when working with complex infrastructures. In these cases, it’s essential to have a plan in place for managing Terraform state in a break glass scenario. This is particularly important when your Terraform state is managed by Terraform Cloud or Enterprise.
In this blog post, we will discuss how to manage Terraform state when something goes wrong during a Kubernetes cluster deployment using Terraform and a Helm chart. We will cover how to manually taint, remove, and import elements from the state to fix the issue and reapply the configuration.
The Scenario
Let’s say you have used Terraform to build a Kubernetes cluster on Google Cloud. Subsequently, you used a Terraform-driven Helm chart to deploy your application onto the cluster. Unfortunately, something goes wrong during the Helm roll-out, and now Terraform cannot apply or destroy the infrastructure. You are receiving a timeout error while waiting for a particular element in the state to complete.
In this example, we’ll assume the state element we need to remove is module.my-app.helm_release.deployment. We’ve identified the cause of the issue by exploring the pod logs and have corrected the configuration in the Helm chart. Now we need to reapply the configuration by tainting the element.
Creating an API Key
Since the Terraform state is managed by Terraform Cloud, we need to create an API key for the workspace. To do this, follow these steps:
- Log in to your Terraform Cloud account and navigate to the workspace that you want to create an API key for.
- Click on the “User Settings” dropdown and select “Tokens”.
- Click on “Create API token” and give it a name.
- Click “Generate” to create the token.
Make sure to store the token in a secure location. We’ll use it later to initialize the backend locally. Pro-Tip: set up Dynamic Creds using HashiCorp Vault and have the token fully managed that way instead.
Initializing the Backend Locally
- Install the Terraform CLI on your local machine.
- Create a new directory for your Terraform code and navigate to it.
- Create a new file called backend.tf with the following contents:
To manage the state in a break glass scenario, we need to initialize the Terraform backend locally. To do this, follow these steps:
terraform { backend "remote" { hostname = "app.terraform.io" organization = "<your-organization>" workspaces { name = "<my_staging_app_environment>" } } }
- Run terraform init. This will initialize the backend and prompt you to enter your API key.
- Enter your API key when prompted.
* Alternatively you can set up a credentials blockin the backend config. Personally, I prefer never to see a credential like that in clear text in a file, ever.
Replace <your-organization> with your Terraform Cloud organization name and <my_staging_app_environment> with the name of your workspace.
*** Be Aware — *the token you have generated is entitled to any action your account is entitled to. Use with caution. Remove when not needed.
Tainting the Element
Now that we have initialized the backend, we can taint the element that needs to be reconfigured. To do this, run the following command:
terraform apply -replace="module.my-app.helm_release.deployment"
This will mark the module.my-app.helm_release.deployment element in the plan before a subsequent apply. The subsequent apply will recreate the element, ensuring that it matches the current configuration in the Helm chart.
** Note, an earlier version recommended approaching this problem with the taint command, which has been deprecated. `teraform apply -replace` is now the recommended approach. Ref: https://developer.hashicorp.com/terraform/cli/commands/taint
Conclusion
When something goes wrong during a Terraform deployment, it’s essential to have a plan in place for managing the Terraform state in a break glass scenario. In this blog post, we have shown how to manually taint, remove, and import elements from the state to fix issues and reapply the configuration. There are probably a couple of approaches to get you out of a bind when things go wrong and you are using Terraform Cloud or Enterprise, but I’ve used this approach in the past and it works great!