Azure Resource Dependencies and Terraform Depends_On

We have deployed a full CI/CD pipeline in Azure DevOps that is deploying our infrastructure into Azure using Terraform. We even have the ability to destroy our infrastructure using ‘Terraform Destroy’ so that we can break down and rebuild our infrastructure on demand. Achieving full CI/CD. Sounds idyllic right?

One slight problem, upon destroying the infrastructure we were getting a failure in the Azure DevOps pipeline:

Error waiting for removal of Application Security Group for NIC "windowsNetworkInterface" (Resource Group "name>"): Code="OperationNotAllowed" Message="Operation 'startTenantUpdate' is not allowed on VM '<name>' since the VM is marked for deletion. You can only retry the Delete operation (or wait for an ongoing one to complete)." Details=[]

Essentially, ‘Terraform Destroy’ was failing to delete our virtual machine due to some dependencies.

When creating our environment we were standing up virtual machines with NSGs (Network Security Groups) and ASGs (Application Security Groups) attached to them. NSGs/ASGs attach to the network card in the virtual machine during our deployment using Terraform. Our team was hitting a blocker getting this process to work without manual intervention.

First thing we did was divide and conquer the tasks. I looked into the Azure dependencies first, while my colleague Julien Corioland started looking at our Terraform file configuration. You can read his journey here. **Spoiler Alert** The resolution goes into Terraform Graph, it’s a great read!

Azure Dependencies

I had remembered that Azure networking has an explicit deletion order (like an order of operations) when deleting a virtual machine. An NSG that is associated to any subnet(s) or network interface(s) (i.e. a NIC on a VM), prevents a VM from being deleted, until that NSG/ASG is disassociated. In order to delete the NSG and/or ASG, you must dissociate it from the resource. You can do this with Azure CLI or PowerShell scripts. To reference the deletion order visit this link.

Once reviewing the Azure documentation it was really a matter of finding out was this an Azure API issue, or a Terraform issue?

I went on the hunt to find out if others had a similar issue. Turning to GitHub, yes there are open (and some closed) issues where other people have experienced something similar. There were various issues that had been opened with Hashicorp, where other struggled to delete a virtual machine that was tied to an address pool, NSG, LB or ASH. All of the following attach to the NIC. Which we know from our link above have a dependency in which order they are deleted.

Terraform didn’t seem to be deleting our resources in the same order in which it built them, nor did it destroy them in the same order every time. Reviewing the issues logged on GitHub others found an instability with the ‘Destroy’ operation of their resources in Azure. While trying to reproduce the issue, sometimes we could delete resources without an error, and other times it would fail.

In our code we were using modules to deploy our code. Other users reported issues whether modules were being used or not. There was also findings from other users that had experienced intermittent failures on their deletions as well.

Another factor that was suggested was enforced policies on the subscription could be a contributing factor, but really this error was duplicated out of the customers subscription (where yes, Azure Policy is in place).

The potential downfall here, is that Azure allows asynchronous addition of resources to be deployed. For example, you can deploy a VM to a backend address pool of a load balancer asynchronously. But when you destroy the resources, the order absolutely matters.

The Fix

Really at this point I knew that we needed to find a way for Terraform to ‘know’ that we have dependencies that need to be managed in a specific run order. We had to find a way to manage a resource with explicit dependencies.

An implicit dependency in Terraform is the preferred and primary way for Terraform to know when there is a relationship between two objects. Let’s take the code snippet here:

Terraform knows that the ‘azurerm_resource_group’ has to be created before the ‘azurerm_virtual_network’ because of the reference in the ‘resource_group_name’ argument. By calling ‘azurerm_resource_group.rg.name’ we create the implicit dependency. So we could swap the blocks of code, and Terraform would still know to deploy see the implicit dependency, the order of your code doesn’t strictly matter in this case.

In our case, the implicit dependency does not work. So we need to use an explicit dependency to manually tell Terraform where the dependencies exist. Which is where depends_on comes in to save the day.

It was a simple one line of code to resolve our failing pipeline:

resource "azurerm_virtual_machine" "vm" {
  
  network_interface_ids = [azurerm_network_interface.nic.id]

  depends_on = [
    azurerm_network_interface_application_security_group_association.vm_asg_assoc
  ]

}

Once we set the explicit dependency between the virtual machine and the ASG, problem fixed!

If you would like to read more about explicit output dependencies go here. Hashicorp advises against using them for data resource dependencies, but give great guidance on using them for resource dependencies.

It was a great adventure in learning on this issue and I really enjoyed pairing up with my colleague Julien! Next time I’m working with modules and explicit dependencies, I know exactly how to proceed.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s