Foundations of CI/CD with Azure Data Factory using Terraform - part 2

Where we left off

n my previous blogpost, I explored the possibilities of marrying the descriptions of ADF pipelines and its generated ARM templates and the Terraform scripts that can describe the other parts of the cloud infrastructure. At the end of the post, we have seen that these two concepts can work together, but there is still some room for improvement, which I will explain in the following sections.

Generate parameters JSON

As mentioned in the blogpost, the parameters JSON file of the ARM template contains such information that vary across different environments and can easily be generated using some custom scripting. Let’s see a JSON file like this as a reminder below:

As you can see, some parameters are constant values (like private endpoint group identifiers and the connection string templates using global parameters) and the ARM template contains that value as default value. Others are resource ID-s that are known in the Terraform scripts and, when adding new resources to the infrastructure, they will be only known after creation (like the private endpoint resources ID-s, factory name). So, the conclusion is that we can omit the constant values as they are the default values in the ARM templates anyway, and we should generate the parameters file using Terraform adding the rest of them, using the following script:

To make the script reusable, I made the naming conventions of parameters to be included in the output to have a prefix of ARM_PARAMTER_, and the location of the output file can also be controlled by env variables.

We need help from the null_resource Terraform resource, which should be run every time the “terraform apply” statement is executed. To make that work, we need to pass the arguments using environment variables and set up dependencies correctly. In our case, we are only dependent on the private endpoints, and to force recreating them on every run we can use the triggers parameter of the null_resource, and set it to something that is different every time the script is executed. That value can be the output of the timestamp() function.

I have created the arm_default_parameters local variable earlier to allow us to add some parameters in an extensible way that are not so dynamic as the private endpoints. We also add the private link resource ID-s and the output file name parameter that is going to be specific to the environment.

Building ARM templates automatically

It was not highlighted in the previous post, but there was one manual step that every developer needed to do after approving a pull request to the main branch: publishing via the Azure Data Factory Studio, which essentially is hitting the button generating ARM templates to the release branch set up in the Studio. This is not a huge issue, but it still would be more elegant to have it built into the CI/CD pipeline, so we do not rely on human contribution on that.

Fortunately, there is an NPM package that can do that job for you, using the command line. That’s great news, and it is pretty straightforward to use based on its readme (you feed the directory root where ADF puts its files, the ADF ID, and the expected output directory to it), it is more interesting how to set this up in the pipeline.

My original plan was that when a pull request is opened or updated in Azure DevOps, then a pipeline is run and commits the ARM templates to the branch. It is actually doable; you just need to grant access to the repository for the build agent. This does its job, but it turned out that it is not so nice: when the pipeline pushes to its branch, it triggers itself due to the push and cancels its most recent run. It does not get into endless loop due to the fact that the second run does not commit anything to the branch.

Azure DevOps has some tricks to avoid it: you can add the [skip ci] hint to the commit message, there are include-exclude path filters for files, and there is the autocancel feature for the PR trigger in the YAML file. Unfortunately, due to different reasons, all of them failed:

  • [skip ci] is not honored by the PR trigger, because if the pipeline is a mandatory check for PR-s based on the branch policy (which I did set up), then developers would be able to skip that, which is not acceptable. That actually makes sense.
  • Include-exclude path: I tried to include the directory of my ADF descriptor json-s, or exclude the build output directory, but it seemed that it restricts not the content of individual pushes, but the whole pull request. So that will not work in our case.
  • Autocancel: It would not help anything related to the triggers, but at least it would be nicer that we did not have a cancelled run as every second pipeline run, but Azure DevOps also does not seem to take that into account if a pipeline is mandatory check.

My suggestion to solve this issue is to use separate branches, or even better, separate repositories: have one repo for the ADF descriptions and one for all the DevOps stuff: your Terraform templates, pipelines and generated ARM templates for the ADF pipelines and its setup. That also includes some complexity in putting together the different version of the ADF setup and the ARM templates generated from it, but this is something that can be handled e.g. with matching tags.

Note that this step does not belong to the Terraform script: we would be able to generate the ARM templates in a similar way that we did with the parameters JSON, and apply them with another null resources, but this is also Infrastructure as a Code, and it belongs to the repository, committed.

See my pipeline as an inspiration here – this is created fora single-repository setup:

Other todos for projects

For a real project, it is also worth noting that besides that required pipeline mentioned in the previous section, it can also be a good idea to add a pipeline to check that before merging, the ADF descriptions and the ARM templates generated from it are matching. Also, to make sure that new codes are not messing up anything working, I would suggest allowing fast-forward and rebase merge type only used heavily by the streamlined workflow.

If you want to have your ADF and other resources (SQL database, Key Vault) completely separated from the internet, and having in a private network, you need to have a self-hosted Azure DevOps build agent in your network. This can be a virtual machine, a virtual machine scale set, some machines in a node pool In a Kubernetes cluster, or even a container in Azure Container Apps.

Conclusion

To sum it up, the total workflow can be the following (light green boxes are manual steps, and dark green ones are automated ones):

We can see that setting up CI/CD is somewhat different from what we are doing for web applications, but some principles can be applied to this kind of scenario. This is really good news for us, because we need to automate as much as we can to make sure that we are building reliable and reproducible cloud systems. I hope that I was able to help you by setting up some automation for your project including data integration and stay tuned for more!