Foundations of CI/CD with Azure Data Factory using Terraform

What is Azure Data Factory?

Azure Data Factory, or simply ADF, is Microsoft’s solution for data integration in the cloud. It is a fully managed, serverless service that supports around 90 different protocols that can be used as datasets (both for input and output). Some of them are generic, like HTTP, ODBC, File system, while others are specific, like Jira, Magneto, or ServiceNow. Additionally, it supports a variety of standard Azure-related services such as Blob storage, Azure SQL, or Azure Cosmos DB.

A screenshot of a computerDescription automatically generated
Azure Data Factory dataset options

It offers a user-friendly interface where pipelines (processes triggered by specific rules in ADF terminology) can be built without the need for extensive coding. Simultaneously, data transformation is possible by executing a self-created stored procedure. ADF serves as an alternative solution to Microsoft SSIS in the cloud and can also run SSIS packages within its runtime for migration purposes.

ADF can manage its own virtual network, allowing secure connections to other Azure resources with minimal configuration. In the following sections, we will demonstrate setting up a sample pipeline using Terraform to configure this.

Azure Data Factory built-in CI/CD capabilities

While the creation of pipelines is possible through the UI, the question often arises about reusing pipelines for different environments or projects, ensuring parallel contributions to the same project without affecting the upper environments.

ADF addresses this elegantly by enabling integration with Azure DevOps or GitHub repositories. Feature and bugfix branches can be created, and pull requests can be made to the main branch using ADF Studio, which becomes part of the repository. Bitbucket and other Git repositories aren't supported due to this integration.

In the background, ADF commits changes to the feature branch in the repository after each pipeline save, described in JSON files. Pull requests can be reviewed in Azure DevOps or GitHub before approval. In this Git repository mode, alterations aren't immediately applied to the ADF instance; publishing is required. Thus, experimentation and mistakes are possible, though strongly discouraged for production environments.

Sample application – architecture

Our sample application is dead simple from ETL point of view – it will just grab an Excel file with a list of stadiums from an Azure Blob Storage container and synchronize the data into a table of an Azure SQL database. We would like to secure the communication between the different components of this demo application the best way we could – let Azure Data Factory manage its own virtual network, and connect the three other components (Storage Account, Key Vault, and SQL database) via private endpoints and restrict public access.

Inside ADF, there will be two datasets (SQL Database and Excel file in a storage account), one pipeline (a copy job using the two datasets), and that is it. The following diagram displays the high-level architecture of the demo. The two datasets connect to the Storage Account and the SQL servers, and we store the password of the SQL server in the Key Vault. ADF’s access rights to the Storage Account is defined by a role assignment, where the role is Storage Blob Data Reader and the user is the managed identity of ADF. Similarly, the access rights to the Key Vault is also a role assignment where the role is Key Vault Secrets User. As the main aim of this sample application is to try out how to move the logic of data factory between different environments using some automated process, the Git integration of Data Factory is turned on so everything we do in the studio is saved to a repository in form of JSON objects.

A diagram of a computerDescription automatically generated

Adding Terraform to the picture

If you take a closer look at what is committed to the repository used by ADF, then you realize it contains everything that is related to ADF – its linked services, datasets, pipelines, settings, global variables, managed private endpoint connections -, but it does not contain anything else: Key Vault, SQL database, Azure Storage Account or the role assignments described earlier. Why would it? They were not created and configured within ADF Studio, they are just configured via links there. Also, they will not necessarily be in the same resource group or subscription in a real-world scenario, so that is fine.

That was the point where I wanted to see how that works if I implement my other ADF-related infrastructure with Terraform. That is straightforward, and we can separate the deployment of our resources based on the following guidelines:

  • Everything that is related to ADF will be managed by the ARM templates published via Azure Data Studio. I set the flag of scripting the global parameters not to be included in the publish, because they vary across environments.
  • Other parts of the solution (other Azure services, parameters of ADF, managed private endpoint connections) are part of the Terraform script, because they are either not part of ADF just connect to it, or have data that become known at the time of the Terraform deployments.

There are two things to mention at this point that might not be straightforward looking at Microsoft’s documentation:

  • Git integration should only be enabled for environments where actual development happens. If we want to deploy the ARM templates generated by the publish of the studio, then we need to turn it off for those instances of ADF.
  • Managed private endpoints fall into both categories of resources mentioned above: they can be configured via the studio, but the actual resource that they point to is only known when running the Terraform scripts. As they cannot be excluded from the ADF publish, then this is something that is in both infrastructure as a code solution. More notes on this later in the conclusion section.

So, setting up a brand-new environment or updating an existing one should always be a four/three-step process:

  1. In case of adding a new environment, define the variables for both the ARM templates and Terraform scripts
  2. Publish your changes from ADF Studio to the release branch created by it.
  3. Update your infrastructure using the Terraform scripts that you have created.
  4. Apply the changes that you published as ARM templates in point #1. The variables for that should be unique to the instance that you are updating.

At this point, we have a working solution. Kind of…

Private endpoint connection headache

One manual step still got into the process (but only upon creation): approving the managed private endpoint’s connection that we created in Terraform from ADF’s virtual network to the other resources. It turned out not to be a straightforward process, due to the link between the different resources and the lack of access to them. When you create a managed private endpoint in Azure Data Factory when it manages its virtual network, the virtual network and the actual private endpoint are created in a different resource group of a different subscription which you do not have access to. All you have in your Terraform scripts are two things: the ID of the managed private endpoint, and the resource that ADF wants to connect to. The relation how the private endpoint is working looks like the following:


A diagram of a serverDescription automatically generated

As you do not have access to the actual private endpoint and to the virtual network, the task is to find programmatically that private endpoint connection of the resource (it is SQL server in the example above) linked somehow to the managed private endpoint. Fortunately, if you get the details of the managed private endpoint, Azure CLI returns the ID of the actual private endpoint (the extension of it gets downloaded when you first try to use it), and you can query the private endpoint connection of a resource the same way. By filtering the endpoint connection list based on the ID, the correct endpoint connection can be found and approved. Unfortunately, this cannot be done using Terraform, so the PowerShell script is the following, it can be parametrized with environment variables:

The local-exec provisioner is used to run a script on the executor machine after the resource is created. One would think that we can attach this to the managed private endpoint resource in Terraform, and we are finished. This is only true in the case if the resource can be created without any error including the custom script that we run. Otherwise, if the resource gets created and the script fails, we are stuck because new runs of the script will fail because the resource is already created but it is not in Terraform’s state.

This is when Terraform’s null_resource comes to the rescue. This is an empty resource that does nothing, just implements the lifecycle of a Terraform resource, and if we set the local-exec provisioner to that, the private endpoint gets into the Terraform state properly, but the approval only in the case when it succeeds. Putting all these together:


It finally works executing this from your local machine!


Doc Brown Bttf GIF - Doc Brown Bttf It Works GIFs

The next steps

Although, this is working, there is still room for improvement:

  • This script can be executed from a local machine, but in a real-world scenario, this should be executed by an Azure DevOps pipeline, by tagging the different stable versions of the publish branch of Azure. At the same time, Terraform state should be moved to Azure Blob storage or some other centralized container for those environments which are used by multiple people.
  • The input for the parameters JSON of the ARM template (remember, the one that contains the setup of the Azure Data Factory) either contains some static value, or others that can be generated by the values created by Terraform. It would be great to find some solution to automate that process since if the content of it is different from the setup created by Terraform, the solution will stop working.
  • We managed ADF to communicate with other resources using a private connection, but if we want to disable public connections for good, then we need to initiate the Terraform scripts from a virtual network that also uses private connection to these resources, otherwise even the simple operations of Terraform would also fail (e.g. getting blob containers of a storage account).

Summary

To sum up, Azure Data Factory is very well prepared to work with CI/CD best practices, and it is great to see that Terraform and ARM templates can work together smoothly. If you are looking for a partner who can help moving your data to the cloud, contact us!