Azure Data Factory, or simply ADF, is Microsoft’s solution for data integration in the cloud. It is a fully managed, serverless service that supports around 90 different protocols that can be used as datasets (both for input and output). Some of them are generic, like HTTP, ODBC, File system, while others are specific, like Jira, Magneto, or ServiceNow. Additionally, it supports a variety of standard Azure-related services such as Blob storage, Azure SQL, or Azure Cosmos DB.
It offers a user-friendly interface where pipelines (processes triggered by specific rules in ADF terminology) can be built without the need for extensive coding. Simultaneously, data transformation is possible by executing a self-created stored procedure. ADF serves as an alternative solution to Microsoft SSIS in the cloud and can also run SSIS packages within its runtime for migration purposes.
ADF can manage its own virtual network, allowing secure connections to other Azure resources with minimal configuration. In the following sections, we will demonstrate setting up a sample pipeline using Terraform to configure this.
While the creation of pipelines is possible through the UI, the question often arises about reusing pipelines for different environments or projects, ensuring parallel contributions to the same project without affecting the upper environments.
ADF addresses this elegantly by enabling integration with Azure DevOps or GitHub repositories. Feature and bugfix branches can be created, and pull requests can be made to the main branch using ADF Studio, which becomes part of the repository. Bitbucket and other Git repositories aren't supported due to this integration.
In the background, ADF commits changes to the feature branch in the repository after each pipeline save, described in JSON files. Pull requests can be reviewed in Azure DevOps or GitHub before approval. In this Git repository mode, alterations aren't immediately applied to the ADF instance; publishing is required. Thus, experimentation and mistakes are possible, though strongly discouraged for production environments.
Our sample application is dead simple from ETL point of view – it will just grab an Excel file with a list of stadiums from an Azure Blob Storage container and synchronize the data into a table of an Azure SQL database. We would like to secure the communication between the different components of this demo application the best way we could – let Azure Data Factory manage its own virtual network, and connect the three other components (Storage Account, Key Vault, and SQL database) via private endpoints and restrict public access.
Inside ADF, there will be two datasets (SQL Database and Excel file in a storage account), one pipeline (a copy job using the two datasets), and that is it. The following diagram displays the high-level architecture of the demo. The two datasets connect to the Storage Account and the SQL servers, and we store the password of the SQL server in the Key Vault. ADF’s access rights to the Storage Account is defined by a role assignment, where the role is Storage Blob Data Reader and the user is the managed identity of ADF. Similarly, the access rights to the Key Vault is also a role assignment where the role is Key Vault Secrets User. As the main aim of this sample application is to try out how to move the logic of data factory between different environments using some automated process, the Git integration of Data Factory is turned on so everything we do in the studio is saved to a repository in form of JSON objects.
If you take a closer look at what is committed to the repository used by ADF, then you realize it contains everything that is related to ADF – its linked services, datasets, pipelines, settings, global variables, managed private endpoint connections -, but it does not contain anything else: Key Vault, SQL database, Azure Storage Account or the role assignments described earlier. Why would it? They were not created and configured within ADF Studio, they are just configured via links there. Also, they will not necessarily be in the same resource group or subscription in a real-world scenario, so that is fine.
That was the point where I wanted to see how that works if I implement my other ADF-related infrastructure with Terraform. That is straightforward, and we can separate the deployment of our resources based on the following guidelines:
There are two things to mention at this point that might not be straightforward looking at Microsoft’s documentation:
So, setting up a brand-new environment or updating an existing one should always be a four/three-step process:
At this point, we have a working solution. Kind of…
One manual step still got into the process (but only upon creation): approving the managed private endpoint’s connection that we created in Terraform from ADF’s virtual network to the other resources. It turned out not to be a straightforward process, due to the link between the different resources and the lack of access to them. When you create a managed private endpoint in Azure Data Factory when it manages its virtual network, the virtual network and the actual private endpoint are created in a different resource group of a different subscription which you do not have access to. All you have in your Terraform scripts are two things: the ID of the managed private endpoint, and the resource that ADF wants to connect to. The relation how the private endpoint is working looks like the following:
As you do not have access to the actual private endpoint and to the virtual network, the task is to find programmatically that private endpoint connection of the resource (it is SQL server in the example above) linked somehow to the managed private endpoint. Fortunately, if you get the details of the managed private endpoint, Azure CLI returns the ID of the actual private endpoint (the extension of it gets downloaded when you first try to use it), and you can query the private endpoint connection of a resource the same way. By filtering the endpoint connection list based on the ID, the correct endpoint connection can be found and approved. Unfortunately, this cannot be done using Terraform, so the PowerShell script is the following, it can be parametrized with environment variables:
The local-exec provisioner is used to run a script on the executor machine after the resource is created. One would think that we can attach this to the managed private endpoint resource in Terraform, and we are finished. This is only true in the case if the resource can be created without any error including the custom script that we run. Otherwise, if the resource gets created and the script fails, we are stuck because new runs of the script will fail because the resource is already created but it is not in Terraform’s state.
This is when Terraform’s null_resource comes to the rescue. This is an empty resource that does nothing, just implements the lifecycle of a Terraform resource, and if we set the local-exec provisioner to that, the private endpoint gets into the Terraform state properly, but the approval only in the case when it succeeds. Putting all these together:
It finally works executing this from your local machine!
Although, this is working, there is still room for improvement:
To sum up, Azure Data Factory is very well prepared to work with CI/CD best practices, and it is great to see that Terraform and ARM templates can work together smoothly. If you are looking for a partner who can help moving your data to the cloud, contact us!