On-demand firewall opening for Azure Data Factory access to Azure Synapse

On-demand firewall opening for Azure Data Factory access to Azure Synapse

adf azure security firewall

On Azure most data services offer a firewall. Unfortunately, at the moment the details of those firewalls differ. As soon as a firewall is switched on for any storage service (e.g. Azure Data Lake Gen2, Azure Synapse, Azure Key Vault), Azure Data Factory cannot access the resources by default anymore and must be configured accordingly.

In this blog post, I want to demonstrate how to demonstrate how to connect ADF and Synapse pretty securely without going with a full Managed VNet Runtime of ADF which would incur extra cost.

Firewalls of Azure data services

Within Azure most of the data services have a built-in firewall. So let's have a look at some examples:

Azure Synapse Firewall

It has the option to configure Firewall access with the "Allow Azure services and resources to access this workspace". Giving any other Azure services the option to reach the SQL endpoint of the Synapse instance.

Azure Storage Account Firewall

The firewall of the storage account has more features. It has three different options of private network access:

  • Using full VNet integration - obviously requiring the accessing resource to be part of a connected VNet as well
  • Whitelisting specific IP addresses - requires prior knowledge of dedicated IP addresses
  • Whitelisting of specific resource instances - allows connectivity across subscription boundaries and was enhanced to supports lots of services in addition to Synapse just in the recent days (see docs)
  • "Allow Azure services on the trusted services list" - very simple way to whitelist all resources within a subscription (see docs)

ADF access to Azure data services

If Azure Data Factory should now access the Storage Account in a secure way, this is easily done by the "trusted services" after assigning an appropriate IAM role. In case one wants to limit it to a specific ADF instance, one can leverage the above resource instance whitelisting. Both options do not incur any additional cost and are easy to setup.

In case of the Azure Synapse SQL endpoint, this becomes more challenging. As described above, it just has the option to allow "Allow Azure Services and resources to access this workspace". This has the drawback that it is exposing the service a lot wider than just the Azure services within a subscription. Unfortunately, at the moment there is neither resources instances whitelisting, nor is there a setting of trusted services. There is the option to whitelist certain IP ranges, though. Now the question becomes which IP addresses do represent ADF?

IP Ranges of Azure Data Factory

All Azure PaaS services consume IP ranges of Azure. To get an understanding of those IP ranges they are labeled with service tags that assign certain CIDR ranges to certain services (see docs).

One can query the details of those ranges with a bit of PowerShell:

PS C:\Users\f.moeller> az network list-service-tags -l westeurope --query "values[?id == 'DataFactory.WestEurope'].properties.addressPrefixes[]" --output tsv | FindStr "\."

The above can be translated into IP ranges with help of online calculators (e.g. jodies.de):

first host last host

However, the issue with above approach is that it is still whitelisting all Azure Data Factories within a region. As a result, you might want to limit the duration of the opened firewall to just allow access during pipeline execution but not at any other time. In the next sections, I will show how to dynamically adjust Synapse firewalls from Azure Data Factory.

IP whitelisting with the REST API

Luckily, Synapse offers a REST API to modify its firewall. The details of this can be found in the official documentation.

One just has to create a PUT request against the Azure Synapse instance via the Management API at https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Synapse/workspaces/{workspaceName}/firewallRules/{ruleName}?api-version=2021-03-01.

Moreover, the REST API exposes the current state of a firewall rule. This allows to identify whether it is Provisioning or the creation Succeeded already. This information can be retrieved with a GET request to https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Synapse/workspaces/{workspaceName}/firewallRules/{ruleName}?api-version=2021-03-01 which is documented as well.

Fortunately, ADF comes with a Web Activity that allows to natively call an REST API without any further overhead involved. Since the Web Activity supports MSI Authentication already since 2018, authentication is really simple. One just has to assign a corresponding IAM role. For real environments I suggest to go with a Custom Role, for a demo the Contributor role will suffice.

ADF pipeline on-demand whitelisting

Firstly, we need a pipeline to whitelist an IP address. Secondly, this pipeline must wait for the whitelisting to succeed because otherwise the firewall role is not active yet.

Whitelisting IP address

And a waiting functionality:

Whitelisting IP address -- wait until success

As a last step, we loop over a variable of IP ranges. The variable ranges is defined as an array and contains:

    {"FirewallRuleName":"adfload1","StartIP":"",   "EndIP":""},
    {"FirewallRuleName":"adfload2","StartIP":"",  "EndIP":""},
    {"FirewallRuleName":"adfload3","StartIP":"",  "EndIP":""},
    {"FirewallRuleName":"adfload4","StartIP":"",   "EndIP":""},
    {"FirewallRuleName":"adfload5","StartIP":"",     "EndIP":""},
    {"FirewallRuleName":"adfload6","StartIP":"", "EndIP":""},
    {"FirewallRuleName":"adfload7","StartIP":"", "EndIP":""}

This pipeline first adds all the IP ranges to the firewall and execute a child pipeline and afterwards removes the ranges from the firewall again.

Whitelisting IP address -- loop through ranges


For a fully-fledged security version, one can go with the Managed Virtual Network feature of Azure Data Factory. However, for lower run-time costs and easier setup, a solution based on trusted services might be preferred.

The solution shown above, allows you to significantly reduce limit the the attack surface on the Synapse SQL endpoint compared to the setting "Allow Azure services and resources to access this workspace". On the one hand, Synapse is opened just for Azure Data Factory of the relevant region. On the other hand, Synapse is only opened while the ETL process is running and closed immediately afterwards.

I hope in the future a feature like the resource instance whitelisting becomes available across all service to easy the network security of PaaS services significantly.

Previous Post Next Post