On Azure most data services offer a firewall. Unfortunately, at the moment the details of those firewalls differ. As soon as a firewall is switched on for any storage service (e.g. Azure Data Lake Gen2, Azure Synapse, Azure Key Vault), Azure Data Factory cannot access the resources by default anymore and must be configured accordingly.
In this blog post, I want to demonstrate how to demonstrate how to connect ADF and Synapse pretty securely without going with a full Managed VNet Runtime of ADF which would incur extra cost.
Within Azure most of the data services have a built-in firewall. So let's have a look at some examples:
It has the option to configure Firewall access with the "Allow Azure services and resources to access this workspace". Giving any other Azure services the option to reach the SQL endpoint of the Synapse instance.
The firewall of the storage account has more features. It has three different options of private network access:
If Azure Data Factory should now access the Storage Account in a secure way, this is easily done by the "trusted services" after assigning an appropriate IAM role. In case one wants to limit it to a specific ADF instance, one can leverage the above resource instance whitelisting. Both options do not incur any additional cost and are easy to setup.
In case of the Azure Synapse SQL endpoint, this becomes more challenging. As described above, it just has the option to allow "Allow Azure Services and resources to access this workspace". This has the drawback that it is exposing the service a lot wider than just the Azure services within a subscription. Unfortunately, at the moment there is neither resources instances whitelisting, nor is there a setting of trusted services. There is the option to whitelist certain IP ranges, though. Now the question becomes which IP addresses do represent ADF?
All Azure PaaS services consume IP ranges of Azure. To get an understanding of those IP ranges they are labeled with service tags that assign certain CIDR ranges to certain services (see docs).
One can query the details of those ranges with a bit of PowerShell:
PS C:\Users\f.moeller> az network list-service-tags -l westeurope --query "values[?id == 'DataFactory.WestEurope'].properties.addressPrefixes[]" --output tsv | FindStr "\."
13.69.67.192/28
13.69.107.112/28
13.69.112.128/28
40.74.24.192/26
40.74.26.0/23
40.113.176.232/29
52.236.187.112/28
The above can be translated into IP ranges with help of online calculators (e.g. jodies.de):
first host | last host |
---|---|
13.69.67.193 | 13.69.67.206 |
13.69.107.113 | 13.69.107.126 |
13.69.112.129 | 13.69.112.142 |
40.74.24.193 | 40.74.24.254 |
40.74.26.1 | 40.74.27.254 |
40.113.176.233 | 40.113.176.238 |
52.236.187.113 | 52.236.187.126 |
However, the issue with above approach is that it is still whitelisting all Azure Data Factories within a region. As a result, you might want to limit the duration of the opened firewall to just allow access during pipeline execution but not at any other time. In the next sections, I will show how to dynamically adjust Synapse firewalls from Azure Data Factory.
Luckily, Synapse offers a REST API to modify its firewall. The details of this can be found in the official documentation.
One just has to create a PUT
request against the Azure Synapse instance via the Management API at https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Synapse/workspaces/{workspaceName}/firewallRules/{ruleName}?api-version=2021-03-01
.
Moreover, the REST API exposes the current state of a firewall rule.
This allows to identify whether it is Provisioning or the creation Succeeded already.
This information can be retrieved with a GET
request to https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Synapse/workspaces/{workspaceName}/firewallRules/{ruleName}?api-version=2021-03-01
which is documented as well.
Fortunately, ADF comes with a Web Activity that allows to natively call an REST API without any further overhead involved. Since the Web Activity supports MSI Authentication already since 2018, authentication is really simple. One just has to assign a corresponding IAM role. For real environments I suggest to go with a Custom Role, for a demo the Contributor role will suffice.
Firstly, we need a pipeline to whitelist an IP address. Secondly, this pipeline must wait for the whitelisting to succeed because otherwise the firewall role is not active yet.
And a waiting functionality:
As a last step, we loop over a variable of IP ranges.
The variable ranges
is defined as an array and contains:
[
{"FirewallRuleName":"adfload1","StartIP":"13.69.67.193", "EndIP":"13.69.67.206"},
{"FirewallRuleName":"adfload2","StartIP":"13.69.107.113", "EndIP":"13.69.107.126"},
{"FirewallRuleName":"adfload3","StartIP":"13.69.112.129", "EndIP":"13.69.112.142"},
{"FirewallRuleName":"adfload4","StartIP":"40.74.24.193", "EndIP":"40.74.24.254"},
{"FirewallRuleName":"adfload5","StartIP":"40.74.26.1", "EndIP":"40.74.27.254"},
{"FirewallRuleName":"adfload6","StartIP":"40.113.176.233", "EndIP":"40.113.176.238"},
{"FirewallRuleName":"adfload7","StartIP":"52.236.187.113", "EndIP":"52.236.187.126"}
]
This pipeline first adds all the IP ranges to the firewall and execute a child pipeline and afterwards removes the ranges from the firewall again.
For a fully-fledged security version, one can go with the Managed Virtual Network feature of Azure Data Factory. However, for lower run-time costs and easier setup, a solution based on trusted services might be preferred.
The solution shown above, allows you to significantly reduce limit the the attack surface on the Synapse SQL endpoint compared to the setting "Allow Azure services and resources to access this workspace". On the one hand, Synapse is opened just for Azure Data Factory of the relevant region. On the other hand, Synapse is only opened while the ETL process is running and closed immediately afterwards.
I hope in the future a feature like the resource instance whitelisting becomes available across all service to easy the network security of PaaS services significantly.