Yesterday Microsoft has announced and made available a new type of ingestion. It is named the metadata-driven copy task. With this new task, one can within minutes create a pipeline that loads data from a wide variety of data sources into a many of the ADF supported data sources.
So far people have been building different custom frameworks. Even within the company I work for, avanade, multiple exist. Some of the frameworks are focused on the orchestration as procfwk of my colleague Paul Andrew for example. Others are focused on the data processing. The one announced by Microsoft falls into the second category.
This new framework does not introduce any new ADF activities, it is creating three pipelines based on standard activities when you run through the wizard.
This is very similar to previous ingestion frameworks based on BIML.
You see it right on the start page of Azure Data Factory.
However, to actually try it lets setup the following show-case environment:
In the following section the installation of this is conducted with PowerShell. Of course, one could click in the Azure Portal as well. However, that would be a significant effort and harder to reproduce for you readers.
The following resources are needed:
To easily setup this demo environment, I have prepared the a PowerShell script for you.
You have to be logged into Azure beforehand (i.e. az login
).
# adjust the following
$project="azdg" # please adjust
$environment="d"
$region="euw"
$location="West Europe" # based on $region
$purpose="ingestion"
$incarnation="02"
$sqlpwd="<strongpassword>" # please adjust
# creating variables with resource names for convenience
$rg=$project+$environment+$region+$purpose+$incarnation
$kv=$project+$environment+$purpose+$incarnation+"kv"
$sql=$project+$environment+$purpose+$incarnation+"sql"
$lake=$project+$environment+$purpose+$incarnation+"lake"+"stg"
$df=$project+$environment+$purpose+$incarnation+"df"
$sqldbmeta="metadata"
$sqldbsrc="adventureworks"
$sqladminuser="sqladmin"
$sqlpwdname=$sql+"pwd"
# creating the resources
az group create --location $location --name $rg
az keyvault create --resource-group $rg --name $kv --location $location
az sql server create --resource-group $rg --name $sql --location $location `
--admin-password $sqlpwd --admin-user $sqladminuser
az sql db create --resource-group $rg --name $sqldbmeta --server $sql --service-objective S0
az sql db create --resource-group $rg --name $sqldbsrc --server $sql --service-objective S0 --sample-name "AdventureWorksLT"
az storage account create --name $lake --resource-group $rg --location $location `
--enable-hierarchical-namespace true
az datafactory create --location $location --name $df --resource-group $rg
Afterward, we have to configure the services correctly. The following changes are required:
This can be done with the following commands:
az sql server firewall-rule create --resource-group $rg --server $sql --name "Azure Services" --start-ip-address 0.0.0.0 --end-ip-address 0.0.0.0
az keyvault secret set --name $sqlpwdname --vault-name $kv --value $sqlpwd
$msiadf=$(az resource list --name $df --query [*].identity.principalId --out tsv)
$dlr=$(az resource list --name $lake --query [*].id --out tsv)
az role assignment create --assignee $msiadf --role 'Storage Blob Data Contributor' --scope $dlr
az keyvault set-policy --name $kv --secret-permissions get list --object-id $msiadf
Lastly we need the appropriate Linked Services configured within ADF:
Those can be setup including the linkage to the KeyVault and leveraging the above MSI permission grants with the following lines of PowerShell:
$ls = '{
"type": "AzureKeyVault",
"typeProperties": {
"baseUrl": "https://'+$kv+'.vault.azure.net/"
}
}' | Out-File ls.json
az datafactory linked-service create --factory-name $df --name "LS_KV_Secrets" --resource-group $rg --properties "@./ls.json"
$ls = '{
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Integrated Security=False;Encrypt=True;Connection Timeout=30;Data Source='+$sql+'.database.windows.net;Initial Catalog='+$sqldbmeta+';User ID='+$sqladminuser+'",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "LS_KV_Secrets",
"type": "LinkedServiceReference"
},
"secretName": "'+$sqlpwdname+'"
}
}
}' | Out-File ls.json
az datafactory linked-service create --factory-name $df --name "LS_SQL_Metadata" --resource-group $rg --properties "@./ls.json"
$ls = '{
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Integrated Security=False;Encrypt=True;Connection Timeout=30;Data Source='+$sql+'.database.windows.net;Initial Catalog='+$sqldbsrc+';User ID='+$sqladminuser+'",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "LS_KV_Secrets",
"type": "LinkedServiceReference"
},
"secretName": "'+$sqlpwdname+'"
}
}
}' | Out-File ls.json
az datafactory linked-service create --factory-name $df --name "LS_SQL_AdventureWorks" --resource-group $rg --properties "@./ls.json"
$ls = '{
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://'+$lake+'.dfs.core.windows.net"
}
}' | Out-File ls.json
az datafactory linked-service create --factory-name $df --name "LS_STG_DataLake" --resource-group $rg --properties "@./ls.json"
Remove-Item .\ls.json
To use the metadata-driven copy task one has to go through the following wizard.
First, we configure the central control table. Luckily, you have already setup the linked service above:
Then, we setup the source database.
Gladly, this has been provisioned and with the AdventureWorksLT
already before.
By default all tables are loaded in full, however, it is possible to configure the loading behavior in there we can leverage any column to be leveraged for an incremental load.
In AdventureWorksLT
most tables have a ModifiedDate
column.
If you set the watermark value to 01/01/1970 00:00:00
, everything will be loaded on the first execution.
Then, we setup the target data lake.
For this the needed linked service is already provisioned again and I decide to put the ingested data into a subfolder called adventureworks
and store the files with an .csv
extension.
We can then configure the file format settings, as always in ADF one should configure the Escape character to "
to align with RFC4180.
Then the task can be named:
After reviewing the settings for a last time:
One reaches the final page, there unfortunately one now has to copy the CREATE TABLE
and metadata INSERT
statement for the metadata table and the CREATE PROCEDURE
for a procedure to update the watermark column manually.
In my testing the Copy to clipboard buttons did not work reliable, but one can just copy the statements from the text boxes directly.
The setup is finished now.
The result of the wizard is, three levels of pipelines nested into each other:
The details of those pipelines are described on docs.microsoft.com.
However, the details of this could be the reason for another blog post.
I am a huge fan, of metadata-driven ingestion. I am very happy that this is now becoming a core port of ADF. This will give a lot more productivity to people that have not built their own framework yet. Moreover, for data that have an easy column to do incremental load this will bring this feature to a much broader audience.
As this is still preview, there is certainly some rough edges, that are either not implemented yet, I might have missed them or maybe the product group aims currently for something else. My first list of findings is:
CREATE
and INSERT
statements not executed automatically at the end of the wizard, even though the linked service is required earlier in the wizard?isActive
column in the metadata to temporarily disable the load of a certain table..txt
instead of .csv
if the standard format is Text (CSV)?\
the default for Escape character rather than "
as specified in RFC4180?What is the communities findings of the task? What do you think? Please comment below.