First look at metadata-driven copy task for Azure Data Factory

First look at metadata-driven copy task for Azure Data Factory

adf azure devops metadata-driven

Yesterday Microsoft has announced and made available a new type of ingestion. It is named the metadata-driven copy task. With this new task, one can within minutes create a pipeline that loads data from a wide variety of data sources into a many of the ADF supported data sources.

Background

So far people have been building different custom frameworks. Even within the company I work for, avanade, multiple exist. Some of the frameworks are focused on the orchestration as procfwk of my colleague Paul Andrew for example. Others are focused on the data processing. The one announced by Microsoft falls into the second category.

This new framework does not introduce any new ADF activities, it is creating three pipelines based on standard activities when you run through the wizard.

This is very similar to previous ingestion frameworks based on BIML.

Where to find it?

You see it right on the start page of Azure Data Factory.

However, to actually try it lets setup the following show-case environment:

  • Azure SQL DB with AdventureWorks as source database
  • Another Azure SQL DB to store the metadata of the ingestion framework
  • Azure KeyVault for proper secret management
  • Azure Data Lake gen2 as a destination

In the following section the installation of this is conducted with PowerShell. Of course, one could click in the Azure Portal as well. However, that would be a significant effort and harder to reproduce for you readers.

Installing the environment

The following resources are needed:

Required resources as seen in the Azure Portal

To easily setup this demo environment, I have prepared the a PowerShell script for you. You have to be logged into Azure beforehand (i.e. az login).

# adjust the following 
$project="azdg" # please adjust
$environment="d"
$region="euw"
$location="West Europe" # based on $region
$purpose="ingestion"
$incarnation="02"
$sqlpwd="<strongpassword>" # please adjust

# creating variables with resource names for convenience
$rg=$project+$environment+$region+$purpose+$incarnation
$kv=$project+$environment+$purpose+$incarnation+"kv"
$sql=$project+$environment+$purpose+$incarnation+"sql"
$lake=$project+$environment+$purpose+$incarnation+"lake"+"stg"
$df=$project+$environment+$purpose+$incarnation+"df"
$sqldbmeta="metadata"
$sqldbsrc="adventureworks"
$sqladminuser="sqladmin"
$sqlpwdname=$sql+"pwd"

# creating the resources
az group create --location $location --name $rg
az keyvault create --resource-group $rg --name $kv --location $location
az sql server create --resource-group $rg --name $sql --location $location `
  --admin-password $sqlpwd --admin-user $sqladminuser
az sql db create --resource-group $rg --name $sqldbmeta --server $sql --service-objective S0
az sql db create --resource-group $rg --name $sqldbsrc --server $sql --service-objective S0 --sample-name "AdventureWorksLT"
az storage account create --name $lake --resource-group $rg --location $location `
  --enable-hierarchical-namespace true
az datafactory create --location $location --name $df --resource-group $rg

Afterward, we have to configure the services correctly. The following changes are required:

  • allow Azure services to access the Azure SQL Database so that ADF can reach it
  • Place the secret to access the Azure SQL Database into the Key Vault
  • Grant access to the key vault to ADF
  • Grant access to the data lake to ADF

This can be done with the following commands:

az sql server firewall-rule create --resource-group $rg --server $sql --name "Azure Services" --start-ip-address 0.0.0.0 --end-ip-address 0.0.0.0
az keyvault secret set --name $sqlpwdname --vault-name $kv --value $sqlpwd
$msiadf=$(az resource list --name $df --query [*].identity.principalId --out tsv)
$dlr=$(az resource list --name $lake --query [*].id --out tsv)
az role assignment create --assignee $msiadf --role 'Storage Blob Data Contributor' --scope $dlr
az keyvault set-policy --name $kv --secret-permissions get list --object-id $msiadf

Lastly we need the appropriate Linked Services configured within ADF:

Linked Services in Azure Data Factory

Those can be setup including the linkage to the KeyVault and leveraging the above MSI permission grants with the following lines of PowerShell:

$ls = '{
  "type": "AzureKeyVault",
  "typeProperties": {
    "baseUrl": "https://'+$kv+'.vault.azure.net/"
  }
}' | Out-File ls.json
az datafactory linked-service create --factory-name $df --name "LS_KV_Secrets" --resource-group $rg --properties "@./ls.json"
$ls = '{
  "type": "AzureSqlDatabase",
  "typeProperties": {
    "connectionString": "Integrated Security=False;Encrypt=True;Connection Timeout=30;Data Source='+$sql+'.database.windows.net;Initial Catalog='+$sqldbmeta+';User ID='+$sqladminuser+'",
    "password": {
      "type": "AzureKeyVaultSecret",
      "store": {
        "referenceName": "LS_KV_Secrets",
        "type": "LinkedServiceReference"
      },
      "secretName": "'+$sqlpwdname+'"
    }
  }
}' | Out-File ls.json
az datafactory linked-service create --factory-name $df --name "LS_SQL_Metadata" --resource-group $rg --properties "@./ls.json"
$ls = '{
  "type": "AzureSqlDatabase",
  "typeProperties": {
    "connectionString": "Integrated Security=False;Encrypt=True;Connection Timeout=30;Data Source='+$sql+'.database.windows.net;Initial Catalog='+$sqldbsrc+';User ID='+$sqladminuser+'",
    "password": {
      "type": "AzureKeyVaultSecret",
      "store": {
        "referenceName": "LS_KV_Secrets",
        "type": "LinkedServiceReference"
      },
      "secretName": "'+$sqlpwdname+'"
    }
  }
}' | Out-File ls.json
az datafactory linked-service create --factory-name $df --name "LS_SQL_AdventureWorks" --resource-group $rg --properties "@./ls.json"
$ls = '{
  "type": "AzureBlobFS",
  "typeProperties": {
    "url": "https://'+$lake+'.dfs.core.windows.net"
  }
}' | Out-File ls.json
az datafactory linked-service create --factory-name $df --name "LS_STG_DataLake" --resource-group $rg --properties "@./ls.json"

Remove-Item .\ls.json

Configuring

To use the metadata-driven copy task one has to go through the following wizard.

First, we configure the central control table. Luckily, you have already setup the linked service above: Configuration: Properties

Then, we setup the source database. Gladly, this has been provisioned and with the AdventureWorksLT already before. Configuration: Source data store

By default all tables are loaded in full, however, it is possible to configure the loading behavior in there we can leverage any column to be leveraged for an incremental load. In AdventureWorksLT most tables have a ModifiedDate column. If you set the watermark value to 01/01/1970 00:00:00, everything will be loaded on the first execution. Configuration: Choose loading behavior

Then, we setup the target data lake. For this the needed linked service is already provisioned again and I decide to put the ingested data into a subfolder called adventureworks and store the files with an .csv extension. Configuration: Destination data store

We can then configure the file format settings, as always in ADF one should configure the Escape character to " to align with RFC4180. Configuration: File format settings

Then the task can be named: Configuration: Settings

After reviewing the settings for a last time: Configuration: Summary

One reaches the final page, there unfortunately one now has to copy the CREATE TABLE and metadata INSERT statement for the metadata table and the CREATE PROCEDURE for a procedure to update the watermark column manually. In my testing the Copy to clipboard buttons did not work reliable, but one can just copy the statements from the text boxes directly. Configuration: Deployment

The setup is finished now.

First Look

The result of the wizard is, three levels of pipelines nested into each other: Configuration: Summary

The details of those pipelines are described on docs.microsoft.com.

However, the details of this could be the reason for another blog post.

Review

I am a huge fan, of metadata-driven ingestion. I am very happy that this is now becoming a core port of ADF. This will give a lot more productivity to people that have not built their own framework yet. Moreover, for data that have an easy column to do incremental load this will bring this feature to a much broader audience.

As this is still preview, there is certainly some rough edges, that are either not implemented yet, I might have missed them or maybe the product group aims currently for something else. My first list of findings is:

  • Why is the batching in ADF done manually and not by using the Batch Count of the ForEach activity?
  • Why are the CREATE and INSERT statements not executed automatically at the end of the wizard, even though the linked service is required earlier in the wizard?
  • It would be very cool, if one could edit the configuration after the fact, e.g. adjust a delta column, include another table from the source.
  • It would very nice, to have an isActive column in the metadata to temporarily disable the load of a certain table.
  • It would be very nice if the metadata-driven copy activity would leverage the meta database to store execution logs, e.g. how many rows have been ingested in each run from which source
  • Why are most columns of JSON type rather than using a normalized data model for the metadata?
  • cosmetics
    • Why does it default to .txt instead of .csv if the standard format is Text (CSV)?
    • Why is the File name suffix in the Dataset page rather than the Configuration?
    • Why is \ the default for Escape character rather than " as specified in RFC4180?

What is the communities findings of the task? What do you think? Please comment below.

Previous Post Next Post