Monkeypox Data Pipeline Using Azure

Image placeholder

The goal is want to put Monkeypox Data into a good data warehouse and data pipeline. Here on the left side is the architecture. We take the data from Outworld in Data, here is the link https://github.com/owid/monkeypox/blob/main/owid-monkeypox-data.csv . Actually, Azure Data Factory, Azure Databricks, and Azure Synapse Analytics can replace each other, but in this case just want to use all of these to give a clear picture of each tools/resources.

Image placeholder

First of all, we should create a data lake using storage account in Azure. You can type it on search bar on top. Just leave any configuration in default, you only change the region which near with you.

Image placeholder

Create container and directory folder

Image placeholder

Next, we should create azure data factory. Its same with the way we create storage account. Make sure the region is same with storage account region. I will make azure data factory short with adf.

Image placeholder

Launch Azure Data Factory Studio. Choose azure active directory, subscription, and adf.

Image placeholder

Click Author menu on the left side. Click plus (+) icon and choose pipeline.

Image placeholder

Basically pipeline that we need is migrating the data from external server into our data lake. Lets move to the steps. Find 'copy data' on activity bar, click and drag it into the workspace. Complete any configurations on the bottom side. Store the data on our container inside raw-data folder. After that click Publish All on blue button.

Image placeholder

We have done the pipeline. Next, we should set schedule when this pipeline running. You can set it on add trigger button > New/Edit. Because I only need to run it one time, so I just use Trigger Now feature to run the pipeline once at rigth now.

Image placeholder

Voila! We can see the data that we need now is already on our data lake storage, it means the pipeline running succesfully.

Image placeholder

Next phase, we will do transformation using Azure Databricks. First, we should create databricks. Its same with the way we create storage account and adf. Just make sure it has the same region as well.

Image placeholder

Launch the workspace then.

Image placeholder

Just a little bit explanation, in databricks we will use Apache Spark Computation. So we need to set how much parallel computing we need and other configuration in the machine. Go to compute and create new compute. Set the configuration and Click Create compute button.

Image placeholder

I forgot, We need to create connection between Storage Account and Azure Databricks. We should create app as connection and key vault to store any credential needed to access storage account.

Image placeholder

Go to the app resource that we've created. On overview tab, there are some information that we need to keep, you can save it on notepad or your code editor, just copy paste. The information that need to kepp are Application (client) ID and Directory (tenant) ID. One important thing, we should grant access to this app on Storage Account and Azure Databricks as well.

Image placeholder

Stil on app, move to Certificates & secrets tab. In here we will create secret key, just name it secretkey on description and leave other configuration as default.

Image placeholder

After secret key created, save the secret ID and paste it on notepad or code editor.

Image placeholder

We will use all information that we've saved on databricks. But everyone who can access databricks can see that sensitive information if we only hardcode it on the notebook. So we need to secure that sensitive information and also we need to able to use that information in our notebook. For that objective, we need one resource/service that provided by Azure called Key Vault.

Image placeholder

First, we should create Key Vault resource. Its same with the way we create storage account, adf, and databrticks. Just make sure it has the same region as well. After that we create a key vault, just leave any configuration as default.

Image placeholder

Keep Vault URI and Directory ID on your notepad or code editor, it will be needed in the future.

Image placeholder

Move to Access Control (IAM) tab. Grant your access as Key Vault Administrator and app that we've created previously as Contributor.

Image placeholder

Move to Access Policies tab. Grant Key Permissions to the app and Secret Permissions to Azure Databricks.

Image placeholder

Move to secret tab and create it. Name it up to you and put any key and id on your notepad or code editor on secret value box.

Image placeholder

We should make it one by one until 3 secrets created there.

Image placeholder

Next, we need to make connection between Key Vault and Azure Databricks. Go to https://{your databricks link}#secrets/createScope. Name it up to you and put the code/informatrion from key vault. Put the Vault URI code into DNS Name box and Directory ID into Resource ID box. Click Create button then.

Image placeholder

Go back to workspace and create notebook. Type all code. dbutils.secrets.get(scope={put your scope name that we've created on databricks}, key={put your secret description/name on key vault}). Those command to take the credential that we've put on Key Vault Secret. All these code run to build mount/connection between Azure Databricks-App-Storage Account.

Image placeholder

After mounting succesfully built. We can read any data from storage account here, these are the code to read the data using spark. Actually in here I do some transformations such as adjust data type and fill empty values.

Image placeholder

After some transformations, we store the data into storage account but different directory with the raw data.

Image placeholder

Here is the result on storage account. Data exported from spark is always contained with other files as well. Our data is on file named 'part...'. It can be multiple file depends on the partition configuration on spark.

Image placeholder

Azure Synapse Analytics have same function with Azure Data Factory plus some Data wrangling and analytics function. We will create Azure SYnapse Analytics workspace for the next phase. Make sure have the same region with other esource.

Image placeholder

After Azure Synapse Analytics workspace created, on Overview tab click Open Synapse Studio. It will redirect you to new tab.

Image placeholder

Basically it has the same UI with Azure Data Factory. Click plus icon and create external table from data lake.

Image placeholder

Put table name up to you. On linked service and input file or folder box bring it to your storage account which is used to store spark exported data.

Image placeholder

Azure Synapse Analytics will automatically read the data with its right data type.

Image placeholder

You can run SQL Script here.

Image placeholder

You can build chart as well here. Azure Synapse Analytics is a perfect and easy tool to wrangling and explore the data. Actually from here you can consume the data and bring it to any visualization platforms such as Power BI, Tablea, Looker, etc.