Mapping Synapse data flow with parameterized dynamic source need importing projection dynamically - azure-data-factory

I am trying to build a cloud data warehouse where I have staged the on-prem tables as parquet files in data lake.
I implemented the metadata driven incremental load.
In the above data flow I am trying to implement merge query passing the table name as parameter so that the data flow dynamically locate respective parquet files for full data and incremental data and then go through some ETL steps to implement merge query.
The merge query is working fine. But I found that projection is not correct. As the source files are dynamic, I also want to "import projection" dynamically during the runtime. So that the same data flow can be used to implement merge query for any table.
In the picture, you see it is showing 104 columns (which is a static projection that it imported at the development time). Actually for this table it should be 38 columns.
Can I dynamically (i.e run-time) assign the projection? If so how?
Or anyone has any suggestion regarding this?
Thanking
Muntasir Joarder

Enable Schema drift in your source transformation when the metadata is often changed. This removes or adds columns in the run time.
The source projection displays what has been imported at the run time but it changes based on the source schema at run time.
Refer to this document for more details with examples.

Related

Azure Synapse Pipeline copy data from the BigQuery, where the source schema is hierarchical with nested columns

Please help me with copying data from Google BigQuery to Azure Data Lake Storage Gen2 with Serverless SQL Pool.
I am using Azure Synapse's Copy data pipeline. The issue is I cannot figure out how to handle source table from the BigQuery with hierarchical schema. This result in missing columns and inaccurate datetime value at the sink.
The source is a Google BigQuery table, it is made of Google Cloud Billing export of a project's standard usage cost. The source table's schema is hierarchical with nested columns, such as service.id; service.description; sku.id; sku.description; Project.labels.key; Project.labels.value, etc.
When I click on Preview data from the Source tab of the Copy data pipeline, it only gives me the top of the column hierarchy, for example: It would only show the column name of [service] and with value of {\v":{"f":[{"v":"[service.id]"},{"v":"[service.descrpition]"}]}}
image description: Source with nested columns result in issues with Synapse Copy Data Pipline
I have tried to configure the Copy Pipline with the following:
Source Tab:
Use query - I think the solution lays in here, but I cannot figure out the syntax of selecting the proper columns. I watched a Youtube video from TechBrothersIT How to Pass Parameters to SQL query in Azure Data Factory - ADF Tutorial 2021, but still unable to do it.
Sink Tab:
1.Sink dataset in various format of csv, json and parquet - with csv and parquet getting similar result, and json format failed
2.Sink dataset to Azure SQL Database - failed because it is not supported with Serverless SQL Pool
3.Mapping Tab: note: edited on Jan22 with screenshot to show issue.
Tried with Import schemas, with Sink Tab copy behavior of None, Flatten Hierarchy and Preserve Hierarchy, but still unable to get source column to be recognized as Hierarchical. Unable to get the Collection reference nor the Advanced Editor configurations to show up. Ref: Screenshot of Source columns not detected as Hierarchical MS Doc on Schema and data type mapping in copy activity
I have also tried with the Data flow pipeline, but it does not support Google BigQueryData Flow Pipe Source do not support BigQuery yet
Here are the steps to reproduce / get to my situation:
Register Google cloud, setup billing export (of standard usage cost) to BigQuery.
At Azure Synapse Analytics, create a Linked service with user authentication. Please follow Data Tech's Youtube video
"Google BigQuery connection (or linked service) in Azure Synapse analytics"
At Azure Synapse Analytics, Integrate, click on the "+" sign -> Copy Data Tool
I believe the answer is at the Source tab with Query and Functions, please help me figure this out, or point me to the right direction.
Looking forward to your input. Thanks in advance!
ADF allows you to write the query in google bigquery source dataset. Therefore write the query to unnest the nested columns using unnest operator and then map it to the sink.
I tried to repro this with sample nested table.
img:1 nested table
img:2 sample data of nested table
Script to flatten the nested table:
select
user_id,
a.post_id,
a.creation_date
from `ds1.stackoverflow_nested`
cross join unnest(comments) a
img:3 flattened table.
Use this query in copy activity source dataset.
img:4 Source settings of copy activity.
Then take the sink dataset, do the mapping and execute the ADF pipeline.
Reference:
MS document on google bigquery as a source - ADF
GC document on unnest operator

How to check a table is made from which tables in pyspark

I have a core layer where I have some tables and I want to find out by what tables in the source layer are these tables made up of. Like the tables in core layer are made by joining some of the tables of source layer. I want to generate an excel sheet using code so that I am able to display the core tables are made from which tables.
I am using PySpark on Databricks and the codes are written for creating the tables in notebooks.
Any help on how to approach this will be beneficial.
This is possible when you use Databricks Unity Catalog - as part of it, there is a feature called Data Lineage that tracks what tables & columns were used to create a specific table and who are consumers of it as well. It also includes Lineage API that could be used for exporting of the lineage data.

Azure Data Factory - Dataverse data ingestion and data type mapping

We are performing data ingestion of Dataverse[Common data service apps] Entities into ADLS Gen2 using Azure Data Factory. We see few columns missing from Dataverse source which are not copied into ADLS, specifically with Dataverse Data type - Choice.
Are all Dataverse column data types supported by ADF linked service? Please suggest fix or any workaround.
Are all Dataverse column data types supported by ADF linked service?
Yes, dataverse supports all column data types.
For missing columns, you should consider the below given points:
When you copy data from Dynamics, explicit column mapping from Dynamics to sink is optional. But we highly recommend the mapping to ensure a deterministic copy result.
When the service imports a schema in the authoring UI, it infers the schema. It does so by sampling the top rows from the Dynamics query result to initialize the source column list. In that case, columns with no values in the top rows are omitted. The same behavior also applies to data preview and copy executions if there is no explicit mapping. You can review and add more columns into the mapping, which are honored during copy runtime.
To consume the dataverse choices using ADF, you should use data flow activity and use the derived transformation because choice values are written as an integer label and not a text label to maintain consistency during edits. The integer-to-text label mapping is stored in the Microsoft.Athena.TrickleFeedService/table-EntityMetadata.json file.
Refer this Microsoft official document to implement the same.

Azure data factory: Implementing the SCD2 on txt files

I have flat files in adls source,
for full load we are adding 2 columns Insert and datatimestamp.
For change load we need to Lookup with full data, the data available in full should be taken as Updated and not available data as Insert and copy.
below is the approach I tried to work out, but i'm unable to perform.
Can any one help me on this.
Thanks you and waiting for quick response.
Currently, the feature to update the existing flat file using the Azure data factory sink is not supported. You have to create a new flat file.
You can also use data flow activity to read full and incremental data and load to a new file in sink transformation.

How to get max of a given column from ADF Copy Data activity

I have a copy data activity for on-premise SQL Server as source and ADLS Gen2 as sink. There is a control table to pickup tableName, watermarkDateColumn and the watermarkDatetime to pull incremental data from the source database.
After data is pulled/loaded in sink, I want to get the max of the watermarkDateColumn in my dataset. Can it be obtained from #activity('copyActivity1').output?
I'm not allowed to use one extra lookup activity to query the source table for getting the max(watermarkDateColumn) in pipeline.
Copy activity only could be used for data transmission,not for any other aggregation feature. So #activity('copyActivity1').output won't help. Since you said you can't use lookup activity, i'm afraid your requirement is not available so far.
If you prefer not using additional activities, I suggest you using Data Flow Activity instead which is more flexible.There is built-in aggregation feature in the Data Flow Activity.