Azure Synapse Pipeline copy data from the BigQuery, where the source schema is hierarchical with nested columns - tsql

Please help me with copying data from Google BigQuery to Azure Data Lake Storage Gen2 with Serverless SQL Pool.
I am using Azure Synapse's Copy data pipeline. The issue is I cannot figure out how to handle source table from the BigQuery with hierarchical schema. This result in missing columns and inaccurate datetime value at the sink.
The source is a Google BigQuery table, it is made of Google Cloud Billing export of a project's standard usage cost. The source table's schema is hierarchical with nested columns, such as service.id; service.description; sku.id; sku.description; Project.labels.key; Project.labels.value, etc.
When I click on Preview data from the Source tab of the Copy data pipeline, it only gives me the top of the column hierarchy, for example: It would only show the column name of [service] and with value of {\v":{"f":[{"v":"[service.id]"},{"v":"[service.descrpition]"}]}}
image description: Source with nested columns result in issues with Synapse Copy Data Pipline
I have tried to configure the Copy Pipline with the following:
Source Tab:
Use query - I think the solution lays in here, but I cannot figure out the syntax of selecting the proper columns. I watched a Youtube video from TechBrothersIT How to Pass Parameters to SQL query in Azure Data Factory - ADF Tutorial 2021, but still unable to do it.
Sink Tab:
1.Sink dataset in various format of csv, json and parquet - with csv and parquet getting similar result, and json format failed
2.Sink dataset to Azure SQL Database - failed because it is not supported with Serverless SQL Pool
3.Mapping Tab: note: edited on Jan22 with screenshot to show issue.
Tried with Import schemas, with Sink Tab copy behavior of None, Flatten Hierarchy and Preserve Hierarchy, but still unable to get source column to be recognized as Hierarchical. Unable to get the Collection reference nor the Advanced Editor configurations to show up. Ref: Screenshot of Source columns not detected as Hierarchical MS Doc on Schema and data type mapping in copy activity
I have also tried with the Data flow pipeline, but it does not support Google BigQueryData Flow Pipe Source do not support BigQuery yet
Here are the steps to reproduce / get to my situation:
Register Google cloud, setup billing export (of standard usage cost) to BigQuery.
At Azure Synapse Analytics, create a Linked service with user authentication. Please follow Data Tech's Youtube video
"Google BigQuery connection (or linked service) in Azure Synapse analytics"
At Azure Synapse Analytics, Integrate, click on the "+" sign -> Copy Data Tool
I believe the answer is at the Source tab with Query and Functions, please help me figure this out, or point me to the right direction.
Looking forward to your input. Thanks in advance!

ADF allows you to write the query in google bigquery source dataset. Therefore write the query to unnest the nested columns using unnest operator and then map it to the sink.
I tried to repro this with sample nested table.
img:1 nested table
img:2 sample data of nested table
Script to flatten the nested table:
select
user_id,
a.post_id,
a.creation_date
from `ds1.stackoverflow_nested`
cross join unnest(comments) a
img:3 flattened table.
Use this query in copy activity source dataset.
img:4 Source settings of copy activity.
Then take the sink dataset, do the mapping and execute the ADF pipeline.
Reference:
MS document on google bigquery as a source - ADF
GC document on unnest operator

Related

Mapping Synapse data flow with parameterized dynamic source need importing projection dynamically

I am trying to build a cloud data warehouse where I have staged the on-prem tables as parquet files in data lake.
I implemented the metadata driven incremental load.
In the above data flow I am trying to implement merge query passing the table name as parameter so that the data flow dynamically locate respective parquet files for full data and incremental data and then go through some ETL steps to implement merge query.
The merge query is working fine. But I found that projection is not correct. As the source files are dynamic, I also want to "import projection" dynamically during the runtime. So that the same data flow can be used to implement merge query for any table.
In the picture, you see it is showing 104 columns (which is a static projection that it imported at the development time). Actually for this table it should be 38 columns.
Can I dynamically (i.e run-time) assign the projection? If so how?
Or anyone has any suggestion regarding this?
Thanking
Muntasir Joarder
Enable Schema drift in your source transformation when the metadata is often changed. This removes or adds columns in the run time.
The source projection displays what has been imported at the run time but it changes based on the source schema at run time.
Refer to this document for more details with examples.

How to Update Table in Snowflake using Azure Data Factory

I have two tables in snowflake named table1 and table2. Table1 is the source table which contains incremental data and table2 is the target table.
So my usecase is I have to take data from table1 and update the data into table2 but this process has to be done using Azure Data Factory.
I tried to create a data flow in ADF but it didn't allowed me to connect with the snowflake directly as it is not in the supported sources list. The native snowflake connector only supports the Copy Data Activity. So as a work around I first created a copy activity which copy the data from snowflake to azure blob. Then used the Azure Blob as source for Data Flow to create my scd1 implementation and saved the output in csv files.
Now My question is how should I update the data in target table2. Because If I directly use the copy activity to copy the csv files into snowflake then it will result in the duplicate records at snowflake side. For instance lets say table2 contains a row
id,name,age,data
1234,kristopher,24,somedata
and table1 contains
id,name,age,data
1234,kristopher,24,some-new-data
So now I have table1 data in csv which has to be loaded in snowflake. If I am loading directly then the resultant looks something like this.
id,name,age,data
1234,kristopher,24,somedata
1234,kristopher,24,some-new-data
But I only need
1234,kristopher,24,some-new-data
Let me know if some more explanation is required. I am new to Azure Data Factory and Snowflake as well.
Thanks
As you have observed, the ADF Data Flows currently don't support Snowflake datasets as a source.
You could theoretically follow this design pattern but it seems like alot of work for the requirement you have described. An alternative would be to go down the Azure Function route, but again I would trade off the requirement vs. effort required.
If it didn't have to be in ADF, then a quick approach would be to use a Snowflake Task to schedule some SQL to manage the SCD behavior for you.
I hope this helps.
Best regards,
Dan.
you can put your login in a snowflake stored procedure, then execute your stored proc in ADF

How to implement scd2 in snowflake tables using Azure Data Factory

I want to implement the scd2 in the snowflake tables. My source and target tables are present in snowflake only. The entire process has to be done using Azure Data Factory.
I went through the documentation given by azure for implementing the scd2 using data flows but when I tried to create a dataset for snowflake connection its showing as disabled.
Is there any way or any documentation where I can see the steps to create SCD2 in adf with snowflake tables.
Thanks
vipendra
SCD2 in ADF can be built and managed graphically via data flows. The Snowflake connector for ADF today does not work directly with data flows, yet. So for now, you will need to use the Copy Activity in an ADF pipeline and stage the dimension data in Blob or ADLS, then build your SCD2 logic in data flows using the staged data.
Your pipeline will look something like this:
[Copy Activity Snowflake-to-Blob] -> [Data Flow SCD2 logic Blob-to-Blob] -> [Copy Activity Blob-to-Snowkflake]
We are working on direct connectivity to Snowflake from data flows and hope to land that soon.
If your source and target tables are both in Snowflake, you could use Snowflake Streams to do this. There's a blog post covering this in more detail at https://community.snowflake.com/s/article/Building-a-Type-2-Slowly-Changing-Dimension-in-Snowflake-Using-Streams-and-Tasks-Part-1
However, in short, if you have a source table source, you can put a stream on it like so:
create or replace stream source_changes on table source;
This will capture all the changes that are made to the source table. You can then build a view on that stream that establishes how you want to feed those changes into the SCD table. (The blog post uses case statements to put start and end dates in on each row in the view).
From there, you can use a Snowflake Task to automate the process of loading from the stream into the SCD only when the Stream actually has changes.

How to get max of a given column from ADF Copy Data activity

I have a copy data activity for on-premise SQL Server as source and ADLS Gen2 as sink. There is a control table to pickup tableName, watermarkDateColumn and the watermarkDatetime to pull incremental data from the source database.
After data is pulled/loaded in sink, I want to get the max of the watermarkDateColumn in my dataset. Can it be obtained from #activity('copyActivity1').output?
I'm not allowed to use one extra lookup activity to query the source table for getting the max(watermarkDateColumn) in pipeline.
Copy activity only could be used for data transmission,not for any other aggregation feature. So #activity('copyActivity1').output won't help. Since you said you can't use lookup activity, i'm afraid your requirement is not available so far.
If you prefer not using additional activities, I suggest you using Data Flow Activity instead which is more flexible.There is built-in aggregation feature in the Data Flow Activity.

BigQuery View is not working if I used BigQuery Plugin

I've been used bigquery plugin under the source category. When I used bigquery View, Pipeline through an error of not allowed View. Also If I used the permanent table in which repeatable columns have existed, then it also through an error of unsupported mode 'repeated' while retrieving its schema. Does anyone have any information on this?
BigQuery source exports the data from the table into temporary GCS buckets and then read it in the pipeline. Since BigQuery VIEWs cannot be exported (please see limitations here - https://cloud.google.com/bigquery/docs/views), pipeline fails.
Also currently BigQuery source does not support repeatable column. The work is currently in progress - https://issues.cask.co/browse/CDAP-15256. Is this what you are looking for?