Loading many tables in Cloud Data Fusion fails with DAG error - google-cloud-data-fusion

I have an MS SQL Server data source with around 1000 tables, which I need to put into BigQuery. I was hoping to use Data Fusion to load them all into staging tables in BigQuery, and then perform transformations on them afterwards. However, as soon as I create a pipeline with two "islands" it give a DAG error. Is that a feature or a just something I'm doing wrong? I can't find anything in the documentation. My pipeline looks like this:
And the error I get when I try to deploy is: "Invalid DAG. There is an island made up of stages BigTest,BigQuery BigTest (no other stages connect to them)."

Each pipeline is a single DAG (Directed acyclic graph) and all the source and sink should be connected for the configuration to be valid. You can use multi-table source plugin that can bring in multiple tables at once to a landing table in BQ.
You can use Multi table plugins and BQ Multi table sink for your use-case.

Related

Azure Synapse Pipeline copy data from the BigQuery, where the source schema is hierarchical with nested columns

Please help me with copying data from Google BigQuery to Azure Data Lake Storage Gen2 with Serverless SQL Pool.
I am using Azure Synapse's Copy data pipeline. The issue is I cannot figure out how to handle source table from the BigQuery with hierarchical schema. This result in missing columns and inaccurate datetime value at the sink.
The source is a Google BigQuery table, it is made of Google Cloud Billing export of a project's standard usage cost. The source table's schema is hierarchical with nested columns, such as service.id; service.description; sku.id; sku.description; Project.labels.key; Project.labels.value, etc.
When I click on Preview data from the Source tab of the Copy data pipeline, it only gives me the top of the column hierarchy, for example: It would only show the column name of [service] and with value of {\v":{"f":[{"v":"[service.id]"},{"v":"[service.descrpition]"}]}}
image description: Source with nested columns result in issues with Synapse Copy Data Pipline
I have tried to configure the Copy Pipline with the following:
Source Tab:
Use query - I think the solution lays in here, but I cannot figure out the syntax of selecting the proper columns. I watched a Youtube video from TechBrothersIT How to Pass Parameters to SQL query in Azure Data Factory - ADF Tutorial 2021, but still unable to do it.
Sink Tab:
1.Sink dataset in various format of csv, json and parquet - with csv and parquet getting similar result, and json format failed
2.Sink dataset to Azure SQL Database - failed because it is not supported with Serverless SQL Pool
3.Mapping Tab: note: edited on Jan22 with screenshot to show issue.
Tried with Import schemas, with Sink Tab copy behavior of None, Flatten Hierarchy and Preserve Hierarchy, but still unable to get source column to be recognized as Hierarchical. Unable to get the Collection reference nor the Advanced Editor configurations to show up. Ref: Screenshot of Source columns not detected as Hierarchical MS Doc on Schema and data type mapping in copy activity
I have also tried with the Data flow pipeline, but it does not support Google BigQueryData Flow Pipe Source do not support BigQuery yet
Here are the steps to reproduce / get to my situation:
Register Google cloud, setup billing export (of standard usage cost) to BigQuery.
At Azure Synapse Analytics, create a Linked service with user authentication. Please follow Data Tech's Youtube video
"Google BigQuery connection (or linked service) in Azure Synapse analytics"
At Azure Synapse Analytics, Integrate, click on the "+" sign -> Copy Data Tool
I believe the answer is at the Source tab with Query and Functions, please help me figure this out, or point me to the right direction.
Looking forward to your input. Thanks in advance!
ADF allows you to write the query in google bigquery source dataset. Therefore write the query to unnest the nested columns using unnest operator and then map it to the sink.
I tried to repro this with sample nested table.
img:1 nested table
img:2 sample data of nested table
Script to flatten the nested table:
select
user_id,
a.post_id,
a.creation_date
from `ds1.stackoverflow_nested`
cross join unnest(comments) a
img:3 flattened table.
Use this query in copy activity source dataset.
img:4 Source settings of copy activity.
Then take the sink dataset, do the mapping and execute the ADF pipeline.
Reference:
MS document on google bigquery as a source - ADF
GC document on unnest operator

Mapping Synapse data flow with parameterized dynamic source need importing projection dynamically

I am trying to build a cloud data warehouse where I have staged the on-prem tables as parquet files in data lake.
I implemented the metadata driven incremental load.
In the above data flow I am trying to implement merge query passing the table name as parameter so that the data flow dynamically locate respective parquet files for full data and incremental data and then go through some ETL steps to implement merge query.
The merge query is working fine. But I found that projection is not correct. As the source files are dynamic, I also want to "import projection" dynamically during the runtime. So that the same data flow can be used to implement merge query for any table.
In the picture, you see it is showing 104 columns (which is a static projection that it imported at the development time). Actually for this table it should be 38 columns.
Can I dynamically (i.e run-time) assign the projection? If so how?
Or anyone has any suggestion regarding this?
Thanking
Muntasir Joarder
Enable Schema drift in your source transformation when the metadata is often changed. This removes or adds columns in the run time.
The source projection displays what has been imported at the run time but it changes based on the source schema at run time.
Refer to this document for more details with examples.

ADF Mapping Dataflow Temp Table issue inside SP call

I have a mapping dataflow inside a foreach activity which I'm using to copy several tables into ADLS; in the dataflow source activity, I call a stored procedure from my synapse environment. In the SP, I have a small temp table which I create to store some values which I will later use for processing a query.
When I run the pipeline, I get an error on the mapping dataflow; "SQLServerException: 111212: Operation cannot be performed within a transaction." If I remove the temp table, and just do a simple select * from a small table, it returns the data fine; it's only after I bring back the temp table that I get an issue.
Have you guys ever seen this before, and is there a way around this?
If you go through the official MS docs, this error is very well documented.
Failed with an error: "SQLServerException: 111212; Operation cannot be performed within a transaction."
Symptoms
When you use the Azure SQL Database as a sink in the data flow to
preview data, debug/trigger run and do other activities, you may find
your job fails with following error message:
{"StatusCode":"DFExecutorUserError","Message":"Job failed due to reason: at Sink 'sink': shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerException: 111212;Operation cannot be performed within a transaction.","Details":"at Sink 'sink': shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerException: 111212;Operation cannot be performed within a transaction."}
Cause
The error "111212;Operation cannot be performed within a transaction." only occurs in the Synapse dedicated SQL pool. But you
mistakenly use the Azure SQL Database as the connector instead.
Recommendation
Confirm if your SQL Database is a Synapse dedicated SQL pool. If so,
use Azure Synapse Analytics as a connector shown in the picture below.
So after running some tests around this issue, it seems like Mapping Dataflows do not like Temp Tables when I call my stored procedure.
The way I ended up fixing this was that instead of using a Temp Table, I ended up using a CTE, which believe it or not, runs a bit faster than when I used the Temp Table.
#KarthikBhyresh
I looked at that article before, but it wasn't an issue with the sink, I was using Synapse LS as my source and a Data Lake Storage as my sink, so I knew from the beginning that this did not apply to my issue, even though it was the same error number.

streaming PostgreSQL tables into Google BigQuery

I would like to automatically stream data from an external PostgreSQL database into a Google Cloud Platform BigQuery database in my GCP account. So far, I have seen that one can query external databases (MySQL or PostgreSQL) with the EXTERNAL_QUERY() function, e.g.:
https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries
But for that to work, the database has to be in GCP Cloud SQL. I tried to see what options are there for streaming from the external PostgreSQL into a Cloud SQL PostgreSQL database, but I could only find information about replicating it in a one time copy, not streaming:
https://cloud.google.com/sql/docs/mysql/replication/replication-from-external
The reason why I want this streaming into BigQuery is that I am using Google Data Studio to create reports from the external PostgreSQL, which works great, but GDS can only accept SQL query parameters if it comes from a Google BigQuery database. E.g. if we have a table with 1M entries, and we want a Google Data Studio parameter to be added by the user, this will turn into a:
SELECT * from table WHERE id=#parameter;
which means that the query will be faster, and won't hit the 100K records limit in Google Data Studio.
What's the best way of creating a connection between an external PostgreSQL (read-only access) and Google BigQuery so that when querying via BigQuery, one gets the same live results as querying the external PostgreSQL?
Perhaps you missed the options stated on the google cloud user guide?
https://cloud.google.com/sql/docs/mysql/replication/replication-from-external#setup-replication
Notice in this section, it says:
"When you set up your replication settings, you can also decide whether the Cloud SQL replica should stay in-sync with the source database server after the initial import is complete. A replica that should stay in-sync is online. A replica that is only updated once, is offline."
I suspect online mode is what you are looking for.
What you are looking for will require some architecture design based on your needs and some coding. There isn't a feature to automatically sync your PostgreSQL database with BigQuery (apart from the EXTERNAL_QUERY() functionality that has some limitations - 1 connection per db - performance - total of connections - etc).
In case you are not looking for the data in real time, what you can do is with Airflow for instance, have a DAG to connect to all your DBs once per day (using KubernetesPodOperator for instance), extract the data (from past day) and loading it into BQ. A typical ETL process, but in this case more EL(T). You can run this process more often if you cannot wait one day for the previous day of data.
On the other hand, if streaming is what you are looking for, then I can think on a Dataflow Job. I guess you can connect using a JDBC connector.
In addition, depending on how you have your pipeline structure, it might be easier to implement (but harder to maintain) if at the same moment you write to your PostgreSQL DB, you also stream your data into BigQuery.
Not sure if you have tried this already, but instead of adding a parameter, if you add a dropdown filter based on a dimension, Data Studio will push that down to the underlying Postgres db in this form:
SELECT * from table WHERE id=$filter_value;
This should achieve the same results you want without going through BigQuery.

How to implement scd2 in snowflake tables using Azure Data Factory

I want to implement the scd2 in the snowflake tables. My source and target tables are present in snowflake only. The entire process has to be done using Azure Data Factory.
I went through the documentation given by azure for implementing the scd2 using data flows but when I tried to create a dataset for snowflake connection its showing as disabled.
Is there any way or any documentation where I can see the steps to create SCD2 in adf with snowflake tables.
Thanks
vipendra
SCD2 in ADF can be built and managed graphically via data flows. The Snowflake connector for ADF today does not work directly with data flows, yet. So for now, you will need to use the Copy Activity in an ADF pipeline and stage the dimension data in Blob or ADLS, then build your SCD2 logic in data flows using the staged data.
Your pipeline will look something like this:
[Copy Activity Snowflake-to-Blob] -> [Data Flow SCD2 logic Blob-to-Blob] -> [Copy Activity Blob-to-Snowkflake]
We are working on direct connectivity to Snowflake from data flows and hope to land that soon.
If your source and target tables are both in Snowflake, you could use Snowflake Streams to do this. There's a blog post covering this in more detail at https://community.snowflake.com/s/article/Building-a-Type-2-Slowly-Changing-Dimension-in-Snowflake-Using-Streams-and-Tasks-Part-1
However, in short, if you have a source table source, you can put a stream on it like so:
create or replace stream source_changes on table source;
This will capture all the changes that are made to the source table. You can then build a view on that stream that establishes how you want to feed those changes into the SCD table. (The blog post uses case statements to put start and end dates in on each row in the view).
From there, you can use a Snowflake Task to automate the process of loading from the stream into the SCD only when the Stream actually has changes.