BigQuery View is not working if I used BigQuery Plugin - google-cloud-data-fusion

I've been used bigquery plugin under the source category. When I used bigquery View, Pipeline through an error of not allowed View. Also If I used the permanent table in which repeatable columns have existed, then it also through an error of unsupported mode 'repeated' while retrieving its schema. Does anyone have any information on this?

BigQuery source exports the data from the table into temporary GCS buckets and then read it in the pipeline. Since BigQuery VIEWs cannot be exported (please see limitations here - https://cloud.google.com/bigquery/docs/views), pipeline fails.
Also currently BigQuery source does not support repeatable column. The work is currently in progress - https://issues.cask.co/browse/CDAP-15256. Is this what you are looking for?

Related

Azure Synapse Pipeline copy data from the BigQuery, where the source schema is hierarchical with nested columns

Please help me with copying data from Google BigQuery to Azure Data Lake Storage Gen2 with Serverless SQL Pool.
I am using Azure Synapse's Copy data pipeline. The issue is I cannot figure out how to handle source table from the BigQuery with hierarchical schema. This result in missing columns and inaccurate datetime value at the sink.
The source is a Google BigQuery table, it is made of Google Cloud Billing export of a project's standard usage cost. The source table's schema is hierarchical with nested columns, such as service.id; service.description; sku.id; sku.description; Project.labels.key; Project.labels.value, etc.
When I click on Preview data from the Source tab of the Copy data pipeline, it only gives me the top of the column hierarchy, for example: It would only show the column name of [service] and with value of {\v":{"f":[{"v":"[service.id]"},{"v":"[service.descrpition]"}]}}
image description: Source with nested columns result in issues with Synapse Copy Data Pipline
I have tried to configure the Copy Pipline with the following:
Source Tab:
Use query - I think the solution lays in here, but I cannot figure out the syntax of selecting the proper columns. I watched a Youtube video from TechBrothersIT How to Pass Parameters to SQL query in Azure Data Factory - ADF Tutorial 2021, but still unable to do it.
Sink Tab:
1.Sink dataset in various format of csv, json and parquet - with csv and parquet getting similar result, and json format failed
2.Sink dataset to Azure SQL Database - failed because it is not supported with Serverless SQL Pool
3.Mapping Tab: note: edited on Jan22 with screenshot to show issue.
Tried with Import schemas, with Sink Tab copy behavior of None, Flatten Hierarchy and Preserve Hierarchy, but still unable to get source column to be recognized as Hierarchical. Unable to get the Collection reference nor the Advanced Editor configurations to show up. Ref: Screenshot of Source columns not detected as Hierarchical MS Doc on Schema and data type mapping in copy activity
I have also tried with the Data flow pipeline, but it does not support Google BigQueryData Flow Pipe Source do not support BigQuery yet
Here are the steps to reproduce / get to my situation:
Register Google cloud, setup billing export (of standard usage cost) to BigQuery.
At Azure Synapse Analytics, create a Linked service with user authentication. Please follow Data Tech's Youtube video
"Google BigQuery connection (or linked service) in Azure Synapse analytics"
At Azure Synapse Analytics, Integrate, click on the "+" sign -> Copy Data Tool
I believe the answer is at the Source tab with Query and Functions, please help me figure this out, or point me to the right direction.
Looking forward to your input. Thanks in advance!
ADF allows you to write the query in google bigquery source dataset. Therefore write the query to unnest the nested columns using unnest operator and then map it to the sink.
I tried to repro this with sample nested table.
img:1 nested table
img:2 sample data of nested table
Script to flatten the nested table:
select
user_id,
a.post_id,
a.creation_date
from `ds1.stackoverflow_nested`
cross join unnest(comments) a
img:3 flattened table.
Use this query in copy activity source dataset.
img:4 Source settings of copy activity.
Then take the sink dataset, do the mapping and execute the ADF pipeline.
Reference:
MS document on google bigquery as a source - ADF
GC document on unnest operator

Mapping Synapse data flow with parameterized dynamic source need importing projection dynamically

I am trying to build a cloud data warehouse where I have staged the on-prem tables as parquet files in data lake.
I implemented the metadata driven incremental load.
In the above data flow I am trying to implement merge query passing the table name as parameter so that the data flow dynamically locate respective parquet files for full data and incremental data and then go through some ETL steps to implement merge query.
The merge query is working fine. But I found that projection is not correct. As the source files are dynamic, I also want to "import projection" dynamically during the runtime. So that the same data flow can be used to implement merge query for any table.
In the picture, you see it is showing 104 columns (which is a static projection that it imported at the development time). Actually for this table it should be 38 columns.
Can I dynamically (i.e run-time) assign the projection? If so how?
Or anyone has any suggestion regarding this?
Thanking
Muntasir Joarder
Enable Schema drift in your source transformation when the metadata is often changed. This removes or adds columns in the run time.
The source projection displays what has been imported at the run time but it changes based on the source schema at run time.
Refer to this document for more details with examples.

Not all [GA4] BigQuery Export schema not available via GA4 API?

I was looking for the number of IDFA /user concents collected per an app release version.
I saw it exists on [GA4] BigQuery Export schema table (device.is_limited_ad_tracking).
But couldn't find it on GA4.
Is there any alternative?
The API, webUI, and BigQuery export are different sources of data. Not only do they have different schemas (available dimensions and metrics), when compared, the data often will not match. This is by design.
This article compares the data sources:
https://analyticscanvas.com/4-ways-to-export-ga4-data/
This article explains why they don't match:
https://analyticscanvas.com/3-reasons-your-ga4-data-doesnt-match/
In most cases, you'll find the solution is to use the BigQuery export. It has the most rich set of data and doesn't have quota limits.

Loading many tables in Cloud Data Fusion fails with DAG error

I have an MS SQL Server data source with around 1000 tables, which I need to put into BigQuery. I was hoping to use Data Fusion to load them all into staging tables in BigQuery, and then perform transformations on them afterwards. However, as soon as I create a pipeline with two "islands" it give a DAG error. Is that a feature or a just something I'm doing wrong? I can't find anything in the documentation. My pipeline looks like this:
And the error I get when I try to deploy is: "Invalid DAG. There is an island made up of stages BigTest,BigQuery BigTest (no other stages connect to them)."
Each pipeline is a single DAG (Directed acyclic graph) and all the source and sink should be connected for the configuration to be valid. You can use multi-table source plugin that can bring in multiple tables at once to a landing table in BQ.
You can use Multi table plugins and BQ Multi table sink for your use-case.

PySpark save to Redshift table with "Overwirte" mode results in dropping table?

Using PySpark in AWS Glue to load data from S3 files to Redshift table, in code used mode("Overwirte") got error stated that "can't drop table because other object depend on the table", turned out there is view created on top of that table, seams the "Overwrite" mode actually drop and re-create redshift table then load data, is there any option that only "truncate" table not dropping it?
AWS Glue uses databricks spark redshift connector (it's not documented anywhere but I verified that empirically). Spark redshift connector's documentation mentions:
Overwriting an existing table: By default, this library uses transactions to perform overwrites, which are implemented by deleting the destination table, creating a new empty table, and appending rows to it.
Here there is a related discussion inline to your question, where they have used truncate instead of overwrite, also its a combination of lambda & glue. Please refer here for detailed discussions and code samples. Hope this helps.
regards