I try to get the auto schema from the nested table with Data Fusion, but get this error:
1) What is the best way to handle nested tables with Data Fusion
2) What is the way to export schema from BigQuery table and use it in Data Fusion?
You are encountering that error due to lack of support for STRUCT type in the BigQuery plugins. This is an improvement to the BigQuery plugin to support the STRUCT type. This is already tracked and prioritized in https://issues.cask.co/browse/CDAP-15256.
Related
I’m trying to create a continuous migration job from Aws S3 to Redshift using Aws Glue.
I wish to load object data types to Redshift as super type directly in Aws Glue.
However during the function glueContext.write_dynamic_frame.from_jdbc_conf, If the data contains an object data type, I get an error msg "CSV data source does not support struct data type" and I am aware of the cause of the error.
An option would be to use pyspark.sql.functions.to_json to the object data and later use json_extract_path_text() when querying the objects in Redshift.
But I hope there is an approach in AWS glue, that supports a direct transformation and loads object type data to super type (type that Amazon Redshift uses to support JSON columns).
Also, I do not want to flatten the objects, just want to keep them as is. So dynamic_frame.relationalize() is also not a suitable solution.
Any help would be greatly thankful.
I am creating a data warehouse using Azure Data Factory to extract data from a MySQL table and saving it in parquet format in an ADLS Gen 2 filesystem. From there, I use Synapse notebooks to process and load data into destination tables.
The initial load is fairly easy using spark.write.saveAsTable('orders') however, I am running into some issues doing incremental load following the intial load. In particular, I have not been able to find a way to reliably insert/update information into an existing Synapse table.
Since Spark does not allow DML operations on a table, I have resorted to reading the current table into a Spark DataFrame and inserting/updating records in that DataFrame. However, when I try to save that DataFrame using spark.write.saveAsTable('orders', mode='overwrite', format='parquet'), I run into a Cannot overwrite table 'orders' that is also being read from error.
A solution indicated by this suggests creating a temporary table and then inserting using that but that still resorts in the above error.
Another solution in this post suggests to write the data into a temporary table, drop the target table, and then rename the table but upon doing this, Spark gives me a FileNotFound errors regarding metadata.
I know Delta Tables can fix this issue pretty reliably but our company is not yet ready to move over to DataBricks.
All suggestions are greatly appreciated.
I was surprised to find that Cloud Datastream does not support enum data types in source if replicating from PostgreSQL.
Datastream doesn't support replication of columns of the enumerated (ENUM) data type.
As we have quite a few fields created that way, that is not a viable option for us. Is any good workaround for this limitation?
From google issue tracker:
A workaround is to create a generated column which is of type text. DataStream will then sync the text column automatically and happily.
Also:
we're looking into adding support for ENUMs as part of our GA launch of the PostgreSQL source
Apache Flink 1.11.0
Python Table API
Catalog : postgresql
Reading and writing data from postgresql Catalog tables which coantain UUID data type columns through Table API throwing UUID data type unsupportedOperatorException.
How to handle the UUID dataype in pyFlink?
I don't know much about PostgreSQL's type system. Is UUID datatype a type similar to string? The type supported by PyFlink is basically the same as the type of Flink's Table. For details, you can refer to the document[1]
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/python/table-api-users-guide/python_types.html
Looks Like There isn't an elegant way to handle Postgres UUIDs off the bat yet, but this is something that will be covered once the Table API's type system rework is completed. I found the JIRA for the same : https://issues.apache.org/jira/browse/FLINK-19869
I've been used bigquery plugin under the source category. When I used bigquery View, Pipeline through an error of not allowed View. Also If I used the permanent table in which repeatable columns have existed, then it also through an error of unsupported mode 'repeated' while retrieving its schema. Does anyone have any information on this?
BigQuery source exports the data from the table into temporary GCS buckets and then read it in the pipeline. Since BigQuery VIEWs cannot be exported (please see limitations here - https://cloud.google.com/bigquery/docs/views), pipeline fails.
Also currently BigQuery source does not support repeatable column. The work is currently in progress - https://issues.cask.co/browse/CDAP-15256. Is this what you are looking for?