Why are new columns added to parquet tables not available from glue pyspark ETL jobs? - pyspark

We've been exploring using Glue to transform some JSON data to parquet. One scenario we tried was adding a column to the parquet table. So partition 1 has columns [A] and partition 2 has columns [A,B]. Then we wanted to write further Glue ETL jobs to aggregate the parquet table but the new column was not available. Using glue_context.create_dynamic_frame.from_catalog to load the dynamic frame our new column was never in the schema.
We tried several configurations for our table crawler. Using a single schema for all partitions, single schema for s3 path, schema per partition. We could always see the new column in the Glue table data but it was always null if we queried it from a Glue job using pyspark. The column was in the parquet when we downloaded some samples and available for querying via Athena.
Why are the new columns not available to pyspark?

This turned out to be a spark configuration issue. From the spark docs:
Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.
Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by
setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or
setting the global SQL option spark.sql.parquet.mergeSchema to true.
We could enable schema merging in two ways.
set the option on the spark session spark.conf.set("spark.sql.parquet.mergeSchema", "true")
set mergeSchema to true in the additional_options when loading the dynamic frame.
source = glueContext.create_dynamic_frame.from_catalog(
database="db",
table_name="table",
additional_options={"mergeSchema": "true"}
)
After that the new column was available in the frame's schema.

Related

Incrementally loading into a Synapse table using Spark

I am creating a data warehouse using Azure Data Factory to extract data from a MySQL table and saving it in parquet format in an ADLS Gen 2 filesystem. From there, I use Synapse notebooks to process and load data into destination tables.
The initial load is fairly easy using spark.write.saveAsTable('orders') however, I am running into some issues doing incremental load following the intial load. In particular, I have not been able to find a way to reliably insert/update information into an existing Synapse table.
Since Spark does not allow DML operations on a table, I have resorted to reading the current table into a Spark DataFrame and inserting/updating records in that DataFrame. However, when I try to save that DataFrame using spark.write.saveAsTable('orders', mode='overwrite', format='parquet'), I run into a Cannot overwrite table 'orders' that is also being read from error.
A solution indicated by this suggests creating a temporary table and then inserting using that but that still resorts in the above error.
Another solution in this post suggests to write the data into a temporary table, drop the target table, and then rename the table but upon doing this, Spark gives me a FileNotFound errors regarding metadata.
I know Delta Tables can fix this issue pretty reliably but our company is not yet ready to move over to DataBricks.
All suggestions are greatly appreciated.

Loading dataframe from dynamodb - aws glue pyspark

I'm trying to read the records from dynamodb table. I have tried to use the dynamic frame. Since I have 8 million records in my table it was taking too long time to filter. Anyway I don't need to load 8 million records to dataframe.Instead of applying filter in dynamic frame, I want to know is any option is available to load dataframe by passing query. So few records only loaded to dataframe and it will work faster.
You can load dataframe by passing a query in spark.sql() but before that, you will have to run the AWS Glue crawler on the Dynamo DB table so that you can get a table corresponding to that Dynamo DB table in AWS Glue catalog and then you can use this table generated in Glue Catalog to read data using Spark dataframe directly.

AWS Glue Scala Upsert

I am trying to Upsert data into an existing S3 bucket from another using AWS Glue in Scala. Is there a standard way to use this? One of the methods that I found was to use SQL's MERGE method. What are the advantages and disadvantages of using that?
Thanks
You can't really implement 'SQL MERGE' method in s3 since it's not possible to update existing data objects.
A workaround is to load existing rows in a Glue job, merge it with incoming dataset, drop obsolete records and overwrite all objects on s3. If you have a lot of data it would be more efficient to partition it by some columns and then override those partitions that should contain new data only.
If you goal is preventing duplicates then you can do similar: load existing, drop those records from incoming dataset that already exist in s3 (loaded on previous step) and then write to s3 new records only.

AWS Glue, data filtering before loading into a frame, naming s3 objects

I have 3 questions, for the following context:
I'm trying to migrate my historical from RDS postgresql to S3. I have about a billion rows of dat in my database,
Q1) Is there a way for me to tell an aws glue job what rows to load? For example i want it to load data from a certain date onwards? There is no bookmarking feature for a PostgreSQL data source,
Q2) Once my data is processed, the glue job automatically creates a name for the s3 output objects, I know i can speciofy the path in DynamicFrame write, but can I specify the object name? if so, how? I cannot find an option for this.
Q3) I tried my glue job on a sample table with 100 rows of data, and it automatically separated the output into 20 files with 5 rows in each of those files, how can I specify the batch size in a job?
Thanks in advance
This is a question I have also posted in AWS Glue forum as well, here is a link to that: https://forums.aws.amazon.com/thread.jspa?threadID=280743
Glue supports pushdown predicates feature, however currently it works with partitioned data on s3 only. There is a feature request to support it for JDBC connections though.
It's not possible to specify name of output files. However, looks like there is an option with renaming files (note that renaming on s3 means copying file from one location into another so it's costly and not atomic operation)
You can't really control the size of output files. There is an option to control min number of files using coalesce though. Also starting from Spark 2.2 there is a possibility to set max number of records per file by setting config spark.sql.files.maxRecordsPerFile

Save to Hive Partitioned Table using Spark

The Spark 1.4.0 and higher source code seems to indicate the subject of this post were not possible (except in Spark specific format).
def saveAsTable(tableName: String): Unit = {
* When the DataFrame is created from a non-partitioned [[HadoopFsRelation]] with a single input
* path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC
* and Parquet), the table is persisted in a Hive compatible format, which means other systems
* like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL
* specific format.
I was wondering if there were systematic workarounds for this. Any hive table worth its salt will be partitioned - for scalability and performance reasons. Therefore this is the normal usecase not a corner case.