Deltalake - Merge is creating too many files per partition - scala

I have a table that ingests new data every day through a merge. I'm currently trying to migrate from ORC to Delta file format and I stumbled through a problem when processing the following simple Merge operation:
DeltaTable
.forPath(sparkSession, deltaTablePath)
.alias(SOURCE)
.merge(
rawDf.alias(RAW),
joinClause // using primary keys and when possible partition keys. Not important here I guess
)
.whenNotMatched().insertAll()
.whenMatched(dedupConditionUpdate).updateAll()
.whenMatched(dedupConditionDelete).delete()
.execute()
When the merge is done, every impacted partition has hundreds of new files. As there is one new ingestion per day, this behaviour makes every following merge operation slower and slower.
Versions:
Spark : 2.4
DeltaLake: 0.6.1
Is there a way to repartition before saving ? or any other way to improve this ?

After searching a bit in Delta core's code, there is an option that does repartition on write :
spark.conf.set("spark.databricks.delta.merge.repartitionBeforeWrite.enabled", "true")

you should set delta.autoOptimize.autoCompact property on table for auto compaction.
following page shows how you can set at for existing and new table.
https://docs.databricks.com/delta/optimizations/auto-optimize.html

Related

Is it possible to query the diff between two Apache Iceberg snapshots?

I have two snapshots in my Iceberg history table, and I want to be able to see the difference between them, or at least with columns/ rows that have been affected on the last snapshot. Is there an easy way of getting this information?
You can use the java api to get the incremental change log between two snapshot id in a table.
table
.newIncrementalChangelogScan()
.fromSnapshotExclusive(startSnapshotId)
.toSnapshot(toSnapshot)
.caseSensitive(caseSensitive)
.filter(filterExpression())
.project(expectedSchema)
.planTasks();
It will get the full change log.
If you just want to query incremental data, an easier way is to use spark or flink:
spark.read()
.format("iceberg")
.option("start-snapshot-id", "10963874102873")
.option("end-snapshot-id", "63874143573109")
.load("path/to/table")
Currently gets only the data from append operation. Cannot support replace, overwrite, delete operations.
Enjoy yourself.

Why are new columns added to parquet tables not available from glue pyspark ETL jobs?

We've been exploring using Glue to transform some JSON data to parquet. One scenario we tried was adding a column to the parquet table. So partition 1 has columns [A] and partition 2 has columns [A,B]. Then we wanted to write further Glue ETL jobs to aggregate the parquet table but the new column was not available. Using glue_context.create_dynamic_frame.from_catalog to load the dynamic frame our new column was never in the schema.
We tried several configurations for our table crawler. Using a single schema for all partitions, single schema for s3 path, schema per partition. We could always see the new column in the Glue table data but it was always null if we queried it from a Glue job using pyspark. The column was in the parquet when we downloaded some samples and available for querying via Athena.
Why are the new columns not available to pyspark?
This turned out to be a spark configuration issue. From the spark docs:
Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.
Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by
setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or
setting the global SQL option spark.sql.parquet.mergeSchema to true.
We could enable schema merging in two ways.
set the option on the spark session spark.conf.set("spark.sql.parquet.mergeSchema", "true")
set mergeSchema to true in the additional_options when loading the dynamic frame.
source = glueContext.create_dynamic_frame.from_catalog(
database="db",
table_name="table",
additional_options={"mergeSchema": "true"}
)
After that the new column was available in the frame's schema.

How to persist data with Spark/ Scala?

I'm performing a batch process using Spark with Scala.
Each day, I need to import a sales file into a Spark dataframe and perform some transformations. ( a file with the same schema, only the date and the sales values may change)
At the end of the week, I need to use all daily transformations to perform weekly aggregations. Consequently, I need to persist the daily transformations so that I don't let Spark do everything at the end of the week. ( I want to avoid importing all data and performing all transformations at the end of the week).
I would like also to have a solution that supports incremental updates ( upserts).
I went through some options like Dataframe.persist(StorageLevel.DISK_ONLY). I would like to know if there are better options like maybe using Hive tables ?
What are your suggestions on that ?
What are the advantages of using Hive tables over Dataframe.persist ?
Many thanks in advance.
You can save results of your daily transformations in a parquet (or orc) format, partitioned by day. Then you can run your weekly process on this parquet file with a query that filters only the data for last week. Predicate pushdown and partitioning works pretty efficiently in Spark to load only the data selected by the filter for further processing.
dataframe
.write
.mode(SaveMode.Append)
.partitionBy("day") // assuming you have a day column in your DF
.parquet(parquetFilePath)
SaveMode.Append option allows you to incrementally add data to parquet files (vs overwriting it using SaveMode.Overwrite)

Spark Streaming dropDuplicates

Spark 2.1.1 (scala api) streaming json files from an s3 location.
I want to deduplicate any incoming records based on an ID column (“event_id”) found in the json for every record. I do not care which record is kept, even if duplication of the record is only partial. I am using append mode as the data is merely being enriched/filtered, with no group by/window aggregations, via the spark.sql() method. I then use the append mode to write parquet files to s3.
According to the documentation, I should be able to use dropDuplicates without watermarking in order to deduplicate (obviously this is not effective in long-running production). However, this fails with the error:
User class threw exception: org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets
That error seems odd as I am doing no aggregation (unless dropDuplicates or sparkSQL counts as an aggregation?).
I know that duplicates won’t occur outside 3 days of each other, so I then tried it again by adding a watermark (by using .withWatermark() immediately before the drop duplicates). However, it seems to want to wait until 3 days are up before writing the data. (ie since today is July 24, only data up to the same time on July 21 is written to the output).
As there is no aggregation, I want to write every row immediately after the batch is processed, and simply throw away any rows with an event id that has occurred in the previous 3 days. Is there a simple way to accomplish this?
Thanks
In my case, I used to achieve that in two ways through DStream :
One way:
load tmp_data(contain 3 days unique data, see below)
receive batch_data and do leftOuterJoin with tmp_data
do filter on step2 and output new unique data
update tmp_data with new unique data through step2's result and drop old data(more than 3 days)
save tmp_data on HDFS or whatever
repeat above again and again
Another way:
create a table on mysql and set UNIQUE INDEX on event_id
receive batch_data and just save event_id + event_time + whatever to mysql
mysql will ignore duplicate automatically
Solution we used was a custom implementation of org.apache.spark.sql.execution.streaming.Sink that inserts into a hive table after dropping duplicates within batch and performing a left anti join against the previous few days worth of data in the target hive table.

Spark-scala how to work with HDFS directory partition

To reduce process time I partitioned my data by dates so that I use only required date data (not complete table).So now in HDFS my tables are stored like below
src_tbl //main dir trg_tbl
2016-01-01 //sub dir 2015-12-30
2016-01-02 2015-12-31
2016-01-03 2016-01-01
2016-01-03
Now I want to select min(date) from src_tbl which will be 2016-01-01
and from trg_tbl I want to use data in >= 2016-01-01(src_tbl min(date)) directories which will be2016-01-01 and 2016-01-03 data`
How can select required partitions or date folder from hdfs using Spark-scala ? After completing process I need to overwrite same date directories too.
Details about process:
I want to choose correct window of data (as all other date data in not required) from source and target table..then I want to do join -> lead / lag -> union -> write.
Spark SQL (including the DataFrame/set api's) is kind of funny in the way it handles partitioned tables wrt retaining the existing partitioning info from one transformation/stage to the next.
For the initial loading Spark SQL tends to do a good job on understanding how to retain the underlying partitioning information - if that information were available in the form of the hive metastore metadata for the table.
So .. are these hive tables?
If so - so far so good - you should see the data loaded partition by partition according to the hive partitions.
Will the DataFrame/Dataset remember this nice partitioning already
set up?
Now things get a bit more tricky. The answer depends on whether a shuffle were required or not.
In your case - a simple filter operation - there should not be any need. So once again - you should see the original partitioning preserved and thus good performance. Please verify that the partitioning were indeed retained.
I will mention that if any aggregate functions were invoked then you can be assured your partitioning would be lost. Spark SQL will in that case use a HashPartitioner -inducing a full shuffle.
Update The OP provided more details here: there is lead/lag and join involved. Then he is well advised - from the strictly performance perspective - to avoid the Spark SQL and do the operations manually.
To the OP: the only thing I can suggest at this point is to check
preservesPartitioning=true
were set in your RDD operations. But I am not even sure that capability were exposed by Spark for the lag/lead: please check.