Continuously Updating Partitioned Parquet - scala

I have a Spark script that pulls data from a database and writes it to S3 in parquet format. The parquet data is partitioned by date. Because of the size of the table, I'd like to run the script daily and have it just rewrite the most recent few days of data (redundancy because data may change for a couple days).
I'm wondering how I can go about writing the data to s3 in a way that only overwrites the partitions of the days I'm working with. SaveMode.Overwrite unfortunately wipes everything before it, and the other save modes don't seem to be what I'm looking for.
Snippet of my current write:
table
.filter(row => row.ts.after(twoDaysAgo)) // update most recent 2 days
.withColumn("date", to_date(col("ts"))) // add a column with just date
.write
.mode(SaveMode.Overwrite)
.partitionBy("date") // use the new date column to partition the parquet output
.parquet("s3a://some-bucket/stuff") // pick a parent directory to hold the parquets
Any advice would be much appreciated, thanks!

The answer I was looking for was Dynamic Overwrite, detailed in this article. Short answer, adding this line fixed my problem:
sparkConf.set("spark.sql.sources.partitionOverwriteMode", "DYNAMIC")

Related

Deltalake - Merge is creating too many files per partition

I have a table that ingests new data every day through a merge. I'm currently trying to migrate from ORC to Delta file format and I stumbled through a problem when processing the following simple Merge operation:
DeltaTable
.forPath(sparkSession, deltaTablePath)
.alias(SOURCE)
.merge(
rawDf.alias(RAW),
joinClause // using primary keys and when possible partition keys. Not important here I guess
)
.whenNotMatched().insertAll()
.whenMatched(dedupConditionUpdate).updateAll()
.whenMatched(dedupConditionDelete).delete()
.execute()
When the merge is done, every impacted partition has hundreds of new files. As there is one new ingestion per day, this behaviour makes every following merge operation slower and slower.
Versions:
Spark : 2.4
DeltaLake: 0.6.1
Is there a way to repartition before saving ? or any other way to improve this ?
After searching a bit in Delta core's code, there is an option that does repartition on write :
spark.conf.set("spark.databricks.delta.merge.repartitionBeforeWrite.enabled", "true")
you should set delta.autoOptimize.autoCompact property on table for auto compaction.
following page shows how you can set at for existing and new table.
https://docs.databricks.com/delta/optimizations/auto-optimize.html

Overwriting the parquet file throws exception in spark

I am trying to read the parquet file from hdfs location, do some transformations and overwrite the file in the same location. I had to overwrite the file in the same location because I had to run the same code multiple times.
Here is the code I have written
val df = spark.read.option("header", "true").option("inferSchema", "true").parquet("hdfs://master:8020/persist/local/")
//after applying some transformations lets say the final dataframe is transDF which I want to overwrite at the same location.
transDF.write.mode("overwrite").parquet("hdfs://master:8020/persist/local/")
Now the problem is before reading the parquet file from the given location, spark for some reason I believe it deletes the file at the given location because of overwrite mode. So when executing the code I get the following error.
File does not exist: hdfs://master:8020/persist/local/part-00000-e73c4dfd-d008-4007-8274-d445bdea3fc8-c000.snappy.parquet
Any suggestions on how to solve this problem? Thanks.
The simple answer is that you cannot overwrite what you are reading. The reason behind this is that overwrite would need to delete everything, however, since spark is working in parallel, some portions might still be reading at the time. Furthermore, even if everything was read, spark needs the original file to recalculate tasks which are failed.
Since you need the input for multiple iterations, I would simply make the name of the input and the output into arguments for the function that does one iteration and delete the previous iteration only once the writing is successful.
This is what I have tried and it worked. My requirement was almost same. It was upsert option.
by the way, spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic') property was set. Even then also the Transform job was failing
Took a backup of S3 folder (final curated layer) before every batch operation
using the dataframe operations, first delete the S3 parquet file location before overwrite
then Append to the particular location
Previously the entire job was running for 1.5Hrs and failing frequently. Now it's taking 10-15mins for the entire operations

save PostgreSQL data in Parquet format

I'm working a project which needs to generate parquet files from a huge PostgreSQL database. The data size can be gigantic (ex: 10TB). I'm very new to this topic and has done some research online but did not find a direct way to convert the data to Parquet file. Here are my questions:
The only feasible solution I saw is to load Postgres table to Apache Spark via JDBC and save as a parquet file. But I assume it will be very slow while transferring 10TB data.
Is it possible to generate a huge parquet file size that is 10 TB? Or is it better to create multiple parquet files?
Hope my question is clear and I really appreciate any helpful feedbacks. Thanks in advance!
Use the ORC format instead of the parquet format for this volume.
I assume the data is partitioned, so I think it's a good idea to extract in parallel taking advantage of data partitioning.

How to persist data with Spark/ Scala?

I'm performing a batch process using Spark with Scala.
Each day, I need to import a sales file into a Spark dataframe and perform some transformations. ( a file with the same schema, only the date and the sales values may change)
At the end of the week, I need to use all daily transformations to perform weekly aggregations. Consequently, I need to persist the daily transformations so that I don't let Spark do everything at the end of the week. ( I want to avoid importing all data and performing all transformations at the end of the week).
I would like also to have a solution that supports incremental updates ( upserts).
I went through some options like Dataframe.persist(StorageLevel.DISK_ONLY). I would like to know if there are better options like maybe using Hive tables ?
What are your suggestions on that ?
What are the advantages of using Hive tables over Dataframe.persist ?
Many thanks in advance.
You can save results of your daily transformations in a parquet (or orc) format, partitioned by day. Then you can run your weekly process on this parquet file with a query that filters only the data for last week. Predicate pushdown and partitioning works pretty efficiently in Spark to load only the data selected by the filter for further processing.
dataframe
.write
.mode(SaveMode.Append)
.partitionBy("day") // assuming you have a day column in your DF
.parquet(parquetFilePath)
SaveMode.Append option allows you to incrementally add data to parquet files (vs overwriting it using SaveMode.Overwrite)

Spark Streaming dropDuplicates

Spark 2.1.1 (scala api) streaming json files from an s3 location.
I want to deduplicate any incoming records based on an ID column (“event_id”) found in the json for every record. I do not care which record is kept, even if duplication of the record is only partial. I am using append mode as the data is merely being enriched/filtered, with no group by/window aggregations, via the spark.sql() method. I then use the append mode to write parquet files to s3.
According to the documentation, I should be able to use dropDuplicates without watermarking in order to deduplicate (obviously this is not effective in long-running production). However, this fails with the error:
User class threw exception: org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets
That error seems odd as I am doing no aggregation (unless dropDuplicates or sparkSQL counts as an aggregation?).
I know that duplicates won’t occur outside 3 days of each other, so I then tried it again by adding a watermark (by using .withWatermark() immediately before the drop duplicates). However, it seems to want to wait until 3 days are up before writing the data. (ie since today is July 24, only data up to the same time on July 21 is written to the output).
As there is no aggregation, I want to write every row immediately after the batch is processed, and simply throw away any rows with an event id that has occurred in the previous 3 days. Is there a simple way to accomplish this?
Thanks
In my case, I used to achieve that in two ways through DStream :
One way:
load tmp_data(contain 3 days unique data, see below)
receive batch_data and do leftOuterJoin with tmp_data
do filter on step2 and output new unique data
update tmp_data with new unique data through step2's result and drop old data(more than 3 days)
save tmp_data on HDFS or whatever
repeat above again and again
Another way:
create a table on mysql and set UNIQUE INDEX on event_id
receive batch_data and just save event_id + event_time + whatever to mysql
mysql will ignore duplicate automatically
Solution we used was a custom implementation of org.apache.spark.sql.execution.streaming.Sink that inserts into a hive table after dropping duplicates within batch and performing a left anti join against the previous few days worth of data in the target hive table.