Spark Range partitioning for unbalanced dataframe - scala

I have dataframe with next schema:
provider_id: Int,
quadkey18: String,
data: Array[ComplexObject]
I need to save this dataframe partitioned by provider_id and quadkey5.
In general data field quite similar in terms of size across all providers.
However some providers (certain provider_id) have data array 100x-1000x times bigger than others.
I am trying to get balance dataset with next code:
df
.withColumn("qk_for_range",
when($"provider"===high_freq_provider_id,substring($"quadkey18",1,14))
otherwise substring($"quadkey18",1,10) )
.withColumn("quadkey5", substring($"quadkey18",1,5) )
.repartitionByRange(nrPartitions, $"provider", $"qk_for_range")
.drop("qk_for_range")
.write
.partitionBy("provider", "quadkey5")
.format("parquet")
.option("compression", "gzip")
.option("maxRecordsPerFile",(maxCountInPartition).toInt)
.mode(SaveMode.Overwrite)
.save(exportUrl)
However I got really huge partition parquet files (~ 1Gb), when I want to get smaller partitions (~200 mb).
I can decrease "maxRecordsPerFile" option, but in that case I would get a lot of small files for all "light" providers (those that have small data array per record).
My question is - how to break down "fat" partitions?

Related

Spark performance issues when iteratively adding multiple columns

I am seeing performance issues when iteratively adding columns (around 100) to a Dataframe.
I know that it is more efficient to use select to add multiple columns however I have to add the columns in order because column2 may depend on column 1 etc. etc.
The columns are being added following a join which resulted in skew so I have explicitly repartitioned by a salt key to evenly distribute data on the cluster.
When I ran locally I was seeing OOM errors even for fairly small (100 row, 500 column) datasets.
I was able to get the job running locally by checkpointing after the addition of every x columns so I suspect spark lineage issues are causing my problems however I am still unable to run the job at scale on the cluster.
Any advice on where to look or on best practice in this scenario would be greatly received.
At a high level my job looks like this:
val df1 = ??? // Millions of rows, ~500 cols, from parquet
val df2 = ??? // 1000 rows, from parquet
val newExpressions = ??? // 100 rows, from Oracle
val joined = df1.join(broadcast(df2), <join expr>)
val newColumns = newExpressions.collectAsList.map(<get columnExpr and columnName>)
val salted = joined.withColumn("salt", rand()).repartition(x, col("salt"))
newColumns.foldLeft(joined) {
case (df, row) => df.withColumn(col(row.expression).as(row.name))
} // Checkpointing after ever x columns seems to help
Cheers
Terry

Sort data within one output directory created by partitionBy

I have a big geospatial dataset partitionBy quadkey's level 5.
In each qk5 level directory, there are about 1-50 Gb of data, so it doesn't fit into one file. I want to benefit from pushdown filters when do my geospatial queries. So I want that files within one qk5 partition be sorted by higher qk resolution (let's say quadkey level 10).
Question: Is there are a way to sort data within partitionBy batch?
For example:
qk5=00001/
part1.parquet
part2.parquet
part3.parquet
part4.parquet
...
qk5=33333/
part10000.parquet
part20000.parquet
part30000.parquet
part40000.parquet
I want to have data from part1.parquet, part2.parquet, part3.parquet, part4.parquet to be sorted by column 'qk10'.
Here is the current code, but it only provides sorting within one particular partition (e.g. part1.parquet):
// Parquet save
preExportRdd.toDF
.repartition(partitionsNumber, $"salt")
.sortWithinPartitions($"qk10")
.drop("salt")
.write
.partitionBy("qk")
.format("parquet")
.option("compression", "gzip")
.mode(SaveMode.Append)
.save(exportUrl)
The problem is that you don't sort your Dataframe globally by qk field and it causes for the same qk values to be distributed in different spark partitions.
During the write phase, due to partitionBy("qk"), the output written to a specific physical partition (folder) may arrive from different spark partitions, which causes your output data to be unsorted.
Try instead the following:
preExportRdd.toDF
.repartitionByRange(partitionsNumber, $"qk", $"qk10", $"salt")
.sortWithinPartitions($"qk10")
.drop("salt")
.write
.partitionBy("qk")
.format("parquet")
.option("compression", "gzip")
.mode(SaveMode.Append)
.save(exportUrl)
The repartitionByRange will sort your Dataframe by the provided columns and split the sorted Dataframe to the desired number of partitions.

How to partition dataframes which are partitionBy and joined with other keys?

I have a lot of dataframes with customer data and the history for different legal entities, simplified:
Val myData = spark.createDataframe(Seq(
(1, 1, “a lot of Data”, 2010-01-01-10.00.00”),
(1, 1, “a lot of Data”, 2010-01-20-10.31.00”),
(1, 1, “a lot of Data”, 2019-06-16-12.00.00”),
(2, 5, “a lot of Data”, 2010-01-01-10.00.00”),
(2,6, “a lot of Data”, 2010-01-01-10.00.00”),
(3, 7, “a lot of Data”, 2010-01-01-10.00.00”)))
.toDF(“legalentity”, “customernumber”,”anydata”,”changetimestamp”)
These dataframes are stored as parquet files and exposed has external hive tables.
The change timestamp is transformed to a >valid from<, >valid to< by views, like this
CREATE VIEW myview
AS SELECT
Legalentity, customernumber, anydata,
Changetimestamp as valid_from,
Coalesce(lead(changetimestamp) over (PARTITION by legalentity, customernumber ORDER BY changetimestamp ASC), “9999-12-31-00.00.00”) as valid_to
(this is simplified, there are some timestamp transformations inside needed)
There are a lot of joins between the dataframes / hive tables later.
These dataframes are store this way:
myDf
.orderBy(col(“legalentity”), col(“customernumber”))
.write
.format(“parquet_format”)
.mode(SaveMode.Append)
.partitionBy(“legalentity”)
.save(outputpath)
For legal reason the data of different legal entities must be stored in different hdfs pathes, that is done by the partitionBy clause which creates a separate folder for each legal entity.
There are small and big legal entities with huge number of customers and others with few customers.
The number of shuffle partitions is averaged over all legal entities, that fine.
Problems:
No more Columns to partition the dataframe are possible:
If we want to speed up all by adding a repartition with more columns as the partitionBy clause for writing like:
myDf
.orderBy(col(“legalentity”), col(“customernumber”))
.repartition(col(“legalentity”), col(“customernumber”))
.write
.format(“parquet_format”)
.mode(SaveMode.Append)
.partitionBy(“legalentity”)
.save(outputpath)
The number of shuffel partitions is used in every legal entity folder.
That causes partitions = >legal entity< * >number of shuffle partitions<
Too many partitions
There are small and big dataframes / tables. All get the same number of shuffle partitions, so small dataframes has partition sizes of 3 MB or less.
If we use different numbers of partitions for each table so that the file size gets close to 128 MB, everything is slowed down.
We get new data every day which we just append but therefor we don’t use the number of shuffle partitions, we repartition(1).
Sometimes we have to re-load all to compress all these partitions, but our processes are not slowed down by the new daily data.

Spark Partition Dataset By Column Value

(I am new to Spark) I need to store a large number of rows of data, and then handle updates to those data. We have unique IDs (DB PKs) for those rows, and we would like to shard the data set by uniqueID % numShards, to make equal sized, addressable partitions. Since the PKs (unique IDs) are present both in the data and in the update files, it will be easy to determine which partition will be updated. We intend to shard the data and the updates by the same criteria, and periodically rewrite "shard S + all updates accumulated for shard S => new shard S". (We know how to combine shard S + updates = new shard S.)
If this is our design, we need to (1) shard a DataFrame by one of its columns (say: column K) into |range(K)| partitions where it is guaranteed that all rows in a partition have the same value in column K and (2) be able to find the Parquet file that corresponds to column_K=k, knowing k = row.uniqueID % numShards.
Is this a good design, or does Spark offer something out of the box that makes our task much easier?
Which Spark class/method should we use for partitioning our data? We are looking at RangePartitioner, but the constructor is asking for the number of partitions. We want to specify "use column_K for partitioning, and make one partition for each distinct value k in range(K)", because we have already created column_K = uniqueID % numShards. Which partitioner is appropriate for splitting on the value of one column of a DataFrame? Do we need to create a custom partitioner, or use partitionBy, or repartitionByRange, or...?
This is what we have so far:
import org.apache.spark.sql.functions._
val df = spark.read
.option("fetchsize", 1000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc(jdbc_url, "SCHEMA.TABLE_NAME", partitions, props)
.withColumn("SHARD_ID", col("TABLE_PK") % 1024)
.write
.parquet("parquet/table_name")
Now we need to specify that this DataFrame should be partitioned by SHARD_ID before it is written out as Parquet files.
This works:
val df = spark.read
.option("fetchsize", 1000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc(jdbc.getString("url"), "SCHEMA.TABLE_NAME", partitions, props)
.withColumn("SHARD_ID", col("TABLE_PK") % 1024)
.write
.partitionBy("SHARD_ID")
.parquet("parquet/table_name")

merge multiple small files in to few larger files in Spark

I using hive through Spark. I have a Insert into partitioned table query in my spark code. The input data is in 200+gb. When Spark is writing to a partitioned table, it is spitting very small files(files in kb's). so now the output partitioned table folder have 5000+ small kb files. I want to merge these in to few large MB files, may be about few 200mb files. I tired using hive merge settings, but they don't seem to work.
'val result7A = hiveContext.sql("set hive.exec.dynamic.partition=true")
val result7B = hiveContext.sql("set hive.exec.dynamic.partition.mode=nonstrict")
val result7C = hiveContext.sql("SET hive.merge.size.per.task=256000000")
val result7D = hiveContext.sql("SET hive.merge.mapfiles=true")
val result7E = hiveContext.sql("SET hive.merge.mapredfiles=true")
val result7F = hiveContext.sql("SET hive.merge.sparkfiles = true")
val result7G = hiveContext.sql("set hive.aux.jars.path=c:\\Applications\\json-serde-1.1.9.3-SNAPSHOT-jar-with-dependencies.jar")
val result8 = hiveContext.sql("INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table")'
The above hive settings work in a mapreduce hive execution and spits out files of specified size. Is there any option to do this Spark or Scala?
I had the same issue. Solution was to add DISTRIBUTE BY clause with the partition columns. This ensures that data for one partition goes to single reducer. Example in your case:
INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table DISTRIBUTE BY date
You may want to try using the DataFrame.coalesce method; it returns a DataFrame with the specified number of partitions (each of which becomes a file on insertion). So using the number of records you are inserting and the typical size of each record, you can estimate how many partitions to coalesce to if you want files of ~200MB.
The dataframe repartition(1) method works in this case.