I have a lot of dataframes with customer data and the history for different legal entities, simplified:
Val myData = spark.createDataframe(Seq(
(1, 1, “a lot of Data”, 2010-01-01-10.00.00”),
(1, 1, “a lot of Data”, 2010-01-20-10.31.00”),
(1, 1, “a lot of Data”, 2019-06-16-12.00.00”),
(2, 5, “a lot of Data”, 2010-01-01-10.00.00”),
(2,6, “a lot of Data”, 2010-01-01-10.00.00”),
(3, 7, “a lot of Data”, 2010-01-01-10.00.00”)))
.toDF(“legalentity”, “customernumber”,”anydata”,”changetimestamp”)
These dataframes are stored as parquet files and exposed has external hive tables.
The change timestamp is transformed to a >valid from<, >valid to< by views, like this
CREATE VIEW myview
AS SELECT
Legalentity, customernumber, anydata,
Changetimestamp as valid_from,
Coalesce(lead(changetimestamp) over (PARTITION by legalentity, customernumber ORDER BY changetimestamp ASC), “9999-12-31-00.00.00”) as valid_to
(this is simplified, there are some timestamp transformations inside needed)
There are a lot of joins between the dataframes / hive tables later.
These dataframes are store this way:
myDf
.orderBy(col(“legalentity”), col(“customernumber”))
.write
.format(“parquet_format”)
.mode(SaveMode.Append)
.partitionBy(“legalentity”)
.save(outputpath)
For legal reason the data of different legal entities must be stored in different hdfs pathes, that is done by the partitionBy clause which creates a separate folder for each legal entity.
There are small and big legal entities with huge number of customers and others with few customers.
The number of shuffle partitions is averaged over all legal entities, that fine.
Problems:
No more Columns to partition the dataframe are possible:
If we want to speed up all by adding a repartition with more columns as the partitionBy clause for writing like:
myDf
.orderBy(col(“legalentity”), col(“customernumber”))
.repartition(col(“legalentity”), col(“customernumber”))
.write
.format(“parquet_format”)
.mode(SaveMode.Append)
.partitionBy(“legalentity”)
.save(outputpath)
The number of shuffel partitions is used in every legal entity folder.
That causes partitions = >legal entity< * >number of shuffle partitions<
Too many partitions
There are small and big dataframes / tables. All get the same number of shuffle partitions, so small dataframes has partition sizes of 3 MB or less.
If we use different numbers of partitions for each table so that the file size gets close to 128 MB, everything is slowed down.
We get new data every day which we just append but therefor we don’t use the number of shuffle partitions, we repartition(1).
Sometimes we have to re-load all to compress all these partitions, but our processes are not slowed down by the new daily data.
Related
I have dataframe with next schema:
provider_id: Int,
quadkey18: String,
data: Array[ComplexObject]
I need to save this dataframe partitioned by provider_id and quadkey5.
In general data field quite similar in terms of size across all providers.
However some providers (certain provider_id) have data array 100x-1000x times bigger than others.
I am trying to get balance dataset with next code:
df
.withColumn("qk_for_range",
when($"provider"===high_freq_provider_id,substring($"quadkey18",1,14))
otherwise substring($"quadkey18",1,10) )
.withColumn("quadkey5", substring($"quadkey18",1,5) )
.repartitionByRange(nrPartitions, $"provider", $"qk_for_range")
.drop("qk_for_range")
.write
.partitionBy("provider", "quadkey5")
.format("parquet")
.option("compression", "gzip")
.option("maxRecordsPerFile",(maxCountInPartition).toInt)
.mode(SaveMode.Overwrite)
.save(exportUrl)
However I got really huge partition parquet files (~ 1Gb), when I want to get smaller partitions (~200 mb).
I can decrease "maxRecordsPerFile" option, but in that case I would get a lot of small files for all "light" providers (those that have small data array per record).
My question is - how to break down "fat" partitions?
I have a big geospatial dataset partitionBy quadkey's level 5.
In each qk5 level directory, there are about 1-50 Gb of data, so it doesn't fit into one file. I want to benefit from pushdown filters when do my geospatial queries. So I want that files within one qk5 partition be sorted by higher qk resolution (let's say quadkey level 10).
Question: Is there are a way to sort data within partitionBy batch?
For example:
qk5=00001/
part1.parquet
part2.parquet
part3.parquet
part4.parquet
...
qk5=33333/
part10000.parquet
part20000.parquet
part30000.parquet
part40000.parquet
I want to have data from part1.parquet, part2.parquet, part3.parquet, part4.parquet to be sorted by column 'qk10'.
Here is the current code, but it only provides sorting within one particular partition (e.g. part1.parquet):
// Parquet save
preExportRdd.toDF
.repartition(partitionsNumber, $"salt")
.sortWithinPartitions($"qk10")
.drop("salt")
.write
.partitionBy("qk")
.format("parquet")
.option("compression", "gzip")
.mode(SaveMode.Append)
.save(exportUrl)
The problem is that you don't sort your Dataframe globally by qk field and it causes for the same qk values to be distributed in different spark partitions.
During the write phase, due to partitionBy("qk"), the output written to a specific physical partition (folder) may arrive from different spark partitions, which causes your output data to be unsorted.
Try instead the following:
preExportRdd.toDF
.repartitionByRange(partitionsNumber, $"qk", $"qk10", $"salt")
.sortWithinPartitions($"qk10")
.drop("salt")
.write
.partitionBy("qk")
.format("parquet")
.option("compression", "gzip")
.mode(SaveMode.Append)
.save(exportUrl)
The repartitionByRange will sort your Dataframe by the provided columns and split the sorted Dataframe to the desired number of partitions.
(I am new to Spark) I need to store a large number of rows of data, and then handle updates to those data. We have unique IDs (DB PKs) for those rows, and we would like to shard the data set by uniqueID % numShards, to make equal sized, addressable partitions. Since the PKs (unique IDs) are present both in the data and in the update files, it will be easy to determine which partition will be updated. We intend to shard the data and the updates by the same criteria, and periodically rewrite "shard S + all updates accumulated for shard S => new shard S". (We know how to combine shard S + updates = new shard S.)
If this is our design, we need to (1) shard a DataFrame by one of its columns (say: column K) into |range(K)| partitions where it is guaranteed that all rows in a partition have the same value in column K and (2) be able to find the Parquet file that corresponds to column_K=k, knowing k = row.uniqueID % numShards.
Is this a good design, or does Spark offer something out of the box that makes our task much easier?
Which Spark class/method should we use for partitioning our data? We are looking at RangePartitioner, but the constructor is asking for the number of partitions. We want to specify "use column_K for partitioning, and make one partition for each distinct value k in range(K)", because we have already created column_K = uniqueID % numShards. Which partitioner is appropriate for splitting on the value of one column of a DataFrame? Do we need to create a custom partitioner, or use partitionBy, or repartitionByRange, or...?
This is what we have so far:
import org.apache.spark.sql.functions._
val df = spark.read
.option("fetchsize", 1000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc(jdbc_url, "SCHEMA.TABLE_NAME", partitions, props)
.withColumn("SHARD_ID", col("TABLE_PK") % 1024)
.write
.parquet("parquet/table_name")
Now we need to specify that this DataFrame should be partitioned by SHARD_ID before it is written out as Parquet files.
This works:
val df = spark.read
.option("fetchsize", 1000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc(jdbc.getString("url"), "SCHEMA.TABLE_NAME", partitions, props)
.withColumn("SHARD_ID", col("TABLE_PK") % 1024)
.write
.partitionBy("SHARD_ID")
.parquet("parquet/table_name")
I am running HDFS and Spark locally and trying to understand how Spark persistence works. My objective is to store a joined dataset in memory and then run queries against it on the fly. However, my queries seem to be redoing the join rather than simply scanning through the persisted pre-joined dataset.
I have created and persisted two dataframes, let's say df1 and df2, by loading in two CSV files from HDFS. I persist a join of the two dataframes in memory:
val result = df1.join(df2, "USERNAME")
result.persist()
result.count()
I then define some operations on top of result:
val result2 = result.select("FOO", "BAR").groupBy("FOO").sum("BAR")
result2.show()
'result2' does not piggy back on the persisted result and redoes the join on its own. Here are the physical plans for result and result2:
== Physical Plan for result ==
InMemoryColumnarTableScan [...], (InMemoryRelation [...], true, 10000, StorageLevel(true, true, false, true, 1), (TungstenProject [...]), None)
== Physical Plan for result2 ==
TungstenAggregate(key=[FOO#2], functions=[(sum(cast(BAR#10 as double)),mode=Final,isDistinct=false)], output=[FOO#2,sum(BAR)#837])
TungstenExchange hashpartitioning(FOO#2)
TungstenAggregate(key=[FOO#2], functions=[(sum(cast(BAR#10 as double)),mode=Partial,isDistinct=false)], output=[FOO#2,currentSum#1311])
InMemoryColumnarTableScan [FOO#2,BAR#10], (InMemoryRelation [...], true, 10000, StorageLevel(true, true, false, true, 1), (TungstenProject [...]), None)
I would naively assume that since the join is already done and partitioned in memory, the second operation would simply consist of aggregation operations on each partition. It should be more expensive to redo the join from scratch. Am I assuming incorrectly or doing something wrong? Also, is this the right pattern for retaining a joined dataset for later querying?
Edit: For the record, the second query became a lot more performant after I turned down the number of shuffle partitions. By default, spark.sql.shuffle.partitions is set to 200. Simply setting it to one on my local instance considerably improved performance.
If we look at the the plan, we'll see that Spark actually is making use of the cached data and not redoing the join. Starting from the bottom up:
This is Spark reading the data from your cache:
InMemoryColumnarTableScan [FOO#2,BAR#10], (InMemoryRelation ...
This is Spark aggregating BAR by FOO in each partition - look for mode=Partial
TungstenAggregate(key=[FOO#2], functions=[(sum(cast(BAR#10 as double)),mode=Partial ...
This is Spark shuffling the data from each partition of the previous step:
TungstenExchange hashpartitioning(FOO#2)
This is Spark aggregating the shuffled partition sums - look for mode=Final
TungstenAggregate(key=[FOO#2], functions=[(sum(cast(BAR#10 as double)),mode=Final ...
Reading these plans is a bit of a pain so if you have access to the SQL tab of the Spark UI (I think 1.5+), I'd recommend using that instead.
I using hive through Spark. I have a Insert into partitioned table query in my spark code. The input data is in 200+gb. When Spark is writing to a partitioned table, it is spitting very small files(files in kb's). so now the output partitioned table folder have 5000+ small kb files. I want to merge these in to few large MB files, may be about few 200mb files. I tired using hive merge settings, but they don't seem to work.
'val result7A = hiveContext.sql("set hive.exec.dynamic.partition=true")
val result7B = hiveContext.sql("set hive.exec.dynamic.partition.mode=nonstrict")
val result7C = hiveContext.sql("SET hive.merge.size.per.task=256000000")
val result7D = hiveContext.sql("SET hive.merge.mapfiles=true")
val result7E = hiveContext.sql("SET hive.merge.mapredfiles=true")
val result7F = hiveContext.sql("SET hive.merge.sparkfiles = true")
val result7G = hiveContext.sql("set hive.aux.jars.path=c:\\Applications\\json-serde-1.1.9.3-SNAPSHOT-jar-with-dependencies.jar")
val result8 = hiveContext.sql("INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table")'
The above hive settings work in a mapreduce hive execution and spits out files of specified size. Is there any option to do this Spark or Scala?
I had the same issue. Solution was to add DISTRIBUTE BY clause with the partition columns. This ensures that data for one partition goes to single reducer. Example in your case:
INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table DISTRIBUTE BY date
You may want to try using the DataFrame.coalesce method; it returns a DataFrame with the specified number of partitions (each of which becomes a file on insertion). So using the number of records you are inserting and the typical size of each record, you can estimate how many partitions to coalesce to if you want files of ~200MB.
The dataframe repartition(1) method works in this case.