Spark Partition Dataset By Column Value - scala

(I am new to Spark) I need to store a large number of rows of data, and then handle updates to those data. We have unique IDs (DB PKs) for those rows, and we would like to shard the data set by uniqueID % numShards, to make equal sized, addressable partitions. Since the PKs (unique IDs) are present both in the data and in the update files, it will be easy to determine which partition will be updated. We intend to shard the data and the updates by the same criteria, and periodically rewrite "shard S + all updates accumulated for shard S => new shard S". (We know how to combine shard S + updates = new shard S.)
If this is our design, we need to (1) shard a DataFrame by one of its columns (say: column K) into |range(K)| partitions where it is guaranteed that all rows in a partition have the same value in column K and (2) be able to find the Parquet file that corresponds to column_K=k, knowing k = row.uniqueID % numShards.
Is this a good design, or does Spark offer something out of the box that makes our task much easier?
Which Spark class/method should we use for partitioning our data? We are looking at RangePartitioner, but the constructor is asking for the number of partitions. We want to specify "use column_K for partitioning, and make one partition for each distinct value k in range(K)", because we have already created column_K = uniqueID % numShards. Which partitioner is appropriate for splitting on the value of one column of a DataFrame? Do we need to create a custom partitioner, or use partitionBy, or repartitionByRange, or...?
This is what we have so far:
import org.apache.spark.sql.functions._
val df = spark.read
.option("fetchsize", 1000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc(jdbc_url, "SCHEMA.TABLE_NAME", partitions, props)
.withColumn("SHARD_ID", col("TABLE_PK") % 1024)
.write
.parquet("parquet/table_name")
Now we need to specify that this DataFrame should be partitioned by SHARD_ID before it is written out as Parquet files.

This works:
val df = spark.read
.option("fetchsize", 1000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc(jdbc.getString("url"), "SCHEMA.TABLE_NAME", partitions, props)
.withColumn("SHARD_ID", col("TABLE_PK") % 1024)
.write
.partitionBy("SHARD_ID")
.parquet("parquet/table_name")

Related

Spark Range partitioning for unbalanced dataframe

I have dataframe with next schema:
provider_id: Int,
quadkey18: String,
data: Array[ComplexObject]
I need to save this dataframe partitioned by provider_id and quadkey5.
In general data field quite similar in terms of size across all providers.
However some providers (certain provider_id) have data array 100x-1000x times bigger than others.
I am trying to get balance dataset with next code:
df
.withColumn("qk_for_range",
when($"provider"===high_freq_provider_id,substring($"quadkey18",1,14))
otherwise substring($"quadkey18",1,10) )
.withColumn("quadkey5", substring($"quadkey18",1,5) )
.repartitionByRange(nrPartitions, $"provider", $"qk_for_range")
.drop("qk_for_range")
.write
.partitionBy("provider", "quadkey5")
.format("parquet")
.option("compression", "gzip")
.option("maxRecordsPerFile",(maxCountInPartition).toInt)
.mode(SaveMode.Overwrite)
.save(exportUrl)
However I got really huge partition parquet files (~ 1Gb), when I want to get smaller partitions (~200 mb).
I can decrease "maxRecordsPerFile" option, but in that case I would get a lot of small files for all "light" providers (those that have small data array per record).
My question is - how to break down "fat" partitions?

Sort data within one output directory created by partitionBy

I have a big geospatial dataset partitionBy quadkey's level 5.
In each qk5 level directory, there are about 1-50 Gb of data, so it doesn't fit into one file. I want to benefit from pushdown filters when do my geospatial queries. So I want that files within one qk5 partition be sorted by higher qk resolution (let's say quadkey level 10).
Question: Is there are a way to sort data within partitionBy batch?
For example:
qk5=00001/
part1.parquet
part2.parquet
part3.parquet
part4.parquet
...
qk5=33333/
part10000.parquet
part20000.parquet
part30000.parquet
part40000.parquet
I want to have data from part1.parquet, part2.parquet, part3.parquet, part4.parquet to be sorted by column 'qk10'.
Here is the current code, but it only provides sorting within one particular partition (e.g. part1.parquet):
// Parquet save
preExportRdd.toDF
.repartition(partitionsNumber, $"salt")
.sortWithinPartitions($"qk10")
.drop("salt")
.write
.partitionBy("qk")
.format("parquet")
.option("compression", "gzip")
.mode(SaveMode.Append)
.save(exportUrl)
The problem is that you don't sort your Dataframe globally by qk field and it causes for the same qk values to be distributed in different spark partitions.
During the write phase, due to partitionBy("qk"), the output written to a specific physical partition (folder) may arrive from different spark partitions, which causes your output data to be unsorted.
Try instead the following:
preExportRdd.toDF
.repartitionByRange(partitionsNumber, $"qk", $"qk10", $"salt")
.sortWithinPartitions($"qk10")
.drop("salt")
.write
.partitionBy("qk")
.format("parquet")
.option("compression", "gzip")
.mode(SaveMode.Append)
.save(exportUrl)
The repartitionByRange will sort your Dataframe by the provided columns and split the sorted Dataframe to the desired number of partitions.

Write each partition data in a single file in S3

We have use-case where we want to partition the data frame by a column value and then write each partition into single file. I did following things to do the same:
val df = spark.read.format("csv").load("hdfs:///tmp/PartitionKeyedDataset.csv")
df.repartition($"_c1")
df.rdd.saveAsTextFile("s3://dfdf/test1234")
When i do:
df.rdd.partitions.size
I get only 62 partition.But, the distinct values for the column is 10,214 (got it by running df.select("_c1").distinct.count)
I can't use:
df.write.partitionBy("_c1").save("s3://dfdf/test123")
as this creates the folder in destination with partition name. We don't want this. We want only files to be dumped.
I did a silly mistake of not using new variable. Hence, i saw same number of partition. Below is the updated code:
val df = spark.read.format("csv").load("hdfs:///tmp/PartitionKeyedDataset.csv")
df.repartition($"_c1")
df.rdd.saveAsTextFile("s3://dfdf/test1234")
repartition will only create 200 partitions by default as the default value for spark.sql.shuffle.partitions is 200. I have set this value to number of unique values i have for the column on which i want to partition.
spark.conf.set("spark.sql.shuffle.partitions", "10214")
After this, i got 10214 partitions and write operation created 10214 files in S3.
You need to assign the new dataframe to a variable and use that instead. Currently in your code the repartition part does not actually do anything.
val df = spark.read.format("csv").load("hdfs:///tmp/PartitionKeyedDataset.csv")
val df2 = df.repartition($"_c1")
df2.rdd.saveAsTextFile("s3://dfdf/test1234")
Although it is possible to change the spark.sql.shuffle.partitions setting, that is not as flexible.

merge multiple small files in to few larger files in Spark

I using hive through Spark. I have a Insert into partitioned table query in my spark code. The input data is in 200+gb. When Spark is writing to a partitioned table, it is spitting very small files(files in kb's). so now the output partitioned table folder have 5000+ small kb files. I want to merge these in to few large MB files, may be about few 200mb files. I tired using hive merge settings, but they don't seem to work.
'val result7A = hiveContext.sql("set hive.exec.dynamic.partition=true")
val result7B = hiveContext.sql("set hive.exec.dynamic.partition.mode=nonstrict")
val result7C = hiveContext.sql("SET hive.merge.size.per.task=256000000")
val result7D = hiveContext.sql("SET hive.merge.mapfiles=true")
val result7E = hiveContext.sql("SET hive.merge.mapredfiles=true")
val result7F = hiveContext.sql("SET hive.merge.sparkfiles = true")
val result7G = hiveContext.sql("set hive.aux.jars.path=c:\\Applications\\json-serde-1.1.9.3-SNAPSHOT-jar-with-dependencies.jar")
val result8 = hiveContext.sql("INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table")'
The above hive settings work in a mapreduce hive execution and spits out files of specified size. Is there any option to do this Spark or Scala?
I had the same issue. Solution was to add DISTRIBUTE BY clause with the partition columns. This ensures that data for one partition goes to single reducer. Example in your case:
INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table DISTRIBUTE BY date
You may want to try using the DataFrame.coalesce method; it returns a DataFrame with the specified number of partitions (each of which becomes a file on insertion). So using the number of records you are inserting and the typical size of each record, you can estimate how many partitions to coalesce to if you want files of ~200MB.
The dataframe repartition(1) method works in this case.

Distributed loading of a wide row into Spark from Cassandra

Let's assume we have a Cassandra cluster with RF = N and a table containing wide rows.
Our table could have an index something like this: pk / ck1 / ck2 / ....
If we create an RDD from a row in the table as follows:
val wide_row = sc.cassandraTable(KS, TABLE).select("c1", "c2").where("pk = ?", PK)
I notice that one Spark node has 100% of the data and the others have none. I assume this is because the spark-cassandra-connector has no way of breaking down the query token range into smaller sub ranges because it's actually not a range -- it's simply the hash of PK.
At this point we could simply call redistribute(N) to spread the data across the Spark cluster before processing, but this has the effect of moving data across the network to nodes that already have the data locally in Cassandra (remember RF = N)
What we would really like is to have each Spark node load a subset (slice) of the row locally from Cassandra.
One approach which came to mind is to generate an RDD containing a list of distinct values of the first cluster key (ck1) when pk = PK. We could then use mapPartitions() to load a slice of the wide row based on each value of ck1.
Assuming we already have our list values for ck1, we could write something like this:
val ck1_list = .... // RDD
ck1_list.repartition(ck1_list.count().toInt) // create a partition for each value of ck1
val wide_row = ck1_list.mapPartitions(f)
Within the partition iterator, f(), we would like to call another function g(pk, ck1) which loads the row slice from Cassandra for partition key pk and cluster key ck1. We could then apply flatMap to ck1_list so as to create a fully distributed RDD of the wide row without any shuffing.
So here's the question:
Is it possible to make a CQL call from within a Spark task? What driver should be used? Can it be set up only once an reused for subsequent tasks?
Any help would be greatly appreciated, thanks.
For the sake of future reference, I will explain how I solved this.
I actually used a slightly different method to the one outlined above, one which does not involve calling Cassandra from inside Spark tasks.
I started off with ck_list, a list of distinct values for the first cluster key when pk = PK. The code is not shown here, but I actually downloaded this list directly from Cassandra in the Spark driver using CQL.
I then transform ck_list into a list of RDDS. Next we combine the RDDs (each one representing a Cassandra row slice) into one unified RDD (wide_row).
The cast on CassandraRDD is necessary because union returns type org.apache.spark.rdd.RDD
After running the job I was able to verify that the wide_row had x partitions where x is the size of ck_list. A useful side effect is that wide_row is partitioned by the first cluster key, which is also the key I want to reduce by. Hence even more shuffling is avoided.
I don't know if this is the best way to achieve what I wanted, but it certainly works.
val ck_list // list first cluster key values where pk = PK
val wide_row = ck_list.map( ck =>
sc.cassandraTable(KS, TBL)
.select("c1", "c2").where("pk = ? and ck1 = ?", PK, ck)
.asInstanceOf[org.apache.spark.rdd.RDD]
).reduce( (x, y) => x.union(y) )