Secondary sorting by using join in Spark? - scala

In Spark, I want to sort an RDD by two different fields. For example, in the given example here, I want to sort the elements by fieldA first, and within that, sort by fieldB (Secondary sorting). Is the method employed in the given example good enough? I have tested my code and it works. But is this a reliable way of doing it?
// x is of type (key, fieldA) and y of type (key, fieldB)
val a = x.sortBy(_._2)
// b will be of type (key, (fieldB, fieldA))
val b = y.join(x).sortBy(_._2._1))
So, I want an output that looks like the following, for example.
fieldA, fieldB
2, 10
2, 11
2, 13
7, 5
7, 7
7, 8
9, 3
9, 10
9, 10

But is this a reliable way of doing it?
It is not reliable. It depends on an assumption that during the shuffle data is processed in the order defined by the order of partitions. This may happen but there is no guarantee it will.
In other words shuffle based sorting is not stable. In general there exist methods which can be used to achieve desired result without performing full shuffle twice, but these are quite low level and for optimal performance require custom Partitioner.

You can use sortBy in the following way
y.join(x).sortBy(r => (r._2._2, r._2._1))
Two sorting will happen in one go.

Related

Can't write ordered data to parquet in spark

I am working with Apache Spark to generate parquet files. I can partition them by date with no problems, but internally I can not seem to lay out the data in the correct order.
The order seems to get lost during processing, which means the parquet metadata is not right (specifically I want to ensure that the parquet row groups are reflecting sorted order so that queries specific to my use case can filter efficiently via the metadata).
Consider the following example:
// note: hbase source is a registered temp table generated from hbase
val transformed = sqlContext.sql(s"SELECT id, sampleTime, ... , toDate(sampleTime) as date FROM hbaseSource")
// Repartion the input set by the date column (in my source there should be 2 distinct dates)
val sorted = transformed.repartition($"date").sortWithinPartitions("id", "sampleTime")
sorted.coalesce(1).write.partitionBy("date").parquet(s"/outputFiles")
With this approach, I do get the right parquet partition structure ( by date). And even better, for each date partition, I see a single large parquet file.
/outputFiles/date=2018-01-01/part-00000-4f14286c-6e2c-464a-bd96-612178868263.snappy.parquet
However, when I query the file I see the contents out of order. To be specific, "out of order" seems more like several ordered data-frame partitions have been merged into the file.
The parquet row group metadata shows that the sorted fields are actually overlapping ( a specific id, for example, could be located in many row groups ):
id: :[min: 54, max: 65012, num_nulls: 0]
sampleTime: :[min: 1514764810000000, max: 1514851190000000, num_nulls: 0]
id: :[min: 827, max: 65470, num_nulls: 0]
sampleTime: :[min: 1514764810000000, max: 1514851190000000, num_nulls: 0]
id: :[min: 1629, max: 61412, num_nulls: 0]
I want the data to be properly ordered inside each file so the metadata min/max in each row group are non-overlapping.
For example, this is the pattern I want to see:
RG 0: id: :[min: 54, max: 100, num_nulls: 0]
RG 1: id: :[min: 100, max: 200, num_nulls: 0]
... where RG = "row group". If I wanted id = 75, the query could find it in one row group.
I have tried many variations of the above code. For example with and without coalesce (I know coalesce is bad, but my idea was to use it to prevent shuffling). I have also tried sort instead of sortWithinPartitions (sort should create a total ordered sort, but will result in many partitions). For example:
val sorted = transformed.repartition($"date").sort("id", "sampleTime")
sorted.write.partitionBy("date").parquet(s"/outputFiles")
Gives me 200 files, which is too many, and they are still not sorted correctly. I can reduce the file count by adjusting the shuffle size, but I would have expected sort to be processed in order during the write (I was under the impression that writes did not shuffle the input). The order I see is as follows (other fields omitted for brevity):
+----------+----------------+
|id| sampleTime|
+----------+----------------+
| 56868|1514840220000000|
| 57834|1514785180000000|
| 56868|1514840220000000|
| 57834|1514785180000000|
| 56868|1514840220000000|
Which looks like it's interleaved sorted partitions. So I think repartition buys me nothing here, and sort seems to be incapable of preserving order on the write step.
I've read that what I want to do should be possible. I've even tried the approach outlined in the presentation "Parquet performance tuning:
The missing guide" by Ryan Blue ( unfortunately it is behind the OReily paywall). That involves using insertInto. In that case, spark seemed to use an old version of parquet-mr which corrupted the metadata, and I am not sure how to upgrade it.
I am not sure what I am doing wrong. My feeling is that I am misunderstanding the way repartition($"date") and sort work and/or interact.
I would appreciate any ideas. Apologies for the essay. :)
edit:
Also note that if I do a show(n) on transformed.sort("id", "sampleTime") the data is sorted correctly. So it seems like the problem occurs during the write stage. As noted above, it does seem like the output of the sort is shuffled during the write.
The problem is that while saving file format, Spark is requiring some order and if the order is not satisfied, Spark will sort the data during the saving process according to the requirement and will forget your sort. To be more specific Spark requires this order (and this is taken directly from the Spark source code of Spark 2.4.4):
val requiredOrdering = partitionColumns ++ bucketIdExpression ++ sortColumns
where partitionColumns are columns by which you partition the data. You are not using bucketing so bucketingIdExpression and sortColumns are not relevant in this example and the requiredOrdering will be only the partitionColumns. So if this is your code:
val sorted = transformed.repartition($"date").sortWithinPartitions("id",
"sampleTime")
sorted.write.partitionBy("date").parquet(s"/outputFiles")
Spark will check if the data is sorted by date, which is not, so Spark will forget your sort and will sort it by date. On the other hand if you instead do it like this:
val sorted = transformed.repartition($"date").sortWithinPartitions("date", "id",
"sampleTime")
sorted.write.partitionBy("date").parquet(s"/outputFiles")
Spark will check again if the data is sorted by date and this time it is (the requirement is satisfied) so Spark will preserve this order and will induce no more sorts while saving the data. So i believe this way it should work.
Just idea, sort after coalesce: ".coalesce(1).sortWithinPartitions()". Also expected result looks strange - why ordered data in parquet required? Sorting after reading looks more appropriate.

Spark RDD aggregate /fold operation business scenario [duplicate]

This question already has an answer here:
Explanation of fold method of spark RDD
(1 answer)
Closed 4 years ago.
[Edit] Actually, my question is about the business scenario/requirement for Spark RDD aggregate operation especially with zeroValue and RDD partition, but not for how it work in Spark. Sorry for the confusion.
I was learning all kinds of Spark RDD calculation. When looking into Spark RDD aggregate /fold related, I can not think about the business scenario of aggregate/fold.
For example, I am going to calculate sum of value in a RDD by fold.
val myrdd1 = sc.parallelize(1 to 10, 2)
myrdd1.fold(1)((x,y) => x + y)
it returns 58.
if we change partition number from 2 to 4, it returns 60. But I expect 55.
I understood what spark did if did not give the partition number when making the myrdd1. it will take the default partition number which is not known. The return value will be "unstable".
So I do not know why Spark has this kind of logic. Is there real business scenario has this kind of requirement?
fold aggregates the data per partition, starting with zero value from the first parenthesis. Partitions aggregation results are at the end combined with zero value.
Thus, for 2 partitions you correctly received 58:
(1+1+2+3+4+5)+(1+6+7+8+9+10)+1
Same, for 4 partitions a correct result was 60:
(1+1+2+3)+(1+4+5+6)+(1+7+8)+(1+9+10)+1
For real world scenarios, such kind of computations (divide-and-conquer like) may be useful everywhere you have a commutative logic, i.e. when the order of operations execution doesn't matter, like in mathematical addition. With that Spark will only move across the network partial results of aggregations instead of, for instance, shuffle whole blocks.
Your expectation about "receiving 55" if instead of fold you'd use treeReduce:
"treeReduce" should "compute sum of numbers" in {
val numbersRdd = sparkContext.parallelize(1 to 20, 10)
val sumComputation = (v1: Int, v2: Int) => v1 + v2
val treeSum = numbersRdd.treeReduce(sumComputation, 2)
treeSum shouldEqual(210)
val reducedSum = numbersRdd.reduce(sumComputation)
reducedSum shouldEqual(treeSum)
}
Some time ago I wrote a small post about tree aggregations in RDD: http://www.waitingforcode.com/apache-spark/tree-aggregations-spark/read
I think the result you are getting now is as expected, I will try to explain how it works.
You have an `rdd` with 10 elements in two partition
val myrdd1 = sc.parallelize(1 to 10, 2)
Lets suppose the two partition contains p1 = {1,2,3,4,5} and p2 = {6,7,8,9,10}
Now as per the documentation the fold operates in each partition
Now you get (default value or zero value which is one in your case) +1+2+3+4+5 = 16 and (1 as zero value)+7+8+9+10 = 41
At last fold those with (1 as zero value) 16 + 41 = 58
Similarly, if you have 4 partition fold operates in four partitions with default value as 1 and combines the four result with another fold with 1 default value and results as 60.
Aggregate the elements of each partition, and then the results for all
the partitions, using a given associative function and a neutral "zero
value". The function op(t1, t2) is allowed to modify t1 and return it
as its result value to avoid object allocation; however, it should not
modify t2.
This behaves somewhat differently from fold operations implemented for
non-distributed collections in functional languages like Scala. This
fold operation may be applied to partitions individually, and then
fold those results into the final result, rather than apply the fold
to each element sequentially in some defined ordering. For functions
that are not commutative, the result may differ from that of a fold
applied to a non-distributed collection.
For sum zero value should be 0 which gives you the correct result as 55.
Hope this helps!

What is the fastest way to group values based on multiple key columns in RDD using Scala? [duplicate]

This question already has an answer here:
Spark groupByKey alternative
(1 answer)
Closed 5 years ago.
My data is a file containing over 2 million rows of employee records. Each row has 15 fields of employee features, including name, DOB, ssn, etc. Example:
ID|name|DOB|address|SSN|...
1|James Bond|10/01/1990|1000 Stanford Ave|123456789|...
2|Jason Bourne|05/17/1987|2000 Yale Rd|987654321|...
3|James Bond|10/01/1990|5000 Berkeley Dr|123456789|...
I need to group the data by a number of columns and aggregate the employee's ID (first column) with the same key. The number and name of the key columns are passed into the function as parameters.
For example, if the key columns include "name, DOB, SSN", the data will be grouped as
(James Bond, 10/01/1990, 123456789), List(1,3)
(Jason Bourne, 05/17/1987, 987654321), List(2)
And the final output is
List(1,3)
List(2)
I am new to Scala and Spark. What I did to solve this problem is: read the data as RDD, and tried using groupBy, reduceByKey, and foldByKey to implement the function based on my research on StackOverflow. Among them, I found groupBy was the slowest, and foldByKey was the fastest. My implementation with foldByKey is:
val buckets = data.map(row => (idx.map(i => row(i)) -> (row(0) :: Nil)))
.foldByKey(List[String]())((acc, e) => acc ::: e).values
My question is: Is there faster implementation than mine using foldByKey on RDD?
Update: I've read posts on StackOverflow and understand groupByKey may be very slow on large dataset. This is why I did avoid groupByKey and ended up with foldByKey. However, this is not the question I asked. I am looking for an even faster implementation, or the optimal implementation in terms of processing time with the fixed hardware setting. (The processing of 2 million records now requires ~15 minutes.) I was told that converting RDD to DataFrame and call groupBy can be faster.
Here are some details on each of these first to understand how they work.
groupByKey runs slow as all the key-value pairs are shuffled around. This is a lot of unnessary data to being transferred over the network.
reduceByKey works much better on a large dataset. That's because Spark knows it can combine output with a common key on each partition before shuffling the data.
combineByKey can be used when you are combining elements but your return type differs from your input value type.
foldByKey merges the values for each key using an associative function and a neutral "zero value".
So avoid groupbyKey. Hoping this helps.
Cheers !

Spark: How to combine 2 sorted RDDs so that order is preserved after union?

I have 2 sorted RDDs:
val rdd_a = some_pair_rdd.sortByKey().
zipWithIndex.filter(f => f._2 < n).
map(f => f._1)
val rdd_b = another_pair_rdd.sortByKey().
zipWithIndex.filter(f => f._2 < n).
map(f => f._1)
val all_rdd = rdd_a.union(rdd_b)
In all_rdd, I see that the order is not necessarily maintained as I'd imagined (that all elements of rdd_a come first, followed by all elements of rdd_b). Is my assumption incorrect (about the contract of union), and if so, what should I use to append multiple sorted RDDs into a single rdd?
I'm fairly new to Spark so I could be wrong, but from what I understand Union is a narrow transformation. That is, each executor joins only its local blocks of RDD a with its local blocks of RDD b and then returns that to the driver.
As an example, let's say that you have 2 executors and 2 RDDS.
RDD_A = ["a","b","c","d","e","f"]
and
RDD_B = ["1","2","3","4","5","6"]
Let Executor 1 contain the first half of both RDD's and Executor 2 contain the second half of both RDD's. When they perform the union on their local blocks, it would look something like:
Union_executor1 = ["a","b","c","1","2","3"]
and
Union_executor2 = ["d","e","f","4","5","6"]
So when the executors pass their parts back to the driver you would have ["a","b","c","1","2","3","d","e","f","4","5","6"]
Again, I'm new to Spark and I could be wrong. I'm just sharing based on my understanding of how it works with RDD's. Hopefully we can both learn something from this.
You can't. Spark does not have a merge sort, because you can't make assumptions about the way that the RDDs are actually stored on the nodes. If you want things in sort order after you take the union, you need to sort again.

Grouping by key with Apache Spark but want to apply contcat between values instead of using an aggregate function

I'm learning Spark and want to perform the following task: I want to use group by but the grouping condition shown below is different and not well known in Spark, any help will be appreciated.
I've an RDD[String,String] with data ->
8 kshitij
8 vini
8 mohan
8 guru
5 aashish
5 aakash
5 ram
I want to convert it to an RDD[String,Set[String]] ->
8 Set[kshitij, vini, mohan, guru]
5 Set[aashish, aakash, ram]
As user52045 said in the comments, you can just use groupByKey, which results in a RDD[String, Iterable[String]]. This is part of the RDDPairFunctions available through implicit conversions for any Tuple2.
The only open question is whether you're ok with an Iterable, or if it has to be a Set, which would require an additional step of calling mapValues, or some customization through aggregateByKey (if you want it in one go)