I'm dealing with a dataset containing some data with timestamps, and I need to group it by an ID and a time window. After grouping, I need to perform a custom aggregation that only makes sense if it is performed in order (specifically, ordered by the timestamps). So far, what I've done is just calling groupBy and agg with my udaf:
ds.groupBy($"id", window($"timestamp", "1 day", "1 day", "8 hours") as "timeslot")
.agg(myUDAF($"someCol", $"someCol2") as "result")
But even if the original Dataset is in order, there is no guarantee that it will be processed in that same order after all parallelization takes place, right?
More specifically, when implementing a custom Aggregator (from org.apache.spark.sql.expressions), a reduce and a merge method must be defined, whose signatures are reduce(buffer: BUF, data: IN): BUF and merge(a: BUF, b: BUF): BUF. How can I guarantee that input data for reduce always follows an order, and merge buffers also respect that order (i.e., buffer a contains aggregated data from rows that come before the rows from buffer b)?
Related
I have a Spark Dataset, and I would like to group the data and process the groups, yielding zero or one element per each group. Something like:
val resulDataset = inputDataset
.groupBy('x, 'y)
.flatMap(...)
I didn't find a way to apply a function after a groupBy, but it appears I can use groupByKey instead (is it a good idea? is there a better way?):
val resulDataset = inputDataset
.groupByKey(v => (v.x, v.y))
.flatMap(...)
This works, but here is a thing: I would like process the groups as Datasets. The reason is that I already have convenient functions to use on Datasets and would like to reuse them when calculating the result for each group. But, the groupByKey.flatMap yields an Iterator over the grouped elements, not the Dataset.
The question: is there a way in Spark to group an input Dataset and map a custom function to each group, while treating the grouped elements as a Dataset ? E.g.:
val inputDataset: Dataset[T] = ...
val resulDataset: Dataset[U] = inputDataset
.groupBy(...)
.flatMap(group: Dataset[T] => {
// using Dataset API to calculate resulting value, e.g.:
group.withColumn(row_number().over(...))....as[U]
})
Note, that grouped data is bounded, and it is OK to process it on a single node. But the number of groups can be very high, so the resulting Dataset needs to be distributed. The point of using the Dataset API to process a group is purely a question of using a convenient API.
What I tried so far:
creating a Dataset from an Iterator in the mapped function - it fails with an NPE from a SparkSession (my understanding is that it boils down to the fact that one cannot create a Dataset within the functions which process a Dataset; see this and this)
tried to overcome the issues in the first solution, attempted to create new SparkSession to create the Dataset within a new session; fails with NPE from SparkSession.newSession
(ab)using repartition('x, 'y).mapPartitions(...), but this also yields an Iterator[T] for each partition, not a Dataset[T]
finally, (ab)using filter: I can collect all distinct values of the grouping criteria into an Array (select.distinct.collect), and iterate this array to filter the source Dataset, yielding one Dataset for each group (sort of joins the idea of multiplexing from this article); although this works, my understanding is that it collects all the data on a single node, so it doesn't scale and will eventually have memory issues
This question already has an answer here:
Explanation of fold method of spark RDD
(1 answer)
Closed 4 years ago.
[Edit] Actually, my question is about the business scenario/requirement for Spark RDD aggregate operation especially with zeroValue and RDD partition, but not for how it work in Spark. Sorry for the confusion.
I was learning all kinds of Spark RDD calculation. When looking into Spark RDD aggregate /fold related, I can not think about the business scenario of aggregate/fold.
For example, I am going to calculate sum of value in a RDD by fold.
val myrdd1 = sc.parallelize(1 to 10, 2)
myrdd1.fold(1)((x,y) => x + y)
it returns 58.
if we change partition number from 2 to 4, it returns 60. But I expect 55.
I understood what spark did if did not give the partition number when making the myrdd1. it will take the default partition number which is not known. The return value will be "unstable".
So I do not know why Spark has this kind of logic. Is there real business scenario has this kind of requirement?
fold aggregates the data per partition, starting with zero value from the first parenthesis. Partitions aggregation results are at the end combined with zero value.
Thus, for 2 partitions you correctly received 58:
(1+1+2+3+4+5)+(1+6+7+8+9+10)+1
Same, for 4 partitions a correct result was 60:
(1+1+2+3)+(1+4+5+6)+(1+7+8)+(1+9+10)+1
For real world scenarios, such kind of computations (divide-and-conquer like) may be useful everywhere you have a commutative logic, i.e. when the order of operations execution doesn't matter, like in mathematical addition. With that Spark will only move across the network partial results of aggregations instead of, for instance, shuffle whole blocks.
Your expectation about "receiving 55" if instead of fold you'd use treeReduce:
"treeReduce" should "compute sum of numbers" in {
val numbersRdd = sparkContext.parallelize(1 to 20, 10)
val sumComputation = (v1: Int, v2: Int) => v1 + v2
val treeSum = numbersRdd.treeReduce(sumComputation, 2)
treeSum shouldEqual(210)
val reducedSum = numbersRdd.reduce(sumComputation)
reducedSum shouldEqual(treeSum)
}
Some time ago I wrote a small post about tree aggregations in RDD: http://www.waitingforcode.com/apache-spark/tree-aggregations-spark/read
I think the result you are getting now is as expected, I will try to explain how it works.
You have an `rdd` with 10 elements in two partition
val myrdd1 = sc.parallelize(1 to 10, 2)
Lets suppose the two partition contains p1 = {1,2,3,4,5} and p2 = {6,7,8,9,10}
Now as per the documentation the fold operates in each partition
Now you get (default value or zero value which is one in your case) +1+2+3+4+5 = 16 and (1 as zero value)+7+8+9+10 = 41
At last fold those with (1 as zero value) 16 + 41 = 58
Similarly, if you have 4 partition fold operates in four partitions with default value as 1 and combines the four result with another fold with 1 default value and results as 60.
Aggregate the elements of each partition, and then the results for all
the partitions, using a given associative function and a neutral "zero
value". The function op(t1, t2) is allowed to modify t1 and return it
as its result value to avoid object allocation; however, it should not
modify t2.
This behaves somewhat differently from fold operations implemented for
non-distributed collections in functional languages like Scala. This
fold operation may be applied to partitions individually, and then
fold those results into the final result, rather than apply the fold
to each element sequentially in some defined ordering. For functions
that are not commutative, the result may differ from that of a fold
applied to a non-distributed collection.
For sum zero value should be 0 which gives you the correct result as 55.
Hope this helps!
This question already has an answer here:
Spark groupByKey alternative
(1 answer)
Closed 5 years ago.
My data is a file containing over 2 million rows of employee records. Each row has 15 fields of employee features, including name, DOB, ssn, etc. Example:
ID|name|DOB|address|SSN|...
1|James Bond|10/01/1990|1000 Stanford Ave|123456789|...
2|Jason Bourne|05/17/1987|2000 Yale Rd|987654321|...
3|James Bond|10/01/1990|5000 Berkeley Dr|123456789|...
I need to group the data by a number of columns and aggregate the employee's ID (first column) with the same key. The number and name of the key columns are passed into the function as parameters.
For example, if the key columns include "name, DOB, SSN", the data will be grouped as
(James Bond, 10/01/1990, 123456789), List(1,3)
(Jason Bourne, 05/17/1987, 987654321), List(2)
And the final output is
List(1,3)
List(2)
I am new to Scala and Spark. What I did to solve this problem is: read the data as RDD, and tried using groupBy, reduceByKey, and foldByKey to implement the function based on my research on StackOverflow. Among them, I found groupBy was the slowest, and foldByKey was the fastest. My implementation with foldByKey is:
val buckets = data.map(row => (idx.map(i => row(i)) -> (row(0) :: Nil)))
.foldByKey(List[String]())((acc, e) => acc ::: e).values
My question is: Is there faster implementation than mine using foldByKey on RDD?
Update: I've read posts on StackOverflow and understand groupByKey may be very slow on large dataset. This is why I did avoid groupByKey and ended up with foldByKey. However, this is not the question I asked. I am looking for an even faster implementation, or the optimal implementation in terms of processing time with the fixed hardware setting. (The processing of 2 million records now requires ~15 minutes.) I was told that converting RDD to DataFrame and call groupBy can be faster.
Here are some details on each of these first to understand how they work.
groupByKey runs slow as all the key-value pairs are shuffled around. This is a lot of unnessary data to being transferred over the network.
reduceByKey works much better on a large dataset. That's because Spark knows it can combine output with a common key on each partition before shuffling the data.
combineByKey can be used when you are combining elements but your return type differs from your input value type.
foldByKey merges the values for each key using an associative function and a neutral "zero value".
So avoid groupbyKey. Hoping this helps.
Cheers !
I have a dataset which is 1,000,000 rows by about 390,000 columns. The fields are all binary, either 0 or 1. The data is very sparse.
I've been using Spark to process this data. My current task is to filter the data--I only want data in 1000 columns that have been preselected. This is the current code that I'm using to achieve this task:
val result = bigdata.map(_.zipWithIndex.filter{case (value, index) => selectedColumns.contains(index)})
bigdata is just an RDD[Array[Int]]
However, this code takes quite a while to run. I'm sure there's a more efficient way to filter my dataset that doesn't involve going in and filtering every single row separately. Would loading my data into a DataFrame, and maniuplating it through the DataFrame API make things faster/easier? Should I be looking into column-store based databases?
You can start with making your filter a little bit more efficient. Please note that:
your RDD contains Array[Int]. It means you can access nth element of each row in O(1) time
#selectedColumns << #columns
Considering these two facts it should be obvious that it doesn't make sense to iterate over all elements for each row not to mention contains calls. Instead you can simply map over selectedColumns
// Optional if selectedColumns are not ordered
val orderedSelectedColumns = selectedColumns.toList.sorted.toArray
rdd.map(row => selectedColumns.map(row))
Comparing time complexity:
zipWithIndex + filter (assuming best case scenario when contains is O(1)) - O(#rows * # columns)
map - O(#rows * #selectedColumns)
The easiest way to speed up execution is to parallelize it with partitionBy:
bigdata.partitionBy(new HashPartitioner(numPartitions)).foreachPartition(...)
foreachPartition receives a Iterator over which you can map and filter.
numPartitions is a val which you can set with the amount of desired parallel partitions.
If I have a file, and I did an RDD zipWithIndex per row,
([row1, id1001, name, address], 0)
([row2, id1001, name, address], 1)
...
([row100000, id1001, name, address], 100000)
Will I be able to get the same index order if I reload the file? Since it runs in parallel, other rows may be partitioned differently?
RDDs can be sorted, and so do have an order. This order is used to create the index with .zipWithIndex().
To get the same order each time depends upon what previous calls are doing in your program. The docs mention that .groupBy() can destroy order or generate different orderings. There may be other calls that do this as well.
I suppose you could always call .sortBy() before calling .zipWithIndex() if you needed to guarantee a specific ordering.
This is explained in the .zipWithIndex() scala API docs
public RDD<scala.Tuple2<T,Object>> zipWithIndex() Zips this RDD with
its element indices. The ordering is first based on the partition
index and then the ordering of items within each partition. So the
first item in the first partition gets index 0, and the last item in
the last partition receives the largest index. This is similar to
Scala's zipWithIndex but it uses Long instead of Int as the index
type. This method needs to trigger a spark job when this RDD contains
more than one partitions.
Note that some RDDs, such as those returned by groupBy(), do not
guarantee order of elements in a partition. The index assigned to each
element is therefore not guaranteed, and may even change if the RDD is
reevaluated. If a fixed ordering is required to guarantee the same
index assignments, you should sort the RDD with sortByKey() or save it
to a file.