Spark RDD aggregate /fold operation business scenario [duplicate]

Spark RDD aggregate /fold operation business scenario [duplicate] - scala

This question already has an answer here:
Explanation of fold method of spark RDD
(1 answer)
Closed 4 years ago.
[Edit] Actually, my question is about the business scenario/requirement for Spark RDD aggregate operation especially with zeroValue and RDD partition, but not for how it work in Spark. Sorry for the confusion.
I was learning all kinds of Spark RDD calculation. When looking into Spark RDD aggregate /fold related, I can not think about the business scenario of aggregate/fold.
For example, I am going to calculate sum of value in a RDD by fold.
val myrdd1 = sc.parallelize(1 to 10, 2)
myrdd1.fold(1)((x,y) => x + y)
it returns 58.
if we change partition number from 2 to 4, it returns 60. But I expect 55.
I understood what spark did if did not give the partition number when making the myrdd1. it will take the default partition number which is not known. The return value will be "unstable".
So I do not know why Spark has this kind of logic. Is there real business scenario has this kind of requirement?

fold aggregates the data per partition, starting with zero value from the first parenthesis. Partitions aggregation results are at the end combined with zero value.
Thus, for 2 partitions you correctly received 58:
(1+1+2+3+4+5)+(1+6+7+8+9+10)+1
Same, for 4 partitions a correct result was 60:
(1+1+2+3)+(1+4+5+6)+(1+7+8)+(1+9+10)+1
For real world scenarios, such kind of computations (divide-and-conquer like) may be useful everywhere you have a commutative logic, i.e. when the order of operations execution doesn't matter, like in mathematical addition. With that Spark will only move across the network partial results of aggregations instead of, for instance, shuffle whole blocks.
Your expectation about "receiving 55" if instead of fold you'd use treeReduce:
"treeReduce" should "compute sum of numbers" in {
val numbersRdd = sparkContext.parallelize(1 to 20, 10)
val sumComputation = (v1: Int, v2: Int) => v1 + v2
val treeSum = numbersRdd.treeReduce(sumComputation, 2)
treeSum shouldEqual(210)
val reducedSum = numbersRdd.reduce(sumComputation)
reducedSum shouldEqual(treeSum)
}
Some time ago I wrote a small post about tree aggregations in RDD: http://www.waitingforcode.com/apache-spark/tree-aggregations-spark/read

I think the result you are getting now is as expected, I will try to explain how it works.
You have an `rdd` with 10 elements in two partition
val myrdd1 = sc.parallelize(1 to 10, 2)
Lets suppose the two partition contains p1 = {1,2,3,4,5} and p2 = {6,7,8,9,10}
Now as per the documentation the fold operates in each partition
Now you get (default value or zero value which is one in your case) +1+2+3+4+5 = 16 and (1 as zero value)+7+8+9+10 = 41
At last fold those with (1 as zero value) 16 + 41 = 58
Similarly, if you have 4 partition fold operates in four partitions with default value as 1 and combines the four result with another fold with 1 default value and results as 60.
Aggregate the elements of each partition, and then the results for all
the partitions, using a given associative function and a neutral "zero
value". The function op(t1, t2) is allowed to modify t1 and return it
as its result value to avoid object allocation; however, it should not
modify t2.
This behaves somewhat differently from fold operations implemented for
non-distributed collections in functional languages like Scala. This
fold operation may be applied to partitions individually, and then
fold those results into the final result, rather than apply the fold
to each element sequentially in some defined ordering. For functions
that are not commutative, the result may differ from that of a fold
applied to a non-distributed collection.
For sum zero value should be 0 which gives you the correct result as 55.
Hope this helps!

Related

How to avoid the need for nested calls to Spark dataframes - which dont' work

Suppose I have a Spark dataframe called trades which has in its schema a few columns, some dimensions (let's say Product and Type) and some facts (let's say Price and Volume).
Rows in the dataframe which have the same dimension columns belong logically to the same group.
What I need is to map each dimension set (Product, Type) to a numeric value, so to obtain in the end a dataframe stats which has as many rows as the distinct number of dimensions and a value - this is the critical part - which is obtained from all the rows in trades of that (Product, Type) and which must be computed sequentially in order, because the function applied row by row is neither associative nor commutative, and it cannot be parallelized.
I managed to handle the sequential function I need to apply to each subset by repartitioning to 1 single chunk each dataframe and sorting the rows, so to get exactly what I need.
The thing I am struggling with is how to do the map from trades to stats as a Spark job: in my scenario master is remote and can leverage multiple executors, while the deploy mode is local and local machine is poorly equipped.
So I don't want to do looping over the driver, but push it down to the cluster.
If this was not Spark, I'd have done something like:
val dimensions = trades.select("Product", "Type").distinct()
val stats = dimensions.map( row =>
val product = row.getAs[String]("Product")
val type = row.getAs[String]("Type")
val inScope = col("Product") === product and col("Type") === type
val tradesInScope = trades.filter(inScope)
Row(product, type, callSequentialFunction(tradesInScope))
)
This seemed fine to me, but it's absolutely not working: I am trying to do a nested call on trades, and it seem they are not supported. Indeed, when running this the spark job compile but when actually performing an action I get a NullPointerException because the dataframe trades is null within the map
I am new to Spark, and I don't know any other way of achieving the same intent in a valid way. Could you help me?

you get a NullpointerExecptionbecause you cannot use dataframes within executor-side code, they only live on the driver.Also, your code would not ensure thatcallSequentialFunction will be called sequentially, because map on a dataframe will run in parallel (if you have more than 1 partition). What you can do is something like this:
val dimensions = trades.select("Product", "Type").distinct().as[(String,String)].collect()
val stats = dimensions.map{case (product,type) =>
val inScope = col("Product") === product and col("Type") === type
val tradesInScope = trades.filter(inScope)
(product, type, callSequentialFunction(tradesInScope))
}
But note that the order in dimensionsis somewhat arbitrary, so you should sort dimensionsaccording to your needs

What is the fastest way to group values based on multiple key columns in RDD using Scala? [duplicate]

This question already has an answer here:
Spark groupByKey alternative
(1 answer)
Closed 5 years ago.
My data is a file containing over 2 million rows of employee records. Each row has 15 fields of employee features, including name, DOB, ssn, etc. Example:
ID|name|DOB|address|SSN|...
1|James Bond|10/01/1990|1000 Stanford Ave|123456789|...
2|Jason Bourne|05/17/1987|2000 Yale Rd|987654321|...
3|James Bond|10/01/1990|5000 Berkeley Dr|123456789|...
I need to group the data by a number of columns and aggregate the employee's ID (first column) with the same key. The number and name of the key columns are passed into the function as parameters.
For example, if the key columns include "name, DOB, SSN", the data will be grouped as
(James Bond, 10/01/1990, 123456789), List(1,3)
(Jason Bourne, 05/17/1987, 987654321), List(2)
And the final output is
List(1,3)
List(2)
I am new to Scala and Spark. What I did to solve this problem is: read the data as RDD, and tried using groupBy, reduceByKey, and foldByKey to implement the function based on my research on StackOverflow. Among them, I found groupBy was the slowest, and foldByKey was the fastest. My implementation with foldByKey is:
val buckets = data.map(row => (idx.map(i => row(i)) -> (row(0) :: Nil)))
.foldByKey(List[String]())((acc, e) => acc ::: e).values
My question is: Is there faster implementation than mine using foldByKey on RDD?
Update: I've read posts on StackOverflow and understand groupByKey may be very slow on large dataset. This is why I did avoid groupByKey and ended up with foldByKey. However, this is not the question I asked. I am looking for an even faster implementation, or the optimal implementation in terms of processing time with the fixed hardware setting. (The processing of 2 million records now requires ~15 minutes.) I was told that converting RDD to DataFrame and call groupBy can be faster.

Here are some details on each of these first to understand how they work.
groupByKey runs slow as all the key-value pairs are shuffled around. This is a lot of unnessary data to being transferred over the network.
reduceByKey works much better on a large dataset. That's because Spark knows it can combine output with a common key on each partition before shuffling the data.
combineByKey can be used when you are combining elements but your return type differs from your input value type.
foldByKey merges the values for each key using an associative function and a neutral "zero value".
So avoid groupbyKey. Hoping this helps.
Cheers !

Why is there no `reduceByValue` in Spark collections?

I am learning Spark and Scala and keep coming across this pattern:
val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
While I understand what it does, I don't understand why it is used instead of having something like:
val lines = sc.textFile("data.txt")
val counts = lines.reduceByValue((v1, v2) => v1 + v2)
Given that Spark is designed to process large amounts of data efficiently, it seems counter intuitive to always have to perform an additional step of converting a list into a map and then reducing by key, instead of simply being able to reduce by value?

First, this "additional step" doesn't really cost much (see more details at the end) - it doesn't shuffle the data, and it is performed together with other transformations: transformations can be "pipelined" as long as they don't change the partitioning.
Second - the API you suggest seems very specific for counting - although you suggest reduceByValue will take a binary operator f: (Int, Int) => Int, your suggested API assumes each value is mapped to the value 1 before applying this operator for all identical values - an assumption that is hardly useful in any scenario other than counting. Adding such specific APIs would just bloat the interface and is never going to cover all use cases anyway (what's next - RDD.wordCount?), so it's better to give users minimal building blocks (along with good documentation).
Lastly - if you're not happy with such low-level APIs, you can use Spark-SQL's DataFrame API to get some higer-level APIs that will hide these details - that's one of the reasons DataFrames exist:
val linesDF = sc.textFile("file.txt").toDF("line")
val wordsDF = linesDF.explode("line","word")((line: String) => line.split(" "))
val wordCountDF = wordsDF.groupBy("word").count()
EDIT: as requested - some more details about why the performance impact of this map operation is either small or entirely negligibile:
First, I'm assuming you are interested in producing the same result as the map -> reduceByKey code would produce (i.e. word count), which means somewhere the mapping from each record to the value 1 must take place, otherwise there's nothing to perform the summing function (v1, v2) => v1 + v2 on (that function takes Ints, they must be created somewhere).
To my understanding - you're just wondering why this has to happen as a separate map operation
So, we're actually interested in the overhead of adding another map operation
Consider these two functionally-identical Spark transformations:
val rdd: RDD[String] = ???
/*(1)*/ rdd.map(s => s.length * 2).collect()
/*(2)*/ rdd.map(s => s.length).map(_ * 2).collect()
Q: Which one is faster?
A: They perform the same
Why? Because as long as two consecutive transformations on an RDD do not change the partitioning (and that's the case in your original example too), Spark will group them together, and perform them within the same task. So, per record, the difference between these two will come down to the difference between:
/*(1)*/ s.length * 2
/*(2)*/ val r1 = s.length; r1 * 2
Which is negligible, especially when you're discussing distributed execution on large datasets, where execution time is dominated by things like shuffling, de/serialization and IO.

Spark: How to combine 2 sorted RDDs so that order is preserved after union?

I have 2 sorted RDDs:
val rdd_a = some_pair_rdd.sortByKey().
zipWithIndex.filter(f => f._2 < n).
map(f => f._1)
val rdd_b = another_pair_rdd.sortByKey().
zipWithIndex.filter(f => f._2 < n).
map(f => f._1)
val all_rdd = rdd_a.union(rdd_b)
In all_rdd, I see that the order is not necessarily maintained as I'd imagined (that all elements of rdd_a come first, followed by all elements of rdd_b). Is my assumption incorrect (about the contract of union), and if so, what should I use to append multiple sorted RDDs into a single rdd?

I'm fairly new to Spark so I could be wrong, but from what I understand Union is a narrow transformation. That is, each executor joins only its local blocks of RDD a with its local blocks of RDD b and then returns that to the driver.
As an example, let's say that you have 2 executors and 2 RDDS.
RDD_A = ["a","b","c","d","e","f"]
and
RDD_B = ["1","2","3","4","5","6"]
Let Executor 1 contain the first half of both RDD's and Executor 2 contain the second half of both RDD's. When they perform the union on their local blocks, it would look something like:
Union_executor1 = ["a","b","c","1","2","3"]
and
Union_executor2 = ["d","e","f","4","5","6"]
So when the executors pass their parts back to the driver you would have ["a","b","c","1","2","3","d","e","f","4","5","6"]
Again, I'm new to Spark and I could be wrong. I'm just sharing based on my understanding of how it works with RDD's. Hopefully we can both learn something from this.

You can't. Spark does not have a merge sort, because you can't make assumptions about the way that the RDDs are actually stored on the nodes. If you want things in sort order after you take the union, you need to sort again.

Filter columns in large dataset with Spark

I have a dataset which is 1,000,000 rows by about 390,000 columns. The fields are all binary, either 0 or 1. The data is very sparse.
I've been using Spark to process this data. My current task is to filter the data--I only want data in 1000 columns that have been preselected. This is the current code that I'm using to achieve this task:
val result = bigdata.map(_.zipWithIndex.filter{case (value, index) => selectedColumns.contains(index)})
bigdata is just an RDD[Array[Int]]
However, this code takes quite a while to run. I'm sure there's a more efficient way to filter my dataset that doesn't involve going in and filtering every single row separately. Would loading my data into a DataFrame, and maniuplating it through the DataFrame API make things faster/easier? Should I be looking into column-store based databases?

You can start with making your filter a little bit more efficient. Please note that:
your RDD contains Array[Int]. It means you can access nth element of each row in O(1) time
#selectedColumns << #columns
Considering these two facts it should be obvious that it doesn't make sense to iterate over all elements for each row not to mention contains calls. Instead you can simply map over selectedColumns
// Optional if selectedColumns are not ordered
val orderedSelectedColumns = selectedColumns.toList.sorted.toArray
rdd.map(row => selectedColumns.map(row))
Comparing time complexity:
zipWithIndex + filter (assuming best case scenario when contains is O(1)) - O(#rows * # columns)
map - O(#rows * #selectedColumns)

The easiest way to speed up execution is to parallelize it with partitionBy:
bigdata.partitionBy(new HashPartitioner(numPartitions)).foreachPartition(...)
foreachPartition receives a Iterator over which you can map and filter.
numPartitions is a val which you can set with the amount of desired parallel partitions.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark RDD aggregate /fold operation business scenario [duplicate] - scala

Related

How to avoid the need for nested calls to Spark dataframes - which dont' work

What is the fastest way to group values based on multiple key columns in RDD using Scala? [duplicate]

Why is there no `reduceByValue` in Spark collections?

Spark: How to combine 2 sorted RDDs so that order is preserved after union?

Filter columns in large dataset with Spark

Categories

Resources