Efficiently take one value for each key out of a RDD[(key,value)] - scala

My starting point is an RDD[(key,value)] in Scala using Apache Spark. The RDD contains roughly 15 million tuples. Each key has roughly 50+-20 values.
Now I'd like to take one value (doesn't matter which one) for each key. My current approach is the following:
HashPartition the RDD by the key. (There is no significant skew)
Group the tuples by key resulting in RDD[(key, array of values)]]
Take the first of each value array
Basically looks like this:
...
candidates
.groupByKey()
.map(c => (c._1, c._2.head)
...
The grouping is the expensive part. It is still fast because there is no network shuffle and candidates is in memory but can I do it faster?
My idea was to work on the partitions directly, but I'm not sure what I get out of the HashPartition. If I take the first tuple of each partition, I will get every key but maybe multiple tuples for a single key depending on the number of partitions? Or will I miss keys?
Thank you!

How about reduceByKey with a function that returns the first argument? Like this:
candidates.reduceByKey((x, _) => x)

Related

What is the fastest way to group values based on multiple key columns in RDD using Scala? [duplicate]

This question already has an answer here:
Spark groupByKey alternative
(1 answer)
Closed 5 years ago.
My data is a file containing over 2 million rows of employee records. Each row has 15 fields of employee features, including name, DOB, ssn, etc. Example:
ID|name|DOB|address|SSN|...
1|James Bond|10/01/1990|1000 Stanford Ave|123456789|...
2|Jason Bourne|05/17/1987|2000 Yale Rd|987654321|...
3|James Bond|10/01/1990|5000 Berkeley Dr|123456789|...
I need to group the data by a number of columns and aggregate the employee's ID (first column) with the same key. The number and name of the key columns are passed into the function as parameters.
For example, if the key columns include "name, DOB, SSN", the data will be grouped as
(James Bond, 10/01/1990, 123456789), List(1,3)
(Jason Bourne, 05/17/1987, 987654321), List(2)
And the final output is
List(1,3)
List(2)
I am new to Scala and Spark. What I did to solve this problem is: read the data as RDD, and tried using groupBy, reduceByKey, and foldByKey to implement the function based on my research on StackOverflow. Among them, I found groupBy was the slowest, and foldByKey was the fastest. My implementation with foldByKey is:
val buckets = data.map(row => (idx.map(i => row(i)) -> (row(0) :: Nil)))
.foldByKey(List[String]())((acc, e) => acc ::: e).values
My question is: Is there faster implementation than mine using foldByKey on RDD?
Update: I've read posts on StackOverflow and understand groupByKey may be very slow on large dataset. This is why I did avoid groupByKey and ended up with foldByKey. However, this is not the question I asked. I am looking for an even faster implementation, or the optimal implementation in terms of processing time with the fixed hardware setting. (The processing of 2 million records now requires ~15 minutes.) I was told that converting RDD to DataFrame and call groupBy can be faster.
Here are some details on each of these first to understand how they work.
groupByKey runs slow as all the key-value pairs are shuffled around. This is a lot of unnessary data to being transferred over the network.
reduceByKey works much better on a large dataset. That's because Spark knows it can combine output with a common key on each partition before shuffling the data.
combineByKey can be used when you are combining elements but your return type differs from your input value type.
foldByKey merges the values for each key using an associative function and a neutral "zero value".
So avoid groupbyKey. Hoping this helps.
Cheers !

Spark: How to get same result using reduceByKey like we get by using groupByKey any alternative solution? to avoid shuffle

I am new to Spark (using 1.1 version) and Scala .. I am converting my existing Hadoop MapReduce code to spark MR using Scala and bit lost.
I want my mapped RDD to be grouped by Key .. When i read online it's suggested that we should avoid groupByKey and use reducedByKey instead.. but when I apply reduceBykey I am not getting list of values for given key as expected by my code =>Ex.
val rdd = sc.parallelize(List(("k1", "v11"), ("k1", "v21"), ("k2", "v21"), ("k2", "v22"), ("k3", "v31") ))
My "values" for actual task are huge, having 300 plus columns in key-values pair
And when I will do group by on common key it will result in shuffle which i want to avoid.
I want something like this as o/p (key, List OR Array of values) from my mapped RDD =>
rdd.groupByKey()
which gives me following Output
(k3,ArrayBuffer(v31))
(k2,ArrayBuffer(v21, v22))
(k1,ArrayBuffer(v11, v21))
But when I use
rdd.reduceByKey((x,y) => x+y)
I get values connected together like following- If pipe('|') or some other breakable character( (k2,v21|v22) ) would have been there my problem would have been little bit solved but still having list would be great for good coding practice
(k3,v31)
(k2,v21v22)
(k1,v11v21)
Please help
If you refer the spark documentation http://spark.apache.org/docs/latest/programming-guide.html
For groupByKey It says
“When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs.”
The Iterable keyword is very important over here, when you get the value as (v21, v22) it’s iterable.
Further it says
“Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance.”
So from this what I understand is, if you want the return RDD to have iterable values use groupByKey if and if you want to have a single added up value like SUM then use reducebyKey.
Now in your tuple instead of having (String,String) => (K1,V1), if you had (String,ListBuffer(String)) => (K1,ListBuffer(“V1”)) then maybe you could have done rdd.reduceByKey( (x,y) = > x += y)

Filter columns in large dataset with Spark

I have a dataset which is 1,000,000 rows by about 390,000 columns. The fields are all binary, either 0 or 1. The data is very sparse.
I've been using Spark to process this data. My current task is to filter the data--I only want data in 1000 columns that have been preselected. This is the current code that I'm using to achieve this task:
val result = bigdata.map(_.zipWithIndex.filter{case (value, index) => selectedColumns.contains(index)})
bigdata is just an RDD[Array[Int]]
However, this code takes quite a while to run. I'm sure there's a more efficient way to filter my dataset that doesn't involve going in and filtering every single row separately. Would loading my data into a DataFrame, and maniuplating it through the DataFrame API make things faster/easier? Should I be looking into column-store based databases?
You can start with making your filter a little bit more efficient. Please note that:
your RDD contains Array[Int]. It means you can access nth element of each row in O(1) time
#selectedColumns << #columns
Considering these two facts it should be obvious that it doesn't make sense to iterate over all elements for each row not to mention contains calls. Instead you can simply map over selectedColumns
// Optional if selectedColumns are not ordered
val orderedSelectedColumns = selectedColumns.toList.sorted.toArray
rdd.map(row => selectedColumns.map(row))
Comparing time complexity:
zipWithIndex + filter (assuming best case scenario when contains is O(1)) - O(#rows * # columns)
map - O(#rows * #selectedColumns)
The easiest way to speed up execution is to parallelize it with partitionBy:
bigdata.partitionBy(new HashPartitioner(numPartitions)).foreachPartition(...)
foreachPartition receives a Iterator over which you can map and filter.
numPartitions is a val which you can set with the amount of desired parallel partitions.

Spark in Scala: How to avoid linear scan for searching a key in each partition?

I have one huge key-value dataset named A, and a set of keys named B as queries. My task is that for each key in B, return the key exists in A or not, if it exists, return the value.
I partition A by HashParitioner(100) first. Currently I can use A.join(B') to solve it, where B' = B.map(x=>(x,null)). Or we can use A.lookup() for each key in B.
However, the problem is that both join and lookup for PairRDD is linear scan for each partition. This is too slow. As I desire, each partition could be a Hashmap, so that we can find the key in each parition in O(1). So the ideal strategy is that when the master machine receives a bunch of keys, the master assigns each key to its corresponding partition, then the partition uses its Hashmap to find the keys and return the result to the master machine.
Is there an easy way to achieve it?
One potential way:
As I searched online, a similar question is here
http://mail-archives.us.apache.org/mod_mbox/spark-user/201401.mbox/%3CCAMwrk0kPiHoX6mAiwZTfkGRPxKURHhn9iqvFHfa4aGj3XJUCNg#mail.gmail.com%3E
As it said, I built the Hashmap for each partition using the code as follows
val hashpair = A.mapPartitions(iterator => {
val hashmap = new HashMap[Long, Double]
iterator.foreach { case (key, value) => hashmap.getOrElseUpdate(key,value) }
Iterator(hashmap)
})
Now I get 100 Hashmap (if I have 100 partitions for data A). Here I'm lost. I don't know how to ask query, how to use the hashpair to search keys in B, since hashpair is not a regular RDD. Do I need to implement a new RDD and implement RDD methods for hashpair? If so, what is the easiest way to implement join or lookup methods for hashpair?
Thanks all.
You're probably looking for the IndexedRDD:
https://github.com/amplab/spark-indexedrdd

getting number of values within reduceByKey RDD

when reduceByKey operation is called, it is receiving list of values of a particular key. My question is:
are the list of values it receives in a sorted order?
is it possible to know how many values it receive?
i'm trying to calculate first quartile of the list of values of a key within reduceByKey. is this possible to do within reduceByKey?
.1. No, that would be totally going against the whole point of a reduce operation - i.e. to parallelalize an operation into an arbitrary tree of suboperations by taking advantage of associativity and commutativity.
.2. You'll need to define a new monoid by composing the integer monoid and whatever it is your doing. Let's assume your operation is op then
.
yourRdd.map(kv => (kv._1, (kv._2, 1)))
.reduceByKey((left, right) => (left._1 op right._1, left._2 + right._2))
will give you an RDD[(KeyType, (ReducedValueType, Int))] where the Int will be the number of values the reduce received for each key.
.3. You'll have to be more specific about what you mean by first quartile. Given that the answer to 1. is no, then you would have to have a bound that defines the first quartile then you won't need the data to be sorted because you could filter the values out by that bound.