Selecting specific elements of an RDD1 - scala

I am stuck on a particular scala-spark syntax, and I am hoping you can guide me in the correct direction.
if RDD1 is type Array[((Float, Float, Float), Long)],
RDD1.collect = Array((x1,y1,z1),1), ((x2,y2,z2),2), ((x3,y3,y3),3), ...)
and RDD2 is indices, of type, Array[Long],
RDD2.collect = Array(1, 3, 5...)
What is the best possible way to extract the values from RDD1 whose indices occur in RDD2. i.e,
the output, Array((x1,y1,z1),1), ((x3,y3,y3),3),(x5,y5,y5),5) ...)
Both RDD1 and RDD2 are large enough that I would like to avoid using .collect. Otherwise, the problem is simply finding intersecting elements in 2 scala arrays/lists.
thank you so much for your help!

There is a join function on PairRDD, which is what you want to use here.
// coming in, we have:
// rdd1: RDD[((Float, Float, Float), Long)]
// rdd2: RDD[Long]
val joinReadyRDD1 = rdd1.map { case (values, key) => (key, values) }
val joinReadyRDD2 = rdd1.map { key => (key, ()) }
val joined = joinReadyRDD1.join(joinReadyRDD2).mapValues(_._1)
This returns an RDD[(Long, (Float, Float, Float))] where the Long keys appeared in rdd2.
A side note: If you have a conceptual "key" and "value", put the key first. Take a look at the PairRDDFunctions I linked above -- it's quite a rich API and it all uses RDD[(Key, Value)].

Related

Spark RDD - CountByValue - Map type - order by key

From spark RDD - countByValue is returning Map Datatype and want to sort by key ascending/ descending .
val s = flightsObjectRDD.map(_.dep_delay / 60 toInt).countByValue() // RDD type is action and returning Map datatype
s.toSeq.sortBy(_._1)
The above code is working as expected. But countByValue itself have implicit sorting . How can i implement that way?
You exit the Big Data realm and get into Scala itself. And then into all those structures that are immutable, sorted, hashed and mutable, or a combination of these. I think that is the reason for the -1 initially. Nice folks out there, anyway.
Take this example, the countByValue returns a Map to the Driver, so only of interest for small amounts of data. Map is also (key, value) pair but with hashing and immutable. So we need to manipulate it. This is what you can do. First up you can sort the Map on the key in ascending order.
val rdd1 = sc.parallelize(Seq(("HR",5),("RD",4),("ADMIN",5),("SALES",4),("SER",6),("MAN",8),("MAN",8),("HR",5),("HR",6),("HR",5)))
val map = rdd1.countByValue
val res1 = ListMap(map.toSeq.sortBy(_._1):_*) // ascending sort on key part of Map
res1: scala.collection.immutable.ListMap[(String, Int),Long] = Map((ADMIN,5) -> 1, (HR,5) -> 3, (HR,6) -> 1, (MAN,8) -> 2, (RD,4) -> 1, (SALES,4) -> 1, (SER,6) -> 1)
However, you cannot apply reverse or descending logic on the key as it is hashing. Next best thing is as follows:
val res2 = map.toList.sortBy(_._1).reverse
val res22 = map.toSeq.sortBy(_._1).reverse
res2: List[((String, Int), Long)] = List(((SER,6),1), ((SALES,4),1), ((RD,4),1), ((MAN,8),2), ((HR,6),1), ((HR,5),3), ((ADMIN,5),1))
res22: Seq[((String, Int), Long)] = ArrayBuffer(((SER,6),1), ((SALES,4),1), ((RD,4),1), ((MAN,8),2), ((HR,6),1), ((HR,5),3), ((ADMIN,5),1))
But you cannot apply the .toMap against the .reverse here, as it will hash and lose the sort. So, you must make a compromise.

scala find top-k elements for a keyed sequence

For a sequence of things where the first element constitutes the key:
val things = Seq(("key_1", ("first", 1)),("key_1", ("first_second", 11)), ("key_2", ("second", 2)))
I want to count how often a key occurs and then only keep the top-k elements.
In pandas or a database I would:
count
join the result to the original and filter
In Scala, the first part can be handled by:
things.groupBy(identity).mapValues(_.size)
The first bit here is:
things.groupBy(_._1).mapValues(_.map( _._2 ))
But I am not sure about the second step.
In the case of the example above when looking at the top-1 keys key_1 occurs twice and is selected, therefore.
The desired outputted results are second elements of the top-k key tuples:
Seq(("first", 1),("first_second", 11))
edit
I need a solution which works for 2.11.x.
This approach first groups by the keys to get a map of the keys to original items.
You can also use an OrderedMap or PriorityQueue for more efficient top-N calculation, but if there aren't many elements, then a simple sortBy would work, too, as shown.
def valuesOfNMostFrequentKeys(things: Seq[(String, (String, Int))], N: Int = 1) = {
val grouped: Map[String,Seq[(String, (String, Int))]] = things.groupBy(_._1)
// "map" array of counts per keys to KV Tuples
val countToTuples:Array[(Int, Seq[(String, (String, Int))])] = grouped.map((kv: (String, Seq[(String, (String, Int))])) => (kv._2.size, kv._2)).toArray
// sort by count (first item in tuple) descending and take top N
val sortByCount:Array[(Int, Seq[(String, (String, Int))])] = countToTuples.sortBy(-_._1)
val topN:Array[(Int, Seq[(String, (String, Int))])] = sortByCount.take(N)
// extract inner (String, Int) item from list of keys and values, and flatten
topN.flatMap((kvList: (Int, Seq[(String, (String, Int))])) => kvList._2.map(_._2))
}
valuesOfNMostFrequentKeys(things)
output:
valuesOfNMostFrequentKeys: (things: Seq[(String, (String, Int))], N: Int)Array[(String, Int)]
res44: Array[(String, Int)] = Array((first,1), (first_second,11))
Note above is an Array and you may want to do toSeq -- but this works in Scala 2.11.
It looks like:
things.groupBy(_._1)
.mapValues(e => (e.map(_._2).size, e.map(_._2))).toSeq.map(_._2)
.sortBy(_._1).reverse.take(2).flatMap(_._2)
computes the desired outputs

Filtering RDDs based on value of Key

I have two RDDs that wrap the following arrays:
Array((3,Ken), (5,Jonny), (4,Adam), (3,Ben), (6,Rhonda), (5,Johny))
Array((4,Rudy), (7,Micheal), (5,Peter), (5,Shawn), (5,Aaron), (7,Gilbert))
I need to design a code in such a way that if I provide input as 3 I need to return
Array((3,Ken), (3,Ben))
If input is 6, output should be
Array((6,Rhonda))
I tried something like this:
val list3 = list1.union(list2)
list3.reduceByKey(_+_).collect
list3.reduceByKey(6).collect
None of these worked, can anyone help me out with a solution for this problem?
Given the following that you would have to define yourself
// Provide you SparkContext and inputs here
val sc: SparkContext = ???
val array1: Array[(Int, String)] = ???
val array2: Array[(Int, String)] = ???
val n: Int = ???
val rdd1 = sc.parallelize(array1)
val rdd2 = sc.parallelize(array2)
You can use the union and filter to reach your goal
rdd1.union(rdd2).filter(_._1 == n)
Since filtering by key is something that you would probably want to do in several occasions, it makes sense to encapsulate this functionality in its own function.
It would also be interesting if we could make sure that this function could work on any type of keys, not only Ints.
You can express this in the old RDD API as follows:
def filterByKey[K, V](rdd: RDD[(K, V)], k: K): RDD[(K, V)] =
rdd.filter(_._1 == k)
You can use it as follows:
val rdd = rdd1.union(rdd2)
val filtered = filterByKey(rdd, n)
Let's look at this method a little bit more in detail.
This method allows to filterByKey and RDD which contains a generic pair, where the type of the first item is K and the type of the second type is V (from key and value). It also accepts a key of type K that will be used to filter the RDD.
You then use the filter function, that takes a predicate (a function that goes from some type - in this case K - to a Boolean) and makes sure that the resulting RDD only contains items that respect this predicate.
We could also have written the body of the function as:
rdd.filter(pair => pair._1 == k)
or
rdd.filter { case (key, value) => key == k }
but we took advantage of the _ wildcard to express the fact that we want to act on the first (and only) parameter of this anonymous function.
To use it, you first parallelize your RDDs, call union on them and then invoke the filterByKey function with the number you want to filter by (as shown in the example).

How to replace RDD type of [String] with values of RDD type [String, Int]

Sorry for the confusion in the initial question. Here is a questions with the reproducible example:
I have an rdd of [String] and I have a rdd of [String, Long]. I would like to have an rdd of [Long] based on the match of String of second with String of first. Example:
//Create RDD
val textFile = sc.parallelize(Array("Spark can also be used for compute intensive tasks",
"This code estimates pi by throwing darts at a circle"))
// tokenize, result: RDD[(String)]
val words = textFile.flatMap(line => line.split(" "))
// create index of distinct words, result: RDD[(String,Long)]
val indexWords = words.distinct().zipWithIndex()
As a result, I would like to have an RDD with indexes of words instead of words in "Spark can also be used for compute intensive tasks".
Sorry again and thanks
If I understand you correctly, you're interested in the indices of works that also appear in Spark can also be used for compute intensive tasks.
If so - here are two versions with identical outputs but different performance characteristics:
val lookupWords: Seq[String] = "Spark can also be used for compute intensive tasks".split(" ")
// option 1 - use join:
val lookupWordsRdd: RDD[(String, String)] = sc.parallelize(lookupWords).keyBy(w => w)
val result1: RDD[Long] = indexWords.join(lookupWordsRdd).map { case (key, (index, _)) => index }
// option 2 - assuming list of lookup words is short, you can use a non-distributed version of it
val result2: RDD[Long] = indexWords.collect { case (key, index) if lookupWords.contains(key) => index }
The first option creates a second RDD with the words whose indices we're interested in, uses keyBy to transform it into a PairRDD (with key == value!), joins it with your indexWords RDD and then maps to get the index only.
The second option should only be used if the list of "interesting words" is known not to be too large - so we can keep it as a list (and not RDD), and let Spark serialize it and send to workers for each task to use. We then use collect(f: PartialFunction[T, U]) which applies this partial function to get a "filter" and a "map" at once - we only return a value if the words exists in the list, and if so - we return the index.
I was getting an error of SPARK-5063 and given this answer, I found the solution to my problem:
//broadcast `indexWords`
val bcIndexWords = sc.broadcast(indexWords.collectAsMap)
// select `value` of `indexWords` given `key`
val result = textFile.map{arr => arr.split(" ").map(elem => bcIndexWords.value(elem))}
result.first()
res373: Array[Long] = Array(3, 7, 14, 6, 17, 15, 0, 12)

Summing items within a Tuple

Below is a data structure of List of tuples, ot type List[(String, String, Int)]
val data3 = (List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1)) )
//> data3 : List[(String, String, Int)] = List((id1,a,1), (id1,a,1), (id1,a,1),
//| (id2,a,1))
I'm attempting to count the occurences of each Int value associated with each id. So above data structure should be converted to List((id1,a,3) , (id2,a,1))
This is what I have come up with but I'm unsure how to group similar items within a Tuple :
data3.map( { case (id,name,num) => (id , name , num + 1)})
//> res0: List[(String, String, Int)] = List((id1,a,2), (id1,a,2), (id1,a,2), (i
//| d2,a,2))
In practice data3 is of type spark obj RDD , I'm using a List in this example for testing but same solution should be compatible with an RDD . I'm using a List for local testing purposes.
Update : based on following code provided by maasg :
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
I needed to amend slightly to get into format I expect which is of type
.RDD[(String, Seq[(String, Int)])]
which corresponds to .RDD[(id, Seq[(name, count-of-names)])]
:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => ((id1),(id2,values.sum))}
val counted = result.groupedByKey
In Spark, you would do something like this: (using Spark Shell to illustrate)
val l = List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1))
val rdd = sc.parallelize(l)
val grouped = rdd.groupBy{case (id1,id2,v) => (id1,id2)}
val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}
Another option would be to map the rdd into a PairRDD and use groupByKey:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
Option 2 is a slightly better option when handling large sets as it does not replicate the id's in the cummulated value.
This seems to work when I use scala-ide:
data3
.groupBy(tupl => (tupl._1, tupl._2))
.mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum))
.values.toList
And the result is the same as required by the question
res0: List[(String, String, Int)] = List((id1,a,3), (id2,a,1))
You should look into List.groupBy.
You can use the id as the key, and then use the length of your values in the map (ie all the items sharing the same id) to know the count.
#vptheron has the right idea.
As can be seen in the docs
def groupBy[K](f: (A) ⇒ K): Map[K, List[A]]
Partitions this list into a map of lists according to some discriminator function.
Note: this method is not re-implemented by views. This means when applied to a view it will >always force the view and return a new list.
K the type of keys returned by the discriminator function.
f the discriminator function.
returns
A map from keys to lists such that the following invariant holds:
(xs partition f)(k) = xs filter (x => f(x) == k)
That is, every key k is bound to a list of those elements x for which f(x) equals k.
So something like the below function, when used with groupBy will give you a list with keys being the ids.
(Sorry, I don't have access to an Scala compiler, so I can't test)
def f(tupule: A) :String = {
return tupule._1
}
Then you will have to iterate through the List for each id in the Map and sum up the number of integer occurrences. That is straightforward, but if you still need help, ask in the comments.
The following is the most readable, efficient and scalable
data.map {
case (key1, key2, value) => ((key1, key2), value)
}
.reduceByKey(_ + _)
which will give a RDD[(String, String, Int)]. By using reduceByKey it means the summation will paralellize, i.e. for very large groups it will be distributed and summation will happen on the map side. Think about the case where there are only 10 groups but billions of records, using .sum won't scale as it will only be able to distribute to 10 cores.
A few more notes about the other answers:
Using head here is unnecessary: .mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum)) can just use .mapValues(v =>(v_1, v._2, v.map(_._3).sum))
Using a foldLeft here is really horrible when the above shows .map(_._3).sum will do: val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}