scala find top-k elements for a keyed sequence - scala

For a sequence of things where the first element constitutes the key:
val things = Seq(("key_1", ("first", 1)),("key_1", ("first_second", 11)), ("key_2", ("second", 2)))
I want to count how often a key occurs and then only keep the top-k elements.
In pandas or a database I would:
count
join the result to the original and filter
In Scala, the first part can be handled by:
things.groupBy(identity).mapValues(_.size)
The first bit here is:
things.groupBy(_._1).mapValues(_.map( _._2 ))
But I am not sure about the second step.
In the case of the example above when looking at the top-1 keys key_1 occurs twice and is selected, therefore.
The desired outputted results are second elements of the top-k key tuples:
Seq(("first", 1),("first_second", 11))
edit
I need a solution which works for 2.11.x.

This approach first groups by the keys to get a map of the keys to original items.
You can also use an OrderedMap or PriorityQueue for more efficient top-N calculation, but if there aren't many elements, then a simple sortBy would work, too, as shown.
def valuesOfNMostFrequentKeys(things: Seq[(String, (String, Int))], N: Int = 1) = {
val grouped: Map[String,Seq[(String, (String, Int))]] = things.groupBy(_._1)
// "map" array of counts per keys to KV Tuples
val countToTuples:Array[(Int, Seq[(String, (String, Int))])] = grouped.map((kv: (String, Seq[(String, (String, Int))])) => (kv._2.size, kv._2)).toArray
// sort by count (first item in tuple) descending and take top N
val sortByCount:Array[(Int, Seq[(String, (String, Int))])] = countToTuples.sortBy(-_._1)
val topN:Array[(Int, Seq[(String, (String, Int))])] = sortByCount.take(N)
// extract inner (String, Int) item from list of keys and values, and flatten
topN.flatMap((kvList: (Int, Seq[(String, (String, Int))])) => kvList._2.map(_._2))
}
valuesOfNMostFrequentKeys(things)
output:
valuesOfNMostFrequentKeys: (things: Seq[(String, (String, Int))], N: Int)Array[(String, Int)]
res44: Array[(String, Int)] = Array((first,1), (first_second,11))
Note above is an Array and you may want to do toSeq -- but this works in Scala 2.11.

It looks like:
things.groupBy(_._1)
.mapValues(e => (e.map(_._2).size, e.map(_._2))).toSeq.map(_._2)
.sortBy(_._1).reverse.take(2).flatMap(_._2)
computes the desired outputs

Related

Applying distinct on rdd cosidering both key- value pair and not just on the basis of keys

I have 2 pair RDDs on which I am doing union to give a third RDD.
But resulting RDD has tupes which are repeated:
rdd3 = {(1,2) , (3,4) , (1,2)}
I want to remove duplicate tuples from rdd3 but only if both the key value pair of tuple is same.
How can i do that?
Please directly invoke the spark-scala lib api:
def distinct(): RDD[T]
Remember that it is a generic method with a type parameter.
If you invoke it with your rdd, of type RDD[(Int, Int)], it will give your distinct pairs of type (Int, Int) in your rdd, just as it is.
If you want to see the internal of this method. see below for the signature:
def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
}
You can use distinct for example
val data= sc.parallelize(
Seq(
("Foo","41","US","3"),
("Foo","39","UK","1"),
("Bar","57","CA","2"),
("Bar","72","CA","2"),
("Baz","22","US","6"),
("Baz","36","US","6"),
("Baz","36","US","6")
)
)
remove duplicate :
val distinctData = data.distinct()
distinctData.collect

Selecting specific elements of an RDD1

I am stuck on a particular scala-spark syntax, and I am hoping you can guide me in the correct direction.
if RDD1 is type Array[((Float, Float, Float), Long)],
RDD1.collect = Array((x1,y1,z1),1), ((x2,y2,z2),2), ((x3,y3,y3),3), ...)
and RDD2 is indices, of type, Array[Long],
RDD2.collect = Array(1, 3, 5...)
What is the best possible way to extract the values from RDD1 whose indices occur in RDD2. i.e,
the output, Array((x1,y1,z1),1), ((x3,y3,y3),3),(x5,y5,y5),5) ...)
Both RDD1 and RDD2 are large enough that I would like to avoid using .collect. Otherwise, the problem is simply finding intersecting elements in 2 scala arrays/lists.
thank you so much for your help!
There is a join function on PairRDD, which is what you want to use here.
// coming in, we have:
// rdd1: RDD[((Float, Float, Float), Long)]
// rdd2: RDD[Long]
val joinReadyRDD1 = rdd1.map { case (values, key) => (key, values) }
val joinReadyRDD2 = rdd1.map { key => (key, ()) }
val joined = joinReadyRDD1.join(joinReadyRDD2).mapValues(_._1)
This returns an RDD[(Long, (Float, Float, Float))] where the Long keys appeared in rdd2.
A side note: If you have a conceptual "key" and "value", put the key first. Take a look at the PairRDDFunctions I linked above -- it's quite a rich API and it all uses RDD[(Key, Value)].

How to unpack a map/list in scala to tuples for a variadic function?

I'm trying to create a PairRDD in spark. For that I need a tuple2 RDD, like RDD[(String, String)]. However, I have an RDD[Map[String, String]].
I can't work out how to get rid of the iterable so I'm just left with RDD[(String, String)] rather than e.g. RDD[List[(String, String)]].
A simple demo of what I'm trying to make work is this broken code:
val lines = sparkContext.textFile("data.txt")
val pairs = lines.map(s => Map(s -> 1))
val counts = pairs.reduceByKey((a, b) => a + b)
The last line doesn't work because pairs is an RDD[Map[String, Int]] when it needs to be an RDD[(String, Int)].
So how can I get rid of the iterable in pairs above to convert the Map to just a tuple2?
You can actually just run:
val counts = pairs.flatMap(identity).reduceByKey(_ + _)
Note that the usage of the identity function that replicates the functionality of flatten on an RDD and the reduceByKey() function has a nifty underscore notation for conciseness.

Scala Spark - Reduce RDD by adding multiple values per key

I have a Spark RDD that is in the format of (String, (Int, Int)) and I would like to add the Int values together to create a (String, Int) map.
This is an example of an element in my RDD:
res13: (String, (Int, Int)) = (9D4669B432A0FD,(1,1))
I would like to end with an RDD of (String, Int) = (9D4669B432A0FD,2)
You should just map the values to the sum of the second pair:
yourRdd.map(pair => (pair._1, pair._2._1 + pair._2._2))
#marios suggested the following nicer syntax in an edit:
Or if you want to make it a bit more readable:
yourRdd.map{case(str, (x1,x2)) => (str, x1+x2)}
Gabor Bakos answer is correct if there are unique keys. But If you have multiple identical keys and if you want to reduce it to unique keys then use reduceByKey.
Example:
val data = Array(("9888wq",(1,2)),("abcd",(1,1)),("abcd",(3,2)),("9888wq",(4,2)))
val rdd= sc.parallelize(data)
val result = rdd.map(x => (x._1,(x._2._1+x._2._2))).reduceByKey((x,y) => x+y)
result.foreach(println)
Output :
(9888wq,9)
(abcd,7)

Summing items within a Tuple

Below is a data structure of List of tuples, ot type List[(String, String, Int)]
val data3 = (List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1)) )
//> data3 : List[(String, String, Int)] = List((id1,a,1), (id1,a,1), (id1,a,1),
//| (id2,a,1))
I'm attempting to count the occurences of each Int value associated with each id. So above data structure should be converted to List((id1,a,3) , (id2,a,1))
This is what I have come up with but I'm unsure how to group similar items within a Tuple :
data3.map( { case (id,name,num) => (id , name , num + 1)})
//> res0: List[(String, String, Int)] = List((id1,a,2), (id1,a,2), (id1,a,2), (i
//| d2,a,2))
In practice data3 is of type spark obj RDD , I'm using a List in this example for testing but same solution should be compatible with an RDD . I'm using a List for local testing purposes.
Update : based on following code provided by maasg :
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
I needed to amend slightly to get into format I expect which is of type
.RDD[(String, Seq[(String, Int)])]
which corresponds to .RDD[(id, Seq[(name, count-of-names)])]
:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => ((id1),(id2,values.sum))}
val counted = result.groupedByKey
In Spark, you would do something like this: (using Spark Shell to illustrate)
val l = List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1))
val rdd = sc.parallelize(l)
val grouped = rdd.groupBy{case (id1,id2,v) => (id1,id2)}
val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}
Another option would be to map the rdd into a PairRDD and use groupByKey:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
Option 2 is a slightly better option when handling large sets as it does not replicate the id's in the cummulated value.
This seems to work when I use scala-ide:
data3
.groupBy(tupl => (tupl._1, tupl._2))
.mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum))
.values.toList
And the result is the same as required by the question
res0: List[(String, String, Int)] = List((id1,a,3), (id2,a,1))
You should look into List.groupBy.
You can use the id as the key, and then use the length of your values in the map (ie all the items sharing the same id) to know the count.
#vptheron has the right idea.
As can be seen in the docs
def groupBy[K](f: (A) ⇒ K): Map[K, List[A]]
Partitions this list into a map of lists according to some discriminator function.
Note: this method is not re-implemented by views. This means when applied to a view it will >always force the view and return a new list.
K the type of keys returned by the discriminator function.
f the discriminator function.
returns
A map from keys to lists such that the following invariant holds:
(xs partition f)(k) = xs filter (x => f(x) == k)
That is, every key k is bound to a list of those elements x for which f(x) equals k.
So something like the below function, when used with groupBy will give you a list with keys being the ids.
(Sorry, I don't have access to an Scala compiler, so I can't test)
def f(tupule: A) :String = {
return tupule._1
}
Then you will have to iterate through the List for each id in the Map and sum up the number of integer occurrences. That is straightforward, but if you still need help, ask in the comments.
The following is the most readable, efficient and scalable
data.map {
case (key1, key2, value) => ((key1, key2), value)
}
.reduceByKey(_ + _)
which will give a RDD[(String, String, Int)]. By using reduceByKey it means the summation will paralellize, i.e. for very large groups it will be distributed and summation will happen on the map side. Think about the case where there are only 10 groups but billions of records, using .sum won't scale as it will only be able to distribute to 10 cores.
A few more notes about the other answers:
Using head here is unnecessary: .mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum)) can just use .mapValues(v =>(v_1, v._2, v.map(_._3).sum))
Using a foldLeft here is really horrible when the above shows .map(_._3).sum will do: val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}