How to sort by value in spark Scala - scala

I have a key, value pair and I need to return the top 10 elements by value in descending order. As you can see from my actual output below, it's giving me the top values by the key instead of the value (in this case by ascii character code).
For example:
//Input:
(the, 5),
(is, 10),
(me, 1)
//Expected Output:
(is, 10),
(the, 5),
(me, 1)
//Actual Output:
(the, 5),
(me, 1),
(is, 10)
My function:
def getActiveTaxis(taxiLines: RDD[Array[String]]): Array[(String, Int)] = {
// Removing set up code for brevity
val counts = keys.map(x => (x, 1))
val sortedResult = counts.reduceByKey((a, b) => a + b).sortBy(_._2, false)
sortedResult.top(10)
}

You should use take() function instead of top().
take() will return top N elements whereas top() will return top N elements after sorting the RDD based on the specified implicit Ordering[T].
You can refer the implementation of top() here.

Related

Spark groupBy X then sortBy Y then get topK

case class Tomato(name:String, rank:Int)
case class Potato(..)
I have Spark 2.4 and Dataset[Tomato, Potato] that I want to groupBy name and get topK ranks.
Issue is that groupBy produces an iterator which is not sortable and iterator.toList explodes on large datasets.
Iterator solution:
data.groupByKey{ case (tomato,_) => tomato.name }
.flatMapGroups((k,it)=>it.toList.sortBy(_.rank).take(topK))
I've also tried aggregation functions but I could not find a topK or firstK only first and last.
Another thing I hate about aggregation functions is that they convert the dataset to a dataframe (yuck) so all the types are gone.
Aggregation Fn solution syntax made up by me:
data.agg(row_number.over(Window.partitionBy("_1.name").orderBy("_1.rank").take(topK))
There are already several questions on SO that ask for groupBy then some other operation but none want to sort by a key different than the groupBy key and then get topK
You could go the iterator route without having to create a full list which indeed explodes with big datasets. Something like:
import spark.implicits._
import scala.util.Sorting
case class Tomato(name:String, rank:Int)
case class Potato(taste: String)
case class MyClass(tomato: Tomato, potato: Potato)
val ordering = Ordering.by[MyClass, Int](_.tomato.rank)
val ds = Seq(
(MyClass(Tomato("tomato1", 1), Potato("tasty"))),
(MyClass(Tomato("tomato1", 2), Potato("tastier"))),
(MyClass(Tomato("tomato2", 2), Potato("tastiest"))),
(MyClass(Tomato("tomato3", 2), Potato("yum"))),
(MyClass(Tomato("tomato3", 4), Potato("yummier"))),
(MyClass(Tomato("tomato3", 50), Potato("yummiest"))),
(MyClass(Tomato("tomato7", 50), Potato("yam")))
).toDS
val k = 2
val output = ds
.groupByKey{
case MyClass(tomato, potato) => tomato.name
}
.mapGroups(
(name, iterator)=> {
val topK = iterator.foldLeft(Seq.empty[MyClass]){
(accumulator, element) => {
val newAccumulator = accumulator :+ element
if (newAccumulator.length > k)
newAccumulator.sorted(ordering).drop(1)
else
newAccumulator
}
}
(name, topK)
}
)
output.show(false)
+-------+--------------------------------------------------------+
|_1 |_2 |
+-------+--------------------------------------------------------+
|tomato7|[[[tomato7, 50], [yam]]] |
|tomato2|[[[tomato2, 2], [tastiest]]] |
|tomato1|[[[tomato1, 1], [tasty]], [[tomato1, 2], [tastier]]] |
|tomato3|[[[tomato3, 4], [yummier]], [[tomato3, 50], [yummiest]]]|
+-------+--------------------------------------------------------+
So as you see, for each Tomato.name key, we're keeping the k elements with the largest Tomato.rank values. You get a Dataset[(String, Seq(MyClass))] as result.
This is not really optimized for performance: for each group, we're iterating over all of its elements and sorting the sequence which could become quite intensive computationally. But this all depends on the size of your actual case classes, the size of your data, your requirements, ...
Hope this helps!
Issue is that groupBy produces an iterator which is not sortable and iterator.toList explodes on large datasets.
What you could do is to come up with a topK() method that takes parameters k, Iterator[A] and a A => B mapping to return an Iterator[A] of top k elements (sorted by value of type B) -- all without having to sort the entire iterator:
def topK[A, B : Ordering](k: Int, iter: Iterator[A], f: A => B): Iterator[A] = {
val orderer = implicitly[Ordering[B]]
import orderer._
val listK = iter.take(k).toList
iter.foldLeft(listK.sortWith(f(_) > f(_))){ (lsK, x) =>
if (f(x) < f(lsK.head))
(x :: lsK.tail).sortWith(f(_) > f(_))
else
lsK
}.reverse.iterator
}
Note that topK() only involves iterative sorting of lists of size k, with the assumption k is small compared with the size of the input iterator. If necessary, it could be further optimized to eliminate the sorting of the k-elements lists by only making its first element the largest element while leaving the rest of the lists unsorted.
Using your groupByKey approach, method topK() can be plugged in within flatMapGroups as shown below:
case class T(name: String, rank: Int)
case class P(name: String, rank: Int)
val ds = Seq(
(T("t1", 4), P("p1", 1)),
(T("t1", 5), P("p2", 2)),
(T("t1", 1), P("p3", 3)),
(T("t1", 3), P("p4", 4)),
(T("t1", 2), P("p5", 5)),
(T("t2", 4), P("p6", 6)),
(T("t2", 2), P("p7", 7)),
(T("t2", 6), P("p8", 8))
).toDF("tomato", "potato").as[(T, P)]
val k = 3
ds.
groupByKey{ case (tomato, _) => tomato.name }.
flatMapGroups((_, it) => topK[(T, P), Int](k, it, { case (t, p) => t.rank })).
show
/*
+-------+-------+
| _1| _2|
+-------+-------+
|{t1, 1}|{p3, 3}|
|{t1, 2}|{p5, 5}|
|{t1, 3}|{p4, 4}|
|{t2, 2}|{p7, 7}|
|{t2, 4}|{p6, 6}|
|{t2, 6}|{p8, 8}|
+-------+-------+
*/

reduceByKey RDD spark scala

I have RDD[(String, String, Int)] and want to use reduceByKey to get the result as shown below. I don't want it to convert to DF and then perform groupBy operation to get result. Int is constant with value as 1 always.
Is it possible to use reduceByKey here to get the result? Presenting it in Tabularformat for easy reading
Question
String
String
Int
First
Apple
1
Second
Banana
1
First
Flower
1
Third
Tree
1
Result
String
String
Int
First
Apple,Flower
2
Second
Banana
1
Third
Tree
1
You can not use reduceByKey if you have a Tuple3, you could use reduceByKey though if you had RDD[(String, String)].
Also, once you groupBy, you can then apply reduceByKey, but since keys would be unique, it makes no sense to call reduceByKey, therefore we use map to map one on one values.
So, assume df is your main table, then this piece of code:
val rest = df.groupBy(x => x._1).map(x => {
val key = x._1 // ex: First
val groupedData = x._2 // ex: [(First, Apple, 1), (First, Flower, 1)]
// ex: [(First, Apple, 1), (First, Flower, 1)] => [Apple, Flower] => Apple, Flower
val concat = groupedData.map(d => d._2).mkString(",")
// ex: [(First, Apple, 1), (First, Flower, 1)] => [1, 1] => 2
val sum = groupedData.map(d => d._3).sum
(key, concat, sum) // return a tuple3 again, same format
})
Returns this result:
(Second,Banana,1)
(First,Apple,Flower,2)
(Third,Tree,1)
EDIT
Implementing reduceByKey without sum, if your dataset looks like:
val data = List(
("First", "Apple"),
("Second", "Banana"),
("First", "Flower"),
("Third", "Tree")
)
val df: RDD[(String, String)] = sparkSession.sparkContext.parallelize(data)
Then, this:
df.reduceByKey((acc, next) => acc + "," + next)
Gives this:
(First,Apple,Flower)
(Second,Banana)
(Third,Tree)
Good luck!

Scala - create a new list and update particular element from existing list

I am new to Scala and new OOP too. How can I update a particular element in a list while creating a new list.
val numbers= List(1,2,3,4,5)
val result = numbers.map(_*2)
I need to update third element only -> multiply by 2. How can I do that by using map?
You can use zipWithIndex to map the list into a list of tuples, where each element is accompanied by its index. Then, using map with pattern matching - you single out the third element (index = 2):
val numbers = List(1,2,3,4,5)
val result = numbers.zipWithIndex.map {
case (v, i) if i == 2 => v * 2
case (v, _) => v
}
// result: List[Int] = List(1, 2, 6, 4, 5)
Alternatively - you can use patch, which replaces a sub-sequence with a provided one:
numbers.patch(from = 2, patch = Seq(numbers(2) * 2), replaced = 1)
I think the clearest way of achieving this is by using updated(index: Int, elem: Int). For your example, it could be applied as follows:
val result = numbers.updated(2, numbers(2) * 2)
list.zipWithIndex creates a list of pairs with original element on the left, and index in the list on the right (indices are 0-based, so "third element" is at index 2).
val result = number.zipWithIndex.map {
case (n, 2) => n*2
case n => n
}
This creates an intermediate list holding the pairs, and then maps through it to do your transformation. A bit more efficient approach is to use iterator. Iterators a 'lazy', so, rather than creating an intermediate container, it will generate the pairs one-by-one, and send them straight to the .map:
val result = number.iterator.zipWithIndex.map {
case (n, 2) => n*2
case n => n
}.toList
1st and the foremost scala is FOP and not OOP. You can update any element of a list through the keyword "updated", see the following example for details:
Signature :- updated(index,value)
val numbers= List(1,2,3,4,5)
print(numbers.updated(2,10))
Now here the 1st argument is the index and the 2nd argument is the value. The result of this code will modify the list to:
List(1, 2, 10, 4, 5).

value map isn't a member of int

so I have the following simple code.
What I want to do is generate for a given list of tuples of (char , int)
another list as explained in the example:
For List((a,2),(b,1)), I want to have a list List( List((a,1)) ,List((a,2)), List((b,1))) .
val abba = List(('a', 2), ('b', 2))
abba.map(elt=> for(t<-elt._2) yield (elt,t))
I tested my approach on the following snippet of code but I got the following error:
Error:(72, 31) value map is not a member of Int
abba.map(elt=> for(t<-elt._2) yield (elt,t))
Any hints on how to solve this problem?
If I interpret your requirement correctly:
For List((a,2),(b,1)), I want to have a list List( List((a,1))
,List((a,2)), List((b,1)))
you can generate your list-of-list using flatMap and for/yield with input 1 to elt._2:
val abba = List(('a', 2), ('b', 1))
abba.flatMap( elt => for(t <- 1 to elt._2) yield List((elt._1, t)) )
// res1: List[List[(Char, Int)]] = List(List((a,1)), List((a,2)), List((b,1)))
Or, you can use case partial function as below:
abba.flatMap{ case (c, i) => for (j <- 1 to i) yield List((c, j)) }
// res2: List[List[(Char, Int)]] = List(List((a,1)), List((a,2)), List((b,1)))
for(t<-elt._2) yield (elt,t) is equivalent to elt._2.map(t => (elt, t)). elt._2 is the second element of the current tuple, so it's an integer. Integers do not have a map method, so the above can't work.
Looking at your expected output, it looks like you want to iterate t from 1 to elt._2. You can do that using the to method:
for(t <- 1 to elt._2) yield (elt._1, t)
It also looks like you want the first element of the original tuple to be the first element of the new tuples, so I added a ._1 as well.
As a style note, it might be a bit cleaner to use pattern matching on the tuple instead of ._1 and ._2.

Apache Spark Scala : How to maintain order of values while grouping rdd by key

May be i am asking very basic question apology for that, but i didn't find it's answer on internet. I have paired RDD want to use something like aggragateByKey and concatenating all the values by a key. Value which occur first in input RDD should come first in the aggragated RDD.
Input RDD [Int, Int]
2 20
1 10
2 8
2 25
Output RDD (Aggregated RDD)
2 20 8 25
1 10
I tried aggregateByKey and gropByKey, both are giving me ouput, but order of values is not maintained. So please suggest something in this.
Since groupByKey and aggregateByKey indeed cannot preserve order - you'll have to artificially add a "hint" to each record so that you can order by that hint yourself after the grouping:
val input = sc.parallelize(Seq((2, 20), (1, 10), (2, 8), (2, 25)))
val withIndex: RDD[(Int, (Long, Int))] = input
.zipWithIndex() // adds index to each record, will be used to order result
.map { case ((k, v), i) => (k, (i, v)) } // restructure into (key, (index, value))
val result: RDD[(Int, List[Int])] = withIndex
.groupByKey()
.map { case (k, it) => (k, it.toList.sortBy(_._1).map(_._2)) } // order values and remove index