Scala MaxBy's Tuple - scala

I have a Seq of Tuples, which represents a word count: (count, word)
For Example:
(5, "Hello")
(3, "World")
My Goal is to find the word with the highest count. In a case of a tie between 2 words, I'll pick the word, which appears first in the Dictionary (aka Alphabetical order).
val wordCounts = Seq(
(10, "World"),
(5, "Something"),
(10, "Hello")
)
val commonWord = wordCounts.maxBy(_._1)
print(commonWord)
Now, this code segment will return (10, "World"), because this is the first tuple that have the maximum count.
I could use .sortBy and then .head, but I want to be more efficient.
My question is there any way to change the Ordering of the maxBy, in order to achieve the desired outcome.
Note: I prefer not to use .sortBy, because it's O(n*log(n)) and not O(n). I know that I can use .reduce, but I want to check if I can adjust .maxBy?
Scala Version 2.13

Functions like max, min, maxBy and minBy use implicit Ordering that defines the comparison between two items. There's a default implementation of Ordering for Tuple2, however the problem is it will apply the same comparison to both elements – while in your case you need to use greater than for _._1 and less than for _._2. However you can easily solve this by inverting the first element, so this does the trick:
wordCounts.minBy(x => (-x._1, x._2))

You can create your own Ordering by using orElse() to combine two Orderings together:
// can't use .orElseBy() because of the .reverse, so this is a bit verbose
val countThenAlphaOrdering =
Ordering.by[(Int, String), Int](_._1)
.orElse(Ordering.by[(Int, String), String](_._2).reverse)
Or you can use Ordering.Tuple2 in this case:
val countThenAlphaOrdering = Ordering.Tuple2(Ordering[Int], Ordering[String].reverse)
Then
val wordCounts = Seq(
(10, "World"),
(5, "Something"),
(10, "Hello"),
)
wordCounts.max(countThenAlphaOrdering) // (10,Hello): (Int, String)

implicit val WordSorter: Ordering[(Int, String)] = new Ordering[(Int, String)]{
override def compare(
x: (Int, String),
y: (Int, String)
) = {
val iComp = implicitly[Ordering[Int]].compare(x._1, y._1)
if iComp == 0
-implicitly[Ordering[String]].compare(x._2, y._2)
else
iComp
}
}
val seq = Seq(
(10, "World"),
(5, "Something"),
(10, "Hello")
)
def main(args: Array[String]): Unit = println(seq.max)
You can create your own Ordering[(Int, String)] implicit whose compare method returns the comparison of the numbers in the tuples if its not zero and the comparison of the strings negatively if the int comparison is zero. Using implicitly defined Ordering[Int] and Ordering[String] for modularity if you want to change the behaviour later on. If you don't want to use those, you can just replace
implicitly[Orderint[Int]].compare(x._1, y._1) with x._1.compareTo(y._1) and so on.

Related

Scala getting average of the result

Hi I am trying to calculate the average of the movie result from this tsv set
running time Genre
1 Documentary,Short
5 Animation,Short
4 Animation,Comedy,Romance
Animation is one type of Genre
and same goes for Short, Comedy, Romance
I'm new to Scala and I'm confused about how to get an Average as per each genre using Scala without any immutable functions
I tried using this below snippet to just try some sort of iterations and get the runTimes as per each genre
val a = list.foldLeft(Map[String,(Int)]()){
case (map,arr) =>{
map + (arr.genres.toString ->(arr.runtimeMinutes))
}}
Is there any way to calculate the average
Assuming the data was already parsed into something like:
final case class Row(runningTime: Int, genres: List[String])
Then you can follow a declarative approach to compute your desired result.
Flatten a List[Row] into a list of pairs, where the first element is a genre and the second element is a running time.
Collect all running times for the same genre.
Reduce each group to compute its average.
def computeAverageRunningTimePerGenre(data: List[Row]): Map[String, Double] =
data.flatMap {
case Row(runningTime, genres) =>
genres.map(genre => genre -> runningTime)
}.groupMap(_._1)(_._2).view.mapValues { runningTimes =>
runningTimes.sum.toDouble / runningTimes.size.toDouble
}.toMap
Note: There are ways to make this faster but IMHO is better to start with the most readable alternative first and then refactor to something more performant if needed.
You can see the code running here.
I tried to break it down as follows:
I modeled your data as a List[(Int, String)]:
val data: List[(Int, List[String])] = List(
(1, List("Documentary","Short")),
(5, List("Animation","Short")),
(4, List("Animation","Comedy","Romance"))
)
I wrote a function to spread the runtime value across each genre so that I have a value for each one:
val spread: ((Int, List[String]))=>List[(Int, String)] = t => t._2.map((t._1, _))
// now, if I pass it a tuple, I get:
// spread((23, List("one","two","three")))) == List((23,one), (23,two), (23,three))
So far, so good. Now I can use spread with flatMap to get a 2-dimensional list:
val flatData = data.flatMap(spread)
flatData: List[(Int, String)] = List(
(1, "Documentary"),
(1, "Short"),
(5, "Animation"),
(5, "Short"),
(4, "Animation"),
(4, "Comedy"),
(4, "Romance")
)
Now we can use groupBy to summarize by genre:
flatData.groupBy(_._2)
res26: Map[String, List[(Int, String)]] = HashMap(
"Animation" -> List((5, "Animation"), (4, "Animation")),
"Documentary" -> List((1, "Documentary")),
"Comedy" -> List((4, "Comedy")),
"Romance" -> List((4, "Romance")),
"Short" -> List((1, "Short"), (5, "Short"))
)
Finally, I can get the results (it took me about 10 tries):
flatData.groupBy(_._2).map(t => (t._1, t._2.map(_._1).foldLeft(0)(_+_)/t._2.size.toDouble))
res43: Map[String, Double] = HashMap(
"Animation" -> 4.5,
"Documentary" -> 1.0,
"Comedy" -> 4.0,
"Romance" -> 4.0,
"Short" -> 3.0
)
The map() after the groupBy() is chunky, but now that I got it, it's easy(er) to explain. Each tuple in the groupBy is (rating, list(genre)). So we just map each tuple and use foldLeft to compute the average of the values in each list. You should coerce the calc to a double, or you'll get integer division.
I think it would have been good to define a cleaner model for the data like Luis did. That would've made all the tuple notation less obscure. Hey, I am learning, too.

creating pair RDD in spark using scala

Im new to spark so I need to create a RDD with just two element.
Array1 = ((1,1)(1,2)(1,3),(2,1),(2,2),(2,3)
when I execute groupby key the output is ((1,(1,2,3)),(2,(1,2,3))
But I need the output to just have 2 value pair with the key. I'm not sure how to get it.
Expected Output = ((1,(1,2)),(1,(1,3)),(1(2,3),(2(1,2)),(2,(1,3)),(2,(2,3)))
The values should only be printed once. There should only be (1,2) and not (2,1)
or like (2,3) not (3,4)
Thanks
You can get the result you require as follows:
// Prior to doing the `groupBy`, you have an RDD[(Int, Int)], x, containing:
// (1,1),(1,2),(1,3),(2,1),(2,2),(2,3)
//
// Can simply map values as below. Result is a RDD[(Int, (Int, Int))].
val x: RDD[(Int, Int)] = sc.parallelize(Seq((1,1),(1,2),(1,3),(2,1),(2,2),(2,3))
val y: RDD[(Int, (Int, Int))] = x.map(t => (t._1, t)) // Map first value in pair tuple to the tuple
y.collect // Get result as an array
// res0: Array[(Int, (Int, Int))] = Array((1,(1,1)), (1,(1,2)), (1,(1,3)), (2,(2,1)), (2,(2,2)), (2,(2,3)))
That is, the result is a pair RDD that relates the key (the first value of each pair) to the pair (as a tuple). Do not use groupBy, since—in this case—it will not give you what you want.
If I understand your requirement correctly, you can use groupByKey and flatMapValues to flatten the 2-combinations of the grouped values, as shown below:
val rdd = sc.parallelize(Seq(
(1, 1), (1, 2), (1 ,3), (2, 1), (2, 2), (2, 3)
))
rdd.groupByKey.flatMapValues(_.toList.combinations(2)).
map{ case (k, v) => (k, (v(0), v(1))) }.
collect
// res1: Array[(Int, (Int, Int))] =
// Array((1,(1,2)), (1,(1,3)), (1,(2,3)), (2,(1,2)), (2,(1,3)), (2,(2,3)))

How to separate array or vector column into multiple columns?

Suppose I have a Spark Dataframe generated as:
val df = Seq(
(Array(1, 2, 3), Array("a", "b", "c")),
(Array(1, 2, 3), Array("a", "b", "c"))
).toDF("Col1", "Col2")
It's possible to extract elements at the first index in "Col1" with something like:
val extractFirstInt = udf { (x: Seq[Int], i: Int) => x(i) }
df.withColumn("Col1_1", extractFirstInt($"Col1", lit(1)))
And similarly for the second column "Col2" with e.g.
val extractFirstString = udf { (x: Seq[String], i: Int) => x(i) }
df.withColumn("Col2_1", extractFirstString($"Col2", lit(1)))
But the code duplication is a little ugly -- I need a separate UDF for each underlying element type.
Is there a way to write a generic UDF, that automatically infers the type of the underlying Array in the column of the Spark Dataset? E.g. I'd like to be able to write something like (pseudocode; with generic T)
val extractFirst = udf { (x: Seq[T], i: Int) => x(i) }
df.withColumn("Col1_1", extractFirst($"Col1", lit(1)))
Where somehow the type T would just be automagically inferred by Spark / the Scala compiler (perhaps using reflection if appropriate).
Bonus points if you're aware of a solution that works both with array-columns and Spark's own DenseVector / SparseVector types. The main thing I'd like to avoid (if at all possible) is the requirement of defining a separate UDF for each underlying array-element type I want to handle.
Perhaps frameless could be a solution?
Since manipulating datasets requires an Encoder for a given type, you have to define the type upfront so Spark SQL can create one for you. I think a Scala macro to generate all sorts of Encoder-supported types would make sense here.
As of now, I'd define a generic method and a UDF per type (which is against your wish to find a way to have "a generic UDF, that automatically infers the type of the underlying Array in the column of the Spark Dataset").
def myExtract[T](x: Seq[T], i: Int) = x(i)
// define UDF for extracting strings
val extractString = udf(myExtract[String] _)
Use as follows:
val df = Seq(
(Array(1, 2, 3), Array("a", "b", "c")),
(Array(1, 2, 3), Array("a", "b", "c"))
).toDF("Col1", "Col2")
scala> df.withColumn("Col1_1", extractString($"Col2", lit(1))).show
+---------+---------+------+
| Col1| Col2|Col1_1|
+---------+---------+------+
|[1, 2, 3]|[a, b, c]| b|
|[1, 2, 3]|[a, b, c]| b|
+---------+---------+------+
You could explore Dataset (not DataFrame, i.e. Dataset[Row]) instead. That would give you all the type machinery (and perhaps you could avoid any macro development).
As per advice from #zero323, I centered on an implementation of the following form:
def extractFirst(df: DataFrame, column: String, into: String) = {
// extract column of interest
val col = df.apply(column)
// figure out the type name for this column
val schema = df.schema
val typeName = schema.apply(schema.fieldIndex(column)).dataType.typeName
// delegate based on column type
typeName match {
case "array" => df.withColumn(into, col.getItem(0))
case "vector" => {
// construct a udf to extract first element
// (could almost certainly do better here,
// but this demonstrates the strategy regardless)
val extractor = udf {
(x: Any) => {
val el = x.getClass.getDeclaredMethod("toArray").invoke(x)
val array = el.asInstanceOf[Array[Double]]
array(0)
}
}
df.withColumn(into, extractor(col))
}
case _ => throw new IllegalArgumentException("unexpected type '" + typeName + "'")
}
}

Sum of Values based on key in scala

I am new to scala I have List of Integers
val list = List((1,2,3),(2,3,4),(1,2,3))
val sum = list.groupBy(_._1).mapValues(_.map(_._2)).sum
val sum2 = list.groupBy(_._1).mapValues(_.map(_._3)).sum
How to perform N values I tried above but its not good way how to sum N values based on key
Also I have tried like this
val sum =list.groupBy(_._1).values.sum => error
val sum =list.groupBy(_._1).mapvalues(_.map(_._2).sum (_._3).sum) error
It's easier to convert these tuples to List[Int] with shapeless and then work with them. Your tuples are actually more like lists anyways. Also, as a bonus, you don't need to change your code at all for lists of Tuple4, Tuple5, etc.
import shapeless._, syntax.std.tuple._
val list = List((1,2,3),(2,3,4),(1,2,3))
list.map(_.toList) // convert tuples to list
.groupBy(_.head) // group by first element of list
.mapValues(_.map(_.tail).map(_.sum).sum) // sums elements of all tails
Result is Map(2 -> 7, 1 -> 10).
val sum = list.groupBy(_._1).map(i => (i._1, i._2.map(j => j._1 + j._2 + j._3).sum))
> sum: scala.collection.immutable.Map[Int,Int] = Map(2 -> 9, 1 -> 12)
Since tuple can't type safe convert to List, need to specify add one by one as j._1 + j._2 + j._3.
using the first element in the tuple as the key and the remaining elements as what you need you could do something like this:
val list = List((1,2,3),(2,3,4),(1,2,3))
list: List[(Int, Int, Int)] = List((1, 2, 3), (2, 3, 4), (1, 2, 3))
val sum = list.groupBy(_._1).map { case (k, v) => (k -> v.flatMap(_.productIterator.toList.drop(1).map(_.asInstanceOf[Int])).sum) }
sum: Map[Int, Int] = Map(2 -> 7, 1 -> 10)
i know its a bit dirty to do asInstanceOf[Int] but when you do .productIterator you get a Iterator of Any
this will work for any tuple size

Calculation on consecutive array elements

I have this:
val myInput:ArrayBuffer[(String,String)] = ArrayBuffer(
(a,timestampAStr),
(b,timestampBStr),
...
)
I would like to calculate the duration between each two consecutive timestamps from myInput and retrieve those like the following:
val myOutput = ArrayBuffer(
(a,durationFromTimestampAToTimestampB),
(b,durationFromTimestampBToTimestampC),
...
)
This is a paired evaluation, which led me to think something with foldLeft() might do the trick, but after giving this a little more thought, I could not come up with a solution.
I have already put something together with some for loops and .indices, but this does not seem as clean and concise as it could be. I would appreciate if somebody had a better option.
You can use zip and sliding to achieve what you want. For example, if you have a collection
scala> List(2,3,5,7,11)
res8: List[Int] = List(2, 3, 5, 7, 11)
The list of differences is res8.sliding(2).map{case List(fst,snd)=>snd-fst}.toList, which you can zip with the original list.
scala> res8.zip(res8.sliding(2).map{case List(fst,snd)=>snd-fst}.toList)
res13: List[(Int, Int)] = List((2,1), (3,2), (5,2), (7,4))
You can zip your array with itself, after dropping the first item - to match each item with the consecutive one - and then map to the calculated result:
val myInput:ArrayBuffer[(String,String)] = ArrayBuffer(
("a","1000"),
("b","1500"),
("c","2500")
)
val result: ArrayBuffer[(String, Int)] = myInput.zip(myInput.drop(1)).map {
case ((k1, v1), (k2, v2)) => (k1, v2.toInt - v1.toInt)
}
result.foreach(println)
// prints:
// (a,500)
// (b,1000)