How to Find Out Per key Maximum in scala collection - scala

Let us consider I hava a collection of eployees as List of Tuples, where t._1 represents department Id, t._2 is salary and t._3 is Name of employee
val eployees = List((1, 8000, "Sally"),(1, 9999, "Tom"), (2, 5000, "Pam"), (4, 500, "NK"), (4, 999, "Robert"))
Expected Result: -((2,5000,Kumar), (4,999,Robert), (1,9999,Ashok))
I am trying with but getting error,
val maxSal1 = emps.map(t => (t._1, (t._2, t._3))).groupBy(a => a._1).map(k => {
k._2.foldLeft(0, "dummy")((aa, bb) => {
if (aa._1 > bb._1) aa else bb
})
})

Don't overcomplicate things, avoid doing unnecessary operations, and carrying redundant information around. Just be explicit, and spell out the transformations you need at each step. Simplicity is your friend.
employees.groupBy(_._1).values.map(_.maxBy(_._2))

scala> List((1, 8000, "Sally"),(1, 9999, "Tom"), (2, 5000, "Pam"), (4, 500, "NK"), (4, 999, "Robert")).groupBy {
| case (dept, salary, employee) => dept
| }
res6: scala.collection.immutable.Map[Int,List[(Int, Int, String)]] = Map(2 -> List((2,5000,Pam)), 4 -> List((4,500,NK), (4,999,Robert)), 1 -> List((1,8000,Sally), (1,9999,Tom)))
scala> res6.map {
| case (dept, employees) => employees.maxBy(_._2)
| }
res5: scala.collection.immutable.Iterable[(Int, Int, String)] = List((2,5000,Pam), (4,999,Robert), (1,9999,Tom))
But note that maxBy is a partial function:
scala> List[Int]().maxBy(x => x)
java.lang.UnsupportedOperationException: empty.maxBy
As a side note, I'd use case class Employee with 3 fields rather than a tuple. I believe it's more readable.

I tried with this option and seems to give result,
val maxsal1 = emps1.map(t => (t._1, t._2, t._3)).groupBy(_._1).values.map(t => t.foldLeft((0, 1, "dummy"))((aa, bb) => {
if (aa._2 > bb._2) aa else bb
}))
Output: List((2,5000,Pam), (4,999,Robert), (1,9999,Tom))

Related

Scala flatten a map with list as key and string as value

I have a peculiar case where I want to declare simple configuration like so
val config = List((("a", "b", "c"), ("first")),
(("d", "e"), ("second")),
(("f"), ("third")))
which at run time, I would like to have a map, which maps like
"a" -> "first"
"b" -> "first"
"c" -> "first"
"d" -> "second"
"e" -> "second"
"f" -> "third"
Using toMap, I was able to convert the config to a Map
scala> config.toMap
res42: scala.collection.immutable.Map[java.io.Serializable,String] = Map((a,b,c) -> first, (d,e) -> second, f -> third)
But I am not able to figure out how to flatten the list of keys into keys so I get the final desirable form. How do I solve this?
If you structure your config using List the code is very simple:
val config = List(
(List("a", "b", "c"), ("first")),
(List("d", "e"), ("second")),
(List("f"), ("third")))
config.flatMap{ case (k, v) => k.map(_ -> v) }.toMap
You can try the solution below:
val config = List(
(("a", "b", "c"), ("first")),
(("d", "e"), ("second")),
(("f"), ("third")))
val result = config.map {
case (k,v) =>
(
k.toString().replace(")", "")
.replace("(", "")
.split(","), v)
}
val res = result.map {
case (key,value) => key.map{ data =>
(data,value)
}.toList
}.flatten.toMap
In case you change the config structure to something like below, solution is much more simpler:
val config1 = List (
(List("a", "b", "c"), "first"),
(List("d", "e"), "second"),
(List("f"), "third")
)
config1.flatMap{
case (k,v) => k.map{data => (data,v)}
}.toMap
I think the above answers are good practical answers. If you're in a situation where you have no control over the input and you're stuck with Tuples instead of Lists, I'd do it this way:
val result: Map[String, String] = config.flatMap {
case (s: String, v) => List(s -> v)
case (ks: Product, v) => ks.productIterator.collect { case s: String => s -> v }
case _ => Nil //Prevent throwing
}.toMap
This will throw away anything that's not a String in the keys.
by using in built spark sql functions
val config = List((Array("a", "b", "c"), ("first")),
(Array("d", "e"), ("second")),
(Array("f"), ("third"))).toDF(List("col1","col2") : _*)
config.withColumn("exploded",functions.explode_outer($"col1")).drop("col1").show()

Scala: map+filter instead of foldLeft

Is it possible to substitute the foldLeft function through some combination of map and filter in Scala? For example on this task.
Input is a list of triples (student name, course, grade):
val grades = List(("Hans", "db", 2.3), ("Maria", "prog1", 1.0), ("Maria", "prog2", 1.3), ("Maria", "prog3", 1.7), ("Hans", "prog2", 1.7), ("Josef", "prog1", 1.3), ("Maria", "mathe1", 1.3), ("Josef", "db", 3.3), ("Hans", "prog1", 2.0))
Then to each student, a list of their courses and grades should be mapped. With foldLeft it goes like that:
grades.foldLeft(Map[String, List[(String, Double)]]())((acc, e) => acc + (e._1 -> (acc.getOrElse(e._1, List()) ::: List((e._2, e._3))))).toList
Output:
List[(String, List[(String, Double)])] = List((Hans,List((db,2.3), (prog2,1.7), (prog1,2.0))), (Maria,List((prog1,1.0), (prog2,1.3), (prog3,1.7), (mathe1,1.3))), (Josef,List((prog1,1.3), (db,3.3))))
How to achieve the same output using only map and filter functions?
So far i have this, but the output is slightly different.
grades.map(x => (x._1, List())).distinct.flatMap(x => grades.map(z => if(!x._2.contains(z._2, z._3)) (x._1, x._2 ::: List((z._2, z._3)))))
Using groupBy is a nice way to tackle this problem:
grades
.groupBy(_._1)
.mapValues(_.map(t => (t._2, t._3)))
(Thanks to m4gic from improvement to original version suggested in comment)
grades
.map({ case (name, _, _) => name })
.distinct
.map(name => {
val scores = grades
.filter({ case (nameInternal, _, _) => name == nameInternal })
.map({ case (_, subject, score) => (subject, score) })
(name, scores)
})
// res0: List[(String, List[(String, Double)])] = List((Hans,List((db,2.3), (prog2,1.7), (prog1,2.0))), (Maria,List((prog1,1.0), (prog2,1.3), (prog3,1.7), (mathe1,1.3))), (Josef,List((prog1,1.3), (db,3.3))))

Scala. How to create general method that accept tuple with different arities?

In my application I have many places, where I need get a list of tuples, groupBy it by first element of tuple and remove it from the rest. For example, I have tuples
(1, "Joe", "Account"), (1, "Tom", "Employer"), (2, "John", "Account"), and result should be Map(1 -> List(("Joe", "Account"), ("Joe", "Account")), 2 -> List(("John", "Account")))
It is easy implemented as
data.groupBy(_._1).map { case (k, v) => k -> v.map(f => (f._2, f._3)) }
But I am looking general solution, because I can have tuples with different arities, 2, 3, 4 or even 7.
I think Shapeless or Scalaz can help me, but my experience is low in those libraries, please point to some example
This is easily implemented using shapeless (for simplicity, I won't be generalizing it for all collection types). There's specific type class for tuples that can deconstruct them into head and tail called IsComposite
import shapeless.ops.tuple.IsComposite
def groupTail[P, H, T](tuples: List[P])(
implicit ic: IsComposite.Aux[P, H, T]): Map[H, List[T]] = {
tuples
.groupBy(ic.head)
.map { case (k, vs) => (k, vs.map(ic.tail)) }
}
This works for your case:
val data =
List((1, "Joe", "Account"), (1, "Tom", "Employer"), (2, "John", "Account"))
assert {
groupTail(data) == Map(
1 -> List(("Joe", "Account"), ("Tom", "Employer")),
2 -> List(("John", "Account"))
)
}
As well as for Tuple4 of different types:
val data2 = List((1, 1, "a", 'a), (1, 2, "b", 'b), (2, 1, "a", 'b))
assert {
groupTail(data2) == Map(
1 -> List((1, "a", 'a), (2, "b", 'b)),
2 -> List((1, "a", 'b))
)
}
Runnable code is available at Scastie

Unpack list, pair up and calculate difference

I want to calculate the time difference between in and out for each id. The data is in the format:
String,Long,String,List[String]
======================================
in, time0, door1, [id1, id2, id3, id4]
out, time1, door1, [id1, id2, id3]
out, time2, door1, [id4, id5]
In the end it should end up with key-value pairs like:
{(id1, #time1-time0), (id2, #time1-time0), (id3, #time1-time0), (id4, #time2-time0), (id5, N/A)}
What would be a good approach for solving this problem?
EDIT: I have tried the following.
case class Data(direction: String, time:Long, door:String, ids: List[String])
val data = sc.parallelize(Seq(Data("in", 5, "d1", List("id1", "id2", "id3", "id4")),Data("out", 20, "d1", List("id1", "id2", "id3")), Data("out",50, "d1", List("id4", "id5"))))
data.flatMap(x => (x.ids, x))
scala> case class Data( direction: String, time: Long, door: String, ids: List[ String ] )
defined class Data
scala> val data = sc.parallelize( Seq( Data( "in", 5, "d1", List( "id1", "id2", "id3", "id4" ) ), Data( "out", 20, "d1", List( "id1", "id2", "id3" ) ), Data( "out",50, "d1", List( "id4", "id5" ) ) ) )
data: org.apache.spark.rdd.RDD[Data] = ParallelCollectionRDD[0] at parallelize at <console>:14
// Get an RDD entry for each ( id, data ) pair
scala> data.flatMap( x => x.ids.map( id => ( id, x ) ) )
res0: org.apache.spark.rdd.RDD[(String, Data)] = FlatMappedRDD[1] at flatMap at <console>:17
// group by id to get data's with same id's
scala> res0.groupBy( { case ( id, data ) => id } )
res1: org.apache.spark.rdd.RDD[(String, Iterable[(String, Data)])] = ShuffledRDD[3] at groupBy at <console>:19
// convert Iterable[(String, Data)] to List[Data]
scala> res1.map( { case ( id, iter ) => ( id, iter.toList.map( { case ( i, d ) => d } ) ) } )
res2: org.apache.spark.rdd.RDD[(String, List[Data])] = MappedRDD[4] at map at <console>:21
// sort list of data's by data.time
res2.map( { case ( id, list ) => ( id, list.sortBy( d => d.time ) ) } )
res3: org.apache.spark.rdd.RDD[(String, List[Data])] = MappedRDD[5] at map at <console>:23
// get the time diff by doing lastData.time - firstData.time for each id
scala> :paste
// Entering paste mode (ctrl-D to finish)
res3.map( { case ( id, list ) => {
list match {
case d :: Nil => ( id, None )
case d :: tail => ( id, Some( list.last.time - d.time ) )
case _ => ( id, None )
}
} } )
// Exiting paste mode, now interpreting.
res6: org.apache.spark.rdd.RDD[(String, Option[Long])] = MappedRDD[7] at map at <console>:25
Now, res6 has your desired data.
Also... I was not sure how do you wanted to use direction so I did not use it, modify some of the code to get what you want ( I think just the last res3 thing need to change a little ) or you can explain it here, and may be I will give you the answer. If you have any other doubts... ask away.
It can also be achieved in a more concise way... but that will be hard to understand. That's why I have provided a verbose and simpler code.

take top N after groupBy and treat them as RDD

I'd like to get top N items after groupByKey of RDD and convert the type of topNPerGroup(in the below) to RDD[(String, Int)] where List[Int] values are flatten
The data is
val data = sc.parallelize(Seq("foo"->3, "foo"->1, "foo"->2,
"bar"->6, "bar"->5, "bar"->4))
The top N items per group are computed as:
val topNPerGroup: RDD[(String, List[Int]) = data.groupByKey.map {
case (key, numbers) =>
key -> numbers.toList.sortBy(-_).take(2)
}
The result is
(bar,List(6, 5))
(foo,List(3, 2))
which was printed by
topNPerGroup.collect.foreach(println)
If I achieve, topNPerGroup.collect.foreach(println) will generate (expected result!)
(bar, 6)
(bar, 5)
(foo, 3)
(foo, 2)
I've been struggling with this same issue recently but my need was a little different in that I needed the top K values per key with a data set like (key: Int, (domain: String, count: Long)). While your dataset is simpler there is still a scaling/performance issue by using groupByKey as noted in the documentation.
When called on a dataset of (K, V) pairs, returns a dataset of (K,
Iterable) pairs. Note: If you are grouping in order to perform an
aggregation (such as a sum or average) over each key, using
reduceByKey or combineByKey will yield much better performance.
In my case I ran into problems very quickly because my Iterable in (K, Iterable<V>) was very large, > 1 million, so the sorting and taking of the top N became very expensive and creates potential memory issues.
After some digging, see references below, here is a full example using combineByKey to accomplish the same task in a way that will perform and scale.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object TopNForKey {
var SampleDataset = List(
(1, ("apple.com", 3L)),
(1, ("google.com", 4L)),
(1, ("stackoverflow.com", 10L)),
(1, ("reddit.com", 15L)),
(2, ("slashdot.org", 11L)),
(2, ("samsung.com", 1L)),
(2, ("apple.com", 9L)),
(3, ("microsoft.com", 5L)),
(3, ("yahoo.com", 3L)),
(3, ("google.com", 4L)))
//sort and trim a traversable (String, Long) tuple by _2 value of the tuple
def topNs(xs: TraversableOnce[(String, Long)], n: Int) = {
var ss = List[(String, Long)]()
var min = Long.MaxValue
var len = 0
xs foreach { e =>
if (len < n || e._2 > min) {
ss = (e :: ss).sortBy((f) => f._2)
min = ss.head._2
len += 1
}
if (len > n) {
ss = ss.tail
min = ss.head._2
len -= 1
}
}
ss
}
def main(args: Array[String]): Unit = {
val topN = 2
val sc = new SparkContext("local", "TopN For Key")
val rdd = sc.parallelize(SampleDataset).map((t) => (t._1, t._2))
//use combineByKey to allow spark to partition the sorting and "trimming" across the cluster
val topNForKey = rdd.combineByKey(
//seed a list for each key to hold your top N's with your first record
(v) => List[(String, Long)](v),
//add the incoming value to the accumulating top N list for the key
(acc: List[(String, Long)], v) => topNs(acc ++ List((v._1, v._2)), topN).toList,
//merge top N lists returned from each partition into a new combined top N list
(acc: List[(String, Long)], acc2: List[(String, Long)]) => topNs(acc ++ acc2, topN).toList)
//print results sorting for pretty
topNForKey.sortByKey(true).foreach((t) => {
println(s"key: ${t._1}")
t._2.foreach((v) => {
println(s"----- $v")
})
})
}
}
And what I get in the returning rdd...
(1, List(("google.com", 4L),
("stackoverflow.com", 10L))
(2, List(("apple.com", 9L),
("slashdot.org", 15L))
(3, List(("google.com", 4L),
("microsoft.com", 5L))
References
https://www.mail-archive.com/user#spark.apache.org/msg16827.html
https://stackoverflow.com/a/8275562/807318
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
Spark 1.4.0 solves the question.
Take a look at https://github.com/apache/spark/commit/5e6ad24ff645a9b0f63d9c0f17193550963aa0a7
This uses BoundedPriorityQueue with aggregateByKey
def topByKey(num: Int)(implicit ord: Ordering[V]): RDD[(K, Array[V])] = {
self.aggregateByKey(new BoundedPriorityQueue[V](num)(ord))(
seqOp = (queue, item) => {
queue += item
},
combOp = (queue1, queue2) => {
queue1 ++= queue2
}
).mapValues(_.toArray.sorted(ord.reverse)) // This is an min-heap, so we reverse the order.
}
Your question is a little confusing, but I think this does what you're looking for:
val flattenedTopNPerGroup =
topNPerGroup.flatMap({case (key, numbers) => numbers.map(key -> _)})
and in the repl it prints out what you want:
flattenedTopNPerGroup.collect.foreach(println)
(foo,3)
(foo,2)
(bar,6)
(bar,5)
Just use topByKey:
import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
import org.apache.spark.rdd.RDD
val topTwo: RDD[(String, Int)] = data.topByKey(2).flatMapValues(x => x)
topTwo.collect.foreach(println)
(foo,3)
(foo,2)
(bar,6)
(bar,5)
It is also possible provide alternative Ordering (not required here). For example if you wanted n smallest values:
data.topByKey(2)(scala.math.Ordering.by[Int, Int](- _))