How to Find Out Per key Maximum in scala collection

How to Find Out Per key Maximum in scala collection - scala

Let us consider I hava a collection of eployees as List of Tuples, where t._1 represents department Id, t._2 is salary and t._3 is Name of employee
val eployees = List((1, 8000, "Sally"),(1, 9999, "Tom"), (2, 5000, "Pam"), (4, 500, "NK"), (4, 999, "Robert"))
Expected Result: -((2,5000,Kumar), (4,999,Robert), (1,9999,Ashok))
I am trying with but getting error,
val maxSal1 = emps.map(t => (t._1, (t._2, t._3))).groupBy(a => a._1).map(k => {
k._2.foldLeft(0, "dummy")((aa, bb) => {
if (aa._1 > bb._1) aa else bb
})
})

Don't overcomplicate things, avoid doing unnecessary operations, and carrying redundant information around. Just be explicit, and spell out the transformations you need at each step. Simplicity is your friend.
employees.groupBy(_._1).values.map(_.maxBy(_._2))

scala> List((1, 8000, "Sally"),(1, 9999, "Tom"), (2, 5000, "Pam"), (4, 500, "NK"), (4, 999, "Robert")).groupBy {
| case (dept, salary, employee) => dept
| }
res6: scala.collection.immutable.Map[Int,List[(Int, Int, String)]] = Map(2 -> List((2,5000,Pam)), 4 -> List((4,500,NK), (4,999,Robert)), 1 -> List((1,8000,Sally), (1,9999,Tom)))
scala> res6.map {
| case (dept, employees) => employees.maxBy(_._2)
| }
res5: scala.collection.immutable.Iterable[(Int, Int, String)] = List((2,5000,Pam), (4,999,Robert), (1,9999,Tom))
But note that maxBy is a partial function:
scala> List[Int]().maxBy(x => x)
java.lang.UnsupportedOperationException: empty.maxBy
As a side note, I'd use case class Employee with 3 fields rather than a tuple. I believe it's more readable.

I tried with this option and seems to give result,
val maxsal1 = emps1.map(t => (t._1, t._2, t._3)).groupBy(_._1).values.map(t => t.foldLeft((0, 1, "dummy"))((aa, bb) => {
if (aa._2 > bb._2) aa else bb
}))
Output: List((2,5000,Pam), (4,999,Robert), (1,9999,Tom))

Related

Scala flatten a map with list as key and string as value

I have a peculiar case where I want to declare simple configuration like so
val config = List((("a", "b", "c"), ("first")),
(("d", "e"), ("second")),
(("f"), ("third")))
which at run time, I would like to have a map, which maps like
"a" -> "first"
"b" -> "first"
"c" -> "first"
"d" -> "second"
"e" -> "second"
"f" -> "third"
Using toMap, I was able to convert the config to a Map
scala> config.toMap
res42: scala.collection.immutable.Map[java.io.Serializable,String] = Map((a,b,c) -> first, (d,e) -> second, f -> third)
But I am not able to figure out how to flatten the list of keys into keys so I get the final desirable form. How do I solve this?

If you structure your config using List the code is very simple:
val config = List(
(List("a", "b", "c"), ("first")),
(List("d", "e"), ("second")),
(List("f"), ("third")))
config.flatMap{ case (k, v) => k.map(_ -> v) }.toMap

You can try the solution below:
val config = List(
(("a", "b", "c"), ("first")),
(("d", "e"), ("second")),
(("f"), ("third")))
val result = config.map {
case (k,v) =>
(
k.toString().replace(")", "")
.replace("(", "")
.split(","), v)
}
val res = result.map {
case (key,value) => key.map{ data =>
(data,value)
}.toList
}.flatten.toMap
In case you change the config structure to something like below, solution is much more simpler:
val config1 = List (
(List("a", "b", "c"), "first"),
(List("d", "e"), "second"),
(List("f"), "third")
)
config1.flatMap{
case (k,v) => k.map{data => (data,v)}
}.toMap

I think the above answers are good practical answers. If you're in a situation where you have no control over the input and you're stuck with Tuples instead of Lists, I'd do it this way:
val result: Map[String, String] = config.flatMap {
case (s: String, v) => List(s -> v)
case (ks: Product, v) => ks.productIterator.collect { case s: String => s -> v }
case _ => Nil //Prevent throwing
}.toMap
This will throw away anything that's not a String in the keys.

by using in built spark sql functions
val config = List((Array("a", "b", "c"), ("first")),
(Array("d", "e"), ("second")),
(Array("f"), ("third"))).toDF(List("col1","col2") : _*)
config.withColumn("exploded",functions.explode_outer($"col1")).drop("col1").show()

Scala: map+filter instead of foldLeft

Is it possible to substitute the foldLeft function through some combination of map and filter in Scala? For example on this task.
Input is a list of triples (student name, course, grade):
val grades = List(("Hans", "db", 2.3), ("Maria", "prog1", 1.0), ("Maria", "prog2", 1.3), ("Maria", "prog3", 1.7), ("Hans", "prog2", 1.7), ("Josef", "prog1", 1.3), ("Maria", "mathe1", 1.3), ("Josef", "db", 3.3), ("Hans", "prog1", 2.0))
Then to each student, a list of their courses and grades should be mapped. With foldLeft it goes like that:
grades.foldLeft(Map[String, List[(String, Double)]]())((acc, e) => acc + (e._1 -> (acc.getOrElse(e._1, List()) ::: List((e._2, e._3))))).toList
Output:
List[(String, List[(String, Double)])] = List((Hans,List((db,2.3), (prog2,1.7), (prog1,2.0))), (Maria,List((prog1,1.0), (prog2,1.3), (prog3,1.7), (mathe1,1.3))), (Josef,List((prog1,1.3), (db,3.3))))
How to achieve the same output using only map and filter functions?
So far i have this, but the output is slightly different.
grades.map(x => (x._1, List())).distinct.flatMap(x => grades.map(z => if(!x._2.contains(z._2, z._3)) (x._1, x._2 ::: List((z._2, z._3)))))

Using groupBy is a nice way to tackle this problem:
grades
.groupBy(_._1)
.mapValues(_.map(t => (t._2, t._3)))
(Thanks to m4gic from improvement to original version suggested in comment)

grades
.map({ case (name, _, _) => name })
.distinct
.map(name => {
val scores = grades
.filter({ case (nameInternal, _, _) => name == nameInternal })
.map({ case (_, subject, score) => (subject, score) })
(name, scores)
})
// res0: List[(String, List[(String, Double)])] = List((Hans,List((db,2.3), (prog2,1.7), (prog1,2.0))), (Maria,List((prog1,1.0), (prog2,1.3), (prog3,1.7), (mathe1,1.3))), (Josef,List((prog1,1.3), (db,3.3))))

Scala. How to create general method that accept tuple with different arities?

In my application I have many places, where I need get a list of tuples, groupBy it by first element of tuple and remove it from the rest. For example, I have tuples
(1, "Joe", "Account"), (1, "Tom", "Employer"), (2, "John", "Account"), and result should be Map(1 -> List(("Joe", "Account"), ("Joe", "Account")), 2 -> List(("John", "Account")))
It is easy implemented as
data.groupBy(_._1).map { case (k, v) => k -> v.map(f => (f._2, f._3)) }
But I am looking general solution, because I can have tuples with different arities, 2, 3, 4 or even 7.
I think Shapeless or Scalaz can help me, but my experience is low in those libraries, please point to some example

This is easily implemented using shapeless (for simplicity, I won't be generalizing it for all collection types). There's specific type class for tuples that can deconstruct them into head and tail called IsComposite
import shapeless.ops.tuple.IsComposite
def groupTail[P, H, T](tuples: List[P])(
implicit ic: IsComposite.Aux[P, H, T]): Map[H, List[T]] = {
tuples
.groupBy(ic.head)
.map { case (k, vs) => (k, vs.map(ic.tail)) }
}
This works for your case:
val data =
List((1, "Joe", "Account"), (1, "Tom", "Employer"), (2, "John", "Account"))
assert {
groupTail(data) == Map(
1 -> List(("Joe", "Account"), ("Tom", "Employer")),
2 -> List(("John", "Account"))
)
}
As well as for Tuple4 of different types:
val data2 = List((1, 1, "a", 'a), (1, 2, "b", 'b), (2, 1, "a", 'b))
assert {
groupTail(data2) == Map(
1 -> List((1, "a", 'a), (2, "b", 'b)),
2 -> List((1, "a", 'b))
)
}
Runnable code is available at Scastie

Unpack list, pair up and calculate difference

I want to calculate the time difference between in and out for each id. The data is in the format:
String,Long,String,List[String]
======================================
in, time0, door1, [id1, id2, id3, id4]
out, time1, door1, [id1, id2, id3]
out, time2, door1, [id4, id5]
In the end it should end up with key-value pairs like:
{(id1, #time1-time0), (id2, #time1-time0), (id3, #time1-time0), (id4, #time2-time0), (id5, N/A)}
What would be a good approach for solving this problem?
EDIT: I have tried the following.
case class Data(direction: String, time:Long, door:String, ids: List[String])
val data = sc.parallelize(Seq(Data("in", 5, "d1", List("id1", "id2", "id3", "id4")),Data("out", 20, "d1", List("id1", "id2", "id3")), Data("out",50, "d1", List("id4", "id5"))))
data.flatMap(x => (x.ids, x))

scala> case class Data( direction: String, time: Long, door: String, ids: List[ String ] )
defined class Data
scala> val data = sc.parallelize( Seq( Data( "in", 5, "d1", List( "id1", "id2", "id3", "id4" ) ), Data( "out", 20, "d1", List( "id1", "id2", "id3" ) ), Data( "out",50, "d1", List( "id4", "id5" ) ) ) )
data: org.apache.spark.rdd.RDD[Data] = ParallelCollectionRDD[0] at parallelize at <console>:14
// Get an RDD entry for each ( id, data ) pair
scala> data.flatMap( x => x.ids.map( id => ( id, x ) ) )
res0: org.apache.spark.rdd.RDD[(String, Data)] = FlatMappedRDD[1] at flatMap at <console>:17
// group by id to get data's with same id's
scala> res0.groupBy( { case ( id, data ) => id } )
res1: org.apache.spark.rdd.RDD[(String, Iterable[(String, Data)])] = ShuffledRDD[3] at groupBy at <console>:19
// convert Iterable[(String, Data)] to List[Data]
scala> res1.map( { case ( id, iter ) => ( id, iter.toList.map( { case ( i, d ) => d } ) ) } )
res2: org.apache.spark.rdd.RDD[(String, List[Data])] = MappedRDD[4] at map at <console>:21
// sort list of data's by data.time
res2.map( { case ( id, list ) => ( id, list.sortBy( d => d.time ) ) } )
res3: org.apache.spark.rdd.RDD[(String, List[Data])] = MappedRDD[5] at map at <console>:23
// get the time diff by doing lastData.time - firstData.time for each id
scala> :paste
// Entering paste mode (ctrl-D to finish)
res3.map( { case ( id, list ) => {
list match {
case d :: Nil => ( id, None )
case d :: tail => ( id, Some( list.last.time - d.time ) )
case _ => ( id, None )
}
} } )
// Exiting paste mode, now interpreting.
res6: org.apache.spark.rdd.RDD[(String, Option[Long])] = MappedRDD[7] at map at <console>:25
Now, res6 has your desired data.
Also... I was not sure how do you wanted to use direction so I did not use it, modify some of the code to get what you want ( I think just the last res3 thing need to change a little ) or you can explain it here, and may be I will give you the answer. If you have any other doubts... ask away.
It can also be achieved in a more concise way... but that will be hard to understand. That's why I have provided a verbose and simpler code.

take top N after groupBy and treat them as RDD

I'd like to get top N items after groupByKey of RDD and convert the type of topNPerGroup(in the below) to RDD[(String, Int)] where List[Int] values are flatten
The data is
val data = sc.parallelize(Seq("foo"->3, "foo"->1, "foo"->2,
"bar"->6, "bar"->5, "bar"->4))
The top N items per group are computed as:
val topNPerGroup: RDD[(String, List[Int]) = data.groupByKey.map {
case (key, numbers) =>
key -> numbers.toList.sortBy(-_).take(2)
}
The result is
(bar,List(6, 5))
(foo,List(3, 2))
which was printed by
topNPerGroup.collect.foreach(println)
If I achieve, topNPerGroup.collect.foreach(println) will generate (expected result!)
(bar, 6)
(bar, 5)
(foo, 3)
(foo, 2)

I've been struggling with this same issue recently but my need was a little different in that I needed the top K values per key with a data set like (key: Int, (domain: String, count: Long)). While your dataset is simpler there is still a scaling/performance issue by using groupByKey as noted in the documentation.
When called on a dataset of (K, V) pairs, returns a dataset of (K,
Iterable) pairs. Note: If you are grouping in order to perform an
aggregation (such as a sum or average) over each key, using
reduceByKey or combineByKey will yield much better performance.
In my case I ran into problems very quickly because my Iterable in (K, Iterable<V>) was very large, > 1 million, so the sorting and taking of the top N became very expensive and creates potential memory issues.
After some digging, see references below, here is a full example using combineByKey to accomplish the same task in a way that will perform and scale.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object TopNForKey {
var SampleDataset = List(
(1, ("apple.com", 3L)),
(1, ("google.com", 4L)),
(1, ("stackoverflow.com", 10L)),
(1, ("reddit.com", 15L)),
(2, ("slashdot.org", 11L)),
(2, ("samsung.com", 1L)),
(2, ("apple.com", 9L)),
(3, ("microsoft.com", 5L)),
(3, ("yahoo.com", 3L)),
(3, ("google.com", 4L)))
//sort and trim a traversable (String, Long) tuple by _2 value of the tuple
def topNs(xs: TraversableOnce[(String, Long)], n: Int) = {
var ss = List[(String, Long)]()
var min = Long.MaxValue
var len = 0
xs foreach { e =>
if (len < n || e._2 > min) {
ss = (e :: ss).sortBy((f) => f._2)
min = ss.head._2
len += 1
}
if (len > n) {
ss = ss.tail
min = ss.head._2
len -= 1
}
}
ss
}
def main(args: Array[String]): Unit = {
val topN = 2
val sc = new SparkContext("local", "TopN For Key")
val rdd = sc.parallelize(SampleDataset).map((t) => (t._1, t._2))
//use combineByKey to allow spark to partition the sorting and "trimming" across the cluster
val topNForKey = rdd.combineByKey(
//seed a list for each key to hold your top N's with your first record
(v) => List[(String, Long)](v),
//add the incoming value to the accumulating top N list for the key
(acc: List[(String, Long)], v) => topNs(acc ++ List((v._1, v._2)), topN).toList,
//merge top N lists returned from each partition into a new combined top N list
(acc: List[(String, Long)], acc2: List[(String, Long)]) => topNs(acc ++ acc2, topN).toList)
//print results sorting for pretty
topNForKey.sortByKey(true).foreach((t) => {
println(s"key: ${t._1}")
t._2.foreach((v) => {
println(s"----- $v")
})
})
}
}
And what I get in the returning rdd...
(1, List(("google.com", 4L),
("stackoverflow.com", 10L))
(2, List(("apple.com", 9L),
("slashdot.org", 15L))
(3, List(("google.com", 4L),
("microsoft.com", 5L))
References
https://www.mail-archive.com/user#spark.apache.org/msg16827.html
https://stackoverflow.com/a/8275562/807318
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

Spark 1.4.0 solves the question.
Take a look at https://github.com/apache/spark/commit/5e6ad24ff645a9b0f63d9c0f17193550963aa0a7
This uses BoundedPriorityQueue with aggregateByKey
def topByKey(num: Int)(implicit ord: Ordering[V]): RDD[(K, Array[V])] = {
self.aggregateByKey(new BoundedPriorityQueue[V](num)(ord))(
seqOp = (queue, item) => {
queue += item
},
combOp = (queue1, queue2) => {
queue1 ++= queue2
}
).mapValues(_.toArray.sorted(ord.reverse)) // This is an min-heap, so we reverse the order.
}

Your question is a little confusing, but I think this does what you're looking for:
val flattenedTopNPerGroup =
topNPerGroup.flatMap({case (key, numbers) => numbers.map(key -> _)})
and in the repl it prints out what you want:
flattenedTopNPerGroup.collect.foreach(println)
(foo,3)
(foo,2)
(bar,6)
(bar,5)

Just use topByKey:
import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
import org.apache.spark.rdd.RDD
val topTwo: RDD[(String, Int)] = data.topByKey(2).flatMapValues(x => x)
topTwo.collect.foreach(println)
(foo,3)
(foo,2)
(bar,6)
(bar,5)
It is also possible provide alternative Ordering (not required here). For example if you wanted n smallest values:
data.topByKey(2)(scala.math.Ordering.by[Int, Int](- _))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to Find Out Per key Maximum in scala collection - scala

Don't overcomplicate things, avoid doing unnecessary operations, and carrying redundant information around. Just be explicit, and spell out the transformations you need at each step. Simplicity is your friend. employees.groupBy(_._1).values.map(_.maxBy(_._2))

I tried with this option and seems to give result, val maxsal1 = emps1.map(t => (t._1, t._2, t._3)).groupBy(_._1).values.map(t => t.foldLeft((0, 1, "dummy"))((aa, bb) => { if (aa._2 > bb._2) aa else bb })) Output: List((2,5000,Pam), (4,999,Robert), (1,9999,Tom))

Related

Scala flatten a map with list as key and string as value

Scala: map+filter instead of foldLeft

Scala. How to create general method that accept tuple with different arities?

Unpack list, pair up and calculate difference

take top N after groupBy and treat them as RDD

Categories

Resources