How to Reduce by key in "Scala" [Not In Spark] - scala

I am trying to reduceByKeys in Scala, is there any method to reduce the values based on the keys in Scala. [ i know we can do by reduceByKey method in spark, but how do we do the same in Scala ? ]
The input Data is :
val File = Source.fromFile("C:/Users/svk12/git/data/retail_db/order_items/part-00000")
.getLines()
.toList
val map = File.map(x => x.split(","))
.map(x => (x(1),x(4)))
map.take(10).foreach(println)
After Above Step i am getting the result as:
(2,250.0)
(2,129.99)
(4,49.98)
(4,299.95)
(4,150.0)
(4,199.92)
(5,299.98)
(5,299.95)
Expected Result :
(2,379.99)
(5,499.93)
.......

Starting Scala 2.13, you can use the groupMapReduce method which is (as its name suggests) an equivalent of a groupBy followed by mapValues and a reduce step:
io.Source.fromFile("file.txt")
.getLines.to(LazyList)
.map(_.split(','))
.groupMapReduce(_(1))(_(4).toDouble)(_ + _)
The groupMapReduce stage:
groups splited arrays by their 2nd element (_(1)) (group part of groupMapReduce)
maps each array occurrence within each group to its 4th element and cast it to Double (_(4).toDouble) (map part of groupMapReduce)
reduces values within each group (_ + _) by summing them (reduce part of groupMapReduce).
This is a one-pass version of what can be translated by:
seq.groupBy(_(1)).mapValues(_.map(_(4).toDouble).reduce(_ + _))
Also note the cast from Iterator to LazyList in order to use a collection which provides groupMapReduce (we don't use a Stream, since starting Scala 2.13, LazyList is the recommended replacement of Streams).

It looks like you want the sum of some values from a file. One problem is that files are strings, so you have to cast the String to a number format before it can be summed.
These are the steps you might use.
io.Source.fromFile("so.txt") //open file
.getLines() //read line-by-line
.map(_.split(",")) //each line is Array[String]
.toSeq //to something that can groupBy()
.groupBy(_(1)) //now is Map[String,Array[String]]
.mapValues(_.map(_(4).toInt).sum) //now is Map[String,Int]
.toSeq //un-Map it to (String,Int) tuples
.sorted //presentation order
.take(10) //sample
.foreach(println) //report
This will, of course, throw if any file data is not in the required format.

There is nothing built-in, but you can write it like this:
def reduceByKey[A, B](items: Traversable[(A, B)])(f: (B, B) => B): Map[A, B] = {
var result = Map.empty[A, B]
items.foreach {
case (a, b) =>
result += (a -> result.get(a).map(b1 => f(b1, b)).getOrElse(b))
}
result
}
There is some space to optimize this (e.g. use mutable maps), but the general idea remains the same.
Another approach, more declarative but less efficient (creates several intermediate collections; can be rewritten but with loss of clarity:
def reduceByKey[A, B](items: Traversable[(A, B)])(f: (B, B) => B): Map[A, B] = {
items
.groupBy { case (a, _) => a }
.mapValues(_.map { case (_, b) => b }.reduce(f))
// mapValues returns a view, view.force changes it back to a realized map
.view.force
}

First group the tuple using key, first element here and then reduce.
Following code will work -
val reducedList = map.groupBy(_._1).map(l => (l._1, l._2.map(_._2).reduce(_+_)))
print(reducedList)

Here another solution using a foldLeft:
val File : List[String] = ???
File.map(x => x.split(","))
.map(x => (x(1),x(4).toInt))
.foldLeft(Map.empty[String,Int]){case (state, (key,value)) => state.updated(key,state.get(key).getOrElse(0)+value)}
.toSeq
.sortBy(_._1)
.take(10)
.foreach(println)

Related

groupBy on List as LinkedHashMap instead of Map

I am processing XML using scala, and I am converting the XML into my own data structures. Currently, I am using plain Map instances to hold (sub-)elements, however, the order of elements from the XML gets lost this way, and I cannot reproduce the original XML.
Therefore, I want to use LinkedHashMap instances instead of Map, however I am using groupBy on the list of nodes, which creates a Map:
For example:
def parse(n:Node): Unit =
{
val leaves:Map[String, Seq[XmlItem]] =
n.child
.filter(node => { ... })
.groupBy(_.label)
.map((tuple:Tuple2[String, Seq[Node]]) =>
{
val items = tuple._2.map(node =>
{
val attributes = ...
if (node.text.nonEmpty)
XmlItem(Some(node.text), attributes)
else
XmlItem(None, attributes)
})
(tuple._1, items)
})
...
}
In this example, I want leaves to be of type LinkedHashMap to retain the order of n.child. How can I achieve this?
Note: I am grouping by label/tagname because elements can occur multiple times, and for each label/tagname, I keep a list of elements in my data structures.
Solution
As answered by #jwvh I am using foldLeft as a substitution for groupBy. Also, I decided to go with LinkedHashMap instead of ListMap.
def parse(n:Node): Unit =
{
val leaves:mutable.LinkedHashMap[String, Seq[XmlItem]] =
n.child
.filter(node => { ... })
.foldLeft(mutable.LinkedHashMap.empty[String, Seq[Node]])((m, sn) =>
{
m.update(sn.label, m.getOrElse(sn.label, Seq.empty[Node]) ++ Seq(sn))
m
})
.map((tuple:Tuple2[String, Seq[Node]]) =>
{
val items = tuple._2.map(node =>
{
val attributes = ...
if (node.text.nonEmpty)
XmlItem(Some(node.text), attributes)
else
XmlItem(None, attributes)
})
(tuple._1, items)
})
To get the rough equivalent to .groupBy() in a ListMap you could fold over your collection. The problem is that ListMap preserves the order of elements as they were appended, not as they were encountered.
import collection.immutable.ListMap
List('a','b','a','c').foldLeft(ListMap.empty[Char,Seq[Char]]){
case (lm,c) => lm.updated(c, c +: lm.getOrElse(c, Seq()))
}
//res0: ListMap[Char,Seq[Char]] = ListMap(b -> Seq(b), a -> Seq(a, a), c -> Seq(c))
To fix this you can foldRight instead of foldLeft. The result is the original order of elements as encountered (scanning left to right) but in reverse.
List('a','b','a','c').foldRight(ListMap.empty[Char,Seq[Char]]){
case (c,lm) => lm.updated(c, c +: lm.getOrElse(c, Seq()))
}
//res1: ListMap[Char,Seq[Char]] = ListMap(c -> Seq(c), b -> Seq(b), a -> Seq(a, a))
This isn't necessarily a bad thing since a ListMap is more efficient with last and init ops, O(1), than it is with head and tail ops, O(n).
To process the ListMap in the original left-to-right order you could .toList and .reverse it.
List('a','b','a','c').foldRight(ListMap.empty[Char,Seq[Char]]){
case (c,lm) => lm.updated(c, c +: lm.getOrElse(c, Seq()))
}.toList.reverse
//res2: List[(Char, Seq[Char])] = List((a,Seq(a, a)), (b,Seq(b)), (c,Seq(c)))
Purely immutable solution would be quite slow. So I'd go with
import collection.mutable.{ArrayBuffer, LinkedHashMap}
implicit class ExtraTraversableOps[A](seq: collection.TraversableOnce[A]) {
def orderedGroupBy[B](f: A => B): collection.Map[B, collection.Seq[A]] = {
val map = LinkedHashMap.empty[B, ArrayBuffer[A]]
for (x <- seq) {
val key = f(x)
map.getOrElseUpdate(key, ArrayBuffer.empty) += x
}
map
}
To use, just change .groupBy in your code to .orderedGroupBy.
The returned Map can't be mutated using this type (though it can be cast to mutable.Map or to mutable.LinkedHashMap), so it's safe enough for most purposes (and you could create a ListMap from it at the end if really desired).

reduce a list in scala by value

How can I reduce a list like below concisely
Seq[Temp] = List(Temp(a,1), Temp(a,2), Temp(b,1))
to
List(Temp(a,2), Temp(b,1))
Only keep Temp objects with unique first param and max of second param.
My solution is with lot of groupBys and reduces which is giving a lengthy answer.
you have to
groupBy
sortBy values in ASC order
get the last one which is the largest
Example,
scala> final case class Temp (a: String, value: Int)
defined class Temp
scala> val data : Seq[Temp] = List(Temp("a",1), Temp("a",2), Temp("b",1))
data: Seq[Temp] = List(Temp(a,1), Temp(a,2), Temp(b,1))
scala> data.groupBy(_.a).map { case (k, group) => group.sortBy(_.value).last }
res0: scala.collection.immutable.Iterable[Temp] = List(Temp(b,1), Temp(a,2))
or instead of sortBy(fn).last you can maxBy(fn)
scala> data.groupBy(_.a).map { case (k, group) => group.maxBy(_.value) }
res1: scala.collection.immutable.Iterable[Temp] = List(Temp(b,1), Temp(a,2))
You can generate a Map with groupBy, compute the max in mapValues and convert it back to the Temp classes as in the following example:
case class Temp(id: String, value: Int)
List(Temp("a", 1), Temp("a", 2), Temp("b", 1)).
groupBy(_.id).mapValues( _.map(_.value).max ).
map{ case (k, v) => Temp(k, v) }
// res1: scala.collection.immutable.Iterable[Temp] = List(Temp(b,1), Temp(a,2))
Worth noting that the solution using maxBy in the other answer is more efficient as it minimizes necessary transformations.
You can do this using foldLeft:
data.foldLeft(Map[String, Int]().withDefaultValue(0))((map, tmp) => {
map.updated(tmp.id, max(map(tmp.id), tmp.value))
}).map{case (i,v) => Temp(i, v)}
This is essentially combining the logic of groupBy with the max operation in a single pass.
Note This may be less efficient because groupBy uses a mutable.Map internally which avoids constantly re-creating a new map. If you care about performance and are prepared to use mutable data, this is another option:
val tmpMap = mutable.Map[String, Int]().withDefaultValue(0)
data.foreach(tmp => tmpMap(tmp.id) = max(tmp.value, tmpMap(tmp.id)))
tmpMap.map{case (i,v) => Temp(i, v)}.toList
Use a ListMap if you need to retain the data order, or sort at the end if you need a particular ordering.

Scala - How to iterate over tuples on RDD?

I have an RDD that contains tuples like this
(A, List(2,5,6,7))
(B, List(2,8,9,10))
and I would like to get the index of the first element where a specific condition between value and index holds.
So far I have tried this on a single tuple test and it works fine:
test._2.zipWithIndex.indexWhere { case (v, i) => SOME_CONDITION}
I just can't find how to iterate over all tuples in the list.. I have tried:
val result= test._._2.zipWithIndex.indexWhere { case (v, i) => SOME_CONDITION}
First, "iterate" is the wrong concept here - it comes from the realm of imperative programming, where you actually iterate over the data structure yourself. Spark uses a functional paradigm, which let's you pass a function to handle each record in the RDD (using some higher-order function like map, foreach...).
In this case, sounds like you want to map each element into a new element.
To map only the right-hand side of your tuples (without changing the left-hand side), you can use mapValues:
// mapValues will map the "values" (of type List[Int]) to new values (of type Int)
rdd.mapValues(list => list.zipWithIndex.indexWhere {
case (v, i) => someCondition(v, i)
})
Or, alternatively, using plain map:
rdd.map {
case (key, list) => (key, list.zipWithIndex.indexWhere {
case (v, i) => someCondition(v, i)
})
}

acces tuple inside a tuple for anonymous map job in Spark

This post is essentially about how to build joint and marginal histograms from a (String, String) RDD. I posted the code that I eventually used below as the answer.
I have an RDD that contains a set of tuples of type (String,String) and since they aren't unique I want to get a look at how many times each String, String combination occurs so I use countByValue like so
val PairCount = Pairs.countByValue().toSeq
which gives me a tuple as output like this ((String,String),Long) where long is the number of times that the (String, String) tuple appeared
These Strings can be repeated in different combinations and I essentially want to run word count on this PairCount variable so I tried something like this to start:
PairCount.map(x => (x._1._1, x._2))
But the output the this spits out is String1->1, String2->1, String3->1, etc.
How do I output a key value pair from a map job in this case where the key is going to be one of the String values from the inner tuple, and the value is going to be the Long value from the outter tuple?
Update:
#vitalii gets me almost there. the answer gets me to a Seq[(String,Long)], but what I really need is to turn that into a map so that I can run reduceByKey it afterwards. when I run
PairCount.flatMap{case((x,y),n) => Seq[x->n]}.toMap
for each unique x I get x->1
for example the above line of code generates mom->1 dad->1 even if the tuples out of the flatMap included (mom,30) (dad,59) (mom,2) (dad,14) in which case I would expect toMap to provide mom->30, dad->59 mom->2 dad->14. However, I'm new to scala so I might be misinterpreting the functionality.
how can I get the Tuple2 sequence converted to a map so that I can reduce on the map keys?
If I correctly understand question, you need flatMap:
val pairCountRDD = pairs.countByValue() // RDD[((String, String), Int)]
val res : RDD[(String, Int)] = pairCountRDD.flatMap { case ((s1, s2), n) =>
Seq(s1 -> n, s2 -> n)
}
Update: I didn't quiet understand what your final goal is, but here's a few more examples that may help you, btw code above is incorrect, I have missed the fact that countByValue returns map, and not RDD:
val pairs = sc.parallelize(
List(
"mom"-> "dad", "dad" -> "granny", "foo" -> "bar", "foo" -> "baz", "foo" -> "foo"
)
)
// don't use countByValue, if pairs is large you will run out of memmory
val pairCountRDD = pairs.map(x => (x, 1)).reduceByKey(_ + _)
val wordCount = pairs.flatMap { case (a,b) => Seq(a -> 1, b ->1)}.reduceByKey(_ + _)
wordCount.take(10)
// count in how many pairs each word occur, keys and values:
val wordPairCount = pairs.flatMap { case (a,b) =>
if (a == b) {
Seq(a->1)
} else {
Seq(a -> 1, b ->1)
}
}.reduceByKey(_ + _)
wordPairCount.take(10)
to get the histograms for the (String,String) RDD I used this code.
val Hist_X = histogram.map(x => (x._1-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_Y = histogram.map(x => (x._2-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_XY = histogram.map(x => (x-> 1.0)).reduceByKey(_+_)
where histogram was the (String,String) RDD

scala flatmap to get correct data structure

I have parse the data and generated following RDD:
x [RDD] = (458817,(CompactBuffer(20),CompactBuffer((837063182,0,1433142639864), (676690466,0,1433175090184), (4642913327036075112,1,1433177284025), (464291332,1,1433182403135), (4642913327036075112,0,1433185531150),
(464291332,0,1433186067803), (4642913327036075112,1,1433186266561), (851805971,0,1433190829047),
(6376558263039679112,1,1433203286945), (837063182,0,1433226615856), (8403476884799939112,0,1433287740066),
(764990231,0,1433289484047), (4642913327036075112,0,1433351165901), (464291332,1,1433351892238),
(4642913327036075112,0,1433374808826), (584492430,1,1433436093253))))
Here I am only showing a record which is in the RDD, My goal is to get the following RDD: Where I attached first element.
(458817,837063182,0,1433142639864)
(458817,676690466,0,1433175090184)
(458817,464291332,1,1433177284025)
(458817,464291332,1,1433182403135)
(458817,464291332,0,1433185531150)
(458817,464291332,0,1433186067803)
(458817,464291332,1,1433186266561)
(458817,851805971,0,1433190829047)
(458817,637655826,1,1433203286945)
(458817,837063182,0,1433226615856)
By doing a flatMap I loose the first element and doesn't get access to it:
val r = x.map(l => l._2).flatMap(x => x._2).map(x => (x._1, x._2, x._3, x._4))
This would probably give you the wanted result:
val r = for {
el <- Seq(x._1)
(el1, el2, el3) <- x._2._2
} yield (el, el1, el2, el3)
Lift the first element to a Sequence to use it in the for expression.
Pull out the second CompactBuffer and yield the wanted tuples.
This gave me the exact structure I wanted.
val s = r.map(x => (x._2._2).map(y => (x._1, y._1, y._2.toInt, y._3, y._4))).flatMap(k => k)