Scala - How to iterate over tuples on RDD? - scala

I have an RDD that contains tuples like this
(A, List(2,5,6,7))
(B, List(2,8,9,10))
and I would like to get the index of the first element where a specific condition between value and index holds.
So far I have tried this on a single tuple test and it works fine:
test._2.zipWithIndex.indexWhere { case (v, i) => SOME_CONDITION}
I just can't find how to iterate over all tuples in the list.. I have tried:
val result= test._._2.zipWithIndex.indexWhere { case (v, i) => SOME_CONDITION}

First, "iterate" is the wrong concept here - it comes from the realm of imperative programming, where you actually iterate over the data structure yourself. Spark uses a functional paradigm, which let's you pass a function to handle each record in the RDD (using some higher-order function like map, foreach...).
In this case, sounds like you want to map each element into a new element.
To map only the right-hand side of your tuples (without changing the left-hand side), you can use mapValues:
// mapValues will map the "values" (of type List[Int]) to new values (of type Int)
rdd.mapValues(list => list.zipWithIndex.indexWhere {
case (v, i) => someCondition(v, i)
})
Or, alternatively, using plain map:
rdd.map {
case (key, list) => (key, list.zipWithIndex.indexWhere {
case (v, i) => someCondition(v, i)
})
}

Related

read a file in scala and get key value pairs as Map[String, List[String]]

i am reading a file and getting the records as a Map[String, List[String]] in spark-scala. similar thing i want to achieve in pure scala form without any spark references(not reading an rdd). what should i change to make it work in a pure scala way
rdd
.filter(x => (x != null) && (x.length > 0))
.zipWithIndex()
.map {
case (line, index) =>
val array = line.split("~").map(_.trim)
(array(0), array(1), index)
}
.groupBy(_._1)
.mapValues(x => x.toList.sortBy(_._3).map(_._2))
.collect
.toMap
For the most part it will remain the same except for the groupBy part in rdd. Scala List also has the map, filter, reduce etc. methods. So they can be used in almost a similar fashion.
val lines = Source.fromFile('filename.txt').getLines.toList
Once the file is read and stored in List, the methods can be applied to it.
For the groupBy part, one simple approach can be to sort the tuples on the key. That will effectively cluster the tuples with same keys together.
val grouped = scala.util.Sorting.stablesort(arr, (e1: String, e2: String, e3: String)
=> e1._1 < e2._2)
There can be better solutions definitely, but this would effectively do the same task.
I came up with the below approach
Source.fromInputStream(
getClass.getResourceAsStream(filePath)).getLines.filter(
lines =>(lines != null) && (lines.length > 0)).map(_.split("~")).toList.groupBy(_(0)).map{ case (key, values) => (key, values.map(_(1))) }

groupBy on List as LinkedHashMap instead of Map

I am processing XML using scala, and I am converting the XML into my own data structures. Currently, I am using plain Map instances to hold (sub-)elements, however, the order of elements from the XML gets lost this way, and I cannot reproduce the original XML.
Therefore, I want to use LinkedHashMap instances instead of Map, however I am using groupBy on the list of nodes, which creates a Map:
For example:
def parse(n:Node): Unit =
{
val leaves:Map[String, Seq[XmlItem]] =
n.child
.filter(node => { ... })
.groupBy(_.label)
.map((tuple:Tuple2[String, Seq[Node]]) =>
{
val items = tuple._2.map(node =>
{
val attributes = ...
if (node.text.nonEmpty)
XmlItem(Some(node.text), attributes)
else
XmlItem(None, attributes)
})
(tuple._1, items)
})
...
}
In this example, I want leaves to be of type LinkedHashMap to retain the order of n.child. How can I achieve this?
Note: I am grouping by label/tagname because elements can occur multiple times, and for each label/tagname, I keep a list of elements in my data structures.
Solution
As answered by #jwvh I am using foldLeft as a substitution for groupBy. Also, I decided to go with LinkedHashMap instead of ListMap.
def parse(n:Node): Unit =
{
val leaves:mutable.LinkedHashMap[String, Seq[XmlItem]] =
n.child
.filter(node => { ... })
.foldLeft(mutable.LinkedHashMap.empty[String, Seq[Node]])((m, sn) =>
{
m.update(sn.label, m.getOrElse(sn.label, Seq.empty[Node]) ++ Seq(sn))
m
})
.map((tuple:Tuple2[String, Seq[Node]]) =>
{
val items = tuple._2.map(node =>
{
val attributes = ...
if (node.text.nonEmpty)
XmlItem(Some(node.text), attributes)
else
XmlItem(None, attributes)
})
(tuple._1, items)
})
To get the rough equivalent to .groupBy() in a ListMap you could fold over your collection. The problem is that ListMap preserves the order of elements as they were appended, not as they were encountered.
import collection.immutable.ListMap
List('a','b','a','c').foldLeft(ListMap.empty[Char,Seq[Char]]){
case (lm,c) => lm.updated(c, c +: lm.getOrElse(c, Seq()))
}
//res0: ListMap[Char,Seq[Char]] = ListMap(b -> Seq(b), a -> Seq(a, a), c -> Seq(c))
To fix this you can foldRight instead of foldLeft. The result is the original order of elements as encountered (scanning left to right) but in reverse.
List('a','b','a','c').foldRight(ListMap.empty[Char,Seq[Char]]){
case (c,lm) => lm.updated(c, c +: lm.getOrElse(c, Seq()))
}
//res1: ListMap[Char,Seq[Char]] = ListMap(c -> Seq(c), b -> Seq(b), a -> Seq(a, a))
This isn't necessarily a bad thing since a ListMap is more efficient with last and init ops, O(1), than it is with head and tail ops, O(n).
To process the ListMap in the original left-to-right order you could .toList and .reverse it.
List('a','b','a','c').foldRight(ListMap.empty[Char,Seq[Char]]){
case (c,lm) => lm.updated(c, c +: lm.getOrElse(c, Seq()))
}.toList.reverse
//res2: List[(Char, Seq[Char])] = List((a,Seq(a, a)), (b,Seq(b)), (c,Seq(c)))
Purely immutable solution would be quite slow. So I'd go with
import collection.mutable.{ArrayBuffer, LinkedHashMap}
implicit class ExtraTraversableOps[A](seq: collection.TraversableOnce[A]) {
def orderedGroupBy[B](f: A => B): collection.Map[B, collection.Seq[A]] = {
val map = LinkedHashMap.empty[B, ArrayBuffer[A]]
for (x <- seq) {
val key = f(x)
map.getOrElseUpdate(key, ArrayBuffer.empty) += x
}
map
}
To use, just change .groupBy in your code to .orderedGroupBy.
The returned Map can't be mutated using this type (though it can be cast to mutable.Map or to mutable.LinkedHashMap), so it's safe enough for most purposes (and you could create a ListMap from it at the end if really desired).

How to Reduce by key in "Scala" [Not In Spark]

I am trying to reduceByKeys in Scala, is there any method to reduce the values based on the keys in Scala. [ i know we can do by reduceByKey method in spark, but how do we do the same in Scala ? ]
The input Data is :
val File = Source.fromFile("C:/Users/svk12/git/data/retail_db/order_items/part-00000")
.getLines()
.toList
val map = File.map(x => x.split(","))
.map(x => (x(1),x(4)))
map.take(10).foreach(println)
After Above Step i am getting the result as:
(2,250.0)
(2,129.99)
(4,49.98)
(4,299.95)
(4,150.0)
(4,199.92)
(5,299.98)
(5,299.95)
Expected Result :
(2,379.99)
(5,499.93)
.......
Starting Scala 2.13, you can use the groupMapReduce method which is (as its name suggests) an equivalent of a groupBy followed by mapValues and a reduce step:
io.Source.fromFile("file.txt")
.getLines.to(LazyList)
.map(_.split(','))
.groupMapReduce(_(1))(_(4).toDouble)(_ + _)
The groupMapReduce stage:
groups splited arrays by their 2nd element (_(1)) (group part of groupMapReduce)
maps each array occurrence within each group to its 4th element and cast it to Double (_(4).toDouble) (map part of groupMapReduce)
reduces values within each group (_ + _) by summing them (reduce part of groupMapReduce).
This is a one-pass version of what can be translated by:
seq.groupBy(_(1)).mapValues(_.map(_(4).toDouble).reduce(_ + _))
Also note the cast from Iterator to LazyList in order to use a collection which provides groupMapReduce (we don't use a Stream, since starting Scala 2.13, LazyList is the recommended replacement of Streams).
It looks like you want the sum of some values from a file. One problem is that files are strings, so you have to cast the String to a number format before it can be summed.
These are the steps you might use.
io.Source.fromFile("so.txt") //open file
.getLines() //read line-by-line
.map(_.split(",")) //each line is Array[String]
.toSeq //to something that can groupBy()
.groupBy(_(1)) //now is Map[String,Array[String]]
.mapValues(_.map(_(4).toInt).sum) //now is Map[String,Int]
.toSeq //un-Map it to (String,Int) tuples
.sorted //presentation order
.take(10) //sample
.foreach(println) //report
This will, of course, throw if any file data is not in the required format.
There is nothing built-in, but you can write it like this:
def reduceByKey[A, B](items: Traversable[(A, B)])(f: (B, B) => B): Map[A, B] = {
var result = Map.empty[A, B]
items.foreach {
case (a, b) =>
result += (a -> result.get(a).map(b1 => f(b1, b)).getOrElse(b))
}
result
}
There is some space to optimize this (e.g. use mutable maps), but the general idea remains the same.
Another approach, more declarative but less efficient (creates several intermediate collections; can be rewritten but with loss of clarity:
def reduceByKey[A, B](items: Traversable[(A, B)])(f: (B, B) => B): Map[A, B] = {
items
.groupBy { case (a, _) => a }
.mapValues(_.map { case (_, b) => b }.reduce(f))
// mapValues returns a view, view.force changes it back to a realized map
.view.force
}
First group the tuple using key, first element here and then reduce.
Following code will work -
val reducedList = map.groupBy(_._1).map(l => (l._1, l._2.map(_._2).reduce(_+_)))
print(reducedList)
Here another solution using a foldLeft:
val File : List[String] = ???
File.map(x => x.split(","))
.map(x => (x(1),x(4).toInt))
.foldLeft(Map.empty[String,Int]){case (state, (key,value)) => state.updated(key,state.get(key).getOrElse(0)+value)}
.toSeq
.sortBy(_._1)
.take(10)
.foreach(println)

How to define case class with a list of tuples and access the tuples in scala

I have a case class with a parameter a which is a list of int tuple. I want to iterate over a and define operations on a.
I have tried the following:
case class XType (a: List[(Int, Int)]) {
for (x <- a) {
assert(x._2 >= 0)
}
def op(): XType = {
for ( x <- XType(a))
yield (x._1, x._2)
}
}
However, I am getting the error:
"Value map is not a member of XType."
How can I access the integers of tuples and define operations on them?
You're running into an issue with for comprehensions, which are really another way of expressing things like foreach and map (and flatMap and withFilter/filter). See here and here for more explanation.
Your first for comprehension (the one with asserts) is equivalent to
a.foreach(x => assert(x._2 >= 0))
a is a List, x is an (Int, Int), everything's good.
However, the second on (in op) translates to
XType(a).map(x => x)
which doesn't make sense--XType doesn't know what to do with map, like the error said.
An instance of XType refers to its a as simply a (or this.a), so a.map(x => x) would be just fine in op (and then turn the result into a new XType).
As a general rule, for comprehensions are handy for nested maps (or flatMaps or whatever), rather than as a 1-1 equivalent for for loops in other languages--just use map instead.
You can access to the tuple list by:
def op(): XType = {
XType(a.map(...))
}

Scala: Grouping list of tuples

I need to group list of tuples in some unique way.
For example, if I have
val l = List((1,2,3),(4,2,5),(2,3,3),(10,3,2))
Then I should group the list with second value and map with the set of first value
So the result should be
Map(2 -> Set(1,4), 3 -> Set(2,10))
By so far, I came up with this
l groupBy { p => p._2 } mapValues { v => (v map { vv => vv._1 }).toSet }
This works, but I believe there should be a much more efficient way...
This is similar to this question. Basically, as #serejja said, your approach is correct and also the most concise one. You could use collection.breakOut as builder factory argument to the last map and thereby save the additional iteration to get the Set type:
l.groupBy(_._2).mapValues(_.map(_._1)(collection.breakOut): Set[Int])
You shouldn't probably go beyond this, unless you really need to squeeze the performance.
Otherwise, this is how a general toMultiMap function could look like which allows you to control the values collection type:
import collection.generic.CanBuildFrom
import collection.mutable
def toMultiMap[A, K, V, Values](xs: TraversableOnce[A])
(key: A => K)(value: A => V)
(implicit cbfv: CanBuildFrom[Nothing, V, Values]): Map[K, Values] = {
val b = mutable.Map.empty[K, mutable.Builder[V, Values]]
xs.foreach { elem =>
b.getOrElseUpdate(key(elem), cbfv()) += value(elem)
}
b.map { case (k, vb) => (k, vb.result()) } (collection.breakOut)
}
What it does is, it uses a mutable Map during building stage, and values gathered in a mutable Builder first (the builder is provided by the CanBuildFrom instance). After the iteration over all input elements has completed, that mutable map of builder values is converted into an immutable map of the values collection type (again using the collection.breakOut trick to get the desired output collection straight away).
Ex:
val l = List((1,2,3),(4,2,5),(2,3,3),(10,3,2))
val v = toMultiMap(l)(_._2)(_._1) // uses Vector for values
val s: Map[Int, Set[Int] = toMultiMap(l)(_._2)(_._1) // uses Set for values
So your annotated result type directs the type inference of the values type. If you do not annotate the result, Scala will pick Vector as default collection type.