working with RDDS and collections in Scala - scala

I have the below function that takes an array of means of type Likes (type Likes=Int)
and an RDD of numbers of type Likes (likesVector). For each number in the likesVector RDD, it computes the distance from each mean in means array and chooses the mean which has the least distance (val distance = (mean-number).abs). While I expect a result of type Map[Likes,Array[Likes]], I get an empty map. Map[Likes,Array[Likes]] represents (mean->Array of number-nearest numbers).
What is the best way to achieve this? I suspect it has a lot to do with the mutability of Scala collections.
def assignDataPoints(means:Array[Likes],likesVector:RDD[Likes]): Map[Likes,Array[Likes]] ={
var likes_Mean = IntMap(1->1)
var likes_mean_final = mutable.Map.empty[Likes,Array[Likes]]
likesVector.map(dataPoint => {
means.foldLeft(Array.empty[Likes])( (accumulator, mean)=> {
val dist= computeDistance(dataPoint,mean)
val nearestMean = if (dist < accumulator(0)) {
accumulator(0)=dist
accumulator(0)
} else{
accumulator(0)
}
val b= IntMap(nearestMean.toInt -> dataPoint)
println("b:"+ b)
likes_mean_final ++ likes_Mean.intersectionWith(b,(_, av, bv: Likes) => Array(av, bv))
accumulator
})})
likes_mean_final.toMap
}

The reason for your empty map is that you are using ++ operation here in the line:
likes_mean_final ++ likes_Mean.intersectionWith(b,(_, av, bv: Likes) => Array(av, bv))
++ operation returns a new map hence we are creating a new object and not changes the current value
The operation for mutating the current mutable map is ++=. So you should use:
likes_mean_final ++= likes_Mean.intersectionWith(b,(_, av, bv: Likes) => Array(av, bv))
Also no need to use var in your context as you are not reassigning values to the same references.
But you should not use mutability in Scala.

Related

appending elements to list of list in scala

i have created a empty scala mutable list
import scala.collection.mutable.ListBuffer
val list_of_list : List[List[String]] = List.empty
i want to append elements to it as below
filtered_df.collect.map(
r => {
val val_list = List(r(0).toString,r(4).toString,r(5).toString)
list_of_list += val_list
}
)
error that i am getting is
Error:(113, 26) value += is not a member of List[List[String]]
Expression does not convert to assignment because receiver is not assignable.
list_of_list += val_list
Can someone help
Your declaration seems wrong:
val list_of_list : List[List[String]] = List.empty
means that you've declared scala.collection.immutable.List whose operations return a new list without changing the current.
To fix the error you need to change the outer List type to ListBuffer that you imported above the declaration as follows:
val list_of_list : ListBuffer[List[String]] = ListBuffer.empty
Also it looks like you don't to use map here unless you want to modify your data collected from DataFrame, so you can change it to foreach:
filtered_df.collect.foreach {
r => {
val val_list = List(r(0).toString,r(4).toString,r(5).toString)
list_of_list += val_list
}
}
Furthermore you can make it in a functional way without resorting to ListBuffer, by using immutable List and foldRight as follows:
val list_of_list: List[List[String]] =
filtered_df.collect.toList
.foldRight(List.empty[List[String]])((r, acc) => List(r(0).toString,r(4).toString,r(5).toString) :: acc)
toList is used to achieve a stack safety when calling foldRight, because it's not stack safe for Arrays
More info about foldLeft and foldRight
You have to change that val list_of_list to var list_of_list. That alone would not be enough as you also have to change the type of list_of_list into a mutable alternative.

groupBy on List as LinkedHashMap instead of Map

I am processing XML using scala, and I am converting the XML into my own data structures. Currently, I am using plain Map instances to hold (sub-)elements, however, the order of elements from the XML gets lost this way, and I cannot reproduce the original XML.
Therefore, I want to use LinkedHashMap instances instead of Map, however I am using groupBy on the list of nodes, which creates a Map:
For example:
def parse(n:Node): Unit =
{
val leaves:Map[String, Seq[XmlItem]] =
n.child
.filter(node => { ... })
.groupBy(_.label)
.map((tuple:Tuple2[String, Seq[Node]]) =>
{
val items = tuple._2.map(node =>
{
val attributes = ...
if (node.text.nonEmpty)
XmlItem(Some(node.text), attributes)
else
XmlItem(None, attributes)
})
(tuple._1, items)
})
...
}
In this example, I want leaves to be of type LinkedHashMap to retain the order of n.child. How can I achieve this?
Note: I am grouping by label/tagname because elements can occur multiple times, and for each label/tagname, I keep a list of elements in my data structures.
Solution
As answered by #jwvh I am using foldLeft as a substitution for groupBy. Also, I decided to go with LinkedHashMap instead of ListMap.
def parse(n:Node): Unit =
{
val leaves:mutable.LinkedHashMap[String, Seq[XmlItem]] =
n.child
.filter(node => { ... })
.foldLeft(mutable.LinkedHashMap.empty[String, Seq[Node]])((m, sn) =>
{
m.update(sn.label, m.getOrElse(sn.label, Seq.empty[Node]) ++ Seq(sn))
m
})
.map((tuple:Tuple2[String, Seq[Node]]) =>
{
val items = tuple._2.map(node =>
{
val attributes = ...
if (node.text.nonEmpty)
XmlItem(Some(node.text), attributes)
else
XmlItem(None, attributes)
})
(tuple._1, items)
})
To get the rough equivalent to .groupBy() in a ListMap you could fold over your collection. The problem is that ListMap preserves the order of elements as they were appended, not as they were encountered.
import collection.immutable.ListMap
List('a','b','a','c').foldLeft(ListMap.empty[Char,Seq[Char]]){
case (lm,c) => lm.updated(c, c +: lm.getOrElse(c, Seq()))
}
//res0: ListMap[Char,Seq[Char]] = ListMap(b -> Seq(b), a -> Seq(a, a), c -> Seq(c))
To fix this you can foldRight instead of foldLeft. The result is the original order of elements as encountered (scanning left to right) but in reverse.
List('a','b','a','c').foldRight(ListMap.empty[Char,Seq[Char]]){
case (c,lm) => lm.updated(c, c +: lm.getOrElse(c, Seq()))
}
//res1: ListMap[Char,Seq[Char]] = ListMap(c -> Seq(c), b -> Seq(b), a -> Seq(a, a))
This isn't necessarily a bad thing since a ListMap is more efficient with last and init ops, O(1), than it is with head and tail ops, O(n).
To process the ListMap in the original left-to-right order you could .toList and .reverse it.
List('a','b','a','c').foldRight(ListMap.empty[Char,Seq[Char]]){
case (c,lm) => lm.updated(c, c +: lm.getOrElse(c, Seq()))
}.toList.reverse
//res2: List[(Char, Seq[Char])] = List((a,Seq(a, a)), (b,Seq(b)), (c,Seq(c)))
Purely immutable solution would be quite slow. So I'd go with
import collection.mutable.{ArrayBuffer, LinkedHashMap}
implicit class ExtraTraversableOps[A](seq: collection.TraversableOnce[A]) {
def orderedGroupBy[B](f: A => B): collection.Map[B, collection.Seq[A]] = {
val map = LinkedHashMap.empty[B, ArrayBuffer[A]]
for (x <- seq) {
val key = f(x)
map.getOrElseUpdate(key, ArrayBuffer.empty) += x
}
map
}
To use, just change .groupBy in your code to .orderedGroupBy.
The returned Map can't be mutated using this type (though it can be cast to mutable.Map or to mutable.LinkedHashMap), so it's safe enough for most purposes (and you could create a ListMap from it at the end if really desired).

Adding key-value pairs to Map in scala

I am very new to scala, and want to create a hash map with the key being a candidate, and value being number of votes. Something like this: {(1:20),(2:4),(3:42),..}.
I`ve attempted with the following code:
val voteTypeList = textFile.map(x=>x(2)) //String array containing votes: [3,4,2,3,2,1,1,1,9,..]
var voteCount:Map[String,Int] = Map()
voteTypeList.foreach{x=>{
if (voteCount.contains(x)){ //Increment value
var i: Integer = voteCount(x)
voteCount.updated(x, i+1)
// print(voteCount(x))
}
else{ //Create new key-value pair
// println(x)
voteCount += (x -> 1)
}}}
print(voteCount.size)
But the voteCount does not get created and .size returns 0.
Thank you!
The problem you're encountering is caused by using a var to hold an immutable Map. Change that to a val holding a mutable Map and it works.
val voteCount:collection.mutable.Map[String,Int] = collection.mutable.Map()
Having said that, there are a number of other issues with the code that makes it non-idiomatic to the Scala way of doing things.
What you really want is something closer to this.
val voteCount = voteTypeList.groupBy(identity).mapValues(_.length)
Jwvh's answer is idiomatic but non-obvious and, if the list has very large numbers of duplicates, memory-heavy.
You might also consider the literal-minded (yet still Scalarific) way of doing this:
val voteCount = voteTypeList.foldLeft(Map(): Map[Int, Int]) { (v, x) =>
v + (x -> (v.getOrElse(x, 0) + 1))
}

Appending element to list in Scala

val indices: List[Int] = List()
val featValues: List[Double] = List()
for (f <- feat) {
val q = f.split(':')
if (q.length == 2) {
println(q.mkString("\n")) // works fine, displays info
indices :+ (q(0).toInt)
featValues :+ (q(1).toDouble)
}
}
println(indices.mkString("\n") + indices.length) // prints nothing and 0?
indices and featValues are not being filled. I'm at a loss here.
You cannot append anything to an immutable data structure such as List stored in a val (immutable named slot).
What your code is doing is creating a new list every time with one element appended, and then throwing it away (by not doing anything with it) — the :+ method on lists does not modify the list in place (even when it's a mutable list such as ArrayBuffer) but always returns a new list.
In order to achieve what you want, the quickest way (as opposed to the right way) is either to use a var (typically preferred):
var xs = List.empty[Int]
xs :+= 123 // same as `xs = xs :+ 123`
or a val containing a mutable collection:
import scala.collection.mutable.ArrayBuffer
val buf = ArrayBuffer.empty[Int]
buf += 123
However, if you really want to make your code idiomatic, you should instead just use a functional approach:
val indiciesAndFeatVals = feat.map { f =>
val Array(q0, q1) = f.split(':') // pattern matching in action
(q0.toInt, q1.toDouble)
}
which will give you a sequence of pairs, which you can then unzip to 2 separate collections:
val (indicies, featVals) = indiciesAndFeatVals.unzip
This approach will avoid the use of any mutable data structures as well as vars (i.e. mutable slots).

Scala: Grouping list of tuples

I need to group list of tuples in some unique way.
For example, if I have
val l = List((1,2,3),(4,2,5),(2,3,3),(10,3,2))
Then I should group the list with second value and map with the set of first value
So the result should be
Map(2 -> Set(1,4), 3 -> Set(2,10))
By so far, I came up with this
l groupBy { p => p._2 } mapValues { v => (v map { vv => vv._1 }).toSet }
This works, but I believe there should be a much more efficient way...
This is similar to this question. Basically, as #serejja said, your approach is correct and also the most concise one. You could use collection.breakOut as builder factory argument to the last map and thereby save the additional iteration to get the Set type:
l.groupBy(_._2).mapValues(_.map(_._1)(collection.breakOut): Set[Int])
You shouldn't probably go beyond this, unless you really need to squeeze the performance.
Otherwise, this is how a general toMultiMap function could look like which allows you to control the values collection type:
import collection.generic.CanBuildFrom
import collection.mutable
def toMultiMap[A, K, V, Values](xs: TraversableOnce[A])
(key: A => K)(value: A => V)
(implicit cbfv: CanBuildFrom[Nothing, V, Values]): Map[K, Values] = {
val b = mutable.Map.empty[K, mutable.Builder[V, Values]]
xs.foreach { elem =>
b.getOrElseUpdate(key(elem), cbfv()) += value(elem)
}
b.map { case (k, vb) => (k, vb.result()) } (collection.breakOut)
}
What it does is, it uses a mutable Map during building stage, and values gathered in a mutable Builder first (the builder is provided by the CanBuildFrom instance). After the iteration over all input elements has completed, that mutable map of builder values is converted into an immutable map of the values collection type (again using the collection.breakOut trick to get the desired output collection straight away).
Ex:
val l = List((1,2,3),(4,2,5),(2,3,3),(10,3,2))
val v = toMultiMap(l)(_._2)(_._1) // uses Vector for values
val s: Map[Int, Set[Int] = toMultiMap(l)(_._2)(_._1) // uses Set for values
So your annotated result type directs the type inference of the values type. If you do not annotate the result, Scala will pick Vector as default collection type.