Treating an Array of (String, Int) Tuples like a Dictionary - scala

I have a list of tuples that looks like the following:
(("String1", Value1), ("String2", Value2), ...)
Where string is a String and value is a Double. Is there a method in scala to accomplish the following:
1) Search the list for a particular string value.
2) If we have a hit, return the value associated with the string.
3) If we have a miss, return -1.
This tuple sequence was created using collect on an RDD of format RDD[K, V] where the keys were strings and the vals were the doubles. Originally I planned on using lookup on the RDD, but it appears that this work needs to be done on the driver (hence the collect).

Maybe you can try to convert it to a map first then:
scala> val collection = Map(("hello", 1), ("world", 2))
collection: scala.collection.immutable.Map[String,Int] = Map(hello -> 1, world -> 2)
scala> collection getOrElse ("hello", -1)
res3: Int = 1
scala> collection getOrElse ("scala", -1)
res4: Int = -1

val m = list.toMap.withDefaultValue(-1d)
// 1 and 2
m("String1") // Value1
// 3
m("Some other") // -1d

Related

Add element to specific index of list

What is the best way of adding to a specific index of a list in scala?
This is what I have tried:
case class Level(price: Double)
case class Order(levels: Seq[Level] = Seq())
def process(order: Order) {
orderBook.levels.updated(0, Level(0.0))
}
I was hoping to insert into position zero the new Level but it just throws java.lang.IndexOutOfBoundsException: 0 . What is the best way of handling this? Is there a better data type other than Seq which should be used for keeping track of indexes in a list?
Is there a better data type other than Seq which should be used for
keeping track of indexes in a list?
Yes, a Vector[T] is recommended when you want random access into the underlying collection:
scala> val vector = Vector(1,2,3)
vector: scala.collection.immutable.Vector[Int] = Vector(1, 2, 3)
scala> vector.updated(0, 5)
res3: scala.collection.immutable.Vector[Int] = Vector(5, 2, 3)
Note that a Vector will also through an IndexOutOfBoundsException when you try to insert data into an empty vector. A good way of appending and prepending data is using :+ and :+, respectively:
scala> val vec = Vector[Int]()
vec: scala.collection.immutable.Vector[Int] = Vector()
scala> vec :+ 1 :+ 2
res7: scala.collection.immutable.Vector[Int] = Vector(1, 2)

How to add all the values of a map without using recurrsion or var

I want to add all the values in a map without using var or any mutable structures. I have tried to do something like this but it doens't work:
val mymap = ("a" -> 1, "b" -> 2)
val sum_of_alcohol_consumption =
for ((k,v) <- mymap ) yield (sum_of_alcohol_consumption += v)
I have been told that I can use .sum on a list
Please help
Thanks
You can use the .values function of a Map to return an Iterable List of its values (all of the Integers) and then call the .sum function on that:
val myMap = Map("a" -> 1, "b" -> 2)
val sum = myMap.values.sum
println(sum) // Outputs: 3
An equivalent answer to the more elegant use of sum is to use a fold operation. sum is implemented in a manner similar to this:
val myMap = Map("a" -> 1, "b" -> 2)
val sumAlcoholConsumption = myMap.values.foldLeft(0)(_ + _)
values returns a sequence of only the values in the map. The first foldLeft argument is the zero value (think of it as the initial value for an accumulator value) for the operation. The second argument is a function that adds the current value of the accumulator to the current element, returning the sum of the two values - and it is applied to each value in turn. That said, sum is a lot more convenient.
To get the only values of map, it provides a function values which will return iterable,we can directly appy sum function to it.
scala> val mymap = Map("a" -> 1, "b" -> 2)
mymap: scala.collection.immutable.Map[String,Int] = Map(a -> 1, b -> 2)
scala> mymap.values.sum
res7: Int = 3

Sum of Values based on key in scala

I am new to scala I have List of Integers
val list = List((1,2,3),(2,3,4),(1,2,3))
val sum = list.groupBy(_._1).mapValues(_.map(_._2)).sum
val sum2 = list.groupBy(_._1).mapValues(_.map(_._3)).sum
How to perform N values I tried above but its not good way how to sum N values based on key
Also I have tried like this
val sum =list.groupBy(_._1).values.sum => error
val sum =list.groupBy(_._1).mapvalues(_.map(_._2).sum (_._3).sum) error
It's easier to convert these tuples to List[Int] with shapeless and then work with them. Your tuples are actually more like lists anyways. Also, as a bonus, you don't need to change your code at all for lists of Tuple4, Tuple5, etc.
import shapeless._, syntax.std.tuple._
val list = List((1,2,3),(2,3,4),(1,2,3))
list.map(_.toList) // convert tuples to list
.groupBy(_.head) // group by first element of list
.mapValues(_.map(_.tail).map(_.sum).sum) // sums elements of all tails
Result is Map(2 -> 7, 1 -> 10).
val sum = list.groupBy(_._1).map(i => (i._1, i._2.map(j => j._1 + j._2 + j._3).sum))
> sum: scala.collection.immutable.Map[Int,Int] = Map(2 -> 9, 1 -> 12)
Since tuple can't type safe convert to List, need to specify add one by one as j._1 + j._2 + j._3.
using the first element in the tuple as the key and the remaining elements as what you need you could do something like this:
val list = List((1,2,3),(2,3,4),(1,2,3))
list: List[(Int, Int, Int)] = List((1, 2, 3), (2, 3, 4), (1, 2, 3))
val sum = list.groupBy(_._1).map { case (k, v) => (k -> v.flatMap(_.productIterator.toList.drop(1).map(_.asInstanceOf[Int])).sum) }
sum: Map[Int, Int] = Map(2 -> 7, 1 -> 10)
i know its a bit dirty to do asInstanceOf[Int] but when you do .productIterator you get a Iterator of Any
this will work for any tuple size

Spark RDD tuple transformation

I'm trying to transform an RDD of tuple of Strings of this format :
(("abc","xyz","123","2016-02-26T18:31:56"),"15") TO
(("abc","xyz","123"),"2016-02-26T18:31:56","15")
Basically seperating out the timestamp string as a seperate tuple element. I tried following but it's still not clean and correct.
val result = rdd.map(r => (r._1.toString.split(",").toVector.dropRight(1).toString, r._1.toString.split(",").toList.last.toString, r._2))
However, it results in
(Vector(("abc", "xyz", "123"),"2016-02-26T18:31:56"),"15")
The expected output I'm looking for is
(("abc", "xyz", "123"),"2016-02-26T18:31:56","15")
This way I can access the elements using r._1, r._2 (the timestamp string) and r._3 in a seperate map operation.
Any hints/pointers will be greatly appreciated.
Vector.toString will include the String 'Vector' in its result. Instead, use Vector.mkString(",").
Example:
scala> val xs = Vector(1,2,3)
xs: scala.collection.immutable.Vector[Int] = Vector(1, 2, 3)
scala> xs.toString
res25: String = Vector(1, 2, 3)
scala> xs.mkString
res26: String = 123
scala> xs.mkString(",")
res27: String = 1,2,3
However, if you want to be able to access (abc,xyz,123) as a Tuple and not as a string, you could also do the following:
val res = rdd.map{
case ((a:String,b:String,c:String,ts:String),d:String) => ((a,b,c),ts,d)
}

Why Spark doesn't allow map-side combining with array keys?

I'm using Spark 1.3.1 and I'm curious why Spark doesn't allow using array keys on map-side combining.
Piece of combineByKey function:
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
}
Basically for the same reason why default partitioner cannot partition array keys.
Scala Array is just a wrapper around Java array and its hashCode doesn't depend on a content:
scala> val x = Array(1, 2, 3)
x: Array[Int] = Array(1, 2, 3)
scala> val h = x.hashCode
h: Int = 630226932
scala> x(0) = -1
scala> x.hashCode() == h1
res3: Boolean = true
It means that two arrays with exact the same content are not equal
scala> x
res4: Array[Int] = Array(-1, 2, 3)
scala> val y = Array(-1, 2, 3)
y: Array[Int] = Array(-1, 2, 3)
scala> y == x
res5: Boolean = false
As result Arrays cannot be used as a meaningful keys. If you're not convinced just check what happens when you use Array as key for Scala Map:
scala> Map(Array(1) -> 1, Array(1) -> 2)
res7: scala.collection.immutable.Map[Array[Int],Int] = Map(Array(1) -> 1, Array(1) -> 2)
If you want to use a collection as key you should use an immutable data structure like a Vector or a List.
scala> Map(Array(1).toVector -> 1, Array(1).toVector -> 2)
res15: scala.collection.immutable.Map[Vector[Int],Int] = Map(Vector(1) -> 2)
See also:
SI-1607
How does HashPartitioner work?
A list as a key for PySpark's reduceByKey