spark: value histogram is not a member of org.apache.spark.rdd.RDD[Option[Any]] - scala

I'm new to spark and scala and I've come up with a compile error with scala:
Let's say we have a rdd, which is a map like this:
val rawData = someRDD.map{
//some ops
Map(
"A" -> someInt_var1 //Int
"B" -> someInt_var2 //Int
"C" -> somelong_var //Long
)
}
Then, I want to get histogram info of these vars. So, here is my code:
rawData.map{row => row.get("A")}.histogram(10)
And the compile error says:
value histogram is not a member of org.apache.spark.rdd.RDD[Option[Any]]
I'm wondering why rawData.map{row => row.get("A")} is org.apache.spark.rdd.RDD[Option[Any]] and how to transform it to rdd[Int]?
I have tried like this:
rawData.map{row => row.get("A")}.map{_.toInt}.histogram(10)
But it compiles fail:
value toInt is not a member of Option[Any]
I'm totally confused and seeking for help here.

You get Option because Map.get returns an option; Map.get returns None if the key doesn't exist in the Map; And Option[Any] is also related to the miscellaneous data types of the Map's Value, you have both Int and Long, in my case it returns AnyVal instead of Any;
A possible solution is use getOrElse to get rid of Option by providing a default value when the key doesn't exist, and if you are sure A's value is always a int, you can convert it from AnyVal to Int using asInstanceOf[Int];
A simplified example as follows:
val rawData = sc.parallelize(Seq(Map("A" -> 1, "B" -> 2, "C" -> 4L)))
rawData.map(_.get("A"))
// res6: org.apache.spark.rdd.RDD[Option[AnyVal]] = MapPartitionsRDD[9] at map at <console>:27
rawData.map(_.getOrElse("A", 0).asInstanceOf[Int]).histogram(10)
// res7: (Array[Double], Array[Long]) = (Array(1.0, 1.0),Array(1))

Related

Scala Nested HashMaps, how to access Case Class value properties?

New to Scala, continue to struggle with Option related code. I have a HashMap built of Case Class instances that themselves contain hash maps with Case Class instance values. It is not clear to me how to access properties of the retrieved Class instances:
import collection.mutable.HashMap
case class InnerClass(name: String, age: Int)
case class OuterClass(name: String, nestedMap: HashMap[String, InnerClass])
// Load some data...hash maps are mutable
val innerMap = new HashMap[String, InnerClass]()
innerMap += ("aaa" -> InnerClass("xyz", 0))
val outerMap = new HashMap[String, OuterClass]()
outerMap += ("AAA" -> OuterClass("XYZ", innerMap))
// Try to retrieve data
val outerMapTest = outerMap.getOrElse("AAA", None)
val nestedMap = outerMapTest.nestedMap
This produces error: value nestedMap is not a member of Option[ScalaFiddle.OuterClass]
// Try to retrieve data a different way
val outerMapTest = outerMap.getOrElse("AAA", None)
val nestedMap = outerMapTest.nestedMap
This produces error: value nestedMap is not a member of Product with Serializable
Please advise on how I would go about getting access to outerMapTest.nestedMap. I'll eventually need to get values and properties out of the nestedMap HashMap as well.
Since you are using .getOrElse("someKey", None) which returns you a type Product (not the actual type as you expect to be OuterClass)
scala> val outerMapTest = outerMap.getOrElse("AAA", None)
outerMapTest: Product with Serializable = OuterClass(XYZ,Map(aaa -> InnerClass(xyz,0)))
so Product either needs to be pattern matched or casted to OuterClass
pattern match example
scala> outerMapTest match { case x : OuterClass => println(x.nestedMap); case _ => println("is not outerclass") }
Map(aaa -> InnerClass(xyz,0))
Casting example which is a terrible idea when outerMapTest is None, (pattern matching is favored over casting)
scala> outerMapTest.asInstanceOf[OuterClass].nestedMap
res30: scala.collection.mutable.HashMap[String,InnerClass] = Map(aaa -> InnerClass(xyz,0))
But better way of solving it would simply use .get which very smart and gives you Option[OuterClass],
scala> outerMap.get("AAA").map(outerClass => outerClass.nestedMap)
res27: Option[scala.collection.mutable.HashMap[String,InnerClass]] = Some(Map(aaa -> InnerClass(xyz,0)))
For key that does not exist, gives you None
scala> outerMap.get("I dont exist").map(outerClass => outerClass.nestedMap)
res28: Option[scala.collection.mutable.HashMap[String,InnerClass]] = None
Here are some steps you can take to get deep inside a nested structure like this.
outerMap.lift("AAA") // Option[OuterClass]
.map(_.nestedMap) // Option[HashMap[String,InnerClass]]
.flatMap(_.lift("aaa")) // Option[InnerClass]
.map(_.name) // Option[String]
.getOrElse("no name") // String
Notice that if either of the inner or outer maps doesn't have the specified key ("aaa" or "AAA" respectively) then the whole thing will safely result in the default string ("no name").
A HashMap will return None if a key is not found so it is unnecessary to do getOrElse to return None if the key is not found.
A simple solution to your problem would be to use get only as below
Change your first get as
val outerMapTest = outerMap.get("AAA").get
you can check the output as
println(outerMapTest.name)
println(outerMapTest.nestedMap)
And change the second get as
val nestedMap = outerMapTest.nestedMap.get("aaa").get
You can test the outputs as
println(nestedMap.name)
println(nestedMap.age)
Hope this is helpful
You want
val maybeInner = outerMap.get("AAA").flatMap(_.nestedMap.get("aaa"))
val maybeName = maybeInner.map(_.name)
Which if your feeling adventurous you can get with
val name: String = maybeName.get
But that will throw an error if its not there. If its a None
you can access the nestMap using below expression.
scala> outerMap.get("AAA").map(_.nestedMap).getOrElse(HashMap())
res5: scala.collection.mutable.HashMap[String,InnerClass] = Map(aaa -> InnerClass(xyz,0))
if "AAA" didnt exist in the outerMap Map object then the below expression would have returned an empty HashMap as indicated in the .getOrElse method argument (HashMap()).

Typesafe keys for a map

Given the following code:
val m: Map[String, Int] = .. // fetch from somewhere
val keys: List[String] = m.keys.toList
val keysSubset: List[String] = ... // choose random keys
We can define the following method:
def sumValues(m: Map[String, Int], ks: List[String]): Int =
ks.map(m).sum
And call this as:
sumValues(m, keysSubset)
However, the problem with sumValues is that if ks happens to have a key not present on the map, the code will still compile but throw an exception at runtime. Ex:
// assume m = Map("two" -> 2, "three" -> 3)
sumValues(m, 1 :: Nil)
What I want instead is a definition for sumValues such that the ks argument should, at compile time, be guaranteed to only contain keys that are present on the map. As such, my guess is that the existing sumValues type signature needs to accept some form of implicit evidence that the ks argument is somehow derived from the list of keys of the map.
I'm not limited to a scala Map however, as any record-like structure would do. The map structure however won't have a hardcoded value, but something derived/passed on as an argument.
Note: I'm not really after summing the values, but more of figuring out a type signature for sumValues whose calls to it can only compile if the ks argument is provably from the list of keys the map (or record-like structure).
Another solution could be to map only the intersection (i.e. : between m keys and ks).
For example :
scala> def sumValues(m: Map[String, Int], ks: List[String]): Int = {
| m.keys.filter(ks.contains).map(m).sum
| }
sumValues: (m: Map[String,Int], ks: List[String])Int
scala> val map = Map("hello" -> 5)
map: scala.collection.immutable.Map[String,Int] = Map(hello -> 5)
scala> sumValues(map, List("hello", "world"))
res1: Int = 5
I think this solution is better than providing a default value because more generic (i.e. : you can use it not only with sums). However, I guess that this solution is less effective in term of performance because the intersection.
EDIT : As #jwvh pointed out in it message below, ks.intersect(m.keys.toSeq).map(m).sum is, to my opinion, more readable than m.keys.filter(ks.contains).map(m).sum.

How do I append to a listbuffer which is a value of a mutable map in Scala?

val mymap= collection.mutable.Map.empty[String,Seq[String]]
mymap("key") = collection.mutable.ListBuffer("a","b")
mymap.get("key") += "c"
The last line to append to the list buffer is giving error. How the append can be done ?
When you run the code in the scala console:
→$scala
scala> val mymap= collection.mutable.Map.empty[String,Seq[String]]
mymap: scala.collection.mutable.Map[String,Seq[String]] = Map()
scala> mymap("key") = collection.mutable.ListBuffer("a","b")
scala> mymap.get("key")
res1: Option[Seq[String]] = Some(ListBuffer(a, b))
You'll see that mymap.get("key") is an optional type. You can't add a string to the optional type.
Additionally, since you typed mymap to Seq[String], Seq[String] does not have a += operator taking in a String.
The following works:
val mymap= collection.mutable.Map.empty[String,collection.mutable.ListBuffer[String]]
mymap("key") = collection.mutable.ListBuffer("a","b")
mymap.get("key").map(_ += "c")
Using the .map function will take advantage of the optional type and prevent noSuchElementException as Łukasz noted.
To deal with your problems one at a time:
Map.get returns an Option[T] and Option does not provide a += or + method.
Even if you use Map.apply (mymap("key")) the return type of apply will be V (in this case Seq) regardless of what the actual concrete type is (Vector, List, Set, etc.). Seq does not provide a += method, and its + method expects another Seq.
Given that, to get what you want you need to declare the type of the Map to be a mutable type:
import collection.mutable.ListBuffer
val mymap= collection.mutable.Map.empty[String,ListBuffer[String]]
mymap("key") = ListBuffer("a","b")
mymap("key") += "c"
will work as you expect it to.
If you really want to have immutable value, then something like this should also work:
val mymap= collection.mutable.Map.empty[String,Seq[String]]
mymap("key") = Vector("a","b")
val oldValue = mymap.get("key").getOrElse(Vector[String]())
mymap("key") = oldValue :+ "c"
I used Vector here, because adding elements to the end of List is unefficient by design.

Scala: Default Value for a Map of Tuples

Look at the following Map:
scala> val v = Map("id" -> ("_id", "$oid")).withDefault(identity)
v: scala.collection.immutable.Map[String,java.io.Serializable] = Map(id -> (_id,$oid))
The compiler generates a Map[String,java.io.Serializable] and the value of id can be retrieved like this:
scala> v("id")
res37: java.io.Serializable = (_id,$oid)
Now, if I try to access an element that does not exist like this...
scala> v("idx")
res45: java.io.Serializable = idx
... then as expected I get back the key itself... but how do I get back a tuple with the key itself and an empty string like this?
scala> v("idx")
resXX: java.io.Serializable = (idx,"")
I always need to get back a tuple, regardless of whether or not the element exists.
Thanks.
Instead of .withDefault(identity) you can use
val v = Map("id" -> ("_id", "$oid")).withDefault(x => (x, ""))
withDefault takes as a parameter a function that will create the default value when needed.
This will also change the return type from useless Serializable to more useful (String, String).

In Scala, how can I check if a generic HashMap contains a particular key?

I'm trying to do something like the following
def defined(hash: HashMap[T, U], key: [T) {
hash.contains(key)
}
The above does not compile because my syntax is incorrect. Is it possible to check if a HashMap of unknown type contains a given key?
Other than the stray "[" I don't think you have a syntax error. That and you need an "=" before your braces, or the function won't be returning the bool. And since there is only one expression, no need for braces...
import scala.collection.mutable._
object Main extends App {
def defined[T,U](hash: HashMap[T, U], key: T) = hash.contains(key)
val m = new HashMap[String,Int]
m.put("one", 1)
m.put("two", 2)
println(defined(m, "one"))
println(m contains "two")
println(defined(m, "three"))
}
There's no need to define your own; because all Maps, and all HashMaps in particular, are PartialFunctions, you can use the isDefinedAt method:
scala> val map = HashMap(1->(), 2->())
map: scala.collection.mutable.HashMap[Int,Unit] = Map(1 -> (), 2 -> ())
scala> map.isDefinedAt(2)
res9: Boolean = true
scala> map.isDefinedAt(3)
res10: Boolean = false
Also there is the contains method which is particular to MapLike objects but does the same thing.
Possibly your solution would be something like this:
def detect[K,V]( map : Map[K,V], value : K ) : Boolean = {
map.keySet.contains( value )
}
You declare the generic parameters after the method name and then use them as the types at your parameters.