How to read from csv to a ListMap - scala

I have been assigned with the task of reading from csv, and creating a ListMap variable. The reason to use this specific class is that for some other use cases they were already using a number of methods with ListMap as input parameter, and they want one more.
What I have done so far is: read from the csv, and create a rdd.
The format of the csv is
"field1,field2"
"value1,value2"
"value3,value4"
In this rdd I have tuples of strings. What I would like is to now convert this to a ListMap class. So what I have is a variable with the type Array[(value1,value2),(value3,value4)].
I did this because I find it easy to go from a csv to tuples. The problem is I do not find any way to go from here to a ListMap. It seems easier to get a normal Map class, but as I said, it is required for the final result to be a ListMap type of object.
I have been reading but I do not really understand this answer nor this one

Depending on the sample data you provided, you can use collectAsMap api to get the final ListMap
val rdd = sparkSession.sparkContext.textFile("path to the text file")
.map(line => line.split(","))
.map(array => array(0) -> array(1))
.collectAsMap()
Thats it.
Now if you want to go a step further you can do additional step as
var listMap : ListMap[String, String] = ListMap.empty[String, String]
for(map <- rdd) {
listMap += map
}

Array("foo" -> "bar", "baz" -> "bat").toMap gives you a Map.
If you are looking for a ListMap specifically (for the life of me, can't think of a reason why you would), then you need a breakOut:
val map: ListMap[String, String] =
Array("foo" -> "bar", "baz" -> "bat")
.toMap
.map(identity)(scala.collection.breakOut)
breakOut is sort of a "collection factory" that lets you implicitly convert between different collection types. You can read more about it here: https://docs.scala-lang.org/tutorials/FAQ/breakout.html

Related

How to work with a Spark RDD to produce or map to another RDD

I have a Key/Value RDD I want to take that "iterate over" the entities in it, Key/Value, and create, or map, to another RDD which could have more or less entries that the first RDD.
Example:
I have records in accumulo that represent observations of colors in paintings.
An observation entity/object holds data on the painting name and the colors in the painting.
Observation
public String getPaintingName() {return paintingName;}
public List<String> getObservedColors() {return colorList}
I pull the observations from accumulo into my code as an RDD.
val observationRDD: RDD[(Text, Observation)] = getObservationsFromAccumulo();
I want to take this RDD and create an RDD of the form of (Color, paintingName) where the key is the color observed and the value is the painting name which the color was observed in.
val colorToPaintingRDD: RDD[(String, String)] = observationRDD.somefunction({ case (_, observation) =>
for(String color : observations.getObservedColors()) {
// Some how output a entry into a new RDD
//output/map (color, observation.getPaintingName)
})
I know map can't work, because its 1 to 1, I thought maybe observationRDD.flatmap(some function) but can't seem to find any examples on how to do that to create a new, larger or smaller, RDD.
Could someone help me out and tell me if flatmap is correct, and if so give me an example using this example I provided, or tell me if i'm way off base?
Please understand this is just a simple example, its not the content im asking about, its how one would transform a RDD to a RDD with more or less entries.
You should use flatmap and return a List[(String, String)] foreach element in RDD. FlatMap will flat the result and you'll get an RDD[(String, String)]
I didn't try the code, but it would be something like this:
val colorToPaintingRDD: RDD[(String, String)] = observationRDD.flatMap { case (_, observation) =>
observations.getObservedColors().map(color => (color, observation.getPaintingName))
}
Probably if getObservedColors method is in Java you have to import JavaConversions and change to scala list.
import scala.collection.JavaConversions._
observations.getObservedColors().toList

Spark: Cannot add RDD elements into a mutable HashMap inside a closure

I have the following code where rddMap is of org.apache.spark.rdd.RDD[(String, (String, String))], and myHashMap is scala.collection.mutable.HashMap.
I did .saveAsTextFile("temp_out") to force the evaluation of rddMap.map.
However, even println(" t " + t) is printing things, later myHashMap still has only one element I manually put in the beginning ("test1", ("10", "20")).
Everything in the rddMap is not put into myHashMap.
Snippet code:
val myHashMap = new HashMap[String, (String, String)]
myHashMap.put("test1", ("10", "20"))
rddMap.map { t =>
println(" t " + t)
myHashMap.put(t._1, t._2)
}.saveAsTextFile("temp_out")
println(rddMap.count)
println(myHashMap.toString)
Why I cannot put the elements from rddMap to my myHashMap?
Here is a working example of what you want to accomplish.
val rddMap = sc.parallelize(Map("A" -> ("v", "v"), "B" -> ("d","d")).toSeq)
// Collects all the data in the RDD and converts the data to a Map
val myMap = rddMap.collect().toMap
myMap.foreach(println)
Output:
(A,(v,v))
(B,(d,d))
Here is similar code to what you've posted
rddMap.map { t=>
println("t" + t)
newHashMap.put(t._1, t._2)
println(newHashMap.toString)
}.collect
Here is the output to the above code from the Spark shell
t(A,(v,v))
Map(A -> (v,v), test1 -> (10,20))
t(B,(d,d))
Map(test1 -> (10,20), B -> (d,d))
To me it looks like Spark copies your HashMap and does add the element to the copied map.
What you are trying to do is not really supported in Spark today.
Note that every user defined function (e.g., what you add inside a map()) is a closure that gets serialized and pushed to each executioner.
Therefore everything you have inside this map() gets serialized and gets transferred around:
.map{ t =>
println(" t " + t)
myHashMap.put(t._1, t._2)
}
Essentially, your myHashMap will be copied to each executioner and each executioner will be updating it's own version of that HashMap. This is why at the end of the execution the myHashMap you have in your driver will never get changed. (Driver is the JVM that manages/orchestrates your Spark jobs. It's the place where you define your SparkContext.)
In order to push structures defined in the driver to all executioners you need to broadcast them (see link here). Note that broadcasted variables are read-only, so again, using broadcasts will not help you here.
Another way is to use Accumulators but I feel that these are more tune towards summarizing numeric values, like doing sum, max, min, etc. Maybe you can take a look at creating a custom accumulator that extends AccumulatorParam. See link here.
Coming back to the original question, if you want to collect values to your driver, currently the best way to do this is to transform your RDDs until they become a small and manageable collection of elements and then you collect() this final/small RDD.

Why can not modify a map in foreach?

I am new to Scala and use Spark to process data. Why does the following code fail to change the categoryMap?
import scala.collection.mutable.LinkedHashMap
val catFile=sc.textFile(inputFile);
var categoryMap=LinkedHashMap[Int,Tuple2[String,Int]]()
catFile.foreach(line => {
val strs=line.split("\001");
categoryMap += (strs(0).toInt -> (strs(2),strs(3).toInt));
})
It's a good practice to try to stay away from both mutable data structures and vars. Sometimes they are needed, but mostly this kind of processing is easy to do by chaining transformation operations on collections. Also, .toMap is handy to convert a Seq containing Tuple2's to a Map.
Here's one way (that I didn't test properly):
val categoryMap = catFile map { _.split("\001") } map { array =>
(array(0).toInt, (array(2), array(3).toInt))
} toMap
Note that if there are more than one record corresponding to a key then only the last one will be present in the resulting map.
Edit: I didn't actually answer your original question - based on a quick test it results in a similar map to what my code above produces. Blind guess, you should make sure that your catFile actually contains data to process.

Converting a Scala Map to a List

I have a map that I need to map to a different type, and the result needs to be a List. I have two ways (seemingly) to accomplish what I want, since calling map on a map seems to always result in a map. Assuming I have some map that looks like:
val input = Map[String, List[Int]]("rk1" -> List(1,2,3), "rk2" -> List(4,5,6))
I can either do:
val output = input.map{ case(k,v) => (k.getBytes, v) } toList
Or:
val output = input.foldRight(List[Pair[Array[Byte], List[Int]]]()){ (el, res) =>
(el._1.getBytes, el._2) :: res
}
In the first example I convert the type, and then call toList. I assume the runtime is something like O(n*2) and the space required is n*2. In the second example, I convert the type and generate the list in one go. I assume the runtime is O(n) and the space required is n.
My question is, are these essentially identical or does the second conversion cut down on memory/time/etc? Additionally, where can I find information on storage and runtime costs of various scala conversions?
Thanks in advance.
My favorite way to do this kind of things is like this:
input.map { case (k,v) => (k.getBytes, v) }(collection.breakOut): List[(Array[Byte], List[Int])]
With this syntax, you are passing to map the builder it needs to reconstruct the resulting collection. (Actually, not a builder, but a builder factory. Read more about Scala's CanBuildFroms if you are interested.) collection.breakOut can exactly be used when you want to change from one collection type to another while doing a map, flatMap, etc. — the only bad part is that you have to use the full type annotation for it to be effective (here, I used a type ascription after the expression). Then, there's no intermediary collection being built, and the list is constructed while mapping.
Mapping over a view in the first example could cut down on the space requirement for a large map:
val output = input.view.map{ case(k,v) => (k.getBytes, v) } toList

Map inside Map in Scala

I've this code :
val total = ListMap[String,HashMap[Int,_]]
val hm1 = new HashMap[Int,String]
val hm2 = new HashMap[Int,Int]
...
//insert values in hm1 and in hm2
...
total += "key1" -> hm1
total += "key2" -> hm2
....
val get = HashMap[Int,String] = total.get("key1") match {
case a : HashMap[Int,String] => a
}
This work, but I would know if exists a better (more readable) way to do this.
Thanks to all !
It looks like you're trying to re-implement tuples as maps.
val total : ( Map[Int,String], Map[Int,Int]) = ...
def get : Map[Int,String] = total._1
(edit: oh, sorry, I get it now)
Here's the thing: the code above doesn't work. Type parameters are erased, so the match above will ALWAYS return true -- try it with key2, for example.
If you want to store multiple types on a Map and retrieve them latter, you'll need to use Manifest and specialized get and put methods. But this has already been answers on Stack Overflow, so I won't repeat myself here.
Your total map, containing maps with non uniform value types, would be best avoided. The question is, when you retrieve the map at "key1", and then cast it to a map of strings, why did you choose String?
The most trivial reason might be that key1 and so on are simply constants, that you know all of them when you write your code. In that case, you probably should have a val for each of your maps, and dispense with map of maps entirely.
It might be that the calls made by the client code have this knowledge. Say that the client does stringMap("key1"), or intMap("key2") or that one way or another, the call implies that some given type is expected. That the client is responsible for not mixing types and names. Again in that case, there is no reason for total. You would have a map of string maps, a map of int maps (provided that you are previous knowledge of a limited number of value types)
What is your reason to have total?
First of all: this is a non-answer (as I would not recommend the approach I discuss), but it was too long for a comment.
If you haven't got too many different keys in your ListMap, I would suggest trying Malvolio's answer.
Otherwise, due to type erasure, the other approaches based on pattern matching are practically equivalent to this (which works, but is very unsafe):
val get = total("key1").asInstanceOf[HashMap[Int, String]]
the reasons why this is unsafe (unless you like living dangerously) are:
total("key1") is not returning an Option (unlike total.get("key1")). If "key1" does not exist, it will throw a NoSuchElementException. I wasn't sure how you were planning to manage the "None" case anyway.
asInstanceOf will also happily cast total("key2") - which should be a HashMap[Int, Int], but is at this point a HashMap[Int, Any] - to a HashMap[Int, String]. You will have problem later on when you try to access the Int value (which now scala believes is a String)