I am new to Scala and use Spark to process data. Why does the following code fail to change the categoryMap?
import scala.collection.mutable.LinkedHashMap
val catFile=sc.textFile(inputFile);
var categoryMap=LinkedHashMap[Int,Tuple2[String,Int]]()
catFile.foreach(line => {
val strs=line.split("\001");
categoryMap += (strs(0).toInt -> (strs(2),strs(3).toInt));
})
It's a good practice to try to stay away from both mutable data structures and vars. Sometimes they are needed, but mostly this kind of processing is easy to do by chaining transformation operations on collections. Also, .toMap is handy to convert a Seq containing Tuple2's to a Map.
Here's one way (that I didn't test properly):
val categoryMap = catFile map { _.split("\001") } map { array =>
(array(0).toInt, (array(2), array(3).toInt))
} toMap
Note that if there are more than one record corresponding to a key then only the last one will be present in the resulting map.
Edit: I didn't actually answer your original question - based on a quick test it results in a similar map to what my code above produces. Blind guess, you should make sure that your catFile actually contains data to process.
Related
I've just started using Scala/Spark and having come from a Java background and I'm still trying to wrap my head around the concept of immutability and other best practices of Scala.
This is a very small segment of code from a larger program:
intersections is RDD(Key, (String, String))
obs is (Key, (String, String))
Data is just a case class I've defined above.
val intersections = map1 join map2
var listOfDatas = List[Data]()
intersections take NumOutputs foreach (obs => {
listOfDatas ::= ParseInformation(obs._1.key, obs._2._1, obs._2._2)
})
listOfDatas foreach println
This code works and does what I need it to do, but I was wondering if there was a better way of making this happen. I'm using a variable list and rewriting it with a new list every single time I iterate, and I'm sure there has to be a better way to create an immutable list that's populated with the results of the ParseInformation method call. Also, I remember reading somewhere that instead of accessing the tuple values directly, the way I have done, you should use case classes within functions (as partial functions I think?) to improve readability.
Thanks in advance for any input!
This might work locally, but only because you are takeing locally. It will not work once distributed as the listOfDatas is passed to each worker as a copy. The better way of doing this IMO is:
val processedData = intersections map{case (key, (item1, item2)) => {
ParseInfo(key, item1, item2)
}}
processedData foreach println
A note for a new to functional dev: If all you are trying to do is transform data in an iterable (List), forget foreach. Use map instead, which runs your transformation on each item and spits out a new iterable of the results.
What's the type of intersections? It looks like you can replace foreach with map:
val listOfDatas: List[Data] =
intersections take NumOutputs map (obs => {
ParseInformation(obs._1.key, obs._2._1, obs._2._2)
})
I have a server API that returns a list of things, and does so in chunks of, let's say, 25 items at a time. With every response, we get a list of items, and a "token" that we can use for the following server call to return the next 25, and so on.
Please note that we're using a client library that has been written in stodgy old mutable Java, and doesn't lend itself nicely to all of Scala's functional compositional patterns.
I'm looking for a way to return a lazily evaluated sequence of all server items, by doing a server call with the latest token whenever the local list of items has been exhausted. What I have so far is:
def fetchFromServer(uglyStateObject: StateObject): Seq[Thing] = {
val results = server.call(uglyStateObject)
uglyStateObject.update(results.token())
results.asScala.toList ++ (if results.moreAvailable() then
fetchFromServer(uglyStateObject)
else
List())
}
However, this function does eager evaluation. What I'm looking for is to have ++ concatenate a "strict sequence" and a "lazy sequence", where a thunk will be used to retrieve the next set of items from the server. In effect, I want something like this:
results.asScala.toList ++ Seq.lazy(() => fetchFromServer(uglyStateObject))
Except I don't know what to use in place of Seq.lazy.
Things I've seen so far:
SeqView, but I've seen comments that it shouldn't be used because it re-evaluates all the time?
Streams, but they seem like the abstraction is supposed to generate elements at a time, whereas I want to generate a bunch of elements at a time.
What should I use?
I also suggest you to take a look at scalaz-strem. Here is small example how it may look like
import scalaz.stream._
import scalaz.concurrent.Task
// Returns updated state + fetched data
def fetchFromServer(uglyStateObject: StateObject): (StateObject, Seq[Thing]) = ???
// Initial state
val init: StateObject = new StateObject
val p: Process[Task, Thing] = Process.repeatEval[Task, Seq[Thing]] {
var state = init
Task(fetchFromServer(state)) map {
case (s, seq) =>
state = s
seq
}
} flatMap Process.emitAll
As a matter of fact, in the meantime I already found a slightly different answer that I find more readable (indeed using Streams):
def fetchFromServer(uglyStateObject: StateObject): Stream[Thing] = {
val results = server.call(uglyStateObject)
uglyStateObject.update(results.token())
results.asScala.toStream #::: (if results.moreAvailable() then
fetchFromServer(uglyStateObject)
else
Stream.empty)
}
Thanks everyone for
First off, apologies for the lame question. I am reading the `Scala for the Impatient' religiously and trying to solve all the exercise questions (and doing some minimal exploration)
Background :
The exercise question goes like - Setup a map of prices for a number of gizmos that you covet. Then produce a second map with the same keys and the prices at a 10% discount.
Unfortunately, at this point, most parts of the scaladoc are still cryptic to me but I understand that the map function of the Map takes a function and returns another map after applying a function (I guess?) - def map[B](f: (A) ⇒ B): HashMap[B]. I tried googling but couldnt get much useful results for map function for Map in scala :-)
My Question:
As attempted in my variation 3, does using map function for this purpose make any sense or should I stick with the variation 2 which actually solves my problem.
Code :
val gizmos:Map[String,Double]=Map("Samsung Galaxy S4 Zoom"-> 1000, "Mac Pro"-> 6000.10, "Google Glass"->2000)
//1. Normal for/yield
val discountedGizmos=(for ((k,v)<-gizmos) yield (k, v*0.9)) //Works fine
//2. Variation using mapValues
val discGizmos1=gizmos.mapValues(_*0.9) //Works fine
//3. Variation using only map function
val discGizmos2=gizmos.map((_,v) =>v*0.9) //ERROR : Wrong number of parameters: expected 1
In this case, mapValues does seem the more appropriate method to use. You would use the map method when you need to perform a transformation that requires knowledge of the keys (eg. converting a product reference into a product name, say).
That said, the map method is more general as it gives you acces to both the keys and values for you to act upon, and you could emulate the mapValues method by simply transforming the values and passing the keys through untouched - and that is where you are going wrong in your code above. To use the map method correctly, you should be producing a (key, value) pair from your function, not just a key:
val discGizmos2=gizmos.map{ case (k,v) => (k,v*0.9) } // pass the key through unchanged
It can be also:
val discGizmos2 = gizmos.map(kv => (kv._1, kv._2*0.9))
I have a map that I need to map to a different type, and the result needs to be a List. I have two ways (seemingly) to accomplish what I want, since calling map on a map seems to always result in a map. Assuming I have some map that looks like:
val input = Map[String, List[Int]]("rk1" -> List(1,2,3), "rk2" -> List(4,5,6))
I can either do:
val output = input.map{ case(k,v) => (k.getBytes, v) } toList
Or:
val output = input.foldRight(List[Pair[Array[Byte], List[Int]]]()){ (el, res) =>
(el._1.getBytes, el._2) :: res
}
In the first example I convert the type, and then call toList. I assume the runtime is something like O(n*2) and the space required is n*2. In the second example, I convert the type and generate the list in one go. I assume the runtime is O(n) and the space required is n.
My question is, are these essentially identical or does the second conversion cut down on memory/time/etc? Additionally, where can I find information on storage and runtime costs of various scala conversions?
Thanks in advance.
My favorite way to do this kind of things is like this:
input.map { case (k,v) => (k.getBytes, v) }(collection.breakOut): List[(Array[Byte], List[Int])]
With this syntax, you are passing to map the builder it needs to reconstruct the resulting collection. (Actually, not a builder, but a builder factory. Read more about Scala's CanBuildFroms if you are interested.) collection.breakOut can exactly be used when you want to change from one collection type to another while doing a map, flatMap, etc. — the only bad part is that you have to use the full type annotation for it to be effective (here, I used a type ascription after the expression). Then, there's no intermediary collection being built, and the list is constructed while mapping.
Mapping over a view in the first example could cut down on the space requirement for a large map:
val output = input.view.map{ case(k,v) => (k.getBytes, v) } toList
I've this code :
val total = ListMap[String,HashMap[Int,_]]
val hm1 = new HashMap[Int,String]
val hm2 = new HashMap[Int,Int]
...
//insert values in hm1 and in hm2
...
total += "key1" -> hm1
total += "key2" -> hm2
....
val get = HashMap[Int,String] = total.get("key1") match {
case a : HashMap[Int,String] => a
}
This work, but I would know if exists a better (more readable) way to do this.
Thanks to all !
It looks like you're trying to re-implement tuples as maps.
val total : ( Map[Int,String], Map[Int,Int]) = ...
def get : Map[Int,String] = total._1
(edit: oh, sorry, I get it now)
Here's the thing: the code above doesn't work. Type parameters are erased, so the match above will ALWAYS return true -- try it with key2, for example.
If you want to store multiple types on a Map and retrieve them latter, you'll need to use Manifest and specialized get and put methods. But this has already been answers on Stack Overflow, so I won't repeat myself here.
Your total map, containing maps with non uniform value types, would be best avoided. The question is, when you retrieve the map at "key1", and then cast it to a map of strings, why did you choose String?
The most trivial reason might be that key1 and so on are simply constants, that you know all of them when you write your code. In that case, you probably should have a val for each of your maps, and dispense with map of maps entirely.
It might be that the calls made by the client code have this knowledge. Say that the client does stringMap("key1"), or intMap("key2") or that one way or another, the call implies that some given type is expected. That the client is responsible for not mixing types and names. Again in that case, there is no reason for total. You would have a map of string maps, a map of int maps (provided that you are previous knowledge of a limited number of value types)
What is your reason to have total?
First of all: this is a non-answer (as I would not recommend the approach I discuss), but it was too long for a comment.
If you haven't got too many different keys in your ListMap, I would suggest trying Malvolio's answer.
Otherwise, due to type erasure, the other approaches based on pattern matching are practically equivalent to this (which works, but is very unsafe):
val get = total("key1").asInstanceOf[HashMap[Int, String]]
the reasons why this is unsafe (unless you like living dangerously) are:
total("key1") is not returning an Option (unlike total.get("key1")). If "key1" does not exist, it will throw a NoSuchElementException. I wasn't sure how you were planning to manage the "None" case anyway.
asInstanceOf will also happily cast total("key2") - which should be a HashMap[Int, Int], but is at this point a HashMap[Int, Any] - to a HashMap[Int, String]. You will have problem later on when you try to access the Int value (which now scala believes is a String)