How to get a subset of a map? - scala

How do I get a subset of a map?
Assume we have
val m: Map[Int, String] = ...
val k: List[Int]
Where all keys in k exist in m.
Now I would like to get a subsect of the Map m with only the pairs which key is in the list k.
Something like m.intersect(k), but intersect is not defined on a map.
One way is to use filterKeys: m.filterKeys(k.contains). But this might be a bit slow, because for each key in the original map a search in the list has to be done.
Another way I could think of is k.map(l => (l, m(l)).toMap. Here wie just iterate through the keys we are really interested in and do not make a search.
Is there a better (built-in) way ?

m filterKeys k.toSet
because a Set is a Function.
On performance:
filterKeys itself is O(1), since it works by producing a new map with overridden foreach, iterator, contains and get methods. The overhead comes when elements are accessed. It means that the new map uses no extra memory, but also that memory for the old map cannot be freed.
If you need to free up the memory and have fastest possible access, a fast way would be to fold the elements of k into a new Map without producing an intermediate List[(Int,String)]:
k.foldLeft(Map[Int,String]()){ (acc, x) => acc + (x -> m(x)) }

val s = Map(k.map(x => (x, m(x))): _*)

I think this is most readable and good performer:
k zip (k map m) toMap
Or, method invocation style would be:
k.zip(k.map(m)).toMap

Related

Scala set vs map in for comprehension

Playing around with Scala I'm facing these two similar pieces of code that puzzle me:
val m = Map("a"->2D, "b"->3D)
for((k, v) <- m) yield (v, k) // Yields Map(4.0 -> a, 3.0 -> b)
for(k <- m.keys) yield (m(k), k) // Yields Set((4.0,a), (3.0,b))
Why the different behavior?
Is it possible to change the second comprehension so that it yields a Map instead of a Set?
I sense there is something good to learn here, any additional pointers appreciated
Recall that a for comprehension is de-sugared into map() and flatMap() (and withFilter()) calls. In this case, because each of your examples has a single generator (<-) each one becomes a single map() call.
Also recall that map() will return the same monad (wrapper type) that it was called on.
In the 1st example you're mapping over a Map so you get a Map back: from Map[String,Double] to Map[Double,String]. The tuples are transformed in to key->value pairs.
In the 2nd example you're mapping over a Set of elements from the keys of a Map, so you get a Set back. No tuple transformation takes place. They are left as tuples.
To get a Map out of the 2nd example, i.e. to get the tuples transformed, wrap the entire for in parentheses and tag a .toMap at the end.

Can we replace map with flatMap?

I was trying to find line with maximum words, and i wrote the following lines, to run on spark-shell:
import java.lang.Math
val counts = textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
But since, map is one to one , and flatMap is one to either zero or anything. So i tried replacing map with flatMap, in above code. But its giving error as:
<console>:24: error: type mismatch;
found : Int
required: TraversableOnce[?]
val counts = F1.flatMap(s => s.split(" ").size).reduce((a,b)=> Math.max(a,b))
If anybody could make me understand the reason, it will really be helpful.
flatMap must return an Iterable which is clearly not what you want. You do want a map because you want to map a line to the number of words, so you want a one-to-one function that takes a line and maps it to the number of words (though you could create a collection with one element, being the size of course...).
FlatMap is meant to associate a collection to an input, for instance if you wanted to map a line to all its words you would do:
val words = textFile.flatMap(x => x.split(" "))
and that would return an RDD[String] containing all the words.
In the end, map transforms an RDD of size N into another RDD of size N (e.g. your lines to their length) whereas flatMap transforms an RDD of size N into an RDD of size P (actually an RDD of size N into an RDD of size N made of collections, all these collections are then flattened to produce the RDD of size P).
P.S.: one last word that has nothing to do with your problem, it is more efficient to do (for a string s)
val nbWords = s.split(" ").length
than call .size(). Indeed, the split method returns an array of String and arrays do not have a size method. So when you call .size() you have an implicit conversion from Array[String] to SeqLike[String] which creates new objects. But Array[T] do have a length field so there's no conversion calling length. (It's a detail but I think it's good habit though).
Any use of map can be replaced by flatMap, but the function argument has to be changed to return a single-element List: textFile.flatMap(line => List(line.split(" ").size)). This isn't a good idea: it just makes your code less understandable and less efficient.
After reading Tired of Null Pointer Exceptions? Consider Using Java SE 8's Optional!'s part about why use flatMap() rather than Map(), I have realized the truly reason why flatMap() can not replace map() is that map() is not a special case of flatMap().
It's true that flatMap() means one-to-many, but that's not the only thing flatMap() does. It can also strip outer Stream() if put it simply.
See the definations of map and flatMap:
Stream<R> map(Function<? super T, ? extends R> mapper)
Stream<R> flatMap(Function<? super T, ? extends Stream<? extends R>> mapper)
the only difference is the type of returned value in inner function. What map() returned is "Stream<'what inner function returned'>", while what flatMap() returned is just "what inner function returned".
So you can say that flatMap() can kick outer Stream() away, but map() can't. This is the most difference in my opinion, and also why map() is not just a special case of flatMap().
ps:
If you really want to make one-to-one with flatMap, then you should change it into one-to-List(one). That means you should add an outer Stream() manually which will be stripped by flatMap() later. After that you'll get the same effect as using map().(Certainly, it's clumsy. So don't do like that.)
Here are examples for Java8, but the same as Scala:
use map():
list.stream().map(line -> line.split(" ").length)
deprecated use flatMap():
list.stream().flatMap(line -> Arrays.asList(line.split(" ").length).stream())

Scala indices of values in one list which are not in a second list

I am trying to find the indices of elements in one Scala list which are not present in a second list (assume the second list has distinct elements so that you do not need to invoke toSet on it). The best way I found is:
val ls = List("a", "b", "c") // the list
val excl = List("c", "d") // the list of items to exclude
val ixs = ls.zipWithIndex.
filterNot{p => excl.contains(p._1)}.
map{ p => p._2} // the list of indices
However, I feel there should be a more direct method. Any hints?
Seems OK to me. It's a bit more elegant as a for-comprehension, perhaps:
for ((e,i) <- ls.zipWithIndex if !excl.contains(e)) yield i
And for efficiency, you might want to make excl into a Set anyway
val exclSet = excl.toSet
for ((e,i) <- ls.zipWithIndex if !exclSet(e)) yield i
One idea would be this
(ls.zipWithIndex.toMap -- excl).values
only works however, if you are not interested in all position if an element occurs multiple times in the list. That would need a MultiMap which Scala does not have in the standard library.
An other version would be to use a partial function and convert the second list to a set first (unless it is really small lookup in a set will be much fast)
val set = excl.toSet
ls.zipWithIndex.collect{case (x,y) if !set(x) => y}

Comparing Subsets of an RDD

I’m looking for a way to compare subsets of an RDD intelligently.
Lets say I had an RDD with key/value pairs of type (Int->T). I eventually need to say “compare all values of key 1 with all values of key 2 and compare values of key 3 to the values of key 5 and key 7”, how would I go about doing this efficiently?
The way I’m currently thinking of doing it is by creating a List of filtered RDDs and then using RDD.cartesian()
def filterSubset[T] = (b:Int, r:RDD[(Int, T)]) => r.filter{case(name, _) => name == b}
Val keyPairs:(Int, Int) // all key pairs
Val rddPairs = keyPairs.map{
case (a, b) =>
filterSubset(a,r).cartesian(filterSubset(b,r))
}
rddPairs.map{whatever I want to compare…}
I would then iterate the list and perform a map on each of the RDDs of pairs to gather the relational data that I need.
What I can’t tell about this idea is whether it would be extremely inefficient to set up possibly of hundreds of map jobs and then iterate through them. In this case, would the lazy valuation in spark optimize the data shuffling between all of the maps? If not, can someone please recommend a possibly more efficient way to approach this issue?
Thank you for your help
One way you can approach this problem is to replicate and partition your data to reflect key pairs you want to compare. Lets start with creating two maps from the actual keys to the temporary keys we'll use for replication and joins:
def genMap(keys: Seq[Int]) = keys
.zipWithIndex.groupBy(_._1)
.map{case (k, vs) => (k -> vs.map(_._2))}
val left = genMap(keyPairs.map(_._1))
val right = genMap(keyPairs.map(_._2))
Next we can transform data by replicating with new keys:
def mapAndReplicate[T: ClassTag](rdd: RDD[(Int, T)], map: Map[Int, Seq[Int]]) = {
rdd.flatMap{case (k, v) => map.getOrElse(k, Seq()).map(x => (x, (k, v)))}
}
val leftRDD = mapAndReplicate(rddPairs, left)
val rightRDD = mapAndReplicate(rddPairs, right)
Finally we can cogroup:
val cogrouped = leftRDD.cogroup(rightRDD)
And compare / filter pairs:
cogrouped.values.flatMap{case (xs, ys) => for {
(kx, vx) <- xs
(ky, vy) <- ys
if cosineSimilarity(vx, vy) <= threshold
} yield ((kx, vx), (ky, vy)) }
Obviously in the current form this approach is limited. It assumes that values for arbitrary pair of keys can fit into memory and require a significant amount of network traffic. Still it should give you some idea how to proceed.
Another possible approach is to store data in the external system (for example database) and fetch required key-value pairs on demand.
Since you're trying to find similarity between elements I would also consider completely different approach. Instead of naively comparing key-by-key I would try to partition data using custom partitioner which reflects expected similarity between documents. It is far from trivial in general but should give much better results.
Using Dataframe you can easily do the cartesian operation using join:
dataframe1.join(dataframe2, dataframe1("key")===dataframe2("key"))
It will probably do exactly what you want, but efficiently.
If you don't know how to create an Dataframe, please refer to http://spark.apache.org/docs/latest/sql-programming-guide.html#creating-dataframes

Scala, make my loop more functional

I'm trying to reduce the extent to which I write Scala (2.8) like Java. Here's a simplification of a problem I came across. Can you suggest improvements on my solutions that are "more functional"?
Transform the map
val inputMap = mutable.LinkedHashMap(1->'a',2->'a',3->'b',4->'z',5->'c')
by discarding any entries with value 'z' and indexing the characters as they are encountered
First try
var outputMap = new mutable.HashMap[Char,Int]()
var counter = 0
for(kvp <- inputMap){
val character = kvp._2
if(character !='z' && !outputMap.contains(character)){
outputMap += (character -> counter)
counter += 1
}
}
Second try (not much better, but uses an immutable map and a 'foreach')
var outputMap = new immutable.HashMap[Char,Int]()
var counter = 0
inputMap.foreach{
case(number,character) => {
if(character !='z' && !outputMap.contains(character)){
outputMap2 += (character -> counter)
counter += 1
}
}
}
Nicer solution:
inputMap.toList.filter(_._2 != 'z').map(_._2).distinct.zipWithIndex.toMap
I find this solution slightly simpler than arjan's:
inputMap.values.filter(_ != 'z').toSeq.distinct.zipWithIndex.toMap
The individual steps:
inputMap.values // Iterable[Char] = MapLike(a, a, b, z, c)
.filter(_ != 'z') // Iterable[Char] = List(a, a, b, c)
.toSeq.distinct // Seq[Char] = List(a, b, c)
.zipWithIndex // Seq[(Char, Int)] = List((a,0), (b,1), (c,2))
.toMap // Map[Char, Int] = Map((a,0), (b,1), (c,2))
Note that your problem doesn't inherently involve a map as input, since you're just discarding the keys. If I were coding this, I'd probably write a function like
def buildIndex[T](s: Seq[T]): Map[T, Int] = s.distinct.zipWithIndex.toMap
and invoke it as
buildIndex(inputMap.values.filter(_ != 'z').toSeq)
First, if you're doing this functionally, you should use an immutable map.
Then, to get rid of something, you use the filter method:
inputMap.filter(_._2 != 'z')
and finally, to do the remapping, you can just use the values (but as a set) withzipWithIndex, which will count up from zero, and then convert back to a map:
inputMap.filter(_._2 != 'z').values.toSet.zipWithIndex.toMap
Since the order of values isn't going to be preserved anyway*, presumably it doesn't matter that the order may have been shuffled yet again with the set transformation.
Edit: There's a better solution in a similar vein; see Arjan's. Assumption (*) is wrong, since it was a LinkedHashMap. So you do need to preserve order, which Arjan's solution does.
i would create some "pipeline" like this, but this has a lot of operations and could be probably shortened. These two List.map's could be put in one, but I think you've got a general idea.
inputMap
.toList // List((5,c), (1,a), (2,a), (3,b), (4,z))
.sorted // List((1,a), (2,a), (3,b), (4,z), (5,c))
.filterNot((x) => {x._2 == 'z'}) // List((1,a), (2,a), (3,b), (5,c))
.map(_._2) // List(a, a, b, c)
.zipWithIndex // List((a,0), (a,1), (b,2), (c,3))
.map((x)=>{(x._2+1 -> x._1)}) // List((1,a), (2,a), (3,b), (4,c))
.toMap // Map((1,a), (2,a), (3,b), (4,c))
performing these operation on lists keeps ordering of elements.
EDIT: I misread the OP question - thought you wanted run length encoding. Here's my take on your actual question:
val values = inputMap.values.filterNot(_ == 'z').toSet.zipWithIndex.toMap
EDIT 2: As noted in the comments, use toSeq.distinct or similar if preserving order is important.
val values = inputMap.values.filterNot(_ == 'z').toSeq.distinct.zipWithIndex.toMap
In my experience I have found that maps and functional languages do not play nice. You'll note that all answers so far in one way or another in involve turning the map into a list, filtering the list, and then turning the list back into a map.
I think this is due to maps being mutable data structures by nature. Consider that when building a list, that the underlying structure of the list does not change when you append a new element and if a true list then an append is a constant O(1) operation. Whereas for a map the internal structure of a map can vastly change when a new element is added ie. when the load factor becomes too high and the add algorithm resizes the map. In this way a functional language cannot just create a series of a values and pop them into a map as it goes along due to the possible side effects of introducing a new key/value pair.
That said, I still think there should be better support for filtering, mapping and folding/reducing maps. Since we start with a map, we know the maximum size of the map and it should be easy to create a new one.
If you're wanting to get to grips with functional programming then I'd recommending steering clear of maps to start with. Stick with the things that functional languages were designed for -- list manipulation.