Retrieve many values associated with the given key at once using Scala - scala

I have a map with hundreds of elements and I want to retrieve many value at once associated with the given key.
Example:
Map(t1 -> 1),(t2 -> 2),.....(t340 ->340)
I know I can use the apply method but i'm trying to retrieve like 50 values at once and this would make the code to look like:
val a = map.apply("t1","t2","t3","t4","t5","t6"...."t50)
Are there any other ways that i can retrieve many values at once using apply method or other methods of the scala map collection?

Depending on what kind of output you expect it can be done in many different ways.
If you are expecting collection, you could do something like:
val keys: List[String]
keys.flatMap(map.get) // List[Int]
where you would get a list of values ordered according to keys, but if there were no value for a key it would be skipped.
If you needed values in a collection and order would be irrelevant then
val keySet = keys.toSet
map.filter(p => keySet(p._1)).values
should be enough.
Yet another option would be:
keys.map { k =>
k -> map.get(k)
}
to avoid loosing information which values were present and which not. Though this wouldn't be much different in practice than just:
map.filter(p => keySet(p._1))
If you expected fixed number of keys, and all have to be present, then I can only see this as:
for {
a1 <- map.get(k1)
a2 <- map.get(k2)
...
an <- map.get(kn)
} yield (a1, a2, ..., an)
which would return Option of tuple. That could be written in a prettier way using Cats e.g.
(map.get(k1), map.get(k2), ..., map.get(kn)).tupled
though most extension methods (and tuples) support up to 22 arguments, so I can only see solving your problem for 50 keys with a collection. (Or very long for comprehension yielding custom 50-field case class).

Related

Scala combination function issue

I have a input file like this:
The Works of Shakespeare, by William Shakespeare
Language: English
and I want to use flatMap with the combinations method to get the K-V pairs per line.
This is what I do:
var pairs = input.flatMap{line =>
line.split("[\\s*$&#/\"'\\,.:;?!\\[\\(){}<>~\\-_]+")
.filter(_.matches("[A-Za-z]+"))
.combinations(2)
.toSeq
.map{ case array => array(0) -> array(1)}
}
I got 17 pairs after this, but missed 2 of them: (by,shakespeare) and (william,shakespeare). I think there might be something wrong with the last word of the first sentence, but I don't know how to solve it, can anyone tell me?
The combinations method will not give duplicates even if the values are in the opposite order. So the values you are missing already appear in the solution in the other order.
This code will create all ordered pairs of words in the text.
for {
line <- input
t <- line.split("""\W+""").tails if t.length > 1
a = t.head
b <- t.tail
} yield a -> b
Here is the description of the tails method:
Iterates over the tails of this traversable collection. The first value will be this traversable collection and the final one will be an empty traversable collection, with the intervening values the results of successive applications of tail.

Scala: efficiently comparing the contents of two lists, may include duplicates, ignoring order, not using sort

In Scala, how to efficiently compare the contents of two lists/seqs, regardless of their order, without sorting (I don't know what the type of elements is)?
The lists/seqs may contain duplicates.
I have seen a somewhat similar discussion, but some answers there are incorrect, or they require sorting.
You can do
list1.groupBy(identity) == list2.groupBy(identity)
It's O(n).
If creating the temporary lists is an issue for you could create a helper method to get only the count for each item and not all occurrences:
def counter[T](l: List[T]) =
l.foldLeft(Map[T,Int]() withDefaultValue 0){ (m,x) =>
m + (x -> (1 + m(x)))
}
counter(list1) == counter(list2)

Comparing Subsets of an RDD

I’m looking for a way to compare subsets of an RDD intelligently.
Lets say I had an RDD with key/value pairs of type (Int->T). I eventually need to say “compare all values of key 1 with all values of key 2 and compare values of key 3 to the values of key 5 and key 7”, how would I go about doing this efficiently?
The way I’m currently thinking of doing it is by creating a List of filtered RDDs and then using RDD.cartesian()
def filterSubset[T] = (b:Int, r:RDD[(Int, T)]) => r.filter{case(name, _) => name == b}
Val keyPairs:(Int, Int) // all key pairs
Val rddPairs = keyPairs.map{
case (a, b) =>
filterSubset(a,r).cartesian(filterSubset(b,r))
}
rddPairs.map{whatever I want to compare…}
I would then iterate the list and perform a map on each of the RDDs of pairs to gather the relational data that I need.
What I can’t tell about this idea is whether it would be extremely inefficient to set up possibly of hundreds of map jobs and then iterate through them. In this case, would the lazy valuation in spark optimize the data shuffling between all of the maps? If not, can someone please recommend a possibly more efficient way to approach this issue?
Thank you for your help
One way you can approach this problem is to replicate and partition your data to reflect key pairs you want to compare. Lets start with creating two maps from the actual keys to the temporary keys we'll use for replication and joins:
def genMap(keys: Seq[Int]) = keys
.zipWithIndex.groupBy(_._1)
.map{case (k, vs) => (k -> vs.map(_._2))}
val left = genMap(keyPairs.map(_._1))
val right = genMap(keyPairs.map(_._2))
Next we can transform data by replicating with new keys:
def mapAndReplicate[T: ClassTag](rdd: RDD[(Int, T)], map: Map[Int, Seq[Int]]) = {
rdd.flatMap{case (k, v) => map.getOrElse(k, Seq()).map(x => (x, (k, v)))}
}
val leftRDD = mapAndReplicate(rddPairs, left)
val rightRDD = mapAndReplicate(rddPairs, right)
Finally we can cogroup:
val cogrouped = leftRDD.cogroup(rightRDD)
And compare / filter pairs:
cogrouped.values.flatMap{case (xs, ys) => for {
(kx, vx) <- xs
(ky, vy) <- ys
if cosineSimilarity(vx, vy) <= threshold
} yield ((kx, vx), (ky, vy)) }
Obviously in the current form this approach is limited. It assumes that values for arbitrary pair of keys can fit into memory and require a significant amount of network traffic. Still it should give you some idea how to proceed.
Another possible approach is to store data in the external system (for example database) and fetch required key-value pairs on demand.
Since you're trying to find similarity between elements I would also consider completely different approach. Instead of naively comparing key-by-key I would try to partition data using custom partitioner which reflects expected similarity between documents. It is far from trivial in general but should give much better results.
Using Dataframe you can easily do the cartesian operation using join:
dataframe1.join(dataframe2, dataframe1("key")===dataframe2("key"))
It will probably do exactly what you want, but efficiently.
If you don't know how to create an Dataframe, please refer to http://spark.apache.org/docs/latest/sql-programming-guide.html#creating-dataframes

Flattening a Set of pairs of sets to one pair of sets

I have a for-comprehension with a generator from a Set[MyType]
This MyType has a lazy val variable called factsPair which returns a pair of sets:
(Set[MyFact], Set[MyFact]).
I wish to loop through all of them and unify the facts into one flattened pair (Set[MyFact], Set[MyFact]) as follows, however I am getting No implicit view available ... and not enough arguments for flatten: implicit (asTraversable ... errors. (I am a bit new to Scala so still trying to get used to the errors).
lazy val allFacts =
(for {
mytype <- mytypeList
} yield mytype.factsPair).flatten
What do I need to specify to flatten for this to work?
Scala flatten works on same types. You have a Seq[(Set[MyFact], Set[MyFact])], which can't be flattened.
I would recommend learning the foldLeft function, because it's very general and quite easy to use as soon as you get the hang of it:
lazy val allFacts = myTypeList.foldLeft((Set[MyFact](), Set[MyFact]())) {
case (accumulator, next) =>
val pairs1 = accumulator._1 ++ next.factsPair._1
val pairs2 = accumulator._2 ++ next.factsPair._2
(pairs1, pairs2)
}
The first parameter takes the initial element it will append the other elements to. We start with an empty Tuple[Set[MyFact], Set[MyFact]] initialized like this: (Set[MyFact](), Set[MyFact]()).
Next we have to specify the function that takes the accumulator and appends the next element to it and returns with the new accumulator that has the next element in it. Because of all the tuples, it doesn't look nice, but works.
You won't be able to use flatten for this, because flatten on a collection returns a collection, and a tuple is not a collection.
You can, of course, just split, flatten, and join again:
val pairs = for {
mytype <- mytypeList
} yield mytype.factsPair
val (first, second) = pairs.unzip
val allFacts = (first.flatten, second.flatten)
A tuple isn't traverable, so you can't flatten over it. You need to return something that can be iterated over, like a List, for example:
List((1,2), (3,4)).flatten // bad
List(List(1,2), List(3,4)).flatten // good
I'd like to offer a more algebraic view. What you have here can be nicely solved using monoids. For each monoid there is a zero element and an operation to combine two elements into one.
In this case, sets for a monoid: the zero element is an empty set and the operation is a union. And if we have two monoids, their Cartesian product is also a monoid, where the operations are defined pairwise (see examples on Wikipedia).
Scalaz defines monoids for sets as well as tuples, so we don't need to do anything there. We'll just need a helper function that combines multiple monoid elements into one, which is implemented easily using folding:
def msum[A](ps: Iterable[A])(implicit m: Monoid[A]): A =
ps.foldLeft(m.zero)(m.append(_, _))
(perhaps there already is such a function in Scala, I didn't find it). Using msum we can easily define
def pairs(ps: Iterable[MyType]): (Set[MyFact], Set[MyFact]) =
msum(ps.map(_.factsPair))
using Scalaz's implicit monoids for tuples and sets.

Scala for loop over two lists simultaneously

I have a List[Message] and a List[Author] which have the same number of items, and should be ordered so that at each index, the Message is from the Author.
I also have class that we'll call here SmartMessage, with a constructor taking 2 arguments: a Message and the corresponding Author.
What I want to do, is to create a List[SmartMessage], combining the data of the 2 simple lists.
Extra question: does List preserve insertion order in Scala? Just to make sure I create List[Message] and a List[Author] with same ordering.
You could use zip:
val ms: List[Message] = ???
val as: List[Author] = ???
var sms = for ( (m, a) <- (ms zip as)) yield new SmartMessage(m, a)
If you don't like for-comprehensions you could use map:
var sms = (ms zip as).map{ case (m, a) => new SmartMessage(m, a)}
Method zip creates collection of pairs. In this case List[(Message, Author)].
You could also use zipped method on Tuple2 (and on Tuple3):
var sms = (ms, as).zipped.map{ (m, a) => new SmartMessage(m, a)}
As you can see you don't need pattern matching in map in this case.
Extra
List is Seq and Seq preserves order. See scala collections overview.
There are 3 main branches of collections: Seq, Set and Map.
Seq preserves order of elements.
Set contains no duplicate elements.
Map contains mappings from keys to values.
List in scala is linked list, so you should prepend elements to it, not append. See Performance Characteristics of scala collections.