I need to create a test for various collections based on Map and HashMap.
I have two functions that create test data, e.g.:
def f1: String = { ... )
def f2: String = { ... }
these functions create random data every time they are called.
My map is:
val m:Map[String,String] = ...
what I try to accomplish is construct an immutable map with 10000 random items that were generated by calling f1/f2. so protocode would be:
for 1 to 10000
add-key-value-to-map (key = f1(), value = f2() )
end for
how can I accomplish this in scala, without destroying and rebuilding the list 10000 times?
EDIT:
Since it wasn't clear in the original post above, I am trying to run this with various types of maps (Map, HashMap, TreeMap).
List.fill(10000)((f1, f2)).toMap
You can use List.fill to create a List of couple (String, String) and then call .toMap on it:
scala> def f1 = util.Random.alphanumeric take 5 mkString
f1: String
scala> def f2 = util.Random.alphanumeric take 5 mkString
f2: String
scala> val m = List.fill(5)(f1 -> f2).toMap
m: scala.collection.immutable.Map[String,String] =
Map(T7hD8 -> BpAa1, uVpno -> 6sMjc, wdaRP -> XSC1V, ZGlC0 -> aTwBo, SjfOr -> hdzIN)
Alternatively you could use Map/HashMap/TreeMap's .apply function:
scala> val m = collection.immutable.TreeMap(List.fill(5)(f1 -> f2) : _*)
m: scala.collection.immutable.TreeMap[String,String] =
Map(3cieU -> iy0KV, 8oUb1 -> YY6NC, 95ol4 -> Sf9qp, GhXWX -> 8U8wt, ZD8Px -> STMOC)
val m = (1 to 10000).foldLeft(Map.empty[String,String]) { (m, _) => m + (f1 -> f2) }
Using tabulate as follows,
Seq.tabulate(10000)(_ => f1 -> f2).toMap
It proves unclear whether the random key generator function may duplicate some keys, in which case 10000 iterations would not suffice to produce a map of such size.
An intuitive approach,
(1 to 10000).map(_ => f1 -> f2).toMap
Using a recursive function instead of generating a range to iterate over, (although numerous intermediate Maps are created),
def g(i: Int): Map[String,String] = {
if (i<=0)
Map()
else
Map(f1 -> f2) ++ g(i-1)
}
Related
I have a Scala map collection that looks something like this:
var collection = Map((A,B) -> 1)
The key is (A,B) and the value is 1.
My question: If I use collection.head._1, the result is (A,B) which is correct. But I want to extract A only, without B, as I need to compare A with some other variable. So the final result should be A stored in a different variable.
I tried to use collection.head._1(0) which results in error
Any does not take parameters
You can try:
val collection = Map(("A","B") -> 1)
collection.map{ case ((a, b),v) => a -> v}
You can use keySet to get all the keys as a Set[(String, String)] and then map it into the first element of each:
val coll: Map[(String, String), Int] =
Map(
("one", "elephant") -> 1,
("two", "elephants") -> 2,
("three", "elephants") -> 3
)
/*
val myKeys = coll.keySet.map { case (x, _) => x }
// equivalent to:
val myKeys = coll.keySet.map(tup => tup._1)
// equivalent to: */
val myKeys = coll.keySet.map(_._1) // Set(one, two, three)
What is the best way to apply a function to each element of a Map and at the end return the same Map, unchanged, so that it can be used in further operations?
I'd like to avoid:
myMap.map(el => {
effectfullFn(el)
el
})
to achieve syntax like this:
myMap
.mapEffectOnKV(effectfullFn)
.foreach(println)
map is not what I'm looking for, because I have to specify what comes out of the map (as in the first code snippet), and I don't want to do that.
I want a special operation that knows/assumes that the map elements should be returned without change after the side-effect function has been executed.
In fact, this would be so useful to me, I'd like to have it for Map, Array, List, Seq, Iterable... The general idea is to peek at the elements to do something, then automatically return these elements.
The real case I'm working on looks like this:
calculateStatistics(trainingData, indexMapLoaders)
.superMap { (featureShardId, shardStats) =>
val outputDir = summarizationOutputDir + "/" + featureShardId
val indexMap = indexMapLoaders(featureShardId).indexMapForDriver()
IOUtils.writeBasicStatistics(sc, shardStats, outputDir, indexMap)
}
Once I have calculated the statistics for each shard, I want to append the side effect of saving them to disk, and then just return those statistics, without having to create a val and having that val's name be the last statement in the function, e.g.:
val stats = calculateStatistics(trainingData, indexMapLoaders)
stats.foreach { (featureShardId, shardStats) =>
val outputDir = summarizationOutputDir + "/" + featureShardId
val indexMap = indexMapLoaders(featureShardId).indexMapForDriver()
IOUtils.writeBasicStatistics(sc, shardStats, outputDir, indexMap)
}
stats
It's probably not very hard to implement, but I was wondering if there was something in Scala already for that.
Function cannot be effectful by definition, so I wouldn't expect anything convenient in scala-lib. However, you can write a wrapper:
def tap[T](effect: T => Unit)(x: T) = {
effect(x)
x
}
Example:
scala> Map(1 -> 1, 2 -> 2)
.map(tap(el => el._1 + 5 -> el._2))
.foreach(println)
(1,1)
(2,2)
You can also define an implicit:
implicit class TapMap[K,V](m: Map[K,V]){
def tap(effect: ((K,V)) => Unit): Map[K,V] = m.map{x =>
effect(x)
x
}
}
Examples:
scala> Map(1 -> 1, 2 -> 2).tap(el => el._1 + 5 -> el._2).foreach(println)
(1,1)
(2,2)
To abstract more, you can define this implicit on TraversableOnce, so it would be applicable to List, Set and so on if you need it:
implicit class TapTraversable[Coll[_], T](m: Coll[T])(implicit ev: Coll[T] <:< TraversableOnce[T]){
def tap(effect: T => Unit): Coll[T] = {
ev(m).foreach(effect)
m
}
}
scala> List(1,2,3).tap(println).map(_ + 1)
1
2
3
res24: List[Int] = List(2, 3, 4)
scala> Map(1 -> 1).tap(println).toMap //`toMap` is needed here for same reasons as it needed when you do `.map(f).toMap`
(1,1)
res5: scala.collection.immutable.Map[Int,Int] = Map(1 -> 1)
scala> Set(1).tap(println)
1
res6: scala.collection.immutable.Set[Int] = Set(1)
It's more useful, but requires some "mamba-jumbo" with types, as Coll[_] <: TraversableOnce[_] doesn't work (Scala 2.12.1), so I had to use an evidence for that.
You can also try CanBuildFrom approach: How to enrich a TraversableOnce with my own generic map?
Overall recommendation about dealing with passthrough side-effects on iterators is to use Streams (scalaz/fs2/monix) and Task, so they've got an observe (or some analog of it) function that does what you want in async (if needed) way.
My answer before you provided example of what you want
You can represent effectful computation without side-effects and have distinct values that represent state before and after:
scala> val withoutSideEffect = Map(1 -> 1, 2 -> 2)
withoutSideEffect: scala.collection.immutable.Map[Int,Int] = Map(1 -> 1, 2 -> 2)
scala> val withSideEffect = withoutSideEffect.map(el => el._1 + 5 -> (el._2 + 5))
withSideEffect: scala.collection.immutable.Map[Int,Int] = Map(6 -> 6, 7 -> 7)
scala> withoutSideEffect //unchanged
res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 1, 2 -> 2)
scala> withSideEffect //changed
res1: scala.collection.immutable.Map[Int,Int] = Map(6 -> 6, 7 -> 7)
Looks like the concept you're after is similar to the Unix tee
utility--take an input and direct it to two different outputs. (tee
gets its name from the shape of the letter 'T', which looks like a
pipeline from left to right with another line branching off downwards.)
Here's the Scala version:
package object mypackage {
implicit class Tee[A](a: A) extends AnyVal {
def tee(f: A => Unit): A = { f(a); a }
}
}
With that, we can do:
calculateStatistics(trainingData, indexMapLoaders) tee { stats =>
stats foreach { case (featureShardId, shardStats) =>
val outputDir = summarizationOutputDir + "/" + featureShardId
val indexMap = indexMapLoaders(featureShardId).indexMapForDriver()
IOUtils.writeBasicStatistics(sc, shardStats, outputDir, indexMap)
}
}
Note that as defined, Tee is very generic--it can do an effectful
operation on any value and then return the original passed-in value.
Call foreach on your Map with your effectfull function. You original Map will not be changed as Maps in scala are immutable.
val myMap = Map(1 -> 1)
myMap.foreach(effectfullFn)
If you are trying to chain this operation, you can use map
myMap.map(el => {
effectfullFn(el)
el
})
This post is essentially about how to build joint and marginal histograms from a (String, String) RDD. I posted the code that I eventually used below as the answer.
I have an RDD that contains a set of tuples of type (String,String) and since they aren't unique I want to get a look at how many times each String, String combination occurs so I use countByValue like so
val PairCount = Pairs.countByValue().toSeq
which gives me a tuple as output like this ((String,String),Long) where long is the number of times that the (String, String) tuple appeared
These Strings can be repeated in different combinations and I essentially want to run word count on this PairCount variable so I tried something like this to start:
PairCount.map(x => (x._1._1, x._2))
But the output the this spits out is String1->1, String2->1, String3->1, etc.
How do I output a key value pair from a map job in this case where the key is going to be one of the String values from the inner tuple, and the value is going to be the Long value from the outter tuple?
Update:
#vitalii gets me almost there. the answer gets me to a Seq[(String,Long)], but what I really need is to turn that into a map so that I can run reduceByKey it afterwards. when I run
PairCount.flatMap{case((x,y),n) => Seq[x->n]}.toMap
for each unique x I get x->1
for example the above line of code generates mom->1 dad->1 even if the tuples out of the flatMap included (mom,30) (dad,59) (mom,2) (dad,14) in which case I would expect toMap to provide mom->30, dad->59 mom->2 dad->14. However, I'm new to scala so I might be misinterpreting the functionality.
how can I get the Tuple2 sequence converted to a map so that I can reduce on the map keys?
If I correctly understand question, you need flatMap:
val pairCountRDD = pairs.countByValue() // RDD[((String, String), Int)]
val res : RDD[(String, Int)] = pairCountRDD.flatMap { case ((s1, s2), n) =>
Seq(s1 -> n, s2 -> n)
}
Update: I didn't quiet understand what your final goal is, but here's a few more examples that may help you, btw code above is incorrect, I have missed the fact that countByValue returns map, and not RDD:
val pairs = sc.parallelize(
List(
"mom"-> "dad", "dad" -> "granny", "foo" -> "bar", "foo" -> "baz", "foo" -> "foo"
)
)
// don't use countByValue, if pairs is large you will run out of memmory
val pairCountRDD = pairs.map(x => (x, 1)).reduceByKey(_ + _)
val wordCount = pairs.flatMap { case (a,b) => Seq(a -> 1, b ->1)}.reduceByKey(_ + _)
wordCount.take(10)
// count in how many pairs each word occur, keys and values:
val wordPairCount = pairs.flatMap { case (a,b) =>
if (a == b) {
Seq(a->1)
} else {
Seq(a -> 1, b ->1)
}
}.reduceByKey(_ + _)
wordPairCount.take(10)
to get the histograms for the (String,String) RDD I used this code.
val Hist_X = histogram.map(x => (x._1-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_Y = histogram.map(x => (x._2-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_XY = histogram.map(x => (x-> 1.0)).reduceByKey(_+_)
where histogram was the (String,String) RDD
I know about couple of similar questions. They don't help me - code does not work if there is no existing key.
I need just some nice approach to append Map with value adding it to existing key (if it does exist) or putting as NEW key (if map does not contain appropriate key).
Following code works but I don't like it:
val a = collection.mutable.Map(("k1" -> 1), ("k2" -> 5))
val key = "k1"
val elem = a.get(key)
if (elem == None) {
a += ("k5" -> 200)
} else {
a.update(key, elem.get + 5)
}
Any point to better one?
Current Scala version is 2.10.4 and I cannot currently switch to 2.11.
Mutable map is not 100% limitation but preferred.
Here is, for example, similar question but I also need to account case of non-existing key which is not accounted there. At least we should understand a.get(key) could be None or add some better approach. Good idea was |+| but I'd like to keep basic Scala 2.10.x.
The shortest way to do that:
a += a.get(key).map(x => key -> (x + 5)).getOrElse("k5" -> 200)
In general:
a += a.get(k).map(f).map(k -> _).getOrElse(kv)
Same if your dictionary is immutable:
m + m.get(k).map(f).map(k -> _).getOrElse(kv)
so I don't see any reason to use mutable collection here.
If you don't like all these Option.map things:
m + (if (m.contains(k)) k -> f(m(k)) else kv)
Note, that there is a whole class of possible variations:
k1 -> f(m(k1)) else k2 -> v2 //original
k1 -> f(m(k1)) else k1 -> v2
k1 -> f(m(k2)) else k2 -> v2
k1 -> f(m(k2)) else k1 -> v2
k2 -> v2 else k1 -> f(m(k1))
k1 -> v2 else k1 -> f(m(k1))
k2 -> v2 else k1 -> f(m(k2))
k1 -> v2 else k1 -> f(m(k2))
... //v2 may also be a function from some key's value
So, Why it's not a standard function? IMO, because all variations still can be implemented as one-liners. If you want library with all functions, that can be implemented as one-liners, you know, it's Scalaz :).
P.S. If yu also wonder why there is no "update(d) if persist" function - see #Rex Kerr 's answer here
You can create you own function for that purpose:
def addOrUpdate[K, V](m: collection.mutable.Map[K, V], k: K, kv: (K, V),
f: V => V) {
m.get(k) match {
case Some(e) => m.update(k, f(e))
case None => m += kv
}
}
addOrUpdate(a, "k1", "k5" -> 200, (v: Int) => v + 5)
Scala 2.13 introduced updatedWith method which seems to be the most idiomatic way to update a map conditionally on the existence of the key.
val a = Map(("k1" -> 1), ("k2" -> 5))
val a1 = a.updatedWith("k1") {
case Some(v) => Some(v + 5)
case None => Some(200)
}
println(a1) // Map(k1 -> 6, k2 -> 5)
One may also remove values using it:
val a2 = a.updatedWith("k2") {
case Some(5) => None
case v => v
}
println(a2) // Map(k1 -> 1)
An excerpt from the Scala Standard Library reference:
def updatedWith[V1 >: V](key: K)(remappingFunction: (Option[V]) => Option[V1]): Map[K, V1]
Update a mapping for the specified key and its current optionally-mapped value (Some if there is current mapping, None if not).
If the remapping function returns Some(v), the mapping is updated with the new value v. If the remapping function returns None, the mapping is removed (or remains absent if initially absent). If the function itself throws an exception, the exception is rethrown, and the current mapping is left unchanged.
One somewhat clear way to do this:
val a = collection.mutable.Map[String, Int]() withDefault insertNewValue
def insertNewValue(key: String): Int =
a += key -> getValueForKey(key)
a(key)
}
def getValueForKey(key: String): Int = key.length
Still, I totally discourage the use of mutable collections. It's preferable to keep internal mutable state as variables holding immutable fields.
This is because of a simple rule, you shouldn't expose your internal state unless it's totally necessary, and if you do, you should diminish as much as possible the potential side effects that may produce.
If you expose a reference of a mutable state, any other actor can change it's values, thus losing referential transparency. It's not by chance that all references to mutable collections are quite long, and difficult to use. That's a developer's message for you
Not surprisingly, the code still remains the same, with some minimal changes at the map instantiation.
var a = Map[String, Int]() withDefault insertNewValue
def insertNewValue(key: String): Int = {
a += key -> getValueForKey(key)
a(key)
}
def getValueForKey(key: String): Int = key.length
For your mutable map, you can just do:
val a = collection.mutable.Map(("k1" -> 1), ("k2" -> 5))
val key = "k1"
a.update(key, a.getOrElse(key, 0) + 5)
a // val res1: scala.collection.mutable.Map[String,Int] = HashMap(k1 -> 6, k2 -> 5)
hey i have Map like this:
val valueParameters = Map("key1"->"value","anotherkey1"->"value","thirdkey1"->"value","key2"->"value","anotherkey2"->"value","thirdkey2"->"value")
and pattern:
val pattern = """(?<=[a-zA-Z])\d{1,2}""".r
val result = valueParameters.groupBy(x=>pattern.findAllIn(x._1).next().toInt).toSeq.sortBy(_._1).toMap
which gives: Map[Int,Map[String,String] and i want to remove the number from the first string of the second map which i dont need anymore so i can : result(1)("key") not result(1)("key1")
This should work
val result1 = result.map { case (k,v) =>
k -> v.map { case (a,b) =>
val a1 = a.takeWhile(! _.isDigit)
a1 -> b
}
}
Note that while using mapValues would result in shorter code, mapValues is a lazy operation that will do the computation every time you access the map, whereas mapping the entries will result in the computation being done once, which is usually what you expect in scala.