I have just started a project in work where we are migrating some C# tooling across to a new Scala project. This is my first exposure to the language (and functional programming in general) so in the interest of not just writing Java style code in Scala, I am wondering what the correct approach to handling the following scenario is.
We have two map objects which represent tabular data with the following structure:
map1 key|date|mapping val
map2 key|number
The mapping value in the first object is not always populated. Currently these are represented by Map[String, Array[String]] and Map[String, Double] types.
In the C# tool we have the following approach:
Loop through key set in first map
For every key, check to see if the mapping val is blank
If no mapping then fetch the number from map 2 and return
If mapping exists then recursively call method to get full range of mapping values and their numbers, summing as you go. E.g. key 1 might have a mapping to key 4, key 4 might have a mapping to key 5 etc and we want to sum all of the values for these keys in map2.
Is there a clever way to do this in Scala which would avoid updating a list from within a for loop and recursively walking the map?
Is this what you are after?
#annotation.tailrec
def recurse(key: String, count: Double, map1: Map[String, String], map2: Map[String, Double]): Double = {
map1.get(key) match {
case Some(mappingVal) if mappingVal == "" =>
count + map2.getOrElse(mappingVal, 0.0)
case Some(mappingVal) =>
recurse(mappingVal, count + map2.getOrElse(mappingVal, 0.0), map1, map2)
case None => count
}
}
example use:
val m1: Map[String, String] = Map("1" -> "4", "4" -> "5", "5" -> "6", "8" -> "")
val m2: Map[String, Double] = Map("1" -> 1.0, "4" -> 4.0, "6" -> 10.0)
m1.map {
case (k, _) => k -> recurse(k, 0.0, m1, m2)
}.foreach(println)
Output:
(1,14.0)
(4,10.0)
(5,10.0)
(8,0.0)
Note that there is no cycle detection - this will never terminate if map1 has a cycle.
Related
I'm trying to convert a list of maps (Seq[Map[String, Map[String, String]]) into an RDD table/tuple where each key -> value pair in the map is flat mapped into a tuple with the outer map's key. For example
Map(
1 -> Map('k' -> 'v', 'k1' -> 'v1')
)
becomes
(1, 'k', 'v')
(1, 'k1', 'v1')
I've tried the following approach, but it seems to fail on concurrency issues. I have two worker nodes, and it duplicates the key -> value twice(which I assume is because i'm doing this wrong)
Lets assume I hold my map type in a case class 'Records'
val rdd = sc.parallelize(1 to records.length)
val recordsIt = records.iterator
val res: RDD[(String, String, String)] = rdd.flatMap(f => {
val currItem = recordsIt.next()
val x: immutable.Iterable[(String, String, String)] = currItem.mapData.map(v => {
(currItem.identifier, v._1, v._2)
})
x
}).sortBy(r => r)
Is there a way to paralleize this work without running into serious concurrency issues(as I suspect is happening?
example duplicated output
(201905_001ac172c2751c1d4f4b4cb0affb42ef_gFF0dSg4iw,CID,B13131608623827542)
(201905_001ac172c2751c1d4f4b4cb0affb42ef_gFF0dSg4iw,CID,B13131608623827542)
(201905_001ac172c2751c1d4f4b4cb0affb42ef_gFF0dSg4iw,ROD,19190321)
(201905_001ac172c2751c1d4f4b4cb0affb42ef_gFF0dSg4iw,ROD,19190321)
(201905_001b3ba44f6d1f7505a99e2288108418_mSfAfo31f8,CID,339B4C3C03DDF96AAD)
(201905_001b3ba44f6d1f7505a99e2288108418_mSfAfo31f8,CID,339B4C3C03DDF96AAD)
(201905_001b3ba44f6d1f7505a99e2288108418_mSfAfo31f8,ROD,19860115)
(201905_001b3ba44f6d1f7505a99e2288108418_mSfAfo31f8,ROD,19860115)
Spark parallelize is very efficient from the beginning (since you already start storing data in memory it is much less expensive to just iterate over it locally), nonetheless a more idiomatic approach would be a simple flatMap:
sc.parallelize(records.toSeq)
.flatMapValues(identity)
.map { case (k1, (k2, v)) => (k1, k2, v) }
I'm fairly new to Scala, so hopefully you tolerate this question in the case you find it noobish :)
I wrote a function that returns a Seq of elements using yield syntax:
def calculateSomeMetrics(names: Seq[String]): Seq[Long] = {
for (name <- names) yield {
// some auxiliary actions
val metrics = somehowCalculateMetrics()
metrics
}
}
Now I need to modify it to return a Map to preserve the original names against each of the calculated values:
def calculateSomeMetrics(names: Seq[String]): Map[String, Long] = { ... }
I've attempted to use the same yield-syntax but to yield a tuple instead of a single element:
def calculateSomeMetrics(names: Seq[String]): Map[String, Long] = {
for (name <- names) yield {
// Everything is the same as before
(name, metrics)
}
}
However, the compiler interprets it Seq[(String, Long)], as per the compiler error message
type mismatch;
found : Seq[(String, Long)]
required: Map[String, Long]
So I'm wondering, what is the "canonical Scala way" to implement such a thing?
The efficient way of creating different collection types is using scala.collection.breakOut. It works with Maps and for comprehensions too:
import scala.collection.breakOut
val x: Map[String, Int] = (for (i <- 1 to 10) yield i.toString -> i)(breakOut)
x: Map[String,Int] = Map(8 -> 8, 4 -> 4, 9 -> 9, 5 -> 5, 10 -> 10, 6 -> 6, 1 -> 1, 2 -> 2, 7 -> 7, 3 -> 3)
In your case it should work too:
import scala.collection.breakOut
def calculateSomeMetrics(names: Seq[String]): Map[String, Long] = {
(for (name <- names) yield {
// Everything is the same as before
(name, metrics)
})(breakOut)
}
Comparison with toMap solutions: before toMap creates an intermediate Seq of Tuple2s (which incidentally might be a Map too in certain cases) and from that it creates the Map, while breakOut omits this intermediate Seq creation and creates the Map directly instead of the intermediate Seq.
Usually this is not a huge difference in memory or CPU usage (+ GC pressure), but sometimes these things matter.
Either:
def calculateSomeMetrics(names: Seq[String]): Map[String, Long] = {
(for (name <- names) yield {
// Everything is the same as before
(name, metrics)
}).toMap
}
Or:
names.map { name =>
// doStuff
(name, metrics)
}.toMap
Several links here that either other people pointed me at or I managed to find out later on, just assembling them in a single answer for my future reference.
breakOut - suggested by MichaĆ in his comment
toMap - in this thread
Great profound explanation on how breakOut works - in this answer
Note, though, that breakOut is going away, as noted by Karl
I have Some() type Map[String, String], such as
Array[Option[Any]] = Array(Some(Map(String, String)
I want to return it as
Array(Map(String, String))
I've tried few different ways of extracting it-
Let's say if
val x = Array(Some(Map(String, String)
val x1 = for (i <- 0 until x.length) yield { x.apply(i) }
but this returns IndexedSeq(Some(Map)), which is not what I want.
I tried pattern matching,
x.foreach { i =>
i match {
case Some(value) => value
case _ => println("nothing") }}
another thing I tried that was somewhat successful was that
x.apply(0).get.asInstanceOf[Map[String, String]]
will do something what I want, but it only gets 0th index of the entire array and I'd want all the maps in the array.
How can I extract Map type out of Some?
If you want an Array[Any] from your Array[Option[Any]], you can use this for expression:
for {
opt <- x
value <- opt
} yield value
This will put the values of all the non-empty Options inside a new array.
It is equivalent to this:
x.flatMap(_.toArray[Any])
Here, all options will be converted to an array of either 0 or 1 element. All these arrays will then be flattened back to one single array containing all the values.
Generally, the pattern is either to use transformations on the Option[T], like map, flatMap, filter, etc.
The problem is, we'll need to add a type cast to retrieve the underlying Map[String, String] from Any. So we'll use flatten to remove any potentially None types and unwrap the Option, and asInstanceOf to retreive the type:
scala> val y = Array(Some(Map("1" -> "1")), Some(Map("2" -> "2")), None)
y: Array[Option[scala.collection.immutable.Map[String,String]]] = Array(Some(Map(1 -> 1)), Some(Map(2 -> 2)), None)
scala> y.flatten.map(_.asInstanceOf[Map[String, String]])
res7: Array[Map[String,String]] = Array(Map(1 -> 1), Map(2 -> 2))
Also when you talk just about single value you can try Some("test").head and for null simply Some(null).flatten
Given the following code:
val m: Map[String, Int] = .. // fetch from somewhere
val keys: List[String] = m.keys.toList
val keysSubset: List[String] = ... // choose random keys
We can define the following method:
def sumValues(m: Map[String, Int], ks: List[String]): Int =
ks.map(m).sum
And call this as:
sumValues(m, keysSubset)
However, the problem with sumValues is that if ks happens to have a key not present on the map, the code will still compile but throw an exception at runtime. Ex:
// assume m = Map("two" -> 2, "three" -> 3)
sumValues(m, 1 :: Nil)
What I want instead is a definition for sumValues such that the ks argument should, at compile time, be guaranteed to only contain keys that are present on the map. As such, my guess is that the existing sumValues type signature needs to accept some form of implicit evidence that the ks argument is somehow derived from the list of keys of the map.
I'm not limited to a scala Map however, as any record-like structure would do. The map structure however won't have a hardcoded value, but something derived/passed on as an argument.
Note: I'm not really after summing the values, but more of figuring out a type signature for sumValues whose calls to it can only compile if the ks argument is provably from the list of keys the map (or record-like structure).
Another solution could be to map only the intersection (i.e. : between m keys and ks).
For example :
scala> def sumValues(m: Map[String, Int], ks: List[String]): Int = {
| m.keys.filter(ks.contains).map(m).sum
| }
sumValues: (m: Map[String,Int], ks: List[String])Int
scala> val map = Map("hello" -> 5)
map: scala.collection.immutable.Map[String,Int] = Map(hello -> 5)
scala> sumValues(map, List("hello", "world"))
res1: Int = 5
I think this solution is better than providing a default value because more generic (i.e. : you can use it not only with sums). However, I guess that this solution is less effective in term of performance because the intersection.
EDIT : As #jwvh pointed out in it message below, ks.intersect(m.keys.toSeq).map(m).sum is, to my opinion, more readable than m.keys.filter(ks.contains).map(m).sum.
This post is essentially about how to build joint and marginal histograms from a (String, String) RDD. I posted the code that I eventually used below as the answer.
I have an RDD that contains a set of tuples of type (String,String) and since they aren't unique I want to get a look at how many times each String, String combination occurs so I use countByValue like so
val PairCount = Pairs.countByValue().toSeq
which gives me a tuple as output like this ((String,String),Long) where long is the number of times that the (String, String) tuple appeared
These Strings can be repeated in different combinations and I essentially want to run word count on this PairCount variable so I tried something like this to start:
PairCount.map(x => (x._1._1, x._2))
But the output the this spits out is String1->1, String2->1, String3->1, etc.
How do I output a key value pair from a map job in this case where the key is going to be one of the String values from the inner tuple, and the value is going to be the Long value from the outter tuple?
Update:
#vitalii gets me almost there. the answer gets me to a Seq[(String,Long)], but what I really need is to turn that into a map so that I can run reduceByKey it afterwards. when I run
PairCount.flatMap{case((x,y),n) => Seq[x->n]}.toMap
for each unique x I get x->1
for example the above line of code generates mom->1 dad->1 even if the tuples out of the flatMap included (mom,30) (dad,59) (mom,2) (dad,14) in which case I would expect toMap to provide mom->30, dad->59 mom->2 dad->14. However, I'm new to scala so I might be misinterpreting the functionality.
how can I get the Tuple2 sequence converted to a map so that I can reduce on the map keys?
If I correctly understand question, you need flatMap:
val pairCountRDD = pairs.countByValue() // RDD[((String, String), Int)]
val res : RDD[(String, Int)] = pairCountRDD.flatMap { case ((s1, s2), n) =>
Seq(s1 -> n, s2 -> n)
}
Update: I didn't quiet understand what your final goal is, but here's a few more examples that may help you, btw code above is incorrect, I have missed the fact that countByValue returns map, and not RDD:
val pairs = sc.parallelize(
List(
"mom"-> "dad", "dad" -> "granny", "foo" -> "bar", "foo" -> "baz", "foo" -> "foo"
)
)
// don't use countByValue, if pairs is large you will run out of memmory
val pairCountRDD = pairs.map(x => (x, 1)).reduceByKey(_ + _)
val wordCount = pairs.flatMap { case (a,b) => Seq(a -> 1, b ->1)}.reduceByKey(_ + _)
wordCount.take(10)
// count in how many pairs each word occur, keys and values:
val wordPairCount = pairs.flatMap { case (a,b) =>
if (a == b) {
Seq(a->1)
} else {
Seq(a -> 1, b ->1)
}
}.reduceByKey(_ + _)
wordPairCount.take(10)
to get the histograms for the (String,String) RDD I used this code.
val Hist_X = histogram.map(x => (x._1-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_Y = histogram.map(x => (x._2-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_XY = histogram.map(x => (x-> 1.0)).reduceByKey(_+_)
where histogram was the (String,String) RDD