Scala Nested Map to Spark RDD - scala

I'm trying to convert a list of maps (Seq[Map[String, Map[String, String]]) into an RDD table/tuple where each key -> value pair in the map is flat mapped into a tuple with the outer map's key. For example
Map(
1 -> Map('k' -> 'v', 'k1' -> 'v1')
)
becomes
(1, 'k', 'v')
(1, 'k1', 'v1')
I've tried the following approach, but it seems to fail on concurrency issues. I have two worker nodes, and it duplicates the key -> value twice(which I assume is because i'm doing this wrong)
Lets assume I hold my map type in a case class 'Records'
val rdd = sc.parallelize(1 to records.length)
val recordsIt = records.iterator
val res: RDD[(String, String, String)] = rdd.flatMap(f => {
val currItem = recordsIt.next()
val x: immutable.Iterable[(String, String, String)] = currItem.mapData.map(v => {
(currItem.identifier, v._1, v._2)
})
x
}).sortBy(r => r)
Is there a way to paralleize this work without running into serious concurrency issues(as I suspect is happening?
example duplicated output
(201905_001ac172c2751c1d4f4b4cb0affb42ef_gFF0dSg4iw,CID,B13131608623827542)
(201905_001ac172c2751c1d4f4b4cb0affb42ef_gFF0dSg4iw,CID,B13131608623827542)
(201905_001ac172c2751c1d4f4b4cb0affb42ef_gFF0dSg4iw,ROD,19190321)
(201905_001ac172c2751c1d4f4b4cb0affb42ef_gFF0dSg4iw,ROD,19190321)
(201905_001b3ba44f6d1f7505a99e2288108418_mSfAfo31f8,CID,339B4C3C03DDF96AAD)
(201905_001b3ba44f6d1f7505a99e2288108418_mSfAfo31f8,CID,339B4C3C03DDF96AAD)
(201905_001b3ba44f6d1f7505a99e2288108418_mSfAfo31f8,ROD,19860115)
(201905_001b3ba44f6d1f7505a99e2288108418_mSfAfo31f8,ROD,19860115)

Spark parallelize is very efficient from the beginning (since you already start storing data in memory it is much less expensive to just iterate over it locally), nonetheless a more idiomatic approach would be a simple flatMap:
sc.parallelize(records.toSeq)
.flatMapValues(identity)
.map { case (k1, (k2, v)) => (k1, k2, v) }

Related

Scala Map from Test File

Looking to create a scala map from a test file. A sample of the text file (a few lines of it) can be seen below:
Alabama (9),Democratic:849624,Republican:1441170,Libertarian:25176,Others:7312
Alaska (3),Democratic:153778,Republican:189951,Libertarian:8897,Others:6904
Arizona (11),Democratic:1672143,Republican:1661686,Libertarian:51465,Green:1557,Others:475
I have been given the map buffer as follows:
var mapBuffer: Map[String, List[(String, Int)]] = Map()
Note the party values are separated by a colon.
I am trying to read the file contents and store the data in a map structure where each line of the file is used to construct a map entry with the date as the key, and a list of tuples as the value. The type of the structure should be Map[String, List[(String,Int)]].
Essentially just trying to create a map of each line from the file but I can't quite get it right. I tried the below but with not luck - I think that 'val lines' should be an array rather than an iterator.
val stream : InputStream = getClass.getResourceAsStream("")
val lines: Iterator[String] = scala.io.Source.fromInputStream(stream).getLines
var map: Map[String, List[(String, Int)]] = lines
.map(_.split(","))
.map(line => (line(0).toString, line(1).toList))
.toMap
This appears to do the job. (Scala 2.13.x)
val stateVotes =
util.Using(io.Source.fromFile("votes.txt")){
val PartyVotes = "([^:]+):(\\d+)".r
_.getLines()
.map(_.split(",").toList)
.toList
.groupMapReduce(_.head)(_.tail.collect{
case PartyVotes(p,v) => (p,v.toInt)})(_ ++ _)
} //file is auto-closed
//stateVotes: Try[Map[String,List[(String, Int)]]] = Success(
// Map(Alabama (9) -> List((Democratic,849624), (Republican,1441170), (Libertarian,25176), (Others,7312))
// , Arizona (11) -> List((Democratic,1672143), (Republican,1661686), (Libertarian,51465), (Green,1557), (Others,475))
// , Alaska (3) -> List((Democratic,153778), (Republican,189951), (Libertarian,8897), (Others,6904))))
In this case the number following the state name is preserved. That can be changed.
No, iterator is fine (better than list actually),
you just need to split the values too to create those tuples.
lines
.map(_.split(","))
.map { case l =>
l.head -> l.tail.toList.map(_.split(":"))
.collect { case Seq(a,b) => a -> b.toInt }
}
.toMap
An alternative that looks a little bit more aesthetic to my eye is converting to map early, and then using mapValues (I personally much
prefer short lambdas). The downside is mapValues is lazy, so you end up
having to do .toMap twice to force it in the end:
lines
.map(_.split(","))
.map { case l => l.head -> l.tail.toList }
.toMap
.mapValues(_.split(":"))
.mapValues(_.collect { case Seq(a,b) => a -> b.toInt })
.toMap

reduce a list in scala by value

How can I reduce a list like below concisely
Seq[Temp] = List(Temp(a,1), Temp(a,2), Temp(b,1))
to
List(Temp(a,2), Temp(b,1))
Only keep Temp objects with unique first param and max of second param.
My solution is with lot of groupBys and reduces which is giving a lengthy answer.
you have to
groupBy
sortBy values in ASC order
get the last one which is the largest
Example,
scala> final case class Temp (a: String, value: Int)
defined class Temp
scala> val data : Seq[Temp] = List(Temp("a",1), Temp("a",2), Temp("b",1))
data: Seq[Temp] = List(Temp(a,1), Temp(a,2), Temp(b,1))
scala> data.groupBy(_.a).map { case (k, group) => group.sortBy(_.value).last }
res0: scala.collection.immutable.Iterable[Temp] = List(Temp(b,1), Temp(a,2))
or instead of sortBy(fn).last you can maxBy(fn)
scala> data.groupBy(_.a).map { case (k, group) => group.maxBy(_.value) }
res1: scala.collection.immutable.Iterable[Temp] = List(Temp(b,1), Temp(a,2))
You can generate a Map with groupBy, compute the max in mapValues and convert it back to the Temp classes as in the following example:
case class Temp(id: String, value: Int)
List(Temp("a", 1), Temp("a", 2), Temp("b", 1)).
groupBy(_.id).mapValues( _.map(_.value).max ).
map{ case (k, v) => Temp(k, v) }
// res1: scala.collection.immutable.Iterable[Temp] = List(Temp(b,1), Temp(a,2))
Worth noting that the solution using maxBy in the other answer is more efficient as it minimizes necessary transformations.
You can do this using foldLeft:
data.foldLeft(Map[String, Int]().withDefaultValue(0))((map, tmp) => {
map.updated(tmp.id, max(map(tmp.id), tmp.value))
}).map{case (i,v) => Temp(i, v)}
This is essentially combining the logic of groupBy with the max operation in a single pass.
Note This may be less efficient because groupBy uses a mutable.Map internally which avoids constantly re-creating a new map. If you care about performance and are prepared to use mutable data, this is another option:
val tmpMap = mutable.Map[String, Int]().withDefaultValue(0)
data.foreach(tmp => tmpMap(tmp.id) = max(tmp.value, tmpMap(tmp.id)))
tmpMap.map{case (i,v) => Temp(i, v)}.toList
Use a ListMap if you need to retain the data order, or sort at the end if you need a particular ordering.

Correct Approach to Recursively Summing Map in Scala

I have just started a project in work where we are migrating some C# tooling across to a new Scala project. This is my first exposure to the language (and functional programming in general) so in the interest of not just writing Java style code in Scala, I am wondering what the correct approach to handling the following scenario is.
We have two map objects which represent tabular data with the following structure:
map1 key|date|mapping val
map2 key|number
The mapping value in the first object is not always populated. Currently these are represented by Map[String, Array[String]] and Map[String, Double] types.
In the C# tool we have the following approach:
Loop through key set in first map
For every key, check to see if the mapping val is blank
If no mapping then fetch the number from map 2 and return
If mapping exists then recursively call method to get full range of mapping values and their numbers, summing as you go. E.g. key 1 might have a mapping to key 4, key 4 might have a mapping to key 5 etc and we want to sum all of the values for these keys in map2.
Is there a clever way to do this in Scala which would avoid updating a list from within a for loop and recursively walking the map?
Is this what you are after?
#annotation.tailrec
def recurse(key: String, count: Double, map1: Map[String, String], map2: Map[String, Double]): Double = {
map1.get(key) match {
case Some(mappingVal) if mappingVal == "" =>
count + map2.getOrElse(mappingVal, 0.0)
case Some(mappingVal) =>
recurse(mappingVal, count + map2.getOrElse(mappingVal, 0.0), map1, map2)
case None => count
}
}
example use:
val m1: Map[String, String] = Map("1" -> "4", "4" -> "5", "5" -> "6", "8" -> "")
val m2: Map[String, Double] = Map("1" -> 1.0, "4" -> 4.0, "6" -> 10.0)
m1.map {
case (k, _) => k -> recurse(k, 0.0, m1, m2)
}.foreach(println)
Output:
(1,14.0)
(4,10.0)
(5,10.0)
(8,0.0)
Note that there is no cycle detection - this will never terminate if map1 has a cycle.

acces tuple inside a tuple for anonymous map job in Spark

This post is essentially about how to build joint and marginal histograms from a (String, String) RDD. I posted the code that I eventually used below as the answer.
I have an RDD that contains a set of tuples of type (String,String) and since they aren't unique I want to get a look at how many times each String, String combination occurs so I use countByValue like so
val PairCount = Pairs.countByValue().toSeq
which gives me a tuple as output like this ((String,String),Long) where long is the number of times that the (String, String) tuple appeared
These Strings can be repeated in different combinations and I essentially want to run word count on this PairCount variable so I tried something like this to start:
PairCount.map(x => (x._1._1, x._2))
But the output the this spits out is String1->1, String2->1, String3->1, etc.
How do I output a key value pair from a map job in this case where the key is going to be one of the String values from the inner tuple, and the value is going to be the Long value from the outter tuple?
Update:
#vitalii gets me almost there. the answer gets me to a Seq[(String,Long)], but what I really need is to turn that into a map so that I can run reduceByKey it afterwards. when I run
PairCount.flatMap{case((x,y),n) => Seq[x->n]}.toMap
for each unique x I get x->1
for example the above line of code generates mom->1 dad->1 even if the tuples out of the flatMap included (mom,30) (dad,59) (mom,2) (dad,14) in which case I would expect toMap to provide mom->30, dad->59 mom->2 dad->14. However, I'm new to scala so I might be misinterpreting the functionality.
how can I get the Tuple2 sequence converted to a map so that I can reduce on the map keys?
If I correctly understand question, you need flatMap:
val pairCountRDD = pairs.countByValue() // RDD[((String, String), Int)]
val res : RDD[(String, Int)] = pairCountRDD.flatMap { case ((s1, s2), n) =>
Seq(s1 -> n, s2 -> n)
}
Update: I didn't quiet understand what your final goal is, but here's a few more examples that may help you, btw code above is incorrect, I have missed the fact that countByValue returns map, and not RDD:
val pairs = sc.parallelize(
List(
"mom"-> "dad", "dad" -> "granny", "foo" -> "bar", "foo" -> "baz", "foo" -> "foo"
)
)
// don't use countByValue, if pairs is large you will run out of memmory
val pairCountRDD = pairs.map(x => (x, 1)).reduceByKey(_ + _)
val wordCount = pairs.flatMap { case (a,b) => Seq(a -> 1, b ->1)}.reduceByKey(_ + _)
wordCount.take(10)
// count in how many pairs each word occur, keys and values:
val wordPairCount = pairs.flatMap { case (a,b) =>
if (a == b) {
Seq(a->1)
} else {
Seq(a -> 1, b ->1)
}
}.reduceByKey(_ + _)
wordPairCount.take(10)
to get the histograms for the (String,String) RDD I used this code.
val Hist_X = histogram.map(x => (x._1-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_Y = histogram.map(x => (x._2-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_XY = histogram.map(x => (x-> 1.0)).reduceByKey(_+_)
where histogram was the (String,String) RDD

Spark Accumulators: Is the right accumulator sometimes many or always one?

I am trying to use a Spark accumulator to remove a group by query which has poor performance.
import org.apache.spark._
object CountPairsParam extends AccumulatorParam[Map[Int, Set[Int]]] {
def zero(initialValue: Map[Int, Set[Int]]): Map[Int, Set[Int]] = {
Map.empty[Int, Set[Int]]
}
def addInPlace(m1: Map[Int, Set[Int]], m2: Map[Int, Set[Int]]): Map[Int, Set[Int]] = {
val keys = m1.keys ++ m2.keys
keys.map((k: Int) => (k -> (m1.getOrElse(k, Set.empty[Int]) ++ m2.getOrElse(k, Set.empty[Int])))).toMap
}
}
val accum = sc.accumulator(Map.empty[Int, Set[Int]])(CountPairsParam)
srch_destination_id_distinct.foreach(r => try{accum += Map(r(0).toString.toInt -> Set(r(1).toString.toInt))} catch {case ioe: NumberFormatException => Map.empty[Int, Set[Int]]})
In my accumulator I am assuming that m2 isn't going to always be a single item set created in my foreach loop, and that sometimes Spark will be using this method to add two different maps that have more then one key. But because of this my performance is low. Does the right Map always come into the accumulator with one item, set from my for each loop, or do I need to make this performance trade off?
You should generally avoid using Accumulators for anything but debugging because there's no guarantee, as far as I know, that each entry of the RDD will only be "added" into the Accumulator exactly once.
Maybe try something like this:
import scala.collection.mutable.HashSet
import scala.util.Try
val result = srch_destination_id_distinct.flatMap(r =>
Try((r(0).toString.toInt, r(1).toString.toInt)).toOption
).aggregateByKey(HashSet.empty[Int])(
(set, n) => set += n,
(set1, set2) => set1 union set2
).mapValues(_.toSet).collectAsMap
The distinction between seqOp and combOp arguments of the aggregate method also allow us to avoid "wrapping" each element of the RDD in a Map[Int, Set[Int]] in the way you did with your approach.