Spark Accumulators: Is the right accumulator sometimes many or always one? - scala

I am trying to use a Spark accumulator to remove a group by query which has poor performance.
import org.apache.spark._
object CountPairsParam extends AccumulatorParam[Map[Int, Set[Int]]] {
def zero(initialValue: Map[Int, Set[Int]]): Map[Int, Set[Int]] = {
Map.empty[Int, Set[Int]]
}
def addInPlace(m1: Map[Int, Set[Int]], m2: Map[Int, Set[Int]]): Map[Int, Set[Int]] = {
val keys = m1.keys ++ m2.keys
keys.map((k: Int) => (k -> (m1.getOrElse(k, Set.empty[Int]) ++ m2.getOrElse(k, Set.empty[Int])))).toMap
}
}
val accum = sc.accumulator(Map.empty[Int, Set[Int]])(CountPairsParam)
srch_destination_id_distinct.foreach(r => try{accum += Map(r(0).toString.toInt -> Set(r(1).toString.toInt))} catch {case ioe: NumberFormatException => Map.empty[Int, Set[Int]]})
In my accumulator I am assuming that m2 isn't going to always be a single item set created in my foreach loop, and that sometimes Spark will be using this method to add two different maps that have more then one key. But because of this my performance is low. Does the right Map always come into the accumulator with one item, set from my for each loop, or do I need to make this performance trade off?

You should generally avoid using Accumulators for anything but debugging because there's no guarantee, as far as I know, that each entry of the RDD will only be "added" into the Accumulator exactly once.
Maybe try something like this:
import scala.collection.mutable.HashSet
import scala.util.Try
val result = srch_destination_id_distinct.flatMap(r =>
Try((r(0).toString.toInt, r(1).toString.toInt)).toOption
).aggregateByKey(HashSet.empty[Int])(
(set, n) => set += n,
(set1, set2) => set1 union set2
).mapValues(_.toSet).collectAsMap
The distinction between seqOp and combOp arguments of the aggregate method also allow us to avoid "wrapping" each element of the RDD in a Map[Int, Set[Int]] in the way you did with your approach.

Related

Scala Nested Map to Spark RDD

I'm trying to convert a list of maps (Seq[Map[String, Map[String, String]]) into an RDD table/tuple where each key -> value pair in the map is flat mapped into a tuple with the outer map's key. For example
Map(
1 -> Map('k' -> 'v', 'k1' -> 'v1')
)
becomes
(1, 'k', 'v')
(1, 'k1', 'v1')
I've tried the following approach, but it seems to fail on concurrency issues. I have two worker nodes, and it duplicates the key -> value twice(which I assume is because i'm doing this wrong)
Lets assume I hold my map type in a case class 'Records'
val rdd = sc.parallelize(1 to records.length)
val recordsIt = records.iterator
val res: RDD[(String, String, String)] = rdd.flatMap(f => {
val currItem = recordsIt.next()
val x: immutable.Iterable[(String, String, String)] = currItem.mapData.map(v => {
(currItem.identifier, v._1, v._2)
})
x
}).sortBy(r => r)
Is there a way to paralleize this work without running into serious concurrency issues(as I suspect is happening?
example duplicated output
(201905_001ac172c2751c1d4f4b4cb0affb42ef_gFF0dSg4iw,CID,B13131608623827542)
(201905_001ac172c2751c1d4f4b4cb0affb42ef_gFF0dSg4iw,CID,B13131608623827542)
(201905_001ac172c2751c1d4f4b4cb0affb42ef_gFF0dSg4iw,ROD,19190321)
(201905_001ac172c2751c1d4f4b4cb0affb42ef_gFF0dSg4iw,ROD,19190321)
(201905_001b3ba44f6d1f7505a99e2288108418_mSfAfo31f8,CID,339B4C3C03DDF96AAD)
(201905_001b3ba44f6d1f7505a99e2288108418_mSfAfo31f8,CID,339B4C3C03DDF96AAD)
(201905_001b3ba44f6d1f7505a99e2288108418_mSfAfo31f8,ROD,19860115)
(201905_001b3ba44f6d1f7505a99e2288108418_mSfAfo31f8,ROD,19860115)
Spark parallelize is very efficient from the beginning (since you already start storing data in memory it is much less expensive to just iterate over it locally), nonetheless a more idiomatic approach would be a simple flatMap:
sc.parallelize(records.toSeq)
.flatMapValues(identity)
.map { case (k1, (k2, v)) => (k1, k2, v) }

groupBy on List as LinkedHashMap instead of Map

I am processing XML using scala, and I am converting the XML into my own data structures. Currently, I am using plain Map instances to hold (sub-)elements, however, the order of elements from the XML gets lost this way, and I cannot reproduce the original XML.
Therefore, I want to use LinkedHashMap instances instead of Map, however I am using groupBy on the list of nodes, which creates a Map:
For example:
def parse(n:Node): Unit =
{
val leaves:Map[String, Seq[XmlItem]] =
n.child
.filter(node => { ... })
.groupBy(_.label)
.map((tuple:Tuple2[String, Seq[Node]]) =>
{
val items = tuple._2.map(node =>
{
val attributes = ...
if (node.text.nonEmpty)
XmlItem(Some(node.text), attributes)
else
XmlItem(None, attributes)
})
(tuple._1, items)
})
...
}
In this example, I want leaves to be of type LinkedHashMap to retain the order of n.child. How can I achieve this?
Note: I am grouping by label/tagname because elements can occur multiple times, and for each label/tagname, I keep a list of elements in my data structures.
Solution
As answered by #jwvh I am using foldLeft as a substitution for groupBy. Also, I decided to go with LinkedHashMap instead of ListMap.
def parse(n:Node): Unit =
{
val leaves:mutable.LinkedHashMap[String, Seq[XmlItem]] =
n.child
.filter(node => { ... })
.foldLeft(mutable.LinkedHashMap.empty[String, Seq[Node]])((m, sn) =>
{
m.update(sn.label, m.getOrElse(sn.label, Seq.empty[Node]) ++ Seq(sn))
m
})
.map((tuple:Tuple2[String, Seq[Node]]) =>
{
val items = tuple._2.map(node =>
{
val attributes = ...
if (node.text.nonEmpty)
XmlItem(Some(node.text), attributes)
else
XmlItem(None, attributes)
})
(tuple._1, items)
})
To get the rough equivalent to .groupBy() in a ListMap you could fold over your collection. The problem is that ListMap preserves the order of elements as they were appended, not as they were encountered.
import collection.immutable.ListMap
List('a','b','a','c').foldLeft(ListMap.empty[Char,Seq[Char]]){
case (lm,c) => lm.updated(c, c +: lm.getOrElse(c, Seq()))
}
//res0: ListMap[Char,Seq[Char]] = ListMap(b -> Seq(b), a -> Seq(a, a), c -> Seq(c))
To fix this you can foldRight instead of foldLeft. The result is the original order of elements as encountered (scanning left to right) but in reverse.
List('a','b','a','c').foldRight(ListMap.empty[Char,Seq[Char]]){
case (c,lm) => lm.updated(c, c +: lm.getOrElse(c, Seq()))
}
//res1: ListMap[Char,Seq[Char]] = ListMap(c -> Seq(c), b -> Seq(b), a -> Seq(a, a))
This isn't necessarily a bad thing since a ListMap is more efficient with last and init ops, O(1), than it is with head and tail ops, O(n).
To process the ListMap in the original left-to-right order you could .toList and .reverse it.
List('a','b','a','c').foldRight(ListMap.empty[Char,Seq[Char]]){
case (c,lm) => lm.updated(c, c +: lm.getOrElse(c, Seq()))
}.toList.reverse
//res2: List[(Char, Seq[Char])] = List((a,Seq(a, a)), (b,Seq(b)), (c,Seq(c)))
Purely immutable solution would be quite slow. So I'd go with
import collection.mutable.{ArrayBuffer, LinkedHashMap}
implicit class ExtraTraversableOps[A](seq: collection.TraversableOnce[A]) {
def orderedGroupBy[B](f: A => B): collection.Map[B, collection.Seq[A]] = {
val map = LinkedHashMap.empty[B, ArrayBuffer[A]]
for (x <- seq) {
val key = f(x)
map.getOrElseUpdate(key, ArrayBuffer.empty) += x
}
map
}
To use, just change .groupBy in your code to .orderedGroupBy.
The returned Map can't be mutated using this type (though it can be cast to mutable.Map or to mutable.LinkedHashMap), so it's safe enough for most purposes (and you could create a ListMap from it at the end if really desired).

reduce a list in scala by value

How can I reduce a list like below concisely
Seq[Temp] = List(Temp(a,1), Temp(a,2), Temp(b,1))
to
List(Temp(a,2), Temp(b,1))
Only keep Temp objects with unique first param and max of second param.
My solution is with lot of groupBys and reduces which is giving a lengthy answer.
you have to
groupBy
sortBy values in ASC order
get the last one which is the largest
Example,
scala> final case class Temp (a: String, value: Int)
defined class Temp
scala> val data : Seq[Temp] = List(Temp("a",1), Temp("a",2), Temp("b",1))
data: Seq[Temp] = List(Temp(a,1), Temp(a,2), Temp(b,1))
scala> data.groupBy(_.a).map { case (k, group) => group.sortBy(_.value).last }
res0: scala.collection.immutable.Iterable[Temp] = List(Temp(b,1), Temp(a,2))
or instead of sortBy(fn).last you can maxBy(fn)
scala> data.groupBy(_.a).map { case (k, group) => group.maxBy(_.value) }
res1: scala.collection.immutable.Iterable[Temp] = List(Temp(b,1), Temp(a,2))
You can generate a Map with groupBy, compute the max in mapValues and convert it back to the Temp classes as in the following example:
case class Temp(id: String, value: Int)
List(Temp("a", 1), Temp("a", 2), Temp("b", 1)).
groupBy(_.id).mapValues( _.map(_.value).max ).
map{ case (k, v) => Temp(k, v) }
// res1: scala.collection.immutable.Iterable[Temp] = List(Temp(b,1), Temp(a,2))
Worth noting that the solution using maxBy in the other answer is more efficient as it minimizes necessary transformations.
You can do this using foldLeft:
data.foldLeft(Map[String, Int]().withDefaultValue(0))((map, tmp) => {
map.updated(tmp.id, max(map(tmp.id), tmp.value))
}).map{case (i,v) => Temp(i, v)}
This is essentially combining the logic of groupBy with the max operation in a single pass.
Note This may be less efficient because groupBy uses a mutable.Map internally which avoids constantly re-creating a new map. If you care about performance and are prepared to use mutable data, this is another option:
val tmpMap = mutable.Map[String, Int]().withDefaultValue(0)
data.foreach(tmp => tmpMap(tmp.id) = max(tmp.value, tmpMap(tmp.id)))
tmpMap.map{case (i,v) => Temp(i, v)}.toList
Use a ListMap if you need to retain the data order, or sort at the end if you need a particular ordering.

Scala: Apply same function to 2 lists in one call

let say I have
val list: List[(Int, String)] = List((1,"test"),(2,"test2"),(3,"sample"))
I need to partition this list in two, based on (Int, String) value. So far, so good.
For example it can be
def isValid(elem: (Int, String)) = elem._1 < 3 && elem._2.startsWith("test")
val (good, bad) = list.partition(isValid)
So, now I had 2 lists with signatures List[(Int, String)], but I need only Int part(some id). Off course I can write some function
def ids(list:List(Int, String)) = list.map(_._1)
and call it on both lists
val (ok, wrong) = (ids(good), ids(bad))
it worked, but looks little bit boilerplate. I prefer something like
val (good, bad) = list.partition(isValid).map(ids)
But it obviously not possible. So is there "Nicer" way to do what I need?
I understand that it's not so bad, but feel that there exist some functional pattern or general solution for such cases and I want to know it:) Thanks!
P.S. Thanks for all! Finally it's transformed to
private def handleGames(games:List[String], lastId:Int) = {
val (ok, wrong) = games.foldLeft(
(List.empty[Int], List.empty[Int])){
(a, b) => b match {
case gameRegex(d,w,e) => {
if(filterGame((d, w, e), lastId)) (d.toInt :: a._1, a._2)
else (a._1, d.toInt :: a._2 )
}
case _ => log.debug(s"not handled game template is: $b"); a
}
}
log.debug(s"not handled game ids are: ${wrong.mkString(",")}")
ok
}
You're looking for a foldLeft on the List:
myList.foldLeft((List.empty[Int], List.empty[Int])){
case ((good, bad), (id, value)) if predicate(id, value) => (id :: good, bad)
case ((good, bad), (id, _)) => (good, id :: bad)
}
This way you're operating at every stage doing both a transform and an accumulate. The returned type will be (List[Int], List[Int]) assuming predicate is the function which chooses between "good" and "bad" states. The cast of the Nil is due to the aggressive nature of Scala for choosing the most restrictive type on a foldl.
An additional approach using Cats can be used with Tuple2K and Foldables foldMap. Note this requires help from the kind-projector compiler plugin
import cats.implicits._
import cats.Foldable
import cats.data.Tuple2K
val listTuple = Tuple2K(list, otherList)
val (good, bad) = Foldable[Tuple2K[List, List, ?]].foldMap(listTuple)(f =>
if (isValid(f)) (List(f), List.empty) else (List.empty, List(f)))

acces tuple inside a tuple for anonymous map job in Spark

This post is essentially about how to build joint and marginal histograms from a (String, String) RDD. I posted the code that I eventually used below as the answer.
I have an RDD that contains a set of tuples of type (String,String) and since they aren't unique I want to get a look at how many times each String, String combination occurs so I use countByValue like so
val PairCount = Pairs.countByValue().toSeq
which gives me a tuple as output like this ((String,String),Long) where long is the number of times that the (String, String) tuple appeared
These Strings can be repeated in different combinations and I essentially want to run word count on this PairCount variable so I tried something like this to start:
PairCount.map(x => (x._1._1, x._2))
But the output the this spits out is String1->1, String2->1, String3->1, etc.
How do I output a key value pair from a map job in this case where the key is going to be one of the String values from the inner tuple, and the value is going to be the Long value from the outter tuple?
Update:
#vitalii gets me almost there. the answer gets me to a Seq[(String,Long)], but what I really need is to turn that into a map so that I can run reduceByKey it afterwards. when I run
PairCount.flatMap{case((x,y),n) => Seq[x->n]}.toMap
for each unique x I get x->1
for example the above line of code generates mom->1 dad->1 even if the tuples out of the flatMap included (mom,30) (dad,59) (mom,2) (dad,14) in which case I would expect toMap to provide mom->30, dad->59 mom->2 dad->14. However, I'm new to scala so I might be misinterpreting the functionality.
how can I get the Tuple2 sequence converted to a map so that I can reduce on the map keys?
If I correctly understand question, you need flatMap:
val pairCountRDD = pairs.countByValue() // RDD[((String, String), Int)]
val res : RDD[(String, Int)] = pairCountRDD.flatMap { case ((s1, s2), n) =>
Seq(s1 -> n, s2 -> n)
}
Update: I didn't quiet understand what your final goal is, but here's a few more examples that may help you, btw code above is incorrect, I have missed the fact that countByValue returns map, and not RDD:
val pairs = sc.parallelize(
List(
"mom"-> "dad", "dad" -> "granny", "foo" -> "bar", "foo" -> "baz", "foo" -> "foo"
)
)
// don't use countByValue, if pairs is large you will run out of memmory
val pairCountRDD = pairs.map(x => (x, 1)).reduceByKey(_ + _)
val wordCount = pairs.flatMap { case (a,b) => Seq(a -> 1, b ->1)}.reduceByKey(_ + _)
wordCount.take(10)
// count in how many pairs each word occur, keys and values:
val wordPairCount = pairs.flatMap { case (a,b) =>
if (a == b) {
Seq(a->1)
} else {
Seq(a -> 1, b ->1)
}
}.reduceByKey(_ + _)
wordPairCount.take(10)
to get the histograms for the (String,String) RDD I used this code.
val Hist_X = histogram.map(x => (x._1-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_Y = histogram.map(x => (x._2-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_XY = histogram.map(x => (x-> 1.0)).reduceByKey(_+_)
where histogram was the (String,String) RDD