Spark map creation takes a very long time - scala

As shown below,
Step 1: Group the calls using groupBy
//Now group the calls by the s_msisdn for call type 1
//grouped: org.apache.spark.rdd.RDD[(String, Iterable[(String, (Array[String], String))])]
val groupedCallsToProcess = callsToProcess.groupBy(_._1)
Step 2: the grouped Calls are mapped.
//create a Map of the second element in the RDD, which is the callObject
//grouped: org.apache.spark.rdd.RDD[(String, Iterable[(String,(Array[String], String))])]
val mapOfCalls = groupedCallsToProcess.map(f => f._2.toList)
Step 3: Map to the Row object, where the map will have key-value pair of [CallsObject, msisdn]
val listOfMappedCalls = mapOfCalls.map(f => f.map(_._2).map(c =>
Row(
c._1(CallCols.call_date_hour),
c._1(CallCols.sw_id),
c._1(CallCols.s_imsi),
f.map(_._1).take(1).mkString
)
))
The 3rd step as shown above seems to take a very long time when the data size is very large. I am wondering if there is any way to make the step 3 efficient. Really appreciate any help in this.

There are lots of things that are very costly in your code which you actually don't need.
You do not need the groupBy in first step. groupBy are very costly in Spark.
The whole second step is useless. toList is very costly with lot of GC overhead.
Remove 1 extra map in third step. Every map is linear operation of the order of map function.
Never do something like f.map(_._1).take(1). You are transforming the whole list but using only 1 (or 5) element. Instead do f.take(5).map(_._1). And if you need only 1 - f.head._1.
Before discussing how can you write this code without groupBy in a different way, lets fix this code.
// you had this in start
val callsToProcess: RDD[(String, (Array[String], String))] = ....
// RDD[(String, Iterable[(String, (Array[String], String))])]
val groupedCallsToProcess = callsToProcess
.groupBy(_._1)
// skip the second step
val listOfMappedCalls = groupedCallsToProcess
.map({ case (key, iter) => {
// this is what you did
// val iterHeadString = iter.head._1
// but the 1st element in each tuple of iter is actually same as key
// so
val iterHeadString = key
// or we can totally remove this iterHeadString and use key
iter.map({ case (str1, (arr, str2)) => Row(
arr(CallCols.call_date_hour),
arr(CallCols.sw_id),
arr(CallCols.s_imsi),
iterHeadString
) })
} })
But... like I said groupBy are very costly in Spark. And you already had a RDD[(key, value)] in your callsToProcess. So we can just use aggregateByKey directly. Also you may notice that your groupBy is not useful for anything else other than putting all those rows inside a list instead of directly inside and RDD.
// you had this in start
val callsToProcess: RDD[(String, (Array[String], String))] = ....
// new lets map it to look like what you needed because we can
// totally do this without any grouping
// I somehow believe that you needed this RDD[Row] and not RDD[List[Row]]
// RDD[Row]
val mapped = callsToProcess
.map({ case (key, (arr, str)) => Row(
arr(CallCols.call_date_hour),
arr(CallCols.sw_id),
arr(CallCols.s_imsi),
key
) })
// Though I can not think of any reason of wanting this
// But if you really needed that RDD[List[Row]] thing...
// then keep the keys with your rows
// RDD[(String, Row)]
val mappedWithKey = callsToProcess
.map({ case (key, (arr, str)) => (key, Row(
arr(CallCols.call_date_hour),
arr(CallCols.sw_id),
arr(CallCols.s_imsi),
key
)) })
// now aggregate by the key to create your lists
// RDD[List[Row]]
val yourStrangeRDD = mappedWithKey
.aggregateByKey(List[Row]())(
(list, row) => row +: list, // prepend, do not append
(list1, list2) => list1 ++ list2
)

Related

Scala Map from Test File

Looking to create a scala map from a test file. A sample of the text file (a few lines of it) can be seen below:
Alabama (9),Democratic:849624,Republican:1441170,Libertarian:25176,Others:7312
Alaska (3),Democratic:153778,Republican:189951,Libertarian:8897,Others:6904
Arizona (11),Democratic:1672143,Republican:1661686,Libertarian:51465,Green:1557,Others:475
I have been given the map buffer as follows:
var mapBuffer: Map[String, List[(String, Int)]] = Map()
Note the party values are separated by a colon.
I am trying to read the file contents and store the data in a map structure where each line of the file is used to construct a map entry with the date as the key, and a list of tuples as the value. The type of the structure should be Map[String, List[(String,Int)]].
Essentially just trying to create a map of each line from the file but I can't quite get it right. I tried the below but with not luck - I think that 'val lines' should be an array rather than an iterator.
val stream : InputStream = getClass.getResourceAsStream("")
val lines: Iterator[String] = scala.io.Source.fromInputStream(stream).getLines
var map: Map[String, List[(String, Int)]] = lines
.map(_.split(","))
.map(line => (line(0).toString, line(1).toList))
.toMap
This appears to do the job. (Scala 2.13.x)
val stateVotes =
util.Using(io.Source.fromFile("votes.txt")){
val PartyVotes = "([^:]+):(\\d+)".r
_.getLines()
.map(_.split(",").toList)
.toList
.groupMapReduce(_.head)(_.tail.collect{
case PartyVotes(p,v) => (p,v.toInt)})(_ ++ _)
} //file is auto-closed
//stateVotes: Try[Map[String,List[(String, Int)]]] = Success(
// Map(Alabama (9) -> List((Democratic,849624), (Republican,1441170), (Libertarian,25176), (Others,7312))
// , Arizona (11) -> List((Democratic,1672143), (Republican,1661686), (Libertarian,51465), (Green,1557), (Others,475))
// , Alaska (3) -> List((Democratic,153778), (Republican,189951), (Libertarian,8897), (Others,6904))))
In this case the number following the state name is preserved. That can be changed.
No, iterator is fine (better than list actually),
you just need to split the values too to create those tuples.
lines
.map(_.split(","))
.map { case l =>
l.head -> l.tail.toList.map(_.split(":"))
.collect { case Seq(a,b) => a -> b.toInt }
}
.toMap
An alternative that looks a little bit more aesthetic to my eye is converting to map early, and then using mapValues (I personally much
prefer short lambdas). The downside is mapValues is lazy, so you end up
having to do .toMap twice to force it in the end:
lines
.map(_.split(","))
.map { case l => l.head -> l.tail.toList }
.toMap
.mapValues(_.split(":"))
.mapValues(_.collect { case Seq(a,b) => a -> b.toInt })
.toMap

read a file in scala and get key value pairs as Map[String, List[String]]

i am reading a file and getting the records as a Map[String, List[String]] in spark-scala. similar thing i want to achieve in pure scala form without any spark references(not reading an rdd). what should i change to make it work in a pure scala way
rdd
.filter(x => (x != null) && (x.length > 0))
.zipWithIndex()
.map {
case (line, index) =>
val array = line.split("~").map(_.trim)
(array(0), array(1), index)
}
.groupBy(_._1)
.mapValues(x => x.toList.sortBy(_._3).map(_._2))
.collect
.toMap
For the most part it will remain the same except for the groupBy part in rdd. Scala List also has the map, filter, reduce etc. methods. So they can be used in almost a similar fashion.
val lines = Source.fromFile('filename.txt').getLines.toList
Once the file is read and stored in List, the methods can be applied to it.
For the groupBy part, one simple approach can be to sort the tuples on the key. That will effectively cluster the tuples with same keys together.
val grouped = scala.util.Sorting.stablesort(arr, (e1: String, e2: String, e3: String)
=> e1._1 < e2._2)
There can be better solutions definitely, but this would effectively do the same task.
I came up with the below approach
Source.fromInputStream(
getClass.getResourceAsStream(filePath)).getLines.filter(
lines =>(lines != null) && (lines.length > 0)).map(_.split("~")).toList.groupBy(_(0)).map{ case (key, values) => (key, values.map(_(1))) }

How to fill Scala Seq of Sets with unique values from Spark RDD?

I'm working with Spark and Scala. I have an RDD of Array[String] that I'm going to iterate through. The RDD contains values for attributes like (name, age, work, ...). I'm using a Sequence of mutable Sets of Strings (called attributes) to collect all unique values for each attribute.
Think of the RDD as something like this:
("name1","21","JobA")
("name2","21","JobB")
("name3","22","JobA")
In the end I want something like this:
attributes = (("name1","name2","name3"),("21","22"),("JobA","JobB"))
I have the following code:
val someLength = 10
val attributes = Seq.fill[mutable.Set[String]](someLength)(mutable.Set())
val splitLines = rdd.map(line => line.split("\t"))
lines.foreach(line => {
for {(value, index) <- line.zipWithIndex} {
attributes(index).add(value)
// #1
}
})
// #2
When I debug and stop at the line marked with #1, everything is fine, attributes is correctly filled with unique values.
But after the loop, at line #2, attributes is empty again. Looking into it shows, that attributes is a sequence of sets, that are all of size 0.
Seq()
Seq()
...
What am I doing wrong? Is there some kind of scoping going on, that I'm not aware of?
The answer lies in the fact that Spark is a distributed engine. I will give you a rough idea of the problem that you are facing. Here the elements in each RDD are bucketed into Partitions and each Partition potentially lives on a different node.
When you write rdd1.foreach(f) that f is wrapped inside a closure (Which gets copies of the corresponding objects). Now, this closure is serialized and then sent to each node where it is applied for each element in that Partition.
Here, your f will get a copy of attributes in its wrapped closure and hence when f is executed, it interacts with that copy of attributes and not with attributes that you want. This results in your attributes being left out without any changes.
I hope the problem is clear now.
val yourRdd = sc.parallelize(List(
("name1","21","JobA"),
("name2","21","JobB"),
("name3","22","JobA")
))
val yourNeededRdd = yourRdd
.flatMap({ case (name, age, work) => List(("name", name), ("age", age), ("work", work)) })
.groupBy({ case (attrName, attrVal) => attrName })
.map({ case (attrName, group) => (attrName, group.toList.map(_._2).distinct })
// RDD(
// ("name", List("name1", "name2", "name3")),
// ("age", List("21", "22")),
// ("work", List("JobA", "JobB"))
// )
// Or
val distinctNamesRdd = yourRdd.map(_._1).distinct
// RDD("name1", "name2", "name3")
val distinctAgesRdd = yourRdd.map(_._2).distinct
// RDD("21", "22")
val distinctWorksRdd = yourRdd.map(_._3).distinct
// RDD("JobA", "JobB")

acces tuple inside a tuple for anonymous map job in Spark

This post is essentially about how to build joint and marginal histograms from a (String, String) RDD. I posted the code that I eventually used below as the answer.
I have an RDD that contains a set of tuples of type (String,String) and since they aren't unique I want to get a look at how many times each String, String combination occurs so I use countByValue like so
val PairCount = Pairs.countByValue().toSeq
which gives me a tuple as output like this ((String,String),Long) where long is the number of times that the (String, String) tuple appeared
These Strings can be repeated in different combinations and I essentially want to run word count on this PairCount variable so I tried something like this to start:
PairCount.map(x => (x._1._1, x._2))
But the output the this spits out is String1->1, String2->1, String3->1, etc.
How do I output a key value pair from a map job in this case where the key is going to be one of the String values from the inner tuple, and the value is going to be the Long value from the outter tuple?
Update:
#vitalii gets me almost there. the answer gets me to a Seq[(String,Long)], but what I really need is to turn that into a map so that I can run reduceByKey it afterwards. when I run
PairCount.flatMap{case((x,y),n) => Seq[x->n]}.toMap
for each unique x I get x->1
for example the above line of code generates mom->1 dad->1 even if the tuples out of the flatMap included (mom,30) (dad,59) (mom,2) (dad,14) in which case I would expect toMap to provide mom->30, dad->59 mom->2 dad->14. However, I'm new to scala so I might be misinterpreting the functionality.
how can I get the Tuple2 sequence converted to a map so that I can reduce on the map keys?
If I correctly understand question, you need flatMap:
val pairCountRDD = pairs.countByValue() // RDD[((String, String), Int)]
val res : RDD[(String, Int)] = pairCountRDD.flatMap { case ((s1, s2), n) =>
Seq(s1 -> n, s2 -> n)
}
Update: I didn't quiet understand what your final goal is, but here's a few more examples that may help you, btw code above is incorrect, I have missed the fact that countByValue returns map, and not RDD:
val pairs = sc.parallelize(
List(
"mom"-> "dad", "dad" -> "granny", "foo" -> "bar", "foo" -> "baz", "foo" -> "foo"
)
)
// don't use countByValue, if pairs is large you will run out of memmory
val pairCountRDD = pairs.map(x => (x, 1)).reduceByKey(_ + _)
val wordCount = pairs.flatMap { case (a,b) => Seq(a -> 1, b ->1)}.reduceByKey(_ + _)
wordCount.take(10)
// count in how many pairs each word occur, keys and values:
val wordPairCount = pairs.flatMap { case (a,b) =>
if (a == b) {
Seq(a->1)
} else {
Seq(a -> 1, b ->1)
}
}.reduceByKey(_ + _)
wordPairCount.take(10)
to get the histograms for the (String,String) RDD I used this code.
val Hist_X = histogram.map(x => (x._1-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_Y = histogram.map(x => (x._2-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_XY = histogram.map(x => (x-> 1.0)).reduceByKey(_+_)
where histogram was the (String,String) RDD

Distributed Map in Scala Spark

Does Spark support distributed Map collection types ?
So if I have an HashMap[String,String] which are key,value pairs , can this be converted to a distributed Map collection type ? To access the element I could use "filter" but I doubt this performs as well as Map ?
Since I found some new info I thought I'd turn my comments into an answer. #maasg already covered the standard lookup function I would like to point out you should be careful because if the RDD's partitioner is None, lookup just uses a filter anyway. In reference to the (K,V) store on top of spark it looks like this is in progress, but a usable pull request has been made here. Here is an example usage.
import org.apache.spark.rdd.IndexedRDD
// Create an RDD of key-value pairs with Long keys.
val rdd = sc.parallelize((1 to 1000000).map(x => (x.toLong, 0)))
// Construct an IndexedRDD from the pairs, hash-partitioning and indexing
// the entries.
val indexed = IndexedRDD(rdd).cache()
// Perform a point update.
val indexed2 = indexed.put(1234L, 10873).cache()
// Perform a point lookup. Note that the original IndexedRDD remains
// unmodified.
indexed2.get(1234L) // => Some(10873)
indexed.get(1234L) // => Some(0)
// Efficiently join derived IndexedRDD with original.
val indexed3 = indexed.innerJoin(indexed2) { (id, a, b) => b }.filter(_._2 != 0)
indexed3.collect // => Array((1234L, 10873))
// Perform insertions and deletions.
val indexed4 = indexed2.put(-100L, 111).delete(Array(998L, 999L)).cache()
indexed2.get(-100L) // => None
indexed4.get(-100L) // => Some(111)
indexed2.get(999L) // => Some(0)
indexed4.get(999L) // => None
It seems like the pull request was well received and will probably be included in future versions of spark, so it is probably safe to use that pull request in your own code. Here is the JIRA ticket in case you were curious
The quick answer: Partially.
You can transform a Map[A,B] into an RDD[(A,B)] by first forcing the map into a sequence of (k,v) pairs but by doing so you loose the constrain that keys of a map must be a set. ie. you loose the semantics of the Map structure.
From a practical perspective, you can still resolve an element into its corresponding value using kvRdd.lookup(element) but the result will be a sequence, given that you have no warranties that there's a single lookup value as explained before.
A spark-shell example to make things clear:
val englishNumbers = Map(1 -> "one", 2 ->"two" , 3 -> "three")
val englishNumbersRdd = sc.parallelize(englishNumbers.toSeq)
englishNumbersRdd.lookup(1)
res: Seq[String] = WrappedArray(one)
val spanishNumbers = Map(1 -> "uno", 2 -> "dos", 3 -> "tres")
val spanishNumbersRdd = sc.parallelize(spanishNumbers.toList)
val bilingueNumbersRdd = englishNumbersRdd union spanishNumbersRdd
bilingueNumbersRdd.lookup(1)
res: Seq[String] = WrappedArray(one, uno)