Spark Scala: mapPartitions in this use case - scala

I was reading a lot about the differences between map and mapPartitions. Still I have some doubts.
The thing is after reading I decided to change the map functions for mapPartitions in my code because apparently mapPartitions is faster than map.
My question is about to be sure if my decision is right in scenarios like the following (comments show the previous code):
val reducedRdd = rdd.mapPartitions(partition => partition.map(r => (r.id, r)))
//val reducedRdd = rdd.map(r => (r.id, r))
.reduceByKey((r1, r2) => r1.combineElem(r2))
// .map(e => e._2)
.mapPartitions(partition => partition.map(e => e._2))
I am thinking it right? Thanks!

In your case, mapPartitions should not make any difference.
mapPartitions vs map
mapPartitions is useful when we have some common computation which we want to do for each partition. Example -
rdd.mapPartitions{
partition =>
val complicatedRowConverter = <SOME-COSTLY-COMPUTATION>
partition.map {
row => (row.id, complicatedRowConverter(row) )
}
}
In above example, we are creating a complicatedRowConverter function which is dervived from some costly computation. This function will be same for entire
RDD partition and we don't need recreate it again and again. The other way to do same thing can be -
rdd.map { row =>
val complicatedRowConverter = <SOME-COSTLY-COMPUTATION>
(row.id, complicatedRowConverter(row) )
}
}
This will be slow because we are unnecessarily running this statement for every row - val complicatedRowConverter = <SOME-COSTLY-COMPUTATION>.
In your case, you don't have any precomputation or anything else for each partition. In the mapPartition, you are simply iterating over each row and mapping it to (row.id, row).
So mapPartition here won't benefit and you can use simple map.

tl;dr mapPartitions will be fast in this case.
Why
consider a function
def someFunc(row): row {
// do some processing on row
// return new row
}
Say we are processing 1million records
map
We will end up calling the someFunc 1 million.
There are bascally 1m virtual function call and other kernel data structures created for the processing
mapPartition
we would write this as
mapPartition { partIter =>
partIter.map {
// do some processing on row
// return new row
}
}
No virual functions, context switch here.
Hence mapPartitions will be faster.
Also, like mentioned in the #moriarity007's answer, we also need to factor in the object creation overhead involved with the operation, when deciding between the operator to use.
Also, I would recommend using the dataframe transforms and actions to do processing, where the compute can be even faster, since Spark catalyst optimizes your code and also take advantage of code generation.

Related

Serialising temp collections created in Spark executors during task execution

I'm trying to find an effective way of writing collections created inside tasks to the output files of the job. For example, if we iterate over a RDD using foreach, we can create data structures that are local to the executor ex.,ListBuffer arr in the following code snippet. My problem is that how do I serialise arr and write it to file?
(1) Should I use FileWriter api or Spark saveAsTextFile will work?
(2) What will be the advantages of using one over the other
(3) Is there a better way of achieving the same.
PS: The reason I am using foreach instead of map is because I might not be able to transform all my RDD rows and I want to avoid getting Null values in the output.
val dataSorted: RDD[(Int, Int)] = <Some Operation>
val arr: ListBuffer = ListBuffer[(String, String)]()
dataSorted.foreach {
case (e, r) => {
if(e.id > 1000) {
arr += (("a", "b"))
}
}
}
Thanks,
Devj
You should not use driver's variables, but Accumulators - therw are articles about them with code examples here and here, also this question maybe helpful - there is simplified code example of custom AccumulatorParam
Write your own accumulator, that is able to add (String, String) or use built-in CollectionAccumulator. This is implementation of AccumulatorV2, new version of accumulator from Spark 2
Other way is to use Spark built-in filter and map functions - thanks #ImDarrenG for suggesting flatMap, but I think filter and map will be easier:
val result : Array[(String, String)] = someRDD
.filter(x => x._1 > 1000) // filter only good rows
.map (x => ("a", "b"))
.collect() // convert to arrat
The Spark API saves you some file handling code but essentially achieves the same thing.
The exception is if you are not using, say, HDFS and do not want your output file to be partitioned (spread across the executors file systems). In this case you will need to collect the data to the driver and use FileWriter to write to a single file, or files, and how you achieve that will depend on how much data you have. If you have more data than driver has memory you will need to handle it differently.
As mentioned in another answer, you're creating an array in your driver, while adding items from your executors, which will not work in a cluster environment. Something like this might be a better way to map your data and handle nulls:
val outputRDD = dataSorted.flatMap {
case (e, r) => {
if(e.id > 1000) {
Some(("a", "b"))
} else {
None
}
}
}
// save outputRDD to file/s here using the approapriate method...

Scala - Update RDD with another Map

I'm trying to update an RDD with more information from another Map....I wrote this but is not working.
Where:
LocalCurrencies is a Sequence of Currency class
rdd: RDD[String, String]
...
val localCurrencies = Await.result(CurrencyDAO.currencies, 30 seconds)
//update ISO3
rdd.map(r => r.updated("currencyiso3", localCurrencies.find(c => c.CurrencyId ==
rdd.get("currencyid")).get.ISO3))
//Update exponent
rdd.map(r => r.updated("exponent", localCurrencies.find(c => c.CurrencyId ==
rdd.get("currencyid")).get.Exponent))
Any suggestion ?
Thanks
map doesn't modify an RDD, it creates a new one (the same applies to every Spark transformation). If you don't actually do anything with this new RDD, Spark won't even bother creating it. So you want to write
val rdd1 = rdd.map(...).map(...) // better to combine two `map`s into one
and work with rdd1 from then one (you can still use rdd as well, if needed). This isn't necessarily the only error, but you'll still need to fix it.

How to run two SparkSql queries in parallel in Apache Spark

First, let me write the part of the code I want to execute in .scala file on spark.
This is my source file. It has structured data with four fields
val inputFile = sc.textFile("hdfs://Hadoop1:9000/user/hduser/test.csv")
I have declared a case class to store the data from file into table with four columns
case class Table1(srcIp: String, destIp: String, srcPrt: Int, destPrt: Int)
val inputValue = inputFile.map(_.split(",")).map(p => Table1(p(0),p(1),p(2).trim.toInt,p(3).trim.toInt)).toDF()
inputValue.registerTempTable("inputValue")
Now, let's say, I want to run following two queries. How can I run these queries in parallel as they are mutually independent. I feel, if I could run them in parallel, it can reduce the execution time. Right now, they are executed serially.
val primaryDestValues = sqlContext.sql("SELECT distinct destIp FROM inputValue")
primaryDestValues.registerTempTable("primaryDestValues")
val primarySrcValues = sqlContext.sql("SELECT distinct srcIp FROM inputValue")
primarySrcValues.registerTempTable("primarySrcValues")
primaryDestValues.join(primarySrcValues, $"destIp" === $"srcIp").select($"destIp",$"srcIp").show(
May be you can look in direction of Futures/Promises. There is a method in SparkContext submitJob which return you future with results. So may this you can fire two jobs and then collect results from futures.
I have not tried this method yet. Just an assumption.
No idea why you want to use sqlContext in the first place, and don't keep things simple.
val inputValue = inputFile.map(_.split(",")).map(p => (p(0),p(1),p(2).trim.toInt,p(3).trim.toInt))
Assuming p(0) = destIp, p(1)=srcIp
val joinedValue = inputValue.map{case(destIp, srcIp, x, y) => (destIp, (x, y))}
.join(inputFile.map{case(destIp, srcIp, x, y) => (srcIp, (x, y))})
.map{case(ip, (x1, y1), (x2, y2)) => (ip, destX, destY, srcX, srcY)}
Now it will be parallezied, and you can even control number of partitions you want using colasce
You can skip the two DISTINCT and do one at the end:
inputValue.select($"srcIp").join(
inputValue.select($"destIp"),
$"srcIp" === $"destIp"
).distinct().show
That's a nice question. This can be executed in parallel using the par in the array. For this you have customize your code accordingly.
Declare an array with two items in it (your can name this as per your wish). Write your code inside each case statement which you need to execute in parallel.
Array("destIp","srcIp").par.foreach { i =>
{
i match {
case "destIp" => {
val primaryDestValues = sqlContext.sql("SELECT distinct destIp FROM inputValue")
primaryDestValues.registerTempTable("primaryDestValues")
}
case "srcIp" => {
val primarySrcValues = sqlContext.sql("SELECT distinct srcIp FROM inputValue")
primarySrcValues.registerTempTable("primarySrcValues")
}}}
}
Once both of the case statement's execution is completed, your below code will be executed.
primaryDestValues.join(primarySrcValues, $"destIp" === $"srcIp").select($"destIp",$"srcIp").show()
Note : If you remove par from the code, it will run sequentially
The other option is to create another sparksession inside the code and execute sql using that sparksession variable. But this is little risky and has be used very carefully

How can we avoid MapPartition related issues?

val counts = parsed.mapPartitions(iter => {
iter.flatMap(point => {
println("points"+point)
point.indices.map(i => i,point(i)))
})
}).countByValue()
val count = parsed.mapPartitions(iter => {
iter.flatMap(point => {
println("pointsssss" + point.deep)
point.indices.map(i => (i, point(i)))
})
}).countByValue()
When I execute count.foreach(println), I also get output from counts. How can I avoid this problem ?
The reason you see both print statements is that countByValue is itself an action and not a transformation, and it triggers evaluation of the RDD (in this case, both of them). From the docs:
def countByValue(): Map[T, Long]
Return the count of each unique value in this RDD as a map of (value, count) pairs. The final combine step happens locally on the master, equivalent to running a single reduce task.
Your next code, count.foreach(println) happens thus outside of Spark, in normal Scala Collections, in the master node.
Check the logic if this is not the behavior you want, I have the suspicion that you want countByKey() instead (also an action).

How do I iterate RDD's in apache spark (scala)

I use the following command to fill an RDD with a bunch of arrays containing 2 strings ["filename", "content"].
Now I want to iterate over every of those occurrences to do something with every filename and content.
val someRDD = sc.wholeTextFiles("hdfs://localhost:8020/user/cloudera/*")
I can't seem to find any documentation on how to do this however.
So what I want is this:
foreach occurrence-in-the-rdd{
//do stuff with the array found on loccation n of the RDD
}
You call various methods on the RDD that accept functions as parameters.
// set up an example -- an RDD of arrays
val sparkConf = new SparkConf().setMaster("local").setAppName("Example")
val sc = new SparkContext(sparkConf)
val testData = Array(Array(1,2,3), Array(4,5,6,7,8))
val testRDD = sc.parallelize(testData, 2)
// Print the RDD of arrays.
testRDD.collect().foreach(a => println(a.size))
// Use map() to create an RDD with the array sizes.
val countRDD = testRDD.map(a => a.size)
// Print the elements of this new RDD.
countRDD.collect().foreach(a => println(a))
// Use filter() to create an RDD with just the longer arrays.
val bigRDD = testRDD.filter(a => a.size > 3)
// Print each remaining array.
bigRDD.collect().foreach(a => {
a.foreach(e => print(e + " "))
println()
})
}
Notice that the functions you write accept a single RDD element as input, and return data of some uniform type, so you create an RDD of the latter type. For example, countRDD is an RDD[Int], while bigRDD is still an RDD[Array[Int]].
It will probably be tempting at some point to write a foreach that modifies some other data, but you should resist for reasons described in this question and answer.
Edit: Don't try to print large RDDs
Several readers have asked about using collect() and println() to see their results, as in the example above. Of course, this only works if you're running in an interactive mode like the Spark REPL (read-eval-print-loop.) It's best to call collect() on the RDD to get a sequential array for orderly printing. But collect() may bring back too much data and in any case too much may be printed. Here are some alternative ways to get insight into your RDDs if they're large:
RDD.take(): This gives you fine control on the number of elements you get but not where they came from -- defined as the "first" ones which is a concept dealt with by various other questions and answers here.
// take() returns an Array so no need to collect()
myHugeRDD.take(20).foreach(a => println(a))
RDD.sample(): This lets you (roughly) control the fraction of results you get, whether sampling uses replacement, and even optionally the random number seed.
// sample() does return an RDD so you may still want to collect()
myHugeRDD.sample(true, 0.01).collect().foreach(a => println(a))
RDD.takeSample(): This is a hybrid: using random sampling that you can control, but both letting you specify the exact number of results and returning an Array.
// takeSample() returns an Array so no need to collect()
myHugeRDD.takeSample(true, 20).foreach(a => println(a))
RDD.count(): Sometimes the best insight comes from how many elements you ended up with -- I often do this first.
println(myHugeRDD.count())
The fundamental operations in Spark are map and filter.
val txtRDD = someRDD filter { case(id, content) => id.endsWith(".txt") }
the txtRDD will now only contain files that have the extension ".txt"
And if you want to word count those files you can say
//split the documents into words in one long list
val words = txtRDD flatMap { case (id,text) => text.split("\\s+") }
// give each word a count of 1
val wordT = words map (x => (x,1))
//sum up the counts for each word
val wordCount = wordsT reduceByKey((a, b) => a + b)
You want to use mapPartitions when you have some expensive initialization you need to perform -- for example, if you want to do Named Entity Recognition with a library like the Stanford coreNLP tools.
Master map, filter, flatMap, and reduce, and you are well on your way to mastering Spark.
I would try making use of a partition mapping function. The code below shows how an entire RDD dataset can be processed in a loop so that each input goes through the very same function. I am afraid I have no knowledge about Scala, so everything I have to offer is java code. However, it should not be very difficult to translate it into scala.
JavaRDD<String> res = file.mapPartitions(new FlatMapFunction <Iterator<String> ,String>(){
#Override
public Iterable<String> call(Iterator <String> t) throws Exception {
ArrayList<String[]> tmpRes = new ArrayList <>();
String[] fillData = new String[2];
fillData[0] = "filename";
fillData[1] = "content";
while(t.hasNext()){
tmpRes.add(fillData);
}
return Arrays.asList(tmpRes);
}
}).cache();
what the wholeTextFiles return is a Pair RDD:
def wholeTextFiles(path: String, minPartitions: Int): RDD[(String, String)]
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
Here is an example of reading the files at a local path then printing every filename and content.
val conf = new SparkConf().setAppName("scala-test").setMaster("local")
val sc = new SparkContext(conf)
sc.wholeTextFiles("file:///Users/leon/Documents/test/")
.collect
.foreach(t => println(t._1 + ":" + t._2));
the result:
file:/Users/leon/Documents/test/1.txt:{"name":"tom","age":12}
file:/Users/leon/Documents/test/2.txt:{"name":"john","age":22}
file:/Users/leon/Documents/test/3.txt:{"name":"leon","age":18}
or converting the Pair RDD to a RDD first
sc.wholeTextFiles("file:///Users/leon/Documents/test/")
.map(t => t._2)
.collect
.foreach { x => println(x)}
the result:
{"name":"tom","age":12}
{"name":"john","age":22}
{"name":"leon","age":18}
And I think wholeTextFiles is more compliant for small files.
for (element <- YourRDD)
{
// do what you want with element in each iteration, and if you want the index of element, simply use a counter variable in this loop beginning from 0
println (element._1) // this will print all filenames
}