Spark: Single pipelined scala command better than separate commands?

Spark: Single pipelined scala command better than separate commands? - scala

I am using Spark with scala. I wanted to know if having single one line command better than separate commands? What are the benefits if any? Does it gain more efficiency in terms of speed? Why?
for e.g.
var d = data.filter(_(1)==user).map(f => (f(2),f(5).toInt)).groupByKey().map(f=> (f._1,f._2.count(x=>true), f._2.sum))
against
var a = data.filter(_(1)==user)
var b = a.map(f => (f(2),f(5).toInt))
var c = b.groupByKey()
var d = c.map(f=> (f._1,f._2.count(x=>true), f._2.sum))

There is no performance difference between your two examples; the decision to chain RDD transformations or to explicitly represent the intermediate RDDs is just a matter of style. Spark's lazy evaluation means that no actual distributed computation will be performed until you invoke an RDD action like take() or count().
During execution, Spark will pipeline as many transformations as possible. For your example, Spark won't materialize the entire filtered dataset before it maps it: the filter() and map() transformations will be pipelined together and executed in a single stage. The groupByKey() transformation (usually) needs to shuffle data over the network, so it's executed in a separate stage. Spark would materialize the output of filter() only if it had been cache()d.
You might need to use the second style if you want to cache an intermediate RDD and perform further processing on it. For example, if I wanted to perform multiple actions on the output of the groupByKey() transformation, I would write something like
val grouped = data.filter(_(1)==user)
.map(f => (f(2),f(5).toInt))
.groupByKey()
.cache()
val mapped = grouped.map(f=> (f._1,f._2.count(x=>true), f._2.sum))
val counted = grouped.count()

There is no difference in terms of execution, but you might want to consider the readability of your code. I would go with your first example but like this:
var d = data.filter(_(1)==user)
.map(f => (f(2),f(5).toInt))
.groupByKey()
.map(f=> (f._1,f._2.count(x=>true), f._2.sum))
Really this is more of a Scala question than Spark though. Still, as you can see from Spark's implementation of word count as shown in their documentation
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
you don't need to worry about those kinds of things. The Scala language (through laziness, etc.) and Spark's RDD implementation handles all that at a higher level of abstraction.
If you find really bad performance, then you should take the time to explore why. As Knuth said, "premature optimization is the root of all evil."

Related

How to execute multiple queries in parallel and distributed?

I am using spark 2.4.1 version and java 8.
I have scenario like:
Will be provided a list of classifiers from a property file to process.
These classifiers determines the data what to pull and process.
Something like the below:
val classifiers = Seq("classifierOne","classifierTwo","classifierThree");
for( classifier : classifiers ){
// read from CassandraDB table
val acutalData = spark.read(.....).where(<classifier conditition>)
// the data varies depend on the classifier passed in
// this data has many fields along with fieldOne, fieldTwo and fieldThree
Depend on the classifier I need to filter the data.
Currently I am doing it as below:
if(classifier.===("classifierOne")) {
val classifierOneDs = acutalData.filter(col("classifierOne").notEqual(lit("")).or(col("classifierOne").isNotNull()));
writeToParquet(classifierOneDs);
} else if(classifier.===("classifierTwo")) {
val classifierTwoDs = acutalData.filter(col("classifierTwo").notEqual(lit("")).or(col("classifierTwo").isNotNull()));
writeToParquet(classifierOneDs);
} else (classifier.===("classifierThree")) {
val classifierThreeDs = acutalData.filter(col("classifierThree").notEqual(lit("")).or(col("classifierThree").isNotNull()));
writeToParquet(classifierOneDs);
}
Is there any way to avoid the if-else block here?
Any other way to do/achieve the same in spark distrubated way?

Your question seems more about how to structure your application than Spark itself. There are two parts really.
Is there any way to avoid the if-else block here?
"Avoid"? In what sense? Spark can't magically "discover" your way of doing distributed processing. You should help Spark a bit.
For this case I'd propose a lookup table with all possible filter conditions and their names to look up by, e.g.
val classifiers = Map(
"classifierOne" -> col("classifierOne").notEqual(lit("")).or(col("classifierOne").isNotNull()),
"classifierTwo" -> ...,
"classifierThree" -> ...)
In order to use it you simply iterate over all the classifiers (or look up as many as needed), e.g.
val queries = classifiers.map { case (name, cond) =>
spark
.read(.....)
.where(cond)
.filter(col(name).notEqual(lit("")).or(col(name).isNotNull()))
}
queries is a collection of DataFrames to be writeToParquet and it's up to you how to make the queries executed in parallel (Spark will take care of doing it in distributed way). Use Scala Futures or another parallel API.
I think the following could make it just fine:
queries.par.foreach(writeToParquet)
With queries.par.foreach you essentially execute all writeToParquet in parallel. Since writeToParquet executes a DataFrame action to writing in parquet format that follows all the rules of Spark for any other action. It will run a Spark job (perhaps even more than one) and the job is executed in distributed fashion using Spark machinery.
Think of queries.par as a way to execute the queries one by one without waiting for earlier queries to finish to start a new one. You are strongly recommended to configure FAIR scheduling mode:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads.
Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources.

So, you need to select, what column to check, based on classifier name, that will be passed as a list?
val classifiers = Seq("classifierOne","classifierTwo","classifierThree");
for(classifier : classifiers) {
val acutalData = spark.read(.....).where(<classifier conditition>)
val classifierDs = acutalData.filter(col(classifier).notEqual(lit("")).or(col(classifier).isNotNull()));
writeToParquet(classifierDs);
}
As you're iterating through list, you would be going through all the classifiers anyway.
If column name can be different from actual classifier name, you can make it List[Classifier], where Classifier is something like
case class Classifier(colName: String, classifierName: String)

Are spark variables lazily evaluated?

I have a spark code of structure:
val a:RDD = readData.someOperations()
a.cache()
val b = a.someOperations1()
val c = a.someOperations2()
val d = a.someOperations3()
val e = a.someOperations4()
a.unpersist()
some other code in many more RDDs(other RDDs are cached in this section and other vals are evaluated).
write variable to disk(a,b,c,d,e and others)
I wanted to know if the varibales are calculated in the place they are defined or only when writing to disk. I fear if they are evaluated only while writing to disk then I will be caching many more RDDs at same time.

Yes. You are correct. All the transformations on RDD are lazily evaluated until an action is done like collect() save() etc
All the transformation operations like map() reduce() generate physical and logical execution plans which are performed by tracking the parent plans when an action is performed.
You can checkout JerryLead and JacekLaskowski for more details.
I hope this is helpful

Split Spark DataFrame based on condition

I need something similar to the randomSplit function:
val Array(df1, df2) = myDataFrame.randomSplit(Array(0.6, 0.4))
However, I need to split myDataFrame based on a boolean condition. Does anything like the following exist?
val Array(df1, df2) = myDataFrame.booleanSplit(col("myColumn") > 100)
I'd like not to do two separate .filter calls.

Unfortunately the DataFrame API doesn't have such a method, to split by a condition you'll have to perform two separate filter transformations:
myDataFrame.cache() // recommended to prevent repeating the calculation
val condition = col("myColumn") > 100
val df1 = myDataFrame.filter(condition)
val df2 = myDataFrame.filter(not(condition))

I understand that caching and filtering twice looks a bit ugly, but please bear in mind that DataFrames are translated to RDDs, which are evaluated lazily, i.e. only when they are directly or indirectly used in an action.
If a method booleanSplit as suggested in the question existed, the result would be translated to two RDDs, each of which would be evaluated lazily. One of the two RDDs would be evaluated first and the other would be evaluated second, strictly after the first. At the point the first RDD is evaluated, the second RDD would not yet have "come into existence" (EDIT: Just noticed that there is a similar question for the RDD API with an answer that gives a similar reasoning)
To actually gain any performance benefit, the second RDD would have to be (partially) persisted during the iteration of the first RDD (or, actually, during the iteration of the parent RDD of both, which is triggered by the iteration of the first RDD). IMO this wouldn't align overly well with the design of the rest of the RDD API. Not sure if the performance gains would justify this.
I think the best you can achieve is to avoid writing two filter calls directly in your business code, by writing an implicit class with a method booleanSplit as a utility method does that part in a similar way as Tzach Zohar's answer, maybe using something along the lines of myDataFrame.withColumn("__condition_value", condition).cache() so the the value of the condition is not calculated twice.

How to parallelize list iteration and be able to create RDDs in Spark?

I've just started learning Spark and Scala.
From what I understand it's bad practice to use collect, because it gathers the whole data in memory and it's also bad practice to use for, because the code inside the block is not executed concurrently by more than one node.
Now, I have a List of numbers from 1 to 10:
List(1,2,3,4,5,6,7,8,9,10)
and for each of these values I need to generate a RDD using this value.
in such cases, how can I generate the RDD?
By doing
sc.parallelize(List(1,2,3,4,5,6,7,8,9,10)).map(number => generate_rdd(number))
I get an error because RDD cannot be generated inside another RDD.
What is the best workaround to this problem?

Assuming generate_rdd defined like def generate_rdd(n: Int): RDD[Something] what you need is flatMap instead of map.
sc.parallelize(List(1,2,3,4,5,6,7,8,9,10)).flatMap(number => generate_rdd(number))
This will give a RDD that is a concatenation of all RDDs that are created for numbers from 1 to 10.

Assuming that the number of RDDs that you would like to create would be lower and hence that parallelization itself need not be accomplished by RDD, we can use Scala's parallel collections instead. For example, I tried to count the number of lines in about 40 HDFS files simultaneously using the following piece of code [Ignore the setting of delimiter. For newline delimited texts, this could have well been replaced by sc.textFile]:
val conf = new Configuration(sc.hadoopConfiguration)
conf.set("textinputformat.record.delimiter", "~^~")
val parSeq = List("path of file1.xsv","path of file2.xsv",...).par
parSeq.map(x => {
val rdd = sc.newAPIHadoopFile(x, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
println(rdd.count())
})
Here is part of the output in Spark UI. As seen, most of the RDD count operations started at the same time.

Find RDD[(T, U)] elements that have key in RDD[T]

Given
val as: RDD[(T, U)]
val bs: RDD[T]
I would like to filter as to find the elements with keys present bs.
One approach is
val intermediateAndOtherwiseUnnessaryPair = bs.map(b => b -> b)
bs.join(as).values
But the mapping on bs is unfortunate. Is there a more direct method?

You can make the mapping less unnecessary by doing:
val intermediateAndOtherwiseUnnessaryPair = bs.map(b => (b, 1))
Also, co-partitioning before joining helps a lot:
val intermediateAndOtherwiseUnnessaryPair = bs.map(b => (b, 1)).paritionBy(new HashPartitioner(NUM_PARTITIONS))
bs.paritionBy(new HashPartitioner(NUM_PARTITIONS)).join(as).values
Co partitioned RDDs will not be shuffled at runtime, and thus you'll see a significant performance boost.
Broadcasting may not work if bs is too big (more precisely, has a large number of unique values), you may also want to increase driver.maxResultsize.

The only two (or at least the only ones I am aware of) popular and generic ways to filter one RDD using a second RDD are:
1) join which you are already doing - in this case I wouldn't worry about the unnecessary intermediate RDD that much though, map() is a narrow transformation and won't introduce that much overhead. The join() itself will most probably be slow, though, as it's a wide transformation (requires shuffles)
2) collecting the bs on the driver and making it a broadcast variable which then will be used in as.filter()
val collected = sc.broadcast(bs.collect().toSet)
as.filter(el => collected.value.contains(el))
You need to do this as Spark doesn't support nesting RDDs inside methods called on RDD.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark: Single pipelined scala command better than separate commands? - scala

Related

How to execute multiple queries in parallel and distributed?

Are spark variables lazily evaluated?

Split Spark DataFrame based on condition

How to parallelize list iteration and be able to create RDDs in Spark?

Find RDD[(T, U)] elements that have key in RDD[T]

Categories

Resources