I am using spark 2.4.1 version and java 8.
I have scenario like:
Will be provided a list of classifiers from a property file to process.
These classifiers determines the data what to pull and process.
Something like the below:
val classifiers = Seq("classifierOne","classifierTwo","classifierThree");
for( classifier : classifiers ){
// read from CassandraDB table
val acutalData = spark.read(.....).where(<classifier conditition>)
// the data varies depend on the classifier passed in
// this data has many fields along with fieldOne, fieldTwo and fieldThree
Depend on the classifier I need to filter the data.
Currently I am doing it as below:
if(classifier.===("classifierOne")) {
val classifierOneDs = acutalData.filter(col("classifierOne").notEqual(lit("")).or(col("classifierOne").isNotNull()));
writeToParquet(classifierOneDs);
} else if(classifier.===("classifierTwo")) {
val classifierTwoDs = acutalData.filter(col("classifierTwo").notEqual(lit("")).or(col("classifierTwo").isNotNull()));
writeToParquet(classifierOneDs);
} else (classifier.===("classifierThree")) {
val classifierThreeDs = acutalData.filter(col("classifierThree").notEqual(lit("")).or(col("classifierThree").isNotNull()));
writeToParquet(classifierOneDs);
}
Is there any way to avoid the if-else block here?
Any other way to do/achieve the same in spark distrubated way?
Your question seems more about how to structure your application than Spark itself. There are two parts really.
Is there any way to avoid the if-else block here?
"Avoid"? In what sense? Spark can't magically "discover" your way of doing distributed processing. You should help Spark a bit.
For this case I'd propose a lookup table with all possible filter conditions and their names to look up by, e.g.
val classifiers = Map(
"classifierOne" -> col("classifierOne").notEqual(lit("")).or(col("classifierOne").isNotNull()),
"classifierTwo" -> ...,
"classifierThree" -> ...)
In order to use it you simply iterate over all the classifiers (or look up as many as needed), e.g.
val queries = classifiers.map { case (name, cond) =>
spark
.read(.....)
.where(cond)
.filter(col(name).notEqual(lit("")).or(col(name).isNotNull()))
}
queries is a collection of DataFrames to be writeToParquet and it's up to you how to make the queries executed in parallel (Spark will take care of doing it in distributed way). Use Scala Futures or another parallel API.
I think the following could make it just fine:
queries.par.foreach(writeToParquet)
With queries.par.foreach you essentially execute all writeToParquet in parallel. Since writeToParquet executes a DataFrame action to writing in parquet format that follows all the rules of Spark for any other action. It will run a Spark job (perhaps even more than one) and the job is executed in distributed fashion using Spark machinery.
Think of queries.par as a way to execute the queries one by one without waiting for earlier queries to finish to start a new one. You are strongly recommended to configure FAIR scheduling mode:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads.
Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources.
So, you need to select, what column to check, based on classifier name, that will be passed as a list?
val classifiers = Seq("classifierOne","classifierTwo","classifierThree");
for(classifier : classifiers) {
val acutalData = spark.read(.....).where(<classifier conditition>)
val classifierDs = acutalData.filter(col(classifier).notEqual(lit("")).or(col(classifier).isNotNull()));
writeToParquet(classifierDs);
}
As you're iterating through list, you would be going through all the classifiers anyway.
If column name can be different from actual classifier name, you can make it List[Classifier], where Classifier is something like
case class Classifier(colName: String, classifierName: String)
Related
I'm working on a code-base that groups together chunks of logic into classes. For example, there might be a class that calculates the mean of a column after filtering, and another that aggregates columns to calculate the unique_count of certain fields. Sample code below:
// note: this interface is mutable and we can add things to this interface as needed
trait Processor {
def dosomething(input: DataFrame): DataFrame
}
class NormScore extends Processor {
def dosomething(input: DataFrame) = {
input.withColumn("normalized_score", col("score") - lit(1))
}
}
class Distribution extends Processor {
def dosomething(input: DataFrame) = {
input.groupBy(col("score"))
.agg(count(lit(1)).as("count"))
}
}
At the end, we chain these together to form a pipeline of transformations:
val pipeline = Seq(Distribution(), NormScore())
And we sequentially enact this set of transformations on a given input.
Problem: We know the schema for the input, and I would like to support a simple def validate(pipeline: Seq[Processor]) method to validate if a pipeline can even work or not. For example, if you're aggregating on a field that doesn't exist, spark will complain and throw an error. However, since we know all the types involved, it feels like we should be able to figure this out before spark even starts. Today we catch this in a unit-test where we run through with some mock-data, but that can take > 1m depending on the complexity of the transform, and it requires generating data, etc. Instead I'd like to see if I can deduce the transforms myself without needing spark.
This boils down to keeping track of all schema changes. Each Processor is just: (a) consuming records based on the input schema - it's a valid use of the processor if the input schema has the necessary columns that the processor uses (b) generating records with the output schema, which is then fed into the next Processor as an input.
What I've tried: I've tried mapping columns back to the spark expressions, and traversing the catalyst expression-tree. However, this doesn't work well for .withColumn type of additions. I've also got no clue if I'm working with the catalyst expression tree correctly, and how I would track (a) what input fields a transformation uses (I figure I can leverage expr.dataType to figure out what's generated).
Any suggestions?
I have several different queries I need to perform on several different parquet files using Spark. Each of the queries is different, and has its own function which applies it. For example:
def query1(sqtx: sqlContext): DataFrame = {
sqtx.sql("select clients as people, reputation from table1") }
def query2(sqtx: sqlContext): DataFrame = {
sqtx.sql("select passengers as people, reputation from table2") }
and so on. As you can see, while all the queries are different, the schema of all the outputs is identical.
After querying, I want to use unionAll on all the successful outputs. And here comes my question - how? Using ParSeq.map is not possible here, since the mapping will be different for every query, and using Future doesn't really seems to fit in this case (I need to use onComplete on each one, see if it failed or not, etc.)
Any ideas how to do this simply?
I have a onequestion, during make spark app.
In Spark API, What is the difference between makeRDD functions and parallelize function?
There is no difference whatsoever. To quote makeRDD doctring:
This method is identical to parallelize.
and if you take a look at the implementation it simply calls parallelize:
def makeRDD[T: ClassTag](
seq: Seq[T],
numSlices: Int = defaultParallelism): RDD[T] = withScope {
parallelize(seq, numSlices)
}
At the end of the day it is a matter of taste. One thing to consider is that makeRDD seems to be specific to Scala API. PySpark and internal SparkR API provide only parallelize.
Note: There is a second implementation of makeRDD which allows you to set location preferences, but given a different signature it is not interchangeable with parallelize.
As noted by #zero323, makeRDD has 2 implementations. One is identical to parallelize. The other is a very useful way to inject data locality into your Spark application even if you are not using HDFS.
For example, it provides data locality when your data is already distributed on disk across your Spark cluster according to some business logic. Assume your goal is to create an RDD that will load data from disk and transform it with a function, and you would like to do so while running local to the data as much as possible.
To do this, you can use makeRDD to create an empty RDD with different location preferences assigned to each of your RDD partitions. Each partition can be responsible for loading your data. As long as you fill the partitions with the path to your partition-local data, then execution of subsequent transformations will be node-local.
Seq<Tuple2<Integer, Seq<String>>> rddElemSeq =
JavaConversions.asScalaBuffer(rddElemList).toSeq();
RDD<Integer> rdd = sparkContext.makeRDD(rddElemSeq, ct);
JavaRDD<Integer> javaRDD = JavaRDD.fromRDD(rdd, ct);
JavaRDD<List<String>> keyRdd = javaRDD.map(myFunction);
JavaRDD<myData> myDataRdd = keyRdd.map(loadMyData);
In this snippet, rddElemSeq contains the location preferences for each partition (an IP address). Each partition also has an Integer which acts like a key. My function myFunction consumes that key and can be used to generate a list of paths to my data local to that partition. Then that data can be loaded in the next line.
What if, when I traverse RDD, I need to calculate values in dataset by calling external (blocking) service? How do you think that could be achieved?
val values: Future[RDD[Double]] = Future sequence tasks
I've tried to create a list of Futures, but as RDD id not Traversable, Future.sequence is not suitable.
I just wonder, if anyone had such a problem, and how did you solve it?
What I'm trying to achieve is to get a parallelism on a single worker node, so I can call that external service 3000 times per second.
Probably, there is another solution, more suitable for spark, like having multiple working nodes on single host.
It's interesting to know, how do you cope with such a challenge? Thanks.
Here is answer to my own question:
val buckets = sc.textFile(logFile, 100)
val tasks: RDD[Future[Object]] = buckets map { item =>
future {
// call native code
}
}
val values = tasks.mapPartitions[Object] { f: Iterator[Future[Object]] =>
val searchFuture: Future[Iterator[Object]] = Future sequence f
Await result (searchFuture, JOB_TIMEOUT)
}
The idea here is, that we get the collection of partitions, where each partition is sent to the specific worker and is the smallest piece of work. Each that piece of work contains data, that could be processed by calling native code and sending that data.
'values' collection contains the data, that is returned from the native code and that work is done across the cluster.
Based on your answer, that the blocking call is to compare provided input with each individual item in the RDD, I would strongly consider rewriting the comparison in java/scala so that it can be run as part of your spark process. If the comparison is a "pure" function (no side effects, depends only on its inputs), it should be straightforward to re-implement, and the decrease in complexity and increase in stability in your spark process due to not having to make remote calls will probably make it worth it.
It seems unlikely that your remote service will be able to handle 3000 calls per second, so a local in-process version would be preferable.
If that is absolutely impossible for some reason, then you might be able to create a RDD transformation which turns your data into a RDD of futures, in pseudo-code:
val callRemote(data:Data):Future[Double] = ...
val inputData:RDD[Data] = ...
val transformed:RDD[Future[Double]] = inputData.map(callRemote)
And then carry on from there, computing on your Future[Double] objects.
If you know how much parallelism your remote process can handle, it might be best to abandon the Future mode and accept that it is a bottleneck resource.
val remoteParallelism:Int = 100 // some constant
val callRemoteBlocking(data:Data):Double = ...
val inputData:RDD[Data] = ...
val transformed:RDD[Double] = inputData.
coalesce(remoteParallelism).
map(callRemoteBlocking)
Your job will probably take quite some time, but it shouldn't flood your remote service and die horribly.
A final option is that if the inputs are reasonably predictable and the range of outcomes is consistent and limited to some reasonable number of outputs (millions or so), you could precompute them all as a data set using your remote service and find them at spark job time using a join.
I am using Spark with scala. I wanted to know if having single one line command better than separate commands? What are the benefits if any? Does it gain more efficiency in terms of speed? Why?
for e.g.
var d = data.filter(_(1)==user).map(f => (f(2),f(5).toInt)).groupByKey().map(f=> (f._1,f._2.count(x=>true), f._2.sum))
against
var a = data.filter(_(1)==user)
var b = a.map(f => (f(2),f(5).toInt))
var c = b.groupByKey()
var d = c.map(f=> (f._1,f._2.count(x=>true), f._2.sum))
There is no performance difference between your two examples; the decision to chain RDD transformations or to explicitly represent the intermediate RDDs is just a matter of style. Spark's lazy evaluation means that no actual distributed computation will be performed until you invoke an RDD action like take() or count().
During execution, Spark will pipeline as many transformations as possible. For your example, Spark won't materialize the entire filtered dataset before it maps it: the filter() and map() transformations will be pipelined together and executed in a single stage. The groupByKey() transformation (usually) needs to shuffle data over the network, so it's executed in a separate stage. Spark would materialize the output of filter() only if it had been cache()d.
You might need to use the second style if you want to cache an intermediate RDD and perform further processing on it. For example, if I wanted to perform multiple actions on the output of the groupByKey() transformation, I would write something like
val grouped = data.filter(_(1)==user)
.map(f => (f(2),f(5).toInt))
.groupByKey()
.cache()
val mapped = grouped.map(f=> (f._1,f._2.count(x=>true), f._2.sum))
val counted = grouped.count()
There is no difference in terms of execution, but you might want to consider the readability of your code. I would go with your first example but like this:
var d = data.filter(_(1)==user)
.map(f => (f(2),f(5).toInt))
.groupByKey()
.map(f=> (f._1,f._2.count(x=>true), f._2.sum))
Really this is more of a Scala question than Spark though. Still, as you can see from Spark's implementation of word count as shown in their documentation
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
you don't need to worry about those kinds of things. The Scala language (through laziness, etc.) and Spark's RDD implementation handles all that at a higher level of abstraction.
If you find really bad performance, then you should take the time to explore why. As Knuth said, "premature optimization is the root of all evil."