I'm working on a code-base that groups together chunks of logic into classes. For example, there might be a class that calculates the mean of a column after filtering, and another that aggregates columns to calculate the unique_count of certain fields. Sample code below:
// note: this interface is mutable and we can add things to this interface as needed
trait Processor {
def dosomething(input: DataFrame): DataFrame
}
class NormScore extends Processor {
def dosomething(input: DataFrame) = {
input.withColumn("normalized_score", col("score") - lit(1))
}
}
class Distribution extends Processor {
def dosomething(input: DataFrame) = {
input.groupBy(col("score"))
.agg(count(lit(1)).as("count"))
}
}
At the end, we chain these together to form a pipeline of transformations:
val pipeline = Seq(Distribution(), NormScore())
And we sequentially enact this set of transformations on a given input.
Problem: We know the schema for the input, and I would like to support a simple def validate(pipeline: Seq[Processor]) method to validate if a pipeline can even work or not. For example, if you're aggregating on a field that doesn't exist, spark will complain and throw an error. However, since we know all the types involved, it feels like we should be able to figure this out before spark even starts. Today we catch this in a unit-test where we run through with some mock-data, but that can take > 1m depending on the complexity of the transform, and it requires generating data, etc. Instead I'd like to see if I can deduce the transforms myself without needing spark.
This boils down to keeping track of all schema changes. Each Processor is just: (a) consuming records based on the input schema - it's a valid use of the processor if the input schema has the necessary columns that the processor uses (b) generating records with the output schema, which is then fed into the next Processor as an input.
What I've tried: I've tried mapping columns back to the spark expressions, and traversing the catalyst expression-tree. However, this doesn't work well for .withColumn type of additions. I've also got no clue if I'm working with the catalyst expression tree correctly, and how I would track (a) what input fields a transformation uses (I figure I can leverage expr.dataType to figure out what's generated).
Any suggestions?
Related
I am using spark 2.4.1 version and java 8.
I have scenario like:
Will be provided a list of classifiers from a property file to process.
These classifiers determines the data what to pull and process.
Something like the below:
val classifiers = Seq("classifierOne","classifierTwo","classifierThree");
for( classifier : classifiers ){
// read from CassandraDB table
val acutalData = spark.read(.....).where(<classifier conditition>)
// the data varies depend on the classifier passed in
// this data has many fields along with fieldOne, fieldTwo and fieldThree
Depend on the classifier I need to filter the data.
Currently I am doing it as below:
if(classifier.===("classifierOne")) {
val classifierOneDs = acutalData.filter(col("classifierOne").notEqual(lit("")).or(col("classifierOne").isNotNull()));
writeToParquet(classifierOneDs);
} else if(classifier.===("classifierTwo")) {
val classifierTwoDs = acutalData.filter(col("classifierTwo").notEqual(lit("")).or(col("classifierTwo").isNotNull()));
writeToParquet(classifierOneDs);
} else (classifier.===("classifierThree")) {
val classifierThreeDs = acutalData.filter(col("classifierThree").notEqual(lit("")).or(col("classifierThree").isNotNull()));
writeToParquet(classifierOneDs);
}
Is there any way to avoid the if-else block here?
Any other way to do/achieve the same in spark distrubated way?
Your question seems more about how to structure your application than Spark itself. There are two parts really.
Is there any way to avoid the if-else block here?
"Avoid"? In what sense? Spark can't magically "discover" your way of doing distributed processing. You should help Spark a bit.
For this case I'd propose a lookup table with all possible filter conditions and their names to look up by, e.g.
val classifiers = Map(
"classifierOne" -> col("classifierOne").notEqual(lit("")).or(col("classifierOne").isNotNull()),
"classifierTwo" -> ...,
"classifierThree" -> ...)
In order to use it you simply iterate over all the classifiers (or look up as many as needed), e.g.
val queries = classifiers.map { case (name, cond) =>
spark
.read(.....)
.where(cond)
.filter(col(name).notEqual(lit("")).or(col(name).isNotNull()))
}
queries is a collection of DataFrames to be writeToParquet and it's up to you how to make the queries executed in parallel (Spark will take care of doing it in distributed way). Use Scala Futures or another parallel API.
I think the following could make it just fine:
queries.par.foreach(writeToParquet)
With queries.par.foreach you essentially execute all writeToParquet in parallel. Since writeToParquet executes a DataFrame action to writing in parquet format that follows all the rules of Spark for any other action. It will run a Spark job (perhaps even more than one) and the job is executed in distributed fashion using Spark machinery.
Think of queries.par as a way to execute the queries one by one without waiting for earlier queries to finish to start a new one. You are strongly recommended to configure FAIR scheduling mode:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads.
Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources.
So, you need to select, what column to check, based on classifier name, that will be passed as a list?
val classifiers = Seq("classifierOne","classifierTwo","classifierThree");
for(classifier : classifiers) {
val acutalData = spark.read(.....).where(<classifier conditition>)
val classifierDs = acutalData.filter(col(classifier).notEqual(lit("")).or(col(classifier).isNotNull()));
writeToParquet(classifierDs);
}
As you're iterating through list, you would be going through all the classifiers anyway.
If column name can be different from actual classifier name, you can make it List[Classifier], where Classifier is something like
case class Classifier(colName: String, classifierName: String)
A good question for Spark experts.
I am processing data in a map operation (RDD). Within the mapper function, I need to lookup objects of class A to be used in processing of elements in an RDD.
Since this will be performed on executors AND creation of elements of type A (that will be looked up) happens to be an expensive operation, I want to pre-load and cache these objects on each executor. What is the best way of doing it?
One idea is to broadcast a lookup table, but class A is not serializable (no control over its implementation).
Another idea is to load them up in a singleton object. However, I want to control what gets loaded into that lookup table (e.g. possibly different data on different Spark jobs).
Ideally, I want to specify what will be loaded on executors once (including the case of Streaming, so that the lookup table stays in memory between batches), through a parameter that will be available on the driver during its start-up, before any data gets processed.
Is there a clean and elegant way of doing it or is it impossible to achieve?
This is exactly the targeted use case for broadcast. Broadcasted variables are transmitted once and use torrents to move efficiently to all executors, and stay in memory / local disk until you no longer need them.
Serialization often pops up as an issue when using others' interfaces. If you can enforce that the objects you consume are serializable, that's going to be the best solution. If this is impossible, your life gets a little more complicated. If you can't serialize the A objects, then you have to create them on the executors for each task. If they're stored in a file somewhere, this would look something like:
rdd.mapPartitions { it =>
val lookupTable = loadLookupTable(path)
it.map(elem => fn(lookupTable, elem))
}
Note that if you're using this model, then you have to load the lookup table once per task -- you can't benefit from the cross-task persistence of broadcast variables.
EDIT: Here's another model, which I believe lets you share the lookup table across tasks per JVM.
class BroadcastableLookupTable {
#transient val lookupTable: LookupTable[A] = null
def get: LookupTable[A] = {
if (lookupTable == null)
lookupTable = < load lookup table from disk>
lookupTable
}
}
This class can be broadcast (nothing substantial is transmitted) and the first time it's called per JVM, you'll load the lookup table and return it.
In case serialisation turns out to be impossible, how about storing the lookup objects in a database? It's not the easiest solution, granted, but should work just fine. I could recommend checking e.g. spark-redis, but I am sure there are better solution out there.
Since A is not serializable the easiest solution is to create yout own serializable type A1 with all data from A required for computation. Then use the new lookup table in broadcast.
I have several different queries I need to perform on several different parquet files using Spark. Each of the queries is different, and has its own function which applies it. For example:
def query1(sqtx: sqlContext): DataFrame = {
sqtx.sql("select clients as people, reputation from table1") }
def query2(sqtx: sqlContext): DataFrame = {
sqtx.sql("select passengers as people, reputation from table2") }
and so on. As you can see, while all the queries are different, the schema of all the outputs is identical.
After querying, I want to use unionAll on all the successful outputs. And here comes my question - how? Using ParSeq.map is not possible here, since the mapping will be different for every query, and using Future doesn't really seems to fit in this case (I need to use onComplete on each one, see if it failed or not, etc.)
Any ideas how to do this simply?
I have a onequestion, during make spark app.
In Spark API, What is the difference between makeRDD functions and parallelize function?
There is no difference whatsoever. To quote makeRDD doctring:
This method is identical to parallelize.
and if you take a look at the implementation it simply calls parallelize:
def makeRDD[T: ClassTag](
seq: Seq[T],
numSlices: Int = defaultParallelism): RDD[T] = withScope {
parallelize(seq, numSlices)
}
At the end of the day it is a matter of taste. One thing to consider is that makeRDD seems to be specific to Scala API. PySpark and internal SparkR API provide only parallelize.
Note: There is a second implementation of makeRDD which allows you to set location preferences, but given a different signature it is not interchangeable with parallelize.
As noted by #zero323, makeRDD has 2 implementations. One is identical to parallelize. The other is a very useful way to inject data locality into your Spark application even if you are not using HDFS.
For example, it provides data locality when your data is already distributed on disk across your Spark cluster according to some business logic. Assume your goal is to create an RDD that will load data from disk and transform it with a function, and you would like to do so while running local to the data as much as possible.
To do this, you can use makeRDD to create an empty RDD with different location preferences assigned to each of your RDD partitions. Each partition can be responsible for loading your data. As long as you fill the partitions with the path to your partition-local data, then execution of subsequent transformations will be node-local.
Seq<Tuple2<Integer, Seq<String>>> rddElemSeq =
JavaConversions.asScalaBuffer(rddElemList).toSeq();
RDD<Integer> rdd = sparkContext.makeRDD(rddElemSeq, ct);
JavaRDD<Integer> javaRDD = JavaRDD.fromRDD(rdd, ct);
JavaRDD<List<String>> keyRdd = javaRDD.map(myFunction);
JavaRDD<myData> myDataRdd = keyRdd.map(loadMyData);
In this snippet, rddElemSeq contains the location preferences for each partition (an IP address). Each partition also has an Integer which acts like a key. My function myFunction consumes that key and can be used to generate a list of paths to my data local to that partition. Then that data can be loaded in the next line.
In my application when taking perfromance numbers, groupby is eating away lot of time.
My RDD is of below strcuture:
JavaPairRDD<CustomTuple, Map<String, Double>>
CustomTuple:
This object contains information about the current row in RDD like which week, month, city, etc.
public class CustomTuple implements Serializable{
private Map hierarchyMap = null;
private Map granularMap = null;
private String timePeriod = null;
private String sourceKey = null;
}
Map
This map contains the statistical data about that row like how much investment, how many GRPs, etc.
<"Inv", 20>
<"GRP", 30>
I was executing below DAG on this RDD
apply filter on this RDD and scope out relevant rows : Filter
apply filter on this RDD and scope out relevant rows : Filter
Join the RDDs: Join
apply map phase to compute investment: Map
apply GroupBy phase to group the data according to the desired view: GroupBy
apply a map phase to aggregate the data as per the grouping achieved in above step (say view data across timeperiod) and also create new objects based on the resultset desired to be collected: Map
collect the result: Collect
So if user wants to view investment across time periods then below List is returned (this was achieved in step 4 above):
<timeperiod1, value>
When I checked time taken in operations, GroupBy was taking 90% of the time taken in executing the whole DAG.
IMO, we can replace GroupBy and subsequent Map operations by a sing reduce.
But reduce will work on object of type JavaPairRDD>.
So my reduce will be like T reduce(T,T,T) where T will be CustomTuple, Map.
Or maybe after step 3 in above DAG I run another map function that returns me an RDD of type for the metric that needs to be aggregated and then run a reduce.
Also, I am not sure how aggregate function works and will it be able to help me in this case.
Secondly, my application will receive request on varying keys. In my current RDD design each request would require me to repartition or re-group my RDD on this key. This means for each request grouping/re-partitioning would take 95% of my time to compute the job.
<"market1", 20>
<"market2", 30>
This is very discouraging as the current performance of application without Spark is 10 times better than performance with Spark.
Any insight is appreciated.
[EDIT]We also noticed that JOIN was taking a lot of time. Maybe thats why groupby was taking time.[EDIT]
TIA!
The Spark's documentation encourages you to avoid operations groupBy operations instead they suggest combineByKey or some of its derivated operation (reduceByKey or aggregateByKey). You have to use this operation in order to make an aggregation before and after the shuffle (in the Map's and in the Reduce's phase if we use Hadoop terminology) so your execution times will improve (i don't kwown if it will be 10 times better but it has to be better)
If i understand your processing i think that you can use a single combineByKey operation The following code's explanation is made for a scala code but you can translate to Java code without too many effort.
combineByKey have three arguments:
combineByKey[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C): RDD[(K, C)]
createCombiner: In this operation you create a new class in order to combine your data so you could aggregate your CustomTuple data into a new Class CustomTupleCombiner (i don't know if you want only make a sum or maybe you want to apply some process to this data but either option can be made in this operation)
mergeValue: In this operation you have to describe how a CustomTuple is sum to another CustumTupleCombiner(again i am presupposing a simple summarize operation). For example if you want sum the data by key, you will have in your CustumTupleCombiner class a Map so the operation should be something like: CustumTupleCombiner.sum(CustomTuple) that make CustumTupleCombiner.Map(CustomTuple.key)-> CustomTuple.Map(CustomTuple.key) + CustumTupleCombiner.value
mergeCombiners: In this operation you have to define how merge two Combiner class, CustumTupleCombiner in my example. So this will be something like CustumTupleCombiner1.merge(CustumTupleCombiner2) that will be something like CustumTupleCombiner1.Map.keys.foreach( k -> CustumTupleCombiner1.Map(k)+CustumTupleCombiner2.Map(k)) or something like that
The pated code is not proved (this will not even compile because i made it with vim) but i think that might work for your scenario.
I hope this will be usefull
Shuffling is triggered by any change in the key of a [K,V] pair, or by a repartition() call. The partitioning is calculated based on the K (key) value. By default partitioning is calculated using the Hash value of your key, implemented by the hashCode() method. In your case your Key contains two Map instance variables. The default implementation of the hashCode() method will have to calculate the hashCode() of those maps as well, causing an iteration to happen over all it elements to in turn again calculate the hashCode() of those elements.
The solutions are:
Do not include the Map instances in your Key. This seems highly unusual.
Implement and override your own hashCode() that avoids going through the Map Instance variables.
Possibly you can avoid using the Map objects completely. If it is something that is shared amongst multiple elements, you might need to consider using broadcast variables in spark. The overhead of serializing your Maps during shuffling might also be a big contributing factor.
Avoid any shuffling, by tuning your hashing between two consecutive group-by's.
Keep shuffling Node local, by choosing a Partitioner that will have an affinity of keeping partitions local during consecutive use.
Good reading on hashCode(), including a reference to quotes by Josh Bloch can be found in wiki.