To reduce shuffling during the joining of two RDDs, I decided to partition them using HashPartitioner first. Here is how I do it. Am I doing it correctly, or is there a better way to do this?
val rddA = ...
val rddB = ...
val numOfPartitions = rddA.getNumPartitions
val rddApartitioned = rddA.partitionBy(new HashPartitioner(numOfPartitions))
val rddBpartitioned = rddB.partitionBy(new HashPartitioner(numOfPartitions))
val rddAB = rddApartitioned.join(rddBpartitioned)
To reduce shuffling during the joining of two RDDs,
It is surprisingly common misconception that repartitoning reduces or even eliminates shuffles. It doesn't. Repartitioning is shuffle in its purest form. It doesn't save time, bandwidth or memory.
The rationale behind using proactive partitioner is different - it allows you to shuffle once, and reuse the state, to perform multiple by-key operations, without additional shuffles (though as far as I am aware, not necessarily without additional network traffic, as co-partitioning doesn't imply co-location, excluding cases where shuffles occurred in a single actions).
So your code is correct, but in a case where you join once it doesn't buy you anything.
Just one comment, better to append .persist() after .partitionBy if there are multiple actions for rddApartitioned and rddBpartitioned, otherwise, all the actions will evaluate the entire lineage of rddApartitioned and rddBpartitioned, which will cause the hash-partitioning takes place again and again.
val rddApartitioned = rddA.partitionBy(new HashPartitioner(numOfPartitions)).persist()
val rddBpartitioned = rddB.partitionBy(new HashPartitioner(numOfPartitions)).persist()
Related
I'm trying to perform a isin filter as optimized as possible. Is there a way to broadcast collList using Scala API?
Edit: I'm not looking for an alternative, I know them, but I need isin so my RelationProviders will pushdown the values.
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
//collList.size == 200.000
val retTable = df.filter(col("col1").isin(collList: _*))
The list i'm passing to the "isin" method has upto ~200.000 unique elements.
I know this doesn't look like the best option and a join sounds better, but I need those elements pushed down into the filters, makes a huge difference when reading (my storage is Kudu, but it also applies to HDFS+Parquet, base data is too big and queries work on around 1% of that data), I already measured everything, and it saved me around 30minutes execution time :). Plus my method already takes care if the isin is larger than 200.000.
My problem is, I'm getting some Spark "task are too big" (~8mb per task) warnings, everything works fine so not a big deal, but I'm looking to remove them and also optimize.
I've tried with, which does nothing as I still get the warning (since the broadcasted var gets resolved in Scala and passed to vargargs I guess):
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
val retTable = df.filter(col("col1").isin(sc.broadcast(collList).value: _*))
And this one which doesn't compile:
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
val retTable = df.filter(col("col1").isin(sc.broadcast(collList: _*).value))
And this one which doesn't work (task too big still appears)
val broadcastedList=df.sparkSession.sparkContext.broadcast(collList.map(lit(_).expr))
val filterBroadcasted=In(col("col1").expr, collList.value)
val retTable = df.filter(new Column(filterBroadcasted))
Any ideas on how to broadcast this variable? (hacks allowed). Any alternative to the isin which allows filter pushdown is also valid I've seen some people doing it on PySpark, but the API is not the same.
PS: Changes on the storage are not possible, I know partitioning (already partitioned, but not by that field) and such could help, but user inputs are totally random and the data is accessed and changed my many clients.
I'd opt for dataframe broad cast hash join in this case instead of broadcast variable.
Prepare a dataframe with your collectedDf("col1") collection list you want to filter with isin and then
use join between 2 dataframes to filter the rows matching.
I think it would be more efficient than isin since you have 200k entries to be filtered. spark.sql.autobroadcastjointhreshhold is the property you need to set with appropriate size(by default 10mb). AFAIK you can use till 200mb or 3oomb based on your requirements.
see this BHJ Explanation of how it works
Further reading Spark efficiently filtering entries from big dataframe that exist in a small dataframe
I'll just leave with big tasks since I only use it twice (but saves a lot of time) in my program and I can afford it, but if someone else needs it badly... well this seems to be the path.
Best alternatives I found to have big-arrays pushdown:
Change your relation provider so it broadcasts big-lists when pushing down In filters, this will probably leave some broadcasted trash, but well..., as long as your app is not streaming, it shouldn't be a problem, or you can save in a global list and clean those after a while
Add a filter in Spark (I wrote something at https://issues.apache.org/jira/browse/SPARK-31417 ) which allows broadcasted pushdown all the way to your relation provider. You would have to add your custom predicate, then implement your custom "Pushdown" (you can do this by adding a new rule) and then rewrite your RDD/Relation provider so it can exploit the fact the variable is broadcasted.
Use coalesce(X) after reading to decrease number of tasks, can work sometimes, depends on how the RelationProvider/RDD is implemented.
I have this program, that uses Apache Spark to calculate the frequency of words.
I create an RDD with the key/value pairs(word=key, frequency=value). The dataset is distributed over worker nodes. The function frequentWordCount is executed at regular intervals. It selects strings from the files.
which are then converted into key-value-pairs and connected to the wordDataset-RDD. The words with a frequency of >50, are counted.
I was told that this approach is not performant. Can somebody tell me why and how I could improve this?
val sc = new SparkContext(...)
var wordDataset:RDD[(String, Int)] = sc.sequenceFile[String, Int](“…”).persist()
def frequentWordCount(fileName:String):Long = {
val words = sc.sequenceFile[String](fileName)
val joined = wordDataset.join(words.map(x=>(x,1)))
joined.filter(x=>x._1._2>50).count
}
Approximately how many frequent words will you have? For a lot of reasonable tasks, I think it should be unexpectedly small - small enough to fit into each individual machine's memory. IIRC, words tend to obey a power law distribution, so there shouldn't be that many "common" words. In that case, broadcasting a set of frequent words could be much faster than joining:
val sc = new SparkContext(...)
var commonWords: BroadCast[Set[String]] = sc.broadcast(sc.sequenceFile[String, Int](“…”).filter(_._2 > 50).collect().toSet)
def frequentWordCount(fileName:String):Long = {
val words = sc.sequenceFile[String](fileName)
words.filter(commonWords.value.contains).count
}
If you are calling frequentWordCount multiple times, it probably is also better to do it in just one RDD operation where your words are associated with a filename and then grouped and counted or something... specifics depend on how it's used.
If the number of common words is small enough to fit into an in-memory Set, then what the other answer suggests (except, you need to map(_._1) there after filter.
Otherwise, the two things you could improve are (1) filter before join, you want to throw out extra data as soon as you can rather than unnecessarily scanning over it multiple times, and (2) as a general rule, you always want to join the larger dataset to the smaller one, not the other way around.
sc.sequenceFile[String](fileName)
.keyBy(identity)
.join(wordDataset.filter(_._2 > 50))
.count
I need something similar to the randomSplit function:
val Array(df1, df2) = myDataFrame.randomSplit(Array(0.6, 0.4))
However, I need to split myDataFrame based on a boolean condition. Does anything like the following exist?
val Array(df1, df2) = myDataFrame.booleanSplit(col("myColumn") > 100)
I'd like not to do two separate .filter calls.
Unfortunately the DataFrame API doesn't have such a method, to split by a condition you'll have to perform two separate filter transformations:
myDataFrame.cache() // recommended to prevent repeating the calculation
val condition = col("myColumn") > 100
val df1 = myDataFrame.filter(condition)
val df2 = myDataFrame.filter(not(condition))
I understand that caching and filtering twice looks a bit ugly, but please bear in mind that DataFrames are translated to RDDs, which are evaluated lazily, i.e. only when they are directly or indirectly used in an action.
If a method booleanSplit as suggested in the question existed, the result would be translated to two RDDs, each of which would be evaluated lazily. One of the two RDDs would be evaluated first and the other would be evaluated second, strictly after the first. At the point the first RDD is evaluated, the second RDD would not yet have "come into existence" (EDIT: Just noticed that there is a similar question for the RDD API with an answer that gives a similar reasoning)
To actually gain any performance benefit, the second RDD would have to be (partially) persisted during the iteration of the first RDD (or, actually, during the iteration of the parent RDD of both, which is triggered by the iteration of the first RDD). IMO this wouldn't align overly well with the design of the rest of the RDD API. Not sure if the performance gains would justify this.
I think the best you can achieve is to avoid writing two filter calls directly in your business code, by writing an implicit class with a method booleanSplit as a utility method does that part in a similar way as Tzach Zohar's answer, maybe using something along the lines of myDataFrame.withColumn("__condition_value", condition).cache() so the the value of the condition is not calculated twice.
In my spark application I would like to do operations on a dataframe in a loop and write the result to hdfs.
pseudocode:
var df = emptyDataframe
for n = 1 to 200000{
someDf=read(n)
df = df.mergeWith(somedf)
}
df.writetohdfs
In the above example I get good results when "mergeWith" does a unionAll.
However, when in "mergeWith" I do a (simple) join, the job gets really slow (>1h with 2 executors with 4 cores each) and never finishes (job aborts itself).
In my scenario I throw in ~50 iterations with files that just contain ~1mb of text data.
Because order of merges is important in my case, I'm suspecting this is due to the DAG generation, causing the whole thing to be run at the moment I store away the data.
Right now I'm attempting to use a .persist on the merged data frame but that also seems to go rather slowly.
EDIT:
As the job was running i noticed (even though I did a count and .persist) the dataframe in memory didn't look like a static dataframe.
It looked like a stringed together path to all the merges it had been doing, effectively slowing down the job linearly.
Am I right to assume the var df is the culprit of this?
breakdown of the issue as I see it:
dfA = empty
dfC = dfA.increment(dfB)
dfD = dfC.increment(dfN)....
When I would expect DF' A C and D are object, spark things differently and does not care if I persist or repartition or not.
to Spark it looks like this:
dfA = empty
dfC = dfA incremented with df B
dfD = ((dfA incremented with df B) incremented with dfN)....
Update2
To get rid of the persisting not working on DF's I could "break" the lineage when converting the DF to and RDD and back again.
This has a little bit of an overhead but an acceptable one (job finishes in minutes rather than hours/never)
I'll run some more tests on the persisting and formulate an answer in the form of a workaround.
Result:
This only seems to fix these issues on the surface. In reality I'm back at square one and get OOM exceptionsjava.lang.OutOfMemoryError: GC overhead limit exceeded
If you have code like this:
var df = sc.parallelize(Seq(1)).toDF()
for(i<- 1 to 200000) {
val df_add = sc.parallelize(Seq(i)).toDF()
df = df.unionAll(df_add)
}
Then df will have 400000 partitions afterwards, which makes the following actions inefficient (because you have 1 tasks for each partition).
Try to reduce the number of partitions to e.g. 200 before persisiting the dataframe (using e.g. df.coalesce(200).write.saveAsTable(....))
So the following is what I ended up using. It's performant enough for my usecase, it works and does not need persisting.
It is very much a workaround rather than a fix.
val mutableBufferArray = ArrayBuffer[DataFrame]()
mutableBufferArray.append(hiveContext.emptyDataframe())
for loop {
val interm = mergeDataFrame(df, mutableBufferArray.last)
val intermSchema = interm.schema
val intermRDD = interm.rdd.repartition(8)
mutableBufferArray.append(hiveContext.createDataFrame(intermRDD, intermSchema))
mutableBufferArray.remove(0)
}
This is how I wrestle tungsten into compliance.
By going from a DF to an RDD and back I end up with a real object rather than a whole tungsten generated process pipe from front to back.
In my code I iterate a few times before writing out to disk (50-150 iterations seem to work best). That's where I clear out the bufferArray again to start over fresh.
What if, when I traverse RDD, I need to calculate values in dataset by calling external (blocking) service? How do you think that could be achieved?
val values: Future[RDD[Double]] = Future sequence tasks
I've tried to create a list of Futures, but as RDD id not Traversable, Future.sequence is not suitable.
I just wonder, if anyone had such a problem, and how did you solve it?
What I'm trying to achieve is to get a parallelism on a single worker node, so I can call that external service 3000 times per second.
Probably, there is another solution, more suitable for spark, like having multiple working nodes on single host.
It's interesting to know, how do you cope with such a challenge? Thanks.
Here is answer to my own question:
val buckets = sc.textFile(logFile, 100)
val tasks: RDD[Future[Object]] = buckets map { item =>
future {
// call native code
}
}
val values = tasks.mapPartitions[Object] { f: Iterator[Future[Object]] =>
val searchFuture: Future[Iterator[Object]] = Future sequence f
Await result (searchFuture, JOB_TIMEOUT)
}
The idea here is, that we get the collection of partitions, where each partition is sent to the specific worker and is the smallest piece of work. Each that piece of work contains data, that could be processed by calling native code and sending that data.
'values' collection contains the data, that is returned from the native code and that work is done across the cluster.
Based on your answer, that the blocking call is to compare provided input with each individual item in the RDD, I would strongly consider rewriting the comparison in java/scala so that it can be run as part of your spark process. If the comparison is a "pure" function (no side effects, depends only on its inputs), it should be straightforward to re-implement, and the decrease in complexity and increase in stability in your spark process due to not having to make remote calls will probably make it worth it.
It seems unlikely that your remote service will be able to handle 3000 calls per second, so a local in-process version would be preferable.
If that is absolutely impossible for some reason, then you might be able to create a RDD transformation which turns your data into a RDD of futures, in pseudo-code:
val callRemote(data:Data):Future[Double] = ...
val inputData:RDD[Data] = ...
val transformed:RDD[Future[Double]] = inputData.map(callRemote)
And then carry on from there, computing on your Future[Double] objects.
If you know how much parallelism your remote process can handle, it might be best to abandon the Future mode and accept that it is a bottleneck resource.
val remoteParallelism:Int = 100 // some constant
val callRemoteBlocking(data:Data):Double = ...
val inputData:RDD[Data] = ...
val transformed:RDD[Double] = inputData.
coalesce(remoteParallelism).
map(callRemoteBlocking)
Your job will probably take quite some time, but it shouldn't flood your remote service and die horribly.
A final option is that if the inputs are reasonably predictable and the range of outcomes is consistent and limited to some reasonable number of outputs (millions or so), you could precompute them all as a data set using your remote service and find them at spark job time using a join.