Why is checkpoint() faster than persist() - scala

I have a code that does calculations with a DataFrame.
+------------------------------------+------------+----------+----+------+
| Name| Role|Experience|Born|Salary|
+------------------------------------+------------+----------+----+------+
| 瓮䇮滴ୗ┦附䬌┊ᇕ鈃디蠾综䛿ꩁ翨찘... | охранник| 16|1960|108111|
| 擲鱫뫉ܞ琱폤縭ᘵ௑훧귚۔᡺♧䋐滜컑... | повар| 14|1977| 40934|
| 㑶뇨⿟ꄳ壚ᗜ㙣޲샾ꎓ㌸翧쉟梒靻駌푤... | геодезист| 29|1997| 27335|
| ࣆ᠘䬆䨎⑁烸ᯠણ ᭯몇믊ຮ쭧닕㟣紕... | не охранн. | 4|1999 | 30000|
... ... ...
I tried to cache the table in different ways.
def processDataFrame(mode: String): Long = {
val t0 = System.currentTimeMillis
val topDf = df.filter(col("Salary").>(50000))
val cacheDf = mode match {
case "CACHE" => topDf.cache()
case "PERSIST" => topDf.persist()
case "CHECKPOINT" => topDf.checkpoint()
case "CHECKPOINT_NON_EAGER" => topDf.checkpoint(false)
case _ => topDf
}
val roleList = cacheDf.groupBy("Role")
.count()
.orderBy("Role")
.collect()
val bornList = cacheDf.groupBy("Born")
.count()
.orderBy(col("Born").desc)
.collect()
val t1 = System.currentTimeMillis()
t1-t0 // time result
}
I got results that made me think.
Why is checkpoint(false) more efficient than persist()?
After all, a checkpoint needs time to serialize objects and write them to disk.
P.S. My small project on GitHub: https://github.com/MinorityMeaning/CacheCheckpoint

I haven't checked your project but I think it's worth a minor discussion.
I would prefer that you cleanly call out that you didn't run this code once but are averaging out several runs, to make a determination about performance on this specific dataset. (Not efficiency) Spark Clusters can have a lot noise that causes difference from job to job and averaging several runs really is required to determine performance. There are several performance factors (Data locality/Spark Executors, Resource contention, ect)
I don't think you can say "efficient" as these functions actually perform two different functionalities. They also will perform differently under different circumstance because of what they do. There are times you will want to check point, to truncate data lineage or after very computationally expensive operations. There are times when having the lineage to recompute is actually cheaper to do than writing & reading from disk.
The easy rule is, if you are going to use this table/DataFrame/DataSet multiple times cache it in memory.(Not Disk)
Once you hit an issue with a job that's not completing think about what can be tuned. From a code perspective/query perspective.
After that...
If and only if this is related to a failure of a complex job and you see executors failing, consider disk to persist the data. This should always be a later step in troubleshooting and never a first step in troubleshooting.

Related

How to process millions of small JSON files quickly using Scala Spark?

I have to process millions of JSON files from Azure Blob Storage, each representing one row, and need to load them into Azure SQL DB with some minimal transformation in between. These files come in at random times but follow the same schema.
My first solution basically just created a DataFrame for each file and pushed it into SQL. This worked when we were receiving hundreds of files but now that we are received millions of files it is not scaling, taking over a day to process.
We also tried processing the files in Scala without Spark (see code below) but this is also too slow; 500 files processed in 8 minutes.
var sql_statement = ""
allFiles.par.map(file_name => {
//processing
val json = scala.io.Source.fromFile(file_name).mkString
val mapData1 = mapper.readValue(json, classOf[Map[String, Any]])
val account= mapData1("Contact").asInstanceOf[Map[String, Any]]
val common = account.keys.toList.intersect(srcDestMap .keys.toList)
val trMap=common.map(rec=>Map(srcDestMap(rec)->account(rec))).flatten.toMap
val vals=trMap.keys.toList.sorted.map(trMap(_).toString.replace("'", "''")).map("'"+_+"'")
//end processing
val cols="insert into dbo.Contact_VS(" + trMap.keys.toList.sorted.mkString(",") + ")" + " values (" + vals.mkString(",") + ")"
sql_statement = sql_statement + cols
})
val updated=statement.executeUpdate(sql_statement)
connection.close()
If anyone knows how to optimize this code, or any out-of-the-box thinking we could use to preprocess our JSON it would be greatly appreciated! The JSON is nested so it's a little more involved to merge everything into one large JSON to be read into Spark but we may have to go that way if no one has any better ideas.
You are close spark contains some helper functions to parallelize tasks across the cluster. Note you will want to set "spark.default.parallelism" to a sane number such that you're not creating too many connections to your DB.
def loadFileAndUploadToRDS(filepath: String): Unit = ???
#Test
def parallelUpload(): Unit ={
val files = List("s3://bucket/path" /** more files **/)
spark.sparkContext.parallelize(files).foreach(filepath => loadFileAndUploadToRDS(filepath))
}
Since you already got an answer let me point some problems with the raw scala implementation:
1) creating sql requests manually is error-prone and inefficient
2) updating sql_statement in a loop is very inefficient
3) level of parallelism of allFiles.par. .par shouldn't be used for blocking tasks for two reasons:
it uses global shared thread pool under the hood so one batch of tasks will block other tasks.
parallelism level is optimized for cpu-bound tasks (number of CPU threads). You want much higher parallelism.

Improve performance for function which counts common words

I have this program, that uses Apache Spark to calculate the frequency of words.
I create an RDD with the key/value pairs(word=key, frequency=value). The dataset is distributed over worker nodes. The function frequentWordCount is executed at regular intervals. It selects strings from the files.
which are then converted into key-value-pairs and connected to the wordDataset-RDD. The words with a frequency of >50, are counted.
I was told that this approach is not performant. Can somebody tell me why and how I could improve this?
val sc = new SparkContext(...)
var wordDataset:RDD[(String, Int)] = sc.sequenceFile[String, Int](“…”).persist()
def frequentWordCount(fileName:String):Long = {
val words = sc.sequenceFile[String](fileName)
val joined = wordDataset.join(words.map(x=>(x,1)))
joined.filter(x=>x._1._2>50).count
}
Approximately how many frequent words will you have? For a lot of reasonable tasks, I think it should be unexpectedly small - small enough to fit into each individual machine's memory. IIRC, words tend to obey a power law distribution, so there shouldn't be that many "common" words. In that case, broadcasting a set of frequent words could be much faster than joining:
val sc = new SparkContext(...)
var commonWords: BroadCast[Set[String]] = sc.broadcast(sc.sequenceFile[String, Int](“…”).filter(_._2 > 50).collect().toSet)
def frequentWordCount(fileName:String):Long = {
val words = sc.sequenceFile[String](fileName)
words.filter(commonWords.value.contains).count
}
If you are calling frequentWordCount multiple times, it probably is also better to do it in just one RDD operation where your words are associated with a filename and then grouped and counted or something... specifics depend on how it's used.
If the number of common words is small enough to fit into an in-memory Set, then what the other answer suggests (except, you need to map(_._1) there after filter.
Otherwise, the two things you could improve are (1) filter before join, you want to throw out extra data as soon as you can rather than unnecessarily scanning over it multiple times, and (2) as a general rule, you always want to join the larger dataset to the smaller one, not the other way around.
sc.sequenceFile[String](fileName)
.keyBy(identity)
.join(wordDataset.filter(_._2 > 50))
.count

How to properly apply HashPartitioner before a join in Spark?

To reduce shuffling during the joining of two RDDs, I decided to partition them using HashPartitioner first. Here is how I do it. Am I doing it correctly, or is there a better way to do this?
val rddA = ...
val rddB = ...
val numOfPartitions = rddA.getNumPartitions
val rddApartitioned = rddA.partitionBy(new HashPartitioner(numOfPartitions))
val rddBpartitioned = rddB.partitionBy(new HashPartitioner(numOfPartitions))
val rddAB = rddApartitioned.join(rddBpartitioned)
To reduce shuffling during the joining of two RDDs,
It is surprisingly common misconception that repartitoning reduces or even eliminates shuffles. It doesn't. Repartitioning is shuffle in its purest form. It doesn't save time, bandwidth or memory.
The rationale behind using proactive partitioner is different - it allows you to shuffle once, and reuse the state, to perform multiple by-key operations, without additional shuffles (though as far as I am aware, not necessarily without additional network traffic, as co-partitioning doesn't imply co-location, excluding cases where shuffles occurred in a single actions).
So your code is correct, but in a case where you join once it doesn't buy you anything.
Just one comment, better to append .persist() after .partitionBy if there are multiple actions for rddApartitioned and rddBpartitioned, otherwise, all the actions will evaluate the entire lineage of rddApartitioned and rddBpartitioned, which will cause the hash-partitioning takes place again and again.
val rddApartitioned = rddA.partitionBy(new HashPartitioner(numOfPartitions)).persist()
val rddBpartitioned = rddB.partitionBy(new HashPartitioner(numOfPartitions)).persist()

Recursive Dataframe operations

In my spark application I would like to do operations on a dataframe in a loop and write the result to hdfs.
pseudocode:
var df = emptyDataframe
for n = 1 to 200000{
someDf=read(n)
df = df.mergeWith(somedf)
}
df.writetohdfs
In the above example I get good results when "mergeWith" does a unionAll.
However, when in "mergeWith" I do a (simple) join, the job gets really slow (>1h with 2 executors with 4 cores each) and never finishes (job aborts itself).
In my scenario I throw in ~50 iterations with files that just contain ~1mb of text data.
Because order of merges is important in my case, I'm suspecting this is due to the DAG generation, causing the whole thing to be run at the moment I store away the data.
Right now I'm attempting to use a .persist on the merged data frame but that also seems to go rather slowly.
EDIT:
As the job was running i noticed (even though I did a count and .persist) the dataframe in memory didn't look like a static dataframe.
It looked like a stringed together path to all the merges it had been doing, effectively slowing down the job linearly.
Am I right to assume the var df is the culprit of this?
breakdown of the issue as I see it:
dfA = empty
dfC = dfA.increment(dfB)
dfD = dfC.increment(dfN)....
When I would expect DF' A C and D are object, spark things differently and does not care if I persist or repartition or not.
to Spark it looks like this:
dfA = empty
dfC = dfA incremented with df B
dfD = ((dfA incremented with df B) incremented with dfN)....
Update2
To get rid of the persisting not working on DF's I could "break" the lineage when converting the DF to and RDD and back again.
This has a little bit of an overhead but an acceptable one (job finishes in minutes rather than hours/never)
I'll run some more tests on the persisting and formulate an answer in the form of a workaround.
Result:
This only seems to fix these issues on the surface. In reality I'm back at square one and get OOM exceptionsjava.lang.OutOfMemoryError: GC overhead limit exceeded
If you have code like this:
var df = sc.parallelize(Seq(1)).toDF()
for(i<- 1 to 200000) {
val df_add = sc.parallelize(Seq(i)).toDF()
df = df.unionAll(df_add)
}
Then df will have 400000 partitions afterwards, which makes the following actions inefficient (because you have 1 tasks for each partition).
Try to reduce the number of partitions to e.g. 200 before persisiting the dataframe (using e.g. df.coalesce(200).write.saveAsTable(....))
So the following is what I ended up using. It's performant enough for my usecase, it works and does not need persisting.
It is very much a workaround rather than a fix.
val mutableBufferArray = ArrayBuffer[DataFrame]()
mutableBufferArray.append(hiveContext.emptyDataframe())
for loop {
val interm = mergeDataFrame(df, mutableBufferArray.last)
val intermSchema = interm.schema
val intermRDD = interm.rdd.repartition(8)
mutableBufferArray.append(hiveContext.createDataFrame(intermRDD, intermSchema))
mutableBufferArray.remove(0)
}
This is how I wrestle tungsten into compliance.
By going from a DF to an RDD and back I end up with a real object rather than a whole tungsten generated process pipe from front to back.
In my code I iterate a few times before writing out to disk (50-150 iterations seem to work best). That's where I clear out the bufferArray again to start over fresh.

How do you perform blocking IO in apache spark job?

What if, when I traverse RDD, I need to calculate values in dataset by calling external (blocking) service? How do you think that could be achieved?
val values: Future[RDD[Double]] = Future sequence tasks
I've tried to create a list of Futures, but as RDD id not Traversable, Future.sequence is not suitable.
I just wonder, if anyone had such a problem, and how did you solve it?
What I'm trying to achieve is to get a parallelism on a single worker node, so I can call that external service 3000 times per second.
Probably, there is another solution, more suitable for spark, like having multiple working nodes on single host.
It's interesting to know, how do you cope with such a challenge? Thanks.
Here is answer to my own question:
val buckets = sc.textFile(logFile, 100)
val tasks: RDD[Future[Object]] = buckets map { item =>
future {
// call native code
}
}
val values = tasks.mapPartitions[Object] { f: Iterator[Future[Object]] =>
val searchFuture: Future[Iterator[Object]] = Future sequence f
Await result (searchFuture, JOB_TIMEOUT)
}
The idea here is, that we get the collection of partitions, where each partition is sent to the specific worker and is the smallest piece of work. Each that piece of work contains data, that could be processed by calling native code and sending that data.
'values' collection contains the data, that is returned from the native code and that work is done across the cluster.
Based on your answer, that the blocking call is to compare provided input with each individual item in the RDD, I would strongly consider rewriting the comparison in java/scala so that it can be run as part of your spark process. If the comparison is a "pure" function (no side effects, depends only on its inputs), it should be straightforward to re-implement, and the decrease in complexity and increase in stability in your spark process due to not having to make remote calls will probably make it worth it.
It seems unlikely that your remote service will be able to handle 3000 calls per second, so a local in-process version would be preferable.
If that is absolutely impossible for some reason, then you might be able to create a RDD transformation which turns your data into a RDD of futures, in pseudo-code:
val callRemote(data:Data):Future[Double] = ...
val inputData:RDD[Data] = ...
val transformed:RDD[Future[Double]] = inputData.map(callRemote)
And then carry on from there, computing on your Future[Double] objects.
If you know how much parallelism your remote process can handle, it might be best to abandon the Future mode and accept that it is a bottleneck resource.
val remoteParallelism:Int = 100 // some constant
val callRemoteBlocking(data:Data):Double = ...
val inputData:RDD[Data] = ...
val transformed:RDD[Double] = inputData.
coalesce(remoteParallelism).
map(callRemoteBlocking)
Your job will probably take quite some time, but it shouldn't flood your remote service and die horribly.
A final option is that if the inputs are reasonably predictable and the range of outcomes is consistent and limited to some reasonable number of outputs (millions or so), you could precompute them all as a data set using your remote service and find them at spark job time using a join.