I'm trying to reduce execution time of some small scala routine, say, concatenation of strings, since I'm too lazy to setup local environment, I'm using online scala compilers, but found the comparison result differs between scastie and scalafiddle w/ the following code:
// routine 1
var startT1 = System.nanoTime()
(1 until 100 * 1000).foreach{ x=>
val sb = new StringBuilder("a")
sb.append("b").append("c").append("d").append("e").append("f")
}
println(System.nanoTime() - startT1)
// routine 2
var startT2 = System.nanoTime()
(1 until 100 * 1000).foreach{ x=>
val arr = Array[Char]('a', 'b', 'c', 'd', 'e', 'f')
}
println(System.nanoTime() - startT2)
In scalafiddle routine 1 is faster but in scastie routine 2 is faster.
I'v read this article https://medium.com/#otto.chrons/what-makes-scalafiddle-so-fast-9a3edf33ed4d, so it seems that scalafiddle actually runs JavaScript instead of scala. But the remaining question is, can I really use scastie for execution time benchmarks?
Short answer - NO I don't you can not rely on ANY online running tools like scastie and scalafiddle to verify performance.
Because, there is 1000 and more reasons why some benchmark will show X millis execution for some operation, and 99% of that reasons is running environment: hardware, operating system, CPU architecture, load on machine, used Scala compiler, used JVM etc. And we don't know environment change between runs on Scatie for instance, so you can get totally different numbers and don't know why, hence benchmark results won't be reliable.
If you would like to get some results, you would like rely one at least a bit, take a look at https://openjdk.java.net/projects/code-tools/jmh/ and it's sbt helper plugin https://github.com/ktoso/sbt-jmh and run known environment.
And along with posting benchmark results - please post environment details, where it was run.
A tricky problem happened related to the sharp increase of executing time.
I run my scala code in local spark, part of which is to build a n*n matrix.
When running a small dataset, it just takes 5s to finish. The most time-consuming part is to build 2000*2000 matrix. And this part is executed within map, which just deals with array data structure.
However, just out of curiosity, I add "println" within the matrix-building code to see the number of iterations. Suddenly, the whole running time increases to 1min23s.
And the final results are same.
I am new to Spark and have no idea what really causes this situation.
The codes are simply:
val x = someRDD.map(buildMatrix)
def buildMatrix(stringVect:Array[String]): Array[Array[Double]] = {
//var count = 0
val num = stringVect.length
var simi_matrix = Array[Array[Double]]()
for (i<- 0 until num-1){
for (j<- (i+1) until num){
"build the matrix with some computation"
//println(count)
//count += 1
}
}
}
TL;DR
This does not have to do anything with Spark. I/O access to the console is synchronized and costly. It will slow down any program on the JVM (Scala/Java/Clojure/...).
println defaults to java.lang.System.out which is a PrintStream. println delegates to PrintStream#println, hence entering the synchronized block of the println implementation to output to the console: There are two expenses:
Getting a synchronized lock
I/O to the console OutputStream
The slowdown observed is to be expected. Just don't use println in hot parts of the code (like a tight loop in this case).
This question already has answers here:
How do I write a correct micro-benchmark in Java?
(11 answers)
Closed 6 years ago.
I have a program written in Scala. I wanted to measure the execution time of different independent code blocks. When I did it in the obvious way (i.e. inserting System.nanoTime() before and after each block), I observed that the the execution time depends on the ordering of the blocks. The first some blocks always took more time than the others.
I created a minimalistic example that reproduces this behaviour. All code blocks are the same and call hashCode() for an array of integers, for simplicity.
package experiments
import scala.util.Random
/**
* Measuring execution time of a code block
*
* Minimalistic example
*/
object CodeBlockMeasurement {
def main(args: Array[String]): Unit = {
val numRecords = args(0).toInt
// number of independent measurements
val iterations = args(1).toInt
// Changes results a little bit, but not too much
// val records2 = Array.fill[Int](1)(0)
// records2.foreach(x => {})
for (_ <- 1 to iterations) {
measure(numRecords)
}
}
def measure(numRecords: Int): Unit = {
// using a new array every time
val records = Array.fill[Int](numRecords)(new Random().nextInt())
// block of code to be measured
def doSomething(): Unit = {
records.foreach(k => k.hashCode())
}
// measure execution time of the code-block
elapsedTime(doSomething(), "HashCodeExperiment")
}
def elapsedTime(block: => Unit, name: String): Unit = {
val t0 = System.nanoTime()
val result = block
val t1 = System.nanoTime()
// print out elapsed time in milliseconds
println(s"$name took ${(t1 - t0).toDouble / 1000000} ms")
}
}
After running the program with numRecords = 100000 and iterations = 10, my console looks like this:
HashCodeExperiment took 14.630283 ms
HashCodeExperiment took 7.125693 ms
HashCodeExperiment took 0.368151 ms
HashCodeExperiment took 0.431628 ms
HashCodeExperiment took 0.086455 ms
HashCodeExperiment took 0.056458 ms
HashCodeExperiment took 0.055138 ms
HashCodeExperiment took 0.062997 ms
HashCodeExperiment took 0.063736 ms
HashCodeExperiment took 0.056682 ms
Can somebody explain why is that? Shouldn't all be the same? Which is the real execution time?
Thanks a lot,
Peter
Environment parameters:
OS: ubuntu 14.04 LTS (64 bit)
IDE: IntelliJ IDEA 2016.1.1 (IU-145.597)
Scala: 2.11.7
It's Java's JIT kicking in. Initially the plain bytecode is executed but after some time (1.5k/10k invocations by default for Oracle JVM, see -XX:CompileThreshold) the optimizations start processing the actual executed native code which usually results in quite drastic performance improvements.
As Ivan mentions, then there's caching of intermediate bytecode/native code and various other technologies involved, one of the most significant ones being the Garbage Collector itself which cause even more variance to individual results. Depending how heavily the code allocates new objects this might absolutely trash performance whenever GC occurs, but that's a separate issue.
To remove such outlier results when microbenchmarking it is recommendable that you benchmark multiple iterations of the action and discard the bottom and top 5..10% results and do the performance estimation based on remaining samples.
Short answer: caching.
These are independent code blocks, but runs can't be fully independent because they are run in same JVM instance, and in same process of same CPU.
JVM itself has a lot of optimization inside, including caching. Modern CPU also do so. So, as result, it's quite common behavior, that re-run usually takes less time than first run.
I prefer Python over Scala. But, as Spark is natively written in Scala, I was expecting my code to run faster in the Scala than the Python version for obvious reasons.
With that assumption, I thought to learn & write the Scala version of some very common preprocessing code for some 1 GB of data. Data is picked from the SpringLeaf competition on Kaggle. Just to give an overview of the data (it contains 1936 dimensions and 145232 rows). Data is composed of various types e.g. int, float, string, boolean. I am using 6 cores out of 8 for Spark processing; that's why I used minPartitions=6 so that every core has something to process.
Scala Code
val input = sc.textFile("train.csv", minPartitions=6)
val input2 = input.mapPartitionsWithIndex { (idx, iter) =>
if (idx == 0) iter.drop(1) else iter }
val delim1 = "\001"
def separateCols(line: String): Array[String] = {
val line2 = line.replaceAll("true", "1")
val line3 = line2.replaceAll("false", "0")
val vals: Array[String] = line3.split(",")
for((x,i) <- vals.view.zipWithIndex) {
vals(i) = "VAR_%04d".format(i) + delim1 + x
}
vals
}
val input3 = input2.flatMap(separateCols)
def toKeyVal(line: String): (String, String) = {
val vals = line.split(delim1)
(vals(0), vals(1))
}
val input4 = input3.map(toKeyVal)
def valsConcat(val1: String, val2: String): String = {
val1 + "," + val2
}
val input5 = input4.reduceByKey(valsConcat)
input5.saveAsTextFile("output")
Python Code
input = sc.textFile('train.csv', minPartitions=6)
DELIM_1 = '\001'
def drop_first_line(index, itr):
if index == 0:
return iter(list(itr)[1:])
else:
return itr
input2 = input.mapPartitionsWithIndex(drop_first_line)
def separate_cols(line):
line = line.replace('true', '1').replace('false', '0')
vals = line.split(',')
vals2 = ['VAR_%04d%s%s' %(e, DELIM_1, val.strip('\"'))
for e, val in enumerate(vals)]
return vals2
input3 = input2.flatMap(separate_cols)
def to_key_val(kv):
key, val = kv.split(DELIM_1)
return (key, val)
input4 = input3.map(to_key_val)
def vals_concat(v1, v2):
return v1 + ',' + v2
input5 = input4.reduceByKey(vals_concat)
input5.saveAsTextFile('output')
Scala Performance
Stage 0 (38 mins), Stage 1 (18 sec)
Python Performance
Stage 0 (11 mins), Stage 1 (7 sec)
Both produces different DAG visualization graphs (due to which both pictures show different stage 0 functions for Scala (map) and Python (reduceByKey))
But, essentially both code tries to transform data into (dimension_id, string of list of values) RDD and save to disk. The output will be used to compute various statistics for each dimension.
Performance wise, Scala code for this real data like this seems to run 4 times slower than the Python version.
Good news for me is that it gave me good motivation to stay with Python. Bad news is I didn't quite understand why?
The original answer discussing the code can be found below.
First of all, you have to distinguish between different types of API, each with its own performance considerations.
RDD API
(pure Python structures with JVM based orchestration)
This is the component which will be most affected by the performance of the Python code and the details of PySpark implementation. While Python performance is rather unlikely to be a problem, there at least few factors you have to consider:
Overhead of JVM communication. Practically all data that comes to and from Python executor has to be passed through a socket and a JVM worker. While this is a relatively efficient local communication it is still not free.
Process-based executors (Python) versus thread based (single JVM multiple threads) executors (Scala). Each Python executor runs in its own process. As a side effect, it provides stronger isolation than its JVM counterpart and some control over executor lifecycle but potentially significantly higher memory usage:
interpreter memory footprint
footprint of the loaded libraries
less efficient broadcasting (each process requires its own copy of a broadcast)
Performance of Python code itself. Generally speaking Scala is faster than Python but it will vary on task to task. Moreover you have multiple options including JITs like Numba, C extensions (Cython) or specialized libraries like Theano. Finally, if you don't use ML / MLlib (or simply NumPy stack), consider using PyPy as an alternative interpreter. See SPARK-3094.
PySpark configuration provides the spark.python.worker.reuse option which can be used to choose between forking Python process for each task and reusing existing process. The latter option seems to be useful to avoid expensive garbage collection (it is more an impression than a result of systematic tests), while the former one (default) is optimal for in case of expensive broadcasts and imports.
Reference counting, used as the first line garbage collection method in CPython, works pretty well with typical Spark workloads (stream-like processing, no reference cycles) and reduces the risk of long GC pauses.
MLlib
(mixed Python and JVM execution)
Basic considerations are pretty much the same as before with a few additional issues. While basic structures used with MLlib are plain Python RDD objects, all algorithms are executed directly using Scala.
It means an additional cost of converting Python objects to Scala objects and the other way around, increased memory usage and some additional limitations we'll cover later.
As of now (Spark 2.x), the RDD-based API is in a maintenance mode and is scheduled to be removed in Spark 3.0.
DataFrame API and Spark ML
(JVM execution with Python code limited to the driver)
These are probably the best choice for standard data processing tasks. Since Python code is mostly limited to high-level logical operations on the driver, there should be no performance difference between Python and Scala.
A single exception is usage of row-wise Python UDFs which are significantly less efficient than their Scala equivalents. While there is some chance for improvements (there has been substantial development in Spark 2.0.0), the biggest limitation is full roundtrip between internal representation (JVM) and Python interpreter. If possible, you should favor a composition of built-in expressions (example. Python UDF behavior has been improved in Spark 2.0.0, but it is still suboptimal compared to native execution.
This may improved in the future has improved significantly with introduction of the vectorized UDFs (SPARK-21190 and further extensions), which uses Arrow Streaming for efficient data exchange with zero-copy deserialization. For most applications their secondary overheads can be just ignored.
Also be sure to avoid unnecessary passing data between DataFrames and RDDs. This requires expensive serialization and deserialization, not to mention data transfer to and from Python interpreter.
It is worth noting that Py4J calls have pretty high latency. This includes simple calls like:
from pyspark.sql.functions import col
col("foo")
Usually, it shouldn't matter (overhead is constant and doesn't depend on the amount of data) but in the case of soft real-time applications, you may consider caching/reusing Java wrappers.
GraphX and Spark DataSets
As for now (Spark 1.6 2.1) neither one provides PySpark API so you can say that PySpark is infinitely worse than Scala.
GraphX
In practice, GraphX development stopped almost completely and the project is currently in the maintenance mode with related JIRA tickets closed as won't fix. GraphFrames library provides an alternative graph processing library with Python bindings.
Dataset
Subjectively speaking there is not much place for statically typed Datasets in Python and even if there was the current Scala implementation is too simplistic and doesn't provide the same performance benefits as DataFrame.
Streaming
From what I've seen so far, I would strongly recommend using Scala over Python. It may change in the future if PySpark gets support for structured streams but right now Scala API seems to be much more robust, comprehensive and efficient. My experience is quite limited.
Structured streaming in Spark 2.x seem to reduce the gap between languages but for now it is still in its early days. Nevertheless, RDD based API is already referenced as "legacy streaming" in the Databricks Documentation (date of access 2017-03-03)) so it reasonable to expect further unification efforts.
Non-performance considerations
Feature parity
Not all Spark features are exposed through PySpark API. Be sure to check if the parts you need are already implemented and try to understand possible limitations.
It is particularly important when you use MLlib and similar mixed contexts (see Calling Java/Scala function from a task). To be fair some parts of the PySpark API, like mllib.linalg, provides a more comprehensive set of methods than Scala.
API design
The PySpark API closely reflects its Scala counterpart and as such is not exactly Pythonic. It means that it is pretty easy to map between languages but at the same time, Python code can be significantly harder to understand.
Complex architecture
PySpark data flow is relatively complex compared to pure JVM execution. It is much harder to reason about PySpark programs or debug. Moreover at least basic understanding of Scala and JVM in general is pretty much a must have.
Spark 2.x and beyond
Ongoing shift towards Dataset API, with frozen RDD API brings both opportunities and challenges for Python users. While high level parts of the API are much easier to expose in Python, the more advanced features are pretty much impossible to be used directly.
Moreover native Python functions continue to be second class citizen in the SQL world. Hopefully this will improve in the future with Apache Arrow serialization (current efforts target data collection but UDF serde is a long term goal).
For projects strongly depending on the Python codebase, pure Python alternatives (like Dask or Ray) could be an interesting alternative.
It doesn't have to be one vs. the other
The Spark DataFrame (SQL, Dataset) API provides an elegant way to integrate Scala/Java code in PySpark application. You can use DataFrames to expose data to a native JVM code and read back the results. I've explained some options somewhere else and you can find a working example of Python-Scala roundtrip in How to use a Scala class inside Pyspark.
It can be further augmented by introducing User Defined Types (see How to define schema for custom type in Spark SQL?).
What is wrong with code provided in the question
(Disclaimer: Pythonista point of view. Most likely I've missed some Scala tricks)
First of all, there is one part in your code which doesn't make sense at all. If you already have (key, value) pairs created using zipWithIndex or enumerate what is the point in creating string just to split it right afterwards? flatMap doesn't work recursively so you can simply yield tuples and skip following map whatsoever.
Another part I find problematic is reduceByKey. Generally speaking, reduceByKey is useful if applying aggregate function can reduce the amount of data that has to be shuffled. Since you simply concatenate strings there is nothing to gain here. Ignoring low-level stuff, like the number of references, the amount of data you have to transfer is exactly the same as for groupByKey.
Normally I wouldn't dwell on that, but as far as I can tell it is a bottleneck in your Scala code. Joining strings on JVM is a rather expensive operation (see for example: Is string concatenation in scala as costly as it is in Java?). It means that something like this _.reduceByKey((v1: String, v2: String) => v1 + ',' + v2) which is equivalent to input4.reduceByKey(valsConcat) in your code is not a good idea.
If you want to avoid groupByKey you can try to use aggregateByKey with StringBuilder. Something similar to this should do the trick:
rdd.aggregateByKey(new StringBuilder)(
(acc, e) => {
if(!acc.isEmpty) acc.append(",").append(e)
else acc.append(e)
},
(acc1, acc2) => {
if(acc1.isEmpty | acc2.isEmpty) acc1.addString(acc2)
else acc1.append(",").addString(acc2)
}
)
but I doubt it is worth all the fuss.
Keeping the above in mind, I've rewritten your code as follows:
Scala:
val input = sc.textFile("train.csv", 6).mapPartitionsWithIndex{
(idx, iter) => if (idx == 0) iter.drop(1) else iter
}
val pairs = input.flatMap(line => line.split(",").zipWithIndex.map{
case ("true", i) => (i, "1")
case ("false", i) => (i, "0")
case p => p.swap
})
val result = pairs.groupByKey.map{
case (k, vals) => {
val valsString = vals.mkString(",")
s"$k,$valsString"
}
}
result.saveAsTextFile("scalaout")
Python:
def drop_first_line(index, itr):
if index == 0:
return iter(list(itr)[1:])
else:
return itr
def separate_cols(line):
line = line.replace('true', '1').replace('false', '0')
vals = line.split(',')
for (i, x) in enumerate(vals):
yield (i, x)
input = (sc
.textFile('train.csv', minPartitions=6)
.mapPartitionsWithIndex(drop_first_line))
pairs = input.flatMap(separate_cols)
result = (pairs
.groupByKey()
.map(lambda kv: "{0},{1}".format(kv[0], ",".join(kv[1]))))
result.saveAsTextFile("pythonout")
Results
In local[6] mode (Intel(R) Xeon(R) CPU E3-1245 V2 # 3.40GHz) with 4GB memory per executor it takes (n = 3):
Scala - mean: 250.00s, stdev: 12.49
Python - mean: 246.66s, stdev: 1.15
I am pretty sure that most of that time is spent on shuffling, serializing, deserializing and other secondary tasks. Just for fun, here's naive single-threaded code in Python that performs the same task on this machine in less than a minute:
def go():
with open("train.csv") as fr:
lines = [
line.replace('true', '1').replace('false', '0').split(",")
for line in fr]
return zip(*lines[1:])
I'm looking to improve the performance of a Monte Carlo simulation I am developing.
I first did an implementation which does the simulation of each paths sequentially as follows:
def simulate() = {
for (path <- 0 to 30000) {
(0 to 100).foreach(
x => // do some computation
)
}
}
This basically is simulating 30,000 paths and each path has 100 discretised random steps.
The above function runs very quickly on my machine (about 1s) for the calculation I am doing.
I then thought about speeding it up even further by making the code run in a multithreaded fashion.
I decided to use Task for this and I coded the following:
val simulation = (1 |-> 30000 ).map(n => Task {
(1 |-> 100).map(x => // do some computation)
})
I then use this as follows:
Task.gatherUnordered(simulation).run
When I kick this off, I know my machine is doing a lot of work as I can
see that in the activity monitor and also the machine fan is going ballistic.
After about two minutes of heavy activity on the machine, the work it seems
to be doing finishes but I don't get any value returned (I am expected a collection
of Doubles from each task that was processed).
My questions are:
Why does this take longer than the sequential example? I am more
than likely doing something wrong but I can't see it.
Why don't I get any returned collection of values from the tasks that are apparently being processed?
I'm not sure why Task.gatherUnordered is so slow, but if you change Task.gatherUnordered to Nondeterminism.gatherUnordered everything will be fine:
import scalaz.Nondeterminism
Nondeterminism[Task].gatherUnordered(simulation).run
I'm going to create an issue on Github about Task.gatherUnordered. This definitely should be fixed.