Do partitions increase performance when only using one computer/node? - scala

I know that partitions will boost the performance by doing parallel tasks on different nodes in a cluster. But will partitions help me get better performance when I am only using one single computer? I am using Spark and Scala.

Yes it will increase performance.
Make sure your CPU have more than one core.
when you making your local sparksession, make sure to use multiple core :
local to run locally with one thread, or local[N] to run locally with N thread, i suggest you to use local[*]
and make sure your RDD/Dataset have multiple partition, i good number of partition is 2 to 4 time the number of core.

Apache Spark scacles as well vertically (CPU, Ram, ...) and horizontally (Nodes). I assume, that your computer/node has a CPU with more than one core. The partitions are then processed in parallel.

Related

having Spark process partitions concurrently, using a single dev/test machine

I'm naively testing for concurrency in local mode, with the following spark context
SparkSession
.builder
.appName("local-mode-spark")
.master("local[*]")
.config("spark.executor.instances", 4)
.config("spark.executor.cores", 2)
.config("spark.network.timeout", "10000001") // to avoid shutdown during debug, avoid otherwise
.config("spark.executor.heartbeatInterval", "10000000") // to avoid shutdown during debug, avoid otherwise
.getOrCreate()
and a mapPartitions API call like follows:
import spark.implicits._
val inputDF : DataFrame = spark.read.parquet(inputFile)
val resultDF : DataFrame =
inputDF.as[T].mapPartitions(sparkIterator => new MyIterator)).toDF
On the surface of it, this did surface one concurrency bug in my code contained in MyIterator (not a bug in Spark's code). However, I'd like to see that my application will crunch all available machine resources both in production, and also during this testing so that the chances of spotting additional concurrency bugs will improve.
That is clearly not the case for me so far: my machine is only at very low CPU utilization throughout the heavy processing of the inputDF, while there's plenty of free RAM and the JVM Xmx poses no real limitation.
How would you recommend testing for concurrency using your local machine? the objective being to test that in production, Spark will not bump into thread-safety or other concurrency issues in my code applied by spark from within MyIterator?
Or can it even in spark local mode, process separate partitions of my input dataframe in parallel? Can I get spark to work concurrently on the same dataframe on a single machine, preferably in local mode?
Max parallelism
You are already running spark in local mode using .master("local[*]").
local[*] uses as many threads as the number of processors available to the Java virtual machine (it uses Runtime.getRuntime.availableProcessors() to know the number).
Max memory available to all executors/threads
I see that you are not setting the driver memory explicitly. By default the driver memory is 512M. If your local machine can spare more than this, set this explicitly. You can do that by either:
setting it in the properties file (default is spark-defaults.conf),
spark.driver.memory 5g
or by supplying configuration setting at runtime
$ ./bin/spark-shell --driver-memory 5g
Note that this cannot be achieved by setting it in the application, because it is already too late by then, the process has already started with some amount of memory.
Nature of Job
Check number of partitions in your dataframe. That will essentially determine how much max parallelism you can use.
inputDF.rdd.partitions.size
If the output of this is 1, that means your dataframe has only 1 partition and so you won't get concurrency when you do operations on this dataframe. In that case, you might have to tweak some config to create more number of partitions so that you can concurrently run tasks.
Running local mode cannot simulate a production environment for the following reasons.
There are lots of code which gets bypassed when code is run in local mode, which would normally run with any other cluster manager. Amongst various issues, few things that i could think
a. Inability to detect bugs from the way shuffle get handled.(Shuffle data is handled in a completely different way in local mode.)
b. We will not be able to detect serialization related issues, since all code is available to the driver and task runs in the driver itself, and hence we would not result in any serialization issues.
c. No speculative tasks(especially for write operations)
d. Networking related issues, all tasks are executed in same JVM. One would not be able detect issues like communication between driver/executor, codegen related issues.
Concurrency in local mode
a. Max concurrency than can be attained will be equal to the number of cores in your local machine.(Link to code)
b. The Job, Stage, Task metrics shown in Spark UI are not accurate since it will incur the overhead of running in the JVM where the driver is also running.
c: As for CPU/Memoryutilization, it depends on operation being performed. Is the operation CPU/memory intensive?
When to use local mode
a. Testing of code that will run only on driver
b. Basic sanity testing of the code that will get executed on the executors
c. Unit testing
tl; dr The concurrency bugs that occur in local mode might not even be present in other cluster resource managers, since there are lot of special handling in Spark code for local mode(There are lots of code which checks isLocal in code and control goes to a different code flow altogether)
Yes!
Achieving parallelism in local mode is quite possible.
Check the amount of memory and cpu available in your local machine and supply values to the driver-memory and driver-cores conf while submitting your spark job.
Increasing executor-memory and executor-cores will not make a difference in this mode.
Once the application is running, open up the SPARK UI for the job. You can now go to the EXECUTORS tab to actually check the amount of resources your spark job is utilizing.
You can monitor various tasks that get generated and the number of tasks that your job runs concurrently using the JOBS and STAGES tab.
In order to process data which is way larger than the resources available, ensure that you break your data into smaller partitions using repartition. This should allow your job to complete successfully.
Increase the default shuffle partitions in case your job has aggregations or joins. Also, ensure sufficient space on the local file system since spark creates intermediate shuffle files and writes them to disk.
Hope this helps!

Dataproc Cluster with Spark 1.6.X using scala 2.11.X

I'm looking for a way to use Spark on Dataproc built with Scala 2.11. I want to use 2.11 since my jobs pulls in ~10 BigQuery tables and I'm using the new reflection libraries to map the corresponding objects to case classes. (There's a bug with the new reflection classes and concurrency which is only fixed in Scala 2.11) I've tried working around this issues by setting executor-cores to 1 but the performance decrease is painful. Is there a better way?
In general, setting executor-cores to 1 is a reasonable way to work around concurrency issues, since it can often happen that third-party libraries you may incorporate into your Spark jobs also have thread-safety problems; the key here is that you should be able to resize the executors to each only have 1 core without really sacrificing performance (the larger scheduling overhead and yarn overhead might mean o the order of, say ~10% performance decrease, but certainly nothing unmanageable).
I'm assuming you're referring to some multiplicative factor performance decrease due to, say, only using 2 out of 8 cores on an 8-core VM (Dataproc packs 2 executors per VM by default). The way to fix this is simply to also adjust spark.executor.memory down proportionally to match up with the 1 core. For example, in your cluster config (gcloud dataproc clusters describe your-cluster-name) if you use 4-core VMs you might see something like:
spark:spark.executor.cores: '2'
spark:spark.executor.memory: 5586m
YARN packs entirely based on memory, not cores, so this means 5586m is designed to fit in half a YARN node, and thus correspond to 2 cores. If you turn up your cluster like:
gcloud dataproc clusters create \
--properties spark:spark.executor.cores=1,spark:spark.executor.memory=2000m
Then you should end up with a setup which still uses all the cores, but without concurrency issues (one worker thread in each executor process only).
I didn't just use 5586/2 in this case because you have to factor in spark:spark.yarn.executor.memoryOverhead as well, so basically you have to add in the memoryOverhead, then divide by two, then subtract the memoryOverhead again to determine the new executor size, and beyond that the allocations also round to the next multiple of a base chunk size, which I believe is 512m.
In general, you can use trial-and-error by starting a bit lower on the memory allocation per core, and then increasing it if you find you need more memory headroom.
You don't have to redeploy a cluster to check these either; you can specify these at job submission time instead for faster turnaround:
gcloud dataproc jobs submit spark \
--properties spark.executor.cores=1,spark.executor.memory=2000m

running a single job across multiple workers in apache spark

I am trying to get to know how Spark splits a single job (a scala file built using sbt package and the jar is run using spark-submit command) across multiple workers.
For example : I have two workers (512MB memory each). I submit a job and it gets allocated to one worker only (if driver memory is less than the worker memory). In case the driver memory is more than the worker memory, it doesn't get allocated to any worker (even though the combined memory of both workers is higher than the driver memory) and goes to submitted state. This job then goes to running state only when a worker with the required memory is available in the cluster.
I want to know whether one job can be split up across multiple workers and can be run in parallel. If so, can anyone help me with the specific steps involved in it.
Note : the scala program requires a lot of jvm memory since I would be using a large array buffer and hence trying to split the job across multiple workers
Thanks in advance!!
Please check if the array you would be using is parallelized. Then when you do some action on it, it should work in parallel across the nodes.
Check out this page for reference : http://spark.apache.org/docs/0.9.1/scala-programming-guide.html
Make sure your RDD has more than one partition (rdd.partitions.size). Make sure you have more than one executor connected to the driver (http://localhost:4040/executors/).
If both of these are fulfilled, your job should run on multiple executors in parallel. If not, please include code and logs in your question.

Understanding parallelism in Spark and Scala

I have some confusion about parallelism in Spark and Scala. I am running an experiment in which I have to read many (csv) files from the disk change/ process certain columns and then write it back to the disk.
In my experiments, if I use SparkContext's parallelize method only then it does not seem to have any impact on the performance. However simply using Scala's parallel collections (through par) reduces the time almost to half.
I am running my experiments in localhost mode with the arguments local[2] for the spark context.
My question is when should I use scala's parallel collections and when to use spark context's parallelize?
SparkContext will have additional processing in order to support generality of multiple nodes, this will be constant on the data size so may be negligible for huge data sets. On 1 node this overhead will make it slower than Scala's parallel collections.
Use Spark when
You have more than 1 node
You want your job to be ready to scale to multiple nodes
The Spark overhead on 1 node is negligible because the data is huge, so you might as well choose the richer framework
SparkContext's parallelize may makes your collection suitable for processing on multiple nodes, as well as on multiple local cores of your single worker instance ( local[2] ), but then again, you probably get too much overhead from running Spark's task scheduler an all that magic. Of course, Scala's parallel collections should be faster on single machine.
http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html#parallelized-collections - are your files big enough to be automatically split to multiple slices, did you try setting slices number manually?
Did you try running the same Spark job on single core and then on two cores?
Expect best result from Spark with one really big uniformly structured file, not with multiple smaller files.

Distributing work to multiple cores: Hadoop or Scala's parallel collections?

What is the better way of making full use of multiple cores for parallel processing in a Scala/Hadoop system?
Let's say I need to process 100 million documents. Documents are not very large, but processing them is computationally intensive. If I have a Hadoop cluster with 100 machines with 10 cores each, I could either:
A) send 1000 documents to each machine and let Hadoop start a map on each of the 10 cores (or as many as are available)
or
B) send 1000 documents to each machine (still using Hadoop) and use Scala's parallel collections to make full use of the multiple cores. (I would put all documents in a parallel collection, and then call map on the collection). In other words, use Hadoop for distribution at cluster level, and use parallel collections to manage the distribution to cores within each machine.
Hadoop is going to offer a lot more than just parallelization. It offers a platform to distribute work, a scheduler for handling concurrent jobs, a distributed filesystem, the ability to perform a distributed reduce, and fault tolerance. That said, it is a complicated system and can sometimes be difficult to work with.
If you plan to have multiple users submitting many different jobs, Hadoop is the way to go (out of the two options). However, if you are devoting a cluster to be always be processing documents through the same function, you could, without too much trouble, develop a system with Scala parallel collections and actors for inter-machine communication. The Scala solution would give you more control, the system could respond in real time, and you wouldn't have to deal with a lot of Hadoop configuration that doesn't pertain to your task.
If you need to run varied jobs over large amounts of data (larger than would fit on a single node), then use Hadoop. I can give you more information if you describe your requirements in more detail.
Update: one million is a fairly small number. You might want to do some calculations and see how long it would take on a single machine with parallel collections. The advantage here is that the development time is minimal!
The answer depends on the following question - does your Scala code capable to fully utilize all cores available. Probabbly if you have good intrinsic synchronization between parts of the document to be processed or some other way to parralelyze algorithm without lock contention - then the "B"" is the way. If so - configure one mapper per node and let your mapper to utilize cores in a best way.
If your gain from the parralelization is not that good, and adding more threads (cores) to the processing does not improve performance in a linear way - then the "A" can be better way. Efficiency of "A" also depends on the size of your RAM - you will need enough ram for 10 mappers per node.
I can suspect that ideal solution can be somewhere in between. So my suggestion is to develop mapper which takes number of threads used as a parameter and then do a few tests increasing number of threads per mapper and decreasing number of mappers per node.
Hadoop is not very good for processing a lot of small files, but for processing a small amount of very large files. Is there any way you can merge the files before processing them, or are they all totally different? Hadoop takes care of distribution and parallelism itself, so there is no need to explicitly send X docs to Y machines. And also i don't think you should use hadoop only as a distribution mechanism, that is not what it's made for. You should either use a real map/reduce, or build your own system for whatever you are trying to do, but not try to bend hadoop to your will.