Spark local cluster vs threading - pyspark

As mentioned into below link Spark Local vs Cluster
Spark Local vs Cluster
It means, spark local run on local machine with number of threads,
Can I assume it is similar to create a thread from threading module. and we do not need to bother any other thing,
Can I explore this way,
convert large list into dataframe
use udf function and apply manipulation on frame
convert dataframe to list.
will it better approach or efficient.

Not fully understand your question.
It means, spark local run on local machine with number of threads,
Yes, both local mode and cluster mode, you can set the configs to let multiple threads run on each node. Each executor would be one thread.
Can I assume it is similar to create a thread from threading module. and we do not need to bother any other thing,
I think so. I believe they are just different threads each performing a different executor's job as if they are different machines.
Can I explore this way,
I do not understand which method you compare to. Sorry.

Related

PySpark - Local system performance

I am new to Pyspark. I would like to learn one while solving a Kaggle Challenge using a large dataset.
Does Pyspark offer performance advantage over Pandas when using on a local system? Or does it not matter?
When running locally, pyspark runs with as many worker threads as logical cores available on your machine - if you run spark.sparkContext.master, it should return local[*] (more information on local configurations can be found here). Since Pandas is single threaded (unless you're using something like Dask), for large datasets, Pyspark should be more performant. However, due to the overhead associated with using multiple threads, serializing data and sending to the JVM, etc., Pandas may be faster for smaller datasets.

having Spark process partitions concurrently, using a single dev/test machine

I'm naively testing for concurrency in local mode, with the following spark context
SparkSession
.builder
.appName("local-mode-spark")
.master("local[*]")
.config("spark.executor.instances", 4)
.config("spark.executor.cores", 2)
.config("spark.network.timeout", "10000001") // to avoid shutdown during debug, avoid otherwise
.config("spark.executor.heartbeatInterval", "10000000") // to avoid shutdown during debug, avoid otherwise
.getOrCreate()
and a mapPartitions API call like follows:
import spark.implicits._
val inputDF : DataFrame = spark.read.parquet(inputFile)
val resultDF : DataFrame =
inputDF.as[T].mapPartitions(sparkIterator => new MyIterator)).toDF
On the surface of it, this did surface one concurrency bug in my code contained in MyIterator (not a bug in Spark's code). However, I'd like to see that my application will crunch all available machine resources both in production, and also during this testing so that the chances of spotting additional concurrency bugs will improve.
That is clearly not the case for me so far: my machine is only at very low CPU utilization throughout the heavy processing of the inputDF, while there's plenty of free RAM and the JVM Xmx poses no real limitation.
How would you recommend testing for concurrency using your local machine? the objective being to test that in production, Spark will not bump into thread-safety or other concurrency issues in my code applied by spark from within MyIterator?
Or can it even in spark local mode, process separate partitions of my input dataframe in parallel? Can I get spark to work concurrently on the same dataframe on a single machine, preferably in local mode?
Max parallelism
You are already running spark in local mode using .master("local[*]").
local[*] uses as many threads as the number of processors available to the Java virtual machine (it uses Runtime.getRuntime.availableProcessors() to know the number).
Max memory available to all executors/threads
I see that you are not setting the driver memory explicitly. By default the driver memory is 512M. If your local machine can spare more than this, set this explicitly. You can do that by either:
setting it in the properties file (default is spark-defaults.conf),
spark.driver.memory 5g
or by supplying configuration setting at runtime
$ ./bin/spark-shell --driver-memory 5g
Note that this cannot be achieved by setting it in the application, because it is already too late by then, the process has already started with some amount of memory.
Nature of Job
Check number of partitions in your dataframe. That will essentially determine how much max parallelism you can use.
inputDF.rdd.partitions.size
If the output of this is 1, that means your dataframe has only 1 partition and so you won't get concurrency when you do operations on this dataframe. In that case, you might have to tweak some config to create more number of partitions so that you can concurrently run tasks.
Running local mode cannot simulate a production environment for the following reasons.
There are lots of code which gets bypassed when code is run in local mode, which would normally run with any other cluster manager. Amongst various issues, few things that i could think
a. Inability to detect bugs from the way shuffle get handled.(Shuffle data is handled in a completely different way in local mode.)
b. We will not be able to detect serialization related issues, since all code is available to the driver and task runs in the driver itself, and hence we would not result in any serialization issues.
c. No speculative tasks(especially for write operations)
d. Networking related issues, all tasks are executed in same JVM. One would not be able detect issues like communication between driver/executor, codegen related issues.
Concurrency in local mode
a. Max concurrency than can be attained will be equal to the number of cores in your local machine.(Link to code)
b. The Job, Stage, Task metrics shown in Spark UI are not accurate since it will incur the overhead of running in the JVM where the driver is also running.
c: As for CPU/Memoryutilization, it depends on operation being performed. Is the operation CPU/memory intensive?
When to use local mode
a. Testing of code that will run only on driver
b. Basic sanity testing of the code that will get executed on the executors
c. Unit testing
tl; dr The concurrency bugs that occur in local mode might not even be present in other cluster resource managers, since there are lot of special handling in Spark code for local mode(There are lots of code which checks isLocal in code and control goes to a different code flow altogether)
Yes!
Achieving parallelism in local mode is quite possible.
Check the amount of memory and cpu available in your local machine and supply values to the driver-memory and driver-cores conf while submitting your spark job.
Increasing executor-memory and executor-cores will not make a difference in this mode.
Once the application is running, open up the SPARK UI for the job. You can now go to the EXECUTORS tab to actually check the amount of resources your spark job is utilizing.
You can monitor various tasks that get generated and the number of tasks that your job runs concurrently using the JOBS and STAGES tab.
In order to process data which is way larger than the resources available, ensure that you break your data into smaller partitions using repartition. This should allow your job to complete successfully.
Increase the default shuffle partitions in case your job has aggregations or joins. Also, ensure sufficient space on the local file system since spark creates intermediate shuffle files and writes them to disk.
Hope this helps!

How to use Concurrency API on data frames?

I have a requirement to parallelize the Scala Data Frames to load various tables. I have a fact table that is having around 1.7 TB of data. This is taking around 5 minutes to load. I want to concurrently load my dimension tables so that I can reduce my overall scala . I am not well versed with Concurrent API in Scala?.
You need to read up on Spark - the whole point of it is to parallelize processing of data beyond the scope of single machine. Essentially Spark will parallelize the load by as many tasks you'd have running in parallel - It all depends on how you set your cluster - from the question I am guessing you only use on and that you ran it in local model in which case you should at least run it with local[number of processors you have]
If I didn't make it clear you shouldn't also use any other Scala concurrency APIs

Parallelize a RDD variable expected from an external file in Spark

Based on the materials I read and some online posts, I think Spark will broadcast all a RDD variable from external file by: sc.textFile, for example:
val rdd = sc.textFile(file_path)
however, when my colleague read my code and requests me code with sc.parallelize, I am so confused about it as i think the sc.parallelize is redundant, I asked my colleague again and he gave me a answer:
To my experience up till now, spark doesn't good at handling the dividence of external file over multiple nodes and workers, so you need set partitions, forcing the worker to apply multiple workers to do the job.
So based on my colleague's suggestions, what is the easiest way that I can set partitions when I am reading a large volume file if sc.textFile can not do that. A possible way is to collect first and then sc.parallelize, but i think it wast too much time and it is redundant.
You can call rdd.repartion(..). Collect and parrallelise is not the right way to achieve what you describe.
The reason that your colleague observed this behaviour is probably due to small files, as partitioning is driven by the HDFS blocks when reading. So if you files are smaller than the block size, all your data will end up in the same executor.

running a single job across multiple workers in apache spark

I am trying to get to know how Spark splits a single job (a scala file built using sbt package and the jar is run using spark-submit command) across multiple workers.
For example : I have two workers (512MB memory each). I submit a job and it gets allocated to one worker only (if driver memory is less than the worker memory). In case the driver memory is more than the worker memory, it doesn't get allocated to any worker (even though the combined memory of both workers is higher than the driver memory) and goes to submitted state. This job then goes to running state only when a worker with the required memory is available in the cluster.
I want to know whether one job can be split up across multiple workers and can be run in parallel. If so, can anyone help me with the specific steps involved in it.
Note : the scala program requires a lot of jvm memory since I would be using a large array buffer and hence trying to split the job across multiple workers
Thanks in advance!!
Please check if the array you would be using is parallelized. Then when you do some action on it, it should work in parallel across the nodes.
Check out this page for reference : http://spark.apache.org/docs/0.9.1/scala-programming-guide.html
Make sure your RDD has more than one partition (rdd.partitions.size). Make sure you have more than one executor connected to the driver (http://localhost:4040/executors/).
If both of these are fulfilled, your job should run on multiple executors in parallel. If not, please include code and logs in your question.