I would like to run a Spark application using multiple executors in parallel while trying the application on my development environment (Eclipse). It seems the Spark engine serializes all the tasks and run them using one executor.
Is there an option to run two or more tasks in parallel in Eclipse with spark.master=local?
Use spark.master="local[n]" where n is the number of cores you want to assign to spark or "*" for all cores.
Based on my understanding, RDD's are stored in the executor memory. That is the reason Spark performing all the tasks where data is available. I think there is no option to control the way Spark executes the tasks. Basically it creates stages based on the tasks and then determines the tasks execution.
Related
I am using Scala Spark 2.4 and want to know the usage of the queue.
How to display how many nodes from a big cluster (100+ nodes) are being utilized for any particular job that is running?
Thanks.
We are working on Qubole with Spark version 2.0.2.
We have a multi-step process in which all the intermediate steps write their output to HDFS and later this output is used in the reporting layer.
As per our use case, we want to avoid writing to HDFS and keep all the intermediate output as temporary tables in spark and directly write the final reporting layer output.
For this implementation, we wanted to use Job server provided by Qubole but when we try to trigger multiple queries on the Job server, Job server is running my jobs sequentially.
I have observed the same behavior in Databricks cluster as well.
The cluster we are using is a 30 node, r4.2xlarge.
Does anyone has experience in running multiple jobs using job server ?
Community's help will be greatly appreciated !
I have an EC2 set up with r3.8xlarge (32 cores, 244G RAM).
In my Spark application, I am reading two csv files from S3 using Spark-CSV from DataBrick, each csv has about 5 millions rows. I am unionAll the two DataFrames and running a dropDuplicates on the combined DataFrame.
But when I have,
val conf = new SparkConf()
.setMaster("local[32]")
.setAppName("Raw Ingestion On Apache Spark")
.set("spark.sql.shuffle.partitions", "32")
Spark is slower than .setMaster("local")
Wouldn't it be faster with 32 cores?
Well spark is not a Windows operating system, that it would work at maximum possible capacity from the start, you need to tune it for your usage.
Right now you just bluntly said to spark start and process my stuff on one node with 32 cores. That is not what Spark is good for. It is a distributed system suppose to be run on multi-node cluster, that is where it works best.
Reason is simple, even if you are using 32 core, what about IO issue?
Because now you are using let's if it has run 30 executors, than that is 32 process reading from same disk.
You specified 32 core, what about executor memory?
Did both machine had same ram, where you were testing.
You have specified now specifically that you want 32 partitions, if data is very small that is alot of overhead. Ideally you shouldn't specify partition until you know specifically what you are doing, or you are doing repetitive task, and you know data is going to be exactly similar all time.
If you tune it correctly spark with 32 core will indeed work faster than "local" which is basically running on one core.
We have partitioned a large number of our jobs to improve the overall performance of our application. We are now investigating running several of these partitioned jobs in parallel (kicked off by an external scheduler). The jobs are all configured to use the same fixes reply queue. As a test, I generated a batch job that has a parallel flow where partitioned jobs are executed in parallel. On a single server (local testing) it works fine. When I try on a multiple server cluster I see the remote steps complete but the parent step does not ever finish. I see the messages in the reply queue but they are never read.
Is this an expected problem with out approach or can you suggest how we can try to resolve the problem?
I am running the spring batch job in three machines. For example the database has 30 records, the batch job in each machine has to pick up unique 10 records and process it.
I read partitioning and Parallel processing and bit confused, which one is suitable?
Appreciate your help.
What you are describing is partitioning. Partitioning is when the input is broken up into partitions and each partition is processed in parallel. Spring Batch offers two different ways to execute partitioning, one is local using threads (via the TaskExecutorPartitionHandler). The other one is distributing the partitions via messages so they can be executed either locally or remotely via the MessageChannelPartitionHandler found in Spring Batch Admin's spring-batch-integration project. You can learn more about remote partitioning via my talk on multi-jvm batch processing here: http://www.youtube.com/watch?v=CYTj5YT7CZU