Spark Jobserver max-jobs-per-context - spark-jobserver

How would you determine a safe max threshold value for the max-jobs-per-context setting, which controls the number of concurrent Spark jobs that are running on a context? What would happen if you go too high? The default is set to 8 (see link below), and I'd like to set it higher, but I'm not sure what happens if you set it too high.
https://github.com/spark-jobserver/spark-jobserver/blob/master/job-server/src/main/resources/application.conf

An approach that we are using in production is to put a queue in front of spark jobserver and control job submission. There is no inbuilt queuing mechanism in SJS.

Related

How do I make sure I can keep ingestion client running if I have heavy read operations on a table?

I am using influx line to insert records into a table in Questdb at a constant and high rate. I have multiple postgres clients attached performing read operations, some are Grafana dashboards which do some heavy aggregations across the table. It looks like when I refresh the dashboards, I'm hitting some issues:
... t.LineTcpConnectionContext [31] queue full, consider increasing queue size or number of writer jobs
Is there a way to make sure I don't kick the kick the insert client out or increase the queue like mentioned in the error?
If you have one client writing Influx line protocol over TCP, it's possible to have a dedicated worker thread for this purpose. The config key that you can set this with is line.tcp.worker.count and this should be set in a configuration file or via environment variable. Setting one dedicated thread in server.conf would look like the following:
line.tcp.worker.count=1

having Spark process partitions concurrently, using a single dev/test machine

I'm naively testing for concurrency in local mode, with the following spark context
SparkSession
.builder
.appName("local-mode-spark")
.master("local[*]")
.config("spark.executor.instances", 4)
.config("spark.executor.cores", 2)
.config("spark.network.timeout", "10000001") // to avoid shutdown during debug, avoid otherwise
.config("spark.executor.heartbeatInterval", "10000000") // to avoid shutdown during debug, avoid otherwise
.getOrCreate()
and a mapPartitions API call like follows:
import spark.implicits._
val inputDF : DataFrame = spark.read.parquet(inputFile)
val resultDF : DataFrame =
inputDF.as[T].mapPartitions(sparkIterator => new MyIterator)).toDF
On the surface of it, this did surface one concurrency bug in my code contained in MyIterator (not a bug in Spark's code). However, I'd like to see that my application will crunch all available machine resources both in production, and also during this testing so that the chances of spotting additional concurrency bugs will improve.
That is clearly not the case for me so far: my machine is only at very low CPU utilization throughout the heavy processing of the inputDF, while there's plenty of free RAM and the JVM Xmx poses no real limitation.
How would you recommend testing for concurrency using your local machine? the objective being to test that in production, Spark will not bump into thread-safety or other concurrency issues in my code applied by spark from within MyIterator?
Or can it even in spark local mode, process separate partitions of my input dataframe in parallel? Can I get spark to work concurrently on the same dataframe on a single machine, preferably in local mode?
Max parallelism
You are already running spark in local mode using .master("local[*]").
local[*] uses as many threads as the number of processors available to the Java virtual machine (it uses Runtime.getRuntime.availableProcessors() to know the number).
Max memory available to all executors/threads
I see that you are not setting the driver memory explicitly. By default the driver memory is 512M. If your local machine can spare more than this, set this explicitly. You can do that by either:
setting it in the properties file (default is spark-defaults.conf),
spark.driver.memory 5g
or by supplying configuration setting at runtime
$ ./bin/spark-shell --driver-memory 5g
Note that this cannot be achieved by setting it in the application, because it is already too late by then, the process has already started with some amount of memory.
Nature of Job
Check number of partitions in your dataframe. That will essentially determine how much max parallelism you can use.
inputDF.rdd.partitions.size
If the output of this is 1, that means your dataframe has only 1 partition and so you won't get concurrency when you do operations on this dataframe. In that case, you might have to tweak some config to create more number of partitions so that you can concurrently run tasks.
Running local mode cannot simulate a production environment for the following reasons.
There are lots of code which gets bypassed when code is run in local mode, which would normally run with any other cluster manager. Amongst various issues, few things that i could think
a. Inability to detect bugs from the way shuffle get handled.(Shuffle data is handled in a completely different way in local mode.)
b. We will not be able to detect serialization related issues, since all code is available to the driver and task runs in the driver itself, and hence we would not result in any serialization issues.
c. No speculative tasks(especially for write operations)
d. Networking related issues, all tasks are executed in same JVM. One would not be able detect issues like communication between driver/executor, codegen related issues.
Concurrency in local mode
a. Max concurrency than can be attained will be equal to the number of cores in your local machine.(Link to code)
b. The Job, Stage, Task metrics shown in Spark UI are not accurate since it will incur the overhead of running in the JVM where the driver is also running.
c: As for CPU/Memoryutilization, it depends on operation being performed. Is the operation CPU/memory intensive?
When to use local mode
a. Testing of code that will run only on driver
b. Basic sanity testing of the code that will get executed on the executors
c. Unit testing
tl; dr The concurrency bugs that occur in local mode might not even be present in other cluster resource managers, since there are lot of special handling in Spark code for local mode(There are lots of code which checks isLocal in code and control goes to a different code flow altogether)
Yes!
Achieving parallelism in local mode is quite possible.
Check the amount of memory and cpu available in your local machine and supply values to the driver-memory and driver-cores conf while submitting your spark job.
Increasing executor-memory and executor-cores will not make a difference in this mode.
Once the application is running, open up the SPARK UI for the job. You can now go to the EXECUTORS tab to actually check the amount of resources your spark job is utilizing.
You can monitor various tasks that get generated and the number of tasks that your job runs concurrently using the JOBS and STAGES tab.
In order to process data which is way larger than the resources available, ensure that you break your data into smaller partitions using repartition. This should allow your job to complete successfully.
Increase the default shuffle partitions in case your job has aggregations or joins. Also, ensure sufficient space on the local file system since spark creates intermediate shuffle files and writes them to disk.
Hope this helps!

Handling Skew data in apache spark production scenario

Can anyone explain how the skew data is handled in production for Apache spark?
Scenario:
We submitted the spark job using "spark-submit" and in spark-ui it is observed that few tasks are taking long time which indicates presence of skew.
Questions:
(1) What steps shall we take(re-partitioning,coalesce,etc.)?
(2) Do we need to kill the job and then include the skew solutions in the jar and
re-submit the job?
(3) Can we solve this issue by running the commands like (coalesce) directly from
shell without killing the job?
Data skews a primarily a problem when applying non-reducing by-key (shuffling) operations. The two most common examples are:
Non-reducing groupByKey (RDD.groupByKey, Dataset.groupBy(Key).mapGroups, Dataset.groupBy.agg(collect_list)).
RDD and Dataset joins.
Rarely, the problem is related to the properties of the partitioning key and partitioning function, with no per-existent issue with data distribution.
// All keys are unique - no obvious data skew
val rdd = sc.parallelize(Seq(0, 3, 6, 9, 12)).map((_, None))
// Drastic data skew
rdd.partitionBy(new org.apache.spark.HashPartitioner(3)).glom.map(_.size).collect
// Array[Int] = Array(5, 0, 0)
What steps shall we take(re-partitioning,coalesce,etc.)?
Repartitioning (never coalesce) can help you with the the latter case by
Changing partitioner.
Adjusting number of partitions to minimize possible impact of data (here you can use the same rules as for associative arrays - prime number and powers of two should be preferred, although might not resolve the problem fully, like 3 in the example used above).
The former cases typically won't benefit from repartitioning much, because skew is naturally induced by the operation itself. Values with the same key cannot be spread multiple partitions, and non-reducing character of the process, is minimally affected by the initial data distribution.
These cases have to be handled by adjusting the logic of your application. It could mean a number of things in practice, depending on the data or problem:
Removing operation completely.
Replacing exact result with an approximation.
Using different workarounds (typically with joins), for example frequent-infrequent split, iterative broadcast join or prefiltering with probabilistic filter (like Bloom filter).
Do we need to kill the job and then include the skew solutions in the jar and re-submit the job?
Normally you have to at least resubmit the job with adjust parameters.
In some cases (mostly RDD batch jobs) you can design your application, to monitor task execution and kill and resubmit particular job in case of possible skew, but it might hard to implement right in practice.
In general, if data skew is possible, you should design your application to be immune to data skews.
Can we solve this issue by running the commands like (coalesce) directly from shell without killing the job?
I believe this is already answered by the points above, but just to say - there is no such option in Spark. You can of course include these in your application.
We can fine tune the query to reduce the complexity .
We can Try Salting mechanism:
Salt the skewed column with random number creation better distribution of data across each partition.
Spark 3 Enables Adaptive Query Execution mechanism to avoid such scenarios in production.
Below are couple of spark properties which we can fine tune accordingly.
spark.sql.adaptive.enabled=true
spark.databricks.adaptive.autoBroadcastJoinThreshold=true #changes sort merge join to broadcast join dynamically , default size = 30 mb
spark.sql.adaptive.coalescePartitions.enabled=true #dynamically coalesced
spark.sql.adaptive.advisoryPartitionSizeInBytes=64MB default
spark.sql.adaptive.coalescePartitions.minPartitionSize=true
spark.sql.adaptive.coalescePartitions.minPartitionNum=true # Default 2X number of cores
spark.sql.adaptive.skewJoin.enabled=true
spark.sql.adaptive.skewJoin.skewedPartitionFactor=Default is 5
spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=256 MB

What is Starvation scenario in Spark streaming?

In the famous word count example for spark streaming, the spark configuration object is initialized as follows:
/* Create a local StreamingContext with two working thread and batch interval of 1 second.
The master requires 2 cores to prevent from a starvation scenario. */
val sparkConf = new SparkConf().
setMaster("local[2]").setAppName("WordCount")
Here if I change the master from local[2] to local or does not set the Master, I do not get the expected output and in fact word counting doesn't happen at all.
The comment says:
"The master requires 2 cores to prevent from a starvation scenario" that's why they have done setMaster("local[2]").
Can somebody explain me why it requires 2 cores and what is starvation scenario ?
From the documentation:
[...] note that a Spark worker/executor is a long-running task, hence it occupies one of the cores allocated to the Spark Streaming application. Therefore, it is important to remember that a Spark Streaming application needs to be allocated enough cores (or threads, if running locally) to process the received data, as well as to run the receiver(s).
In other words, one thread will be used to run the receiver and at least one more is necessary for processing the received data. For a cluster, the number of allocated cores must be more than the number of receivers, otherwise the system can not process the data.
Hence, when running locally, you need at least 2 threads and when using a cluster at least 2 cores need to be allocated to your system.
Starvation scenario refers to this type of problem, where some threads are not able to execute at all while others make progress.
There are two classical problems where starvation is well known:
Dining philosophers
Readers-writer problem, here it's possible to synchronize the threads so the readers or writers starve. It's also possible to make sure that no starvation occurs.

Dataproc Cluster with Spark 1.6.X using scala 2.11.X

I'm looking for a way to use Spark on Dataproc built with Scala 2.11. I want to use 2.11 since my jobs pulls in ~10 BigQuery tables and I'm using the new reflection libraries to map the corresponding objects to case classes. (There's a bug with the new reflection classes and concurrency which is only fixed in Scala 2.11) I've tried working around this issues by setting executor-cores to 1 but the performance decrease is painful. Is there a better way?
In general, setting executor-cores to 1 is a reasonable way to work around concurrency issues, since it can often happen that third-party libraries you may incorporate into your Spark jobs also have thread-safety problems; the key here is that you should be able to resize the executors to each only have 1 core without really sacrificing performance (the larger scheduling overhead and yarn overhead might mean o the order of, say ~10% performance decrease, but certainly nothing unmanageable).
I'm assuming you're referring to some multiplicative factor performance decrease due to, say, only using 2 out of 8 cores on an 8-core VM (Dataproc packs 2 executors per VM by default). The way to fix this is simply to also adjust spark.executor.memory down proportionally to match up with the 1 core. For example, in your cluster config (gcloud dataproc clusters describe your-cluster-name) if you use 4-core VMs you might see something like:
spark:spark.executor.cores: '2'
spark:spark.executor.memory: 5586m
YARN packs entirely based on memory, not cores, so this means 5586m is designed to fit in half a YARN node, and thus correspond to 2 cores. If you turn up your cluster like:
gcloud dataproc clusters create \
--properties spark:spark.executor.cores=1,spark:spark.executor.memory=2000m
Then you should end up with a setup which still uses all the cores, but without concurrency issues (one worker thread in each executor process only).
I didn't just use 5586/2 in this case because you have to factor in spark:spark.yarn.executor.memoryOverhead as well, so basically you have to add in the memoryOverhead, then divide by two, then subtract the memoryOverhead again to determine the new executor size, and beyond that the allocations also round to the next multiple of a base chunk size, which I believe is 512m.
In general, you can use trial-and-error by starting a bit lower on the memory allocation per core, and then increasing it if you find you need more memory headroom.
You don't have to redeploy a cluster to check these either; you can specify these at job submission time instead for faster turnaround:
gcloud dataproc jobs submit spark \
--properties spark.executor.cores=1,spark.executor.memory=2000m