GC in a spark/scala job? - scala

I've inherited two spark scala jobs, which have previously been launched sequentially, but independently:
dse spark-submit <job1 params>
dse spark-submit <job2 params>
Both jobs more-or-less max out the system resources, but run successfully with the current --executor-memory/--driver-memory/--memory-fraction settings.
There is considerable overlap between the two jobs, and we would like to merge them into a single task. It seemed reasonable to define a new App which launches each, individually:
def main(args: Array[String]) {
processJob1(args)
processJob2(args)
}
Watching in the spark console, job 1 runs and completes, then job 2 begins but runs into out of memory errors.
In Java, I believe that an OOM error usually means that GC ran, and that there wasn't enough resulting free space. Being new to spark and ignorant, this raises a few questions.
How does spark 'scope' an Rdd? For example, shouldn't any objects
defined in job 1 should be marked for GC before job 2 begins? Would
the OOM failure in this case then indicate a memory leak in job 1?
Will scala block the workers while a GC is going on? Would it even know, or would the jobs just run more slowly?
GC is run by the JVM. Does this mean that something like System.gc() would only trigger a GC on the master node? Is it possible to trigger a GC at the workers?

Related

Implementing exponential backoff in spark between consecutive attempts

On our AWS EMR spark cluster we have 100-150 jobs running at a given time. Each cluster-task-node is allocated 7 of 8 cores. But at some hour in a day the load average touches 10 --causing task failure/ task lost/ executors lost/ heart beat missed. If we decrease the allocated cores/node the cluster is under utilised (load avg 4)
To avoid executors lost/ tasks lost --we tried increasing maxAttempts, but all the attempts are happening in that hour window only and the job is failing.
So, as intermediary we are thinking of
having try catch in the main method. So that executor lost, max attempts exceeded exceptions can be caught and... re-trigger the code
The EMR-core-nodes have 6 of 8cores allocated to spark --May be for this reason the driver never crashed!
So the questions
Will re-triggering main code in the same JVM several times any side affects? It can happen that there are some dataframes already created in the previous run and now they are getting recreated. Or recached or the SQL is getting re-executed
if this is not the way then any suggestions on how to get exponential backoff?
we are using spark 3.3 on AWS EMR 6.X. The spark job is triggered via sh file.

Cluster Resource Usage in Databricks

I was just wondering if anyone could explain if all compute resources in a Databricks cluster are shared or if the resources are tied to each worker. For example, if two users were connected to a cluster made up of 2 workers with 4 cores per worker and one user's job required 2 cores and the other's required 6 cores, would they be able to share the 8 total cores or would the full 4 cores from one worker be unavailable during the job that only required 2 cores?
TL;DR; Yes, default behavior is to allow sharing but you're going to have to tightly control the default parallelism with such a small cluster.
Take a look at Job Scheduling for Apache Spark. I'm assuming you are using an "all-purpose" / "interactive" cluster where users are working on notebooks OR you are submitting jobs to an existing, all-purpose cluster and it is NOT a job cluster with multiple spark applications being deployed.
Databricks Runs in FAIR Scheduling Mode by Default
Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
By default, all queries started in a notebook run in the same fair scheduling pool
The Apache Spark scheduler in Azure Databricks automatically preempts tasks to enforce fair sharing.
Apache Spark Defaults to FIFO
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
Keep in mind the word "job" is specific Spark term that represents an action being taken that launches one or more stages and tasks. See What is the concept of application, job, stage and task in spark?.
So in your example you have...
2 Workers with 4 cores each == 8 cores == 8 tasks can be handled in parallel
One application (App A) that has a job that launches a stage with only 2 tasks.
One application (App B) that has a job that launches a stage with 6 tasks.
In this case, YES, you will be able to share the resources of the cluster. However, the devil is in the default behaviors. If you're reading from many files, performing a join, aggregating, etc, you're going to run into the fact that Spark is going to partition your data into chunks that can be acted on in parallel (see configuration like spark.default.parallelism).
So, in a more realistic example, you're going to have...
2 Workers with 4 cores each == 8 cores == 8 tasks can be handled in parallel
One application (App A) that has a job that launches a stage with 200 tasks.
One application (App B) that has a job that launches three stage with 8, 200, and 1 tasks respectively.
In a scenario like this FIFO scheduling, as is the default, will result in one of these applications blocking the other since the number of executors is completely overwhelmed by the number of tasks in just one stage.
In a FAIR scheduling mode, there will still be some blocking since the number of executors is small but some work will be done on each job since FAIR scheduling does a round-robin at the task level.
In Apache Spark, you have tighter control by creating different pools of the resources and submitting apps only to those pools where they have "isolated" resources. The "better" way of doing this is with Databricks Job clusters that have isolated compute dedicated to the application being ran.

How to safely restart Airflow and kill a long-running task?

I have Airflow is running in Kubernetes using the CeleryExecutor. Airflow submits and monitors Spark jobs using the DatabricksOperator.
My streaming Spark jobs have a very long runtime (they run forever unless they fail or are cancelled). When pods for Airflow worker are killed while a streaming job is running, the following happens:
Associated task becomes a zombie (running state, but no process with heartbeat)
Task is marked as failed when Airflow reaps zombies
Spark streaming job continues to run
How can I force the worker to kill my Spark job before it shuts down?
I've tried killing the Celery worker with a TERM signal, but apparently that causes Celery to stop accepting new tasks and wait for current tasks to finish (docs).
You need to be more clear about the issue. If you are saying that the spark cluster finishes the jobs as expected and not calling the on_kill function, it's expected behavior. As per the docs on kill function is for cleaning up after task get killed.
def on_kill(self) -> None:
"""
Override this method to cleanup subprocesses when a task instance
gets killed. Any use of the threading, subprocess or multiprocessing
module within an operator needs to be cleaned up or it will leave
ghost processes behind.
"""
In your case when you manually kill the job it is doing what it has to do.
Now if you want to have a clean_up even after successful completion of the job, override post_execute function.
As per the docs. The post execute is
def post_execute(self, context: Any, result: Any = None):
"""
This hook is triggered right after self.execute() is called.
It is passed the execution context and any results returned by the
operator.
"""

Spark: long delay between jobs

So we are running spark job that extract data and do some expansive data conversion and writes to several different files. Everything is running fine but I'm getting random expansive delays between resource intensive job finish and next job start.
In below picture, we can see that job that was scheduled at 17:22:02 took 15 min to finish, which means I'm expecting next job to be scheduled around 17:37:02. However, next job was scheduled at 22:05:59, which is +4 hours after job success.
When I dig into next job's spark UI it show <1 sec scheduler delay. So I'm confused to where does this 4 hours long delay is coming from.
(Spark 1.6.1 with Hadoop 2)
Updated:
I can confirm that David's answer below is spot on about how IO ops are handled in Spark is bit unexpected. (It makes sense to that file write essentially does "collect" behind the curtain before it writes considering ordering and/or other operations.) But I'm bit discomforted by the fact that I/O time is not included in job execution time. I guess you can see it in "SQL" tab of spark UI as queries are still running even with all jobs being successful but you cannot dive into it at all.
I'm sure there are more ways to improve but below two methods were sufficient for me:
reduce file count
set parquet.enable.summary-metadata to false
I/O operations often come with significant overhead that will occur on the master node. Since this work isn't parallelized, it can take quite a bit of time. And since it is not a job, it does not show up in the resource manager UI. Some examples of I/O tasks that are done by the master node
Spark will write to temporary s3 directories, then move the files using the master node
Reading of text files often occur on the master node
When writing parquet files, the master node will scan all the files post-write to check the schema
These issues can be solved by tweaking yarn settings or redesigning your code. If you provide some source code, I might be able to pinpoint your issue.
Discussion of writing I/O Overhead with Parquet and s3
Discussion of reading I/O Overhead "s3 is not a filesystem"
Problem:
I faced similar issue when writing parquet data on s3 with pyspark on EMR 5.5.1. All workers would finish writing data in _temporary bucket in output folder & Spark UI would show that all tasks have completed. But Hadoop Resource Manager UI would not release resources for the application neither mark it as complete. On checking s3 bucket, it seemed like spark driver was moving the files 1 by 1 from _temporary directory to output bucket which was extremely slow & all the cluster was idle except Driver node.
Solution:
The solution that worked for me was to use committer class by AWS ( EmrOptimizedSparkSqlParquetOutputCommitter ) by setting the configuration property spark.sql.parquet.fs.optimized.committer.optimization-enabled to true.
e.g.:
spark-submit ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
or
pyspark ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
Note that this property is available in EMR 5.19 or higher.
Result:
After running the spark job on EMR 5.20.0 using above solution, it did not create any _temporary directory & all the files were directly written to the output bucket, hence job finished very quickly.
Fore more details:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-committer.html

apache spark: local[K] master URL - job gets stuck

I am using apache spark 0.8.0 to process a large data file and perform some basic .map and .reduceByKey operations on the RDD.
Since I am using a single machine with multiple processors, I mention local[8] in the Master URL field while creating SparkContext
val sc = new SparkContext("local[8]", "Tower-Aggs", SPARK_HOME )
But whenever I mention multiple processors, the job gets stuck (pauses/halts) randomly. There is no definite place where it gets stuck, its just random. Sometimes it won't happen at all. I am not sure if it continues after that but it gets stuck for a long time after which I abort the job.
But when I just use local in place of local[8], the job runs seamlessly without getting stuck ever.
val sc = new SparkContext("local", "Tower-Aggs", SPARK_HOME )
I am not able to understand where is the problem.
I am using Scala 2.9.3 and sbt to build and run the application
I'm using spark 1.0.0 and met the same problem: if a function passed to a transformation or action wait/loop indefinitely, then spark won't wake it or terminate/retry it by default, in which case you can kill the task.
However, a recent feature (speculative task) allows spark to start replicated tasks if a few tasks take much longer than average running time of their peers. This can be enabled and configured in the following config properties:
spark.speculation false If set to "true", performs speculative execution of tasks. This means if one or more tasks are running slowly in a stage, they will be re-launched.
spark.speculation.interval 100 How often Spark will check for tasks to speculate, in milliseconds.
spark.speculation.quantile 0.75 Percentage of tasks which must be complete before speculation is enabled for a particular stage.
spark.speculation.multiplier 1.5 How many times slower a task is than the median to be considered for speculation.
(source: http://spark.apache.org/docs/latest/configuration.html)