databricks Job cluster output Limits - pyspark

I am running a production job in databricks using cluster. During environment Initialization I have created a notebook which will include lot of print statements which is causing job cluster to exceed the output size and the job was failing.
I have tried to configure this parameter
spark.databricks.driver.disableScalaOutput true
Seems to be the above parameter does not working. Is there any other way we can tackle this issue.

a notebook which will include lot of print statements which is causing job cluster to exceed the output size and the job was failing.
As per the given information I have tried to repro the issue by writing the 999 print per Notebook with total 3 notebooks. Means total 2997 print statements in a notebook and it is working fine for me. Please refer below image.
The possible reasons for the error could be:
If you are using multiple display(), displayHTML(), show() commands in your notebook, this increases the amount of output. Once the output exceeds 20 MB, the error occurs.
If you are using multiple print() commands in your notebook, this can increase the output to stdout. Once the output exceeds 20 MB, the error occurs.
If you are running a streaming job and enable awaitAnyTermination in the cluster’s Spark config, it tries to fetch the entire output in a single request. If this exceeds 20 MB, the error occurs.
So, make sure the total output shouldn’t exceed 20 MB.
Solution:
Remove any unnecessary display(), displayHTML(), print(), and show(), commands in your notebook. These can be useful for debugging, but they are not recommended for production jobs.
If your job output is exceeding the 20 MB limit, try redirecting your logs to log4j or disable stdout by setting spark.databricks.driver.disableScalaOutput true in the cluster’s Spark config.

Related

Azure Synapse - How to stop an Apache Spark application / notebook?

When I run (in debug mode) a Spark notebook in Azure Synapse Analytics, it doesn't seem to shutdown as expected.
In the last cell I call: mssparkutils.notebook.exit("exiting notebook")
But then when I fire off another notebook (again in debug mode, same pool), I get this error:
AVAILABLE_COMPUTE_CAPACITY_EXCEEDED: Livy session has failed. Session state: Error. Error code: AVAILABLE_COMPUTE_CAPACITY_EXCEEDED. Your job requested 12 vcores. However, the pool only has 0 vcores available out of quota of 12 vcores. Try ending the running job(s) in the pool, reducing the numbers of vcores requested, increasing the pool maximum size or using another pool. Source: User.
So I go to Monitor => Apache Spark applications and I see my the first notebook I ran still in a "Running" status and I can manually stop it.
How do I automatically stop the Notebook / Apache Spark application? I thought that was the notebook.exit() call but apparently not...
In debug mode, the cluster's vcores are supplied to the notebook for the entire duration of the debug (that is one hour of inactivity or until you manually terminate it)
Thus, you have two options:
Work on one notebook at a time, closing the debug before starting another
OR
Configure the session to reduce the number of executors so that the spark cluster can provision all three debug modes at the same time (might need to increase the size of the cluster)

Azure Databricks error- The output of the notebook is too large. Cause: rpc response

Error Message - job failed with error message The output of the notebook is too large. Cause: rpc response (of 20972488 bytes) exceeds limit of 20971520 bytes
Details:
We are using databricks notebooks to run the job. Job is running on job cluster. This is a streaming job.
Job started failing with above mentioned error.
We do not have any display(), show(), print(), explain method in the job.
We are not using awaitAnyTermination method in the job as well.
We also tried adding "spark.databricks.driver.disableScalaOutput true" to the job but it still did not work. Job is failing with same error.
We have followed all the steps mentioned in this document - https://learn.microsoft.com/en-us/azure/databricks/kb/jobs/job-cluster-limit-nb-output
Do we have any option to resolve this issue or to find out exactly which commands output is causing it to go above 20MB limit.
See the docs regarding structured streaming in prod.
I would recommend migrating to workflows based on jar jobs because:
Notebook workflows are not supported with long-running jobs. Therefore we don’t recommend using notebook workflows in your streaming jobs.

Session isn't active Pyspark in an AWS EMR cluster

I have opened an AWS EMR cluster and in pyspark3 jupyter notebook I run this code:
"..
textRdd = sparkDF.select(textColName).rdd.flatMap(lambda x: x)
textRdd.collect().show()
.."
I got this error:
An error was encountered:
Invalid status code '400' from http://..../sessions/4/statements/7 with error payload: {"msg":"requirement failed: Session isn't active."}
Running the line:
sparkDF.show()
works!
I also created a small subset of the file and all my code runs fine.
What is the problem?
I had the same issue and the reason for the timeout is the driver running out of memory. Since you run collect() all the data gets sent to the driver. By default the driver memory is 1000M when creating a spark application through JupyterHub even if you set a higher value through config.json. You can see that by executing the code from within a jupyter notebook
spark.sparkContext.getConf().get('spark.driver.memory')
1000M
To increase the driver memory just do
%%configure -f
{"driverMemory": "6000M"}
This will restart the application with increased driver memory. You might need to use higher values for your data. Hope it helps.
From This stack overflow question's answer which worked for me
Judging by the output, if your application is not finishing with a FAILED status, that sounds like a Livy timeout error: your application is likely taking longer than the defined timeout for a Livy session (which defaults to 1h), so even despite the Spark app succeeds your notebook will receive this error if the app takes longer than the Livy session's timeout.
If that's the case, here's how to address it:
1. edit the /etc/livy/conf/livy.conf file (in the cluster's master node)
2. set the livy.server.session.timeout to a higher value, like 8h (or larger, depending on your app)
3. restart Livy to update the setting: sudo restart livy-server in the cluster's master
4. test your code again
Alternative way to edit this setting - https://allinonescript.com/questions/54220381/how-to-set-livy-server-session-timeout-on-emr-cluster-boostrap
Just a restart helped solve this problem for me. On your Jupyter Notebook, go to -->Kernel-->>Restart
Once done, if you run the cell with "spark" command you will see that a new spark session gets established.
You might get some insights from this similar Stack Overflow thread: Timeout error: Error with 400 StatusCode: "requirement failed: Session isn't active."
Solution might be to increase spark.executor.heartbeatInterval. Default is 10 seconds.
See EMR's official documentation on how to change Spark defaults:
You change the defaults in spark-defaults.conf using the spark-defaults configuration classification or the maximizeResourceAllocation setting in the spark configuration classification.
Insufficient reputation to comment.
I tried increasing heartbeat Interval to a much higher (100 seconds), still the same result. FWIW, the error shows up in < 9s.
What worked for me is adding {"Classification": "spark-defaults", "Properties": {"spark.driver.memory": "20G"}} to the EMR configuration.

Spark Streaming does not clean files in /tmp

I have a Spark Streaming job (2.3.1 - standalone cluster):
~50K RPS
10 executors - (2 core, 7.5Gb RAM, 10Gb disk GCP nodes)
Data rate ~20Mb/sec and job (while running) run ~0.5s on 1s batches.
The problem is that the temporary files that spark writes to /tmp are never cleaned out as the JVM on the executor never terminates. Short of some clever batch job to clean the /tmp directory I am looking to find the proper way to keep my job from crashing (and restarting) on no space left of device errors.
I have the following SPARK_WORKER_OPTS set as follows:
SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=1200 -Dspark.worker.cleanup.appDataTtl=345600"
I have experimented with both CMS and G1GC collectors - neither seemed to have an impacted other than modulating GC time.
I have been through most of the documented settings, and searched about but have not been able to find any additional directions to try. I have to believe that ppl are running much bigger jobs with a long running, stable stream and that this is a solved problem.
Other config bits:
spark-defaults:
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.broadcast.compress true
spark.streaming.stopGracefullyOnShutdown true
spark.ui.reverseProxy true
spark.cleaner.periodicGC.interval 20
...
spark-submit: (nothing special)
/opt/spark/bin/spark-submit \
--master spark://6.7.8.9:6066 \
--deploy-mode cluster \
--supervise \
/opt/application/switch.jar
As it stands the job runs for ~90 minutes before the drives fill up and we crash. I could spin them up with larger drives, but 90 minutes should allow for the config options I've tried to have a go at cleanup at 20 minute intervals. Larger drives would likely just prolong the issue.
Drives filling up turned out to be a red herring. A cached DataSet inside the job was not being released from memory after ds.cache. Eventually things spilled to disk and our no remaining drive space condition resulted. Or, bit by premature optimization. Removing the explicit call to cache resolved the issue and performance was not impacted.

Spark: long delay between jobs

So we are running spark job that extract data and do some expansive data conversion and writes to several different files. Everything is running fine but I'm getting random expansive delays between resource intensive job finish and next job start.
In below picture, we can see that job that was scheduled at 17:22:02 took 15 min to finish, which means I'm expecting next job to be scheduled around 17:37:02. However, next job was scheduled at 22:05:59, which is +4 hours after job success.
When I dig into next job's spark UI it show <1 sec scheduler delay. So I'm confused to where does this 4 hours long delay is coming from.
(Spark 1.6.1 with Hadoop 2)
Updated:
I can confirm that David's answer below is spot on about how IO ops are handled in Spark is bit unexpected. (It makes sense to that file write essentially does "collect" behind the curtain before it writes considering ordering and/or other operations.) But I'm bit discomforted by the fact that I/O time is not included in job execution time. I guess you can see it in "SQL" tab of spark UI as queries are still running even with all jobs being successful but you cannot dive into it at all.
I'm sure there are more ways to improve but below two methods were sufficient for me:
reduce file count
set parquet.enable.summary-metadata to false
I/O operations often come with significant overhead that will occur on the master node. Since this work isn't parallelized, it can take quite a bit of time. And since it is not a job, it does not show up in the resource manager UI. Some examples of I/O tasks that are done by the master node
Spark will write to temporary s3 directories, then move the files using the master node
Reading of text files often occur on the master node
When writing parquet files, the master node will scan all the files post-write to check the schema
These issues can be solved by tweaking yarn settings or redesigning your code. If you provide some source code, I might be able to pinpoint your issue.
Discussion of writing I/O Overhead with Parquet and s3
Discussion of reading I/O Overhead "s3 is not a filesystem"
Problem:
I faced similar issue when writing parquet data on s3 with pyspark on EMR 5.5.1. All workers would finish writing data in _temporary bucket in output folder & Spark UI would show that all tasks have completed. But Hadoop Resource Manager UI would not release resources for the application neither mark it as complete. On checking s3 bucket, it seemed like spark driver was moving the files 1 by 1 from _temporary directory to output bucket which was extremely slow & all the cluster was idle except Driver node.
Solution:
The solution that worked for me was to use committer class by AWS ( EmrOptimizedSparkSqlParquetOutputCommitter ) by setting the configuration property spark.sql.parquet.fs.optimized.committer.optimization-enabled to true.
e.g.:
spark-submit ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
or
pyspark ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
Note that this property is available in EMR 5.19 or higher.
Result:
After running the spark job on EMR 5.20.0 using above solution, it did not create any _temporary directory & all the files were directly written to the output bucket, hence job finished very quickly.
Fore more details:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-committer.html