Azure Databricks error- The output of the notebook is too large. Cause: rpc response - scala

Error Message - job failed with error message The output of the notebook is too large. Cause: rpc response (of 20972488 bytes) exceeds limit of 20971520 bytes
Details:
We are using databricks notebooks to run the job. Job is running on job cluster. This is a streaming job.
Job started failing with above mentioned error.
We do not have any display(), show(), print(), explain method in the job.
We are not using awaitAnyTermination method in the job as well.
We also tried adding "spark.databricks.driver.disableScalaOutput true" to the job but it still did not work. Job is failing with same error.
We have followed all the steps mentioned in this document - https://learn.microsoft.com/en-us/azure/databricks/kb/jobs/job-cluster-limit-nb-output
Do we have any option to resolve this issue or to find out exactly which commands output is causing it to go above 20MB limit.

See the docs regarding structured streaming in prod.
I would recommend migrating to workflows based on jar jobs because:
Notebook workflows are not supported with long-running jobs. Therefore we don’t recommend using notebook workflows in your streaming jobs.

Related

ADF Dataflow stuck IN progress and fail with below errors

ADF Pipeline DF task is Stuck in Progress. It was working seamlessly last couple of months but suddenly Dataflow stuck in progress and Time out after certain time. We are using IR managed Virtual Network. I am using forereach loop to run data flow for multiple entities parallel, it always randomly get stuck on last Entity.
What can I try to resolve this?
Error in Dev Environment
Error Code 4508
Spark cluster not found
Error in Prod Environment:
Error code
5000
Failure type
User configuration issue
Details
[plugins.*** ADF.adf-ir-001 WorkspaceType:<ADF> CCID:<f289c067-7c6c-4b49-b0db-783e842a5675>] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
Images:
I tried below steps:
By changing IR configuring as below
Tried DF Retry and retry Interval
Also, tried For each loop one batch at a time instead of 4 batch parallel. None of the above trouble-shooting steps worked. These PL is running last 3-4 months without a single failure, suddenly they started to fail last 3 days consistently. DF flow always stuck in progress randomly for different entity and times out in one point by throwing above errors.
Error Code 4508 Spark cluster not found.
This error can cause because of two reasons.
The debug session is getting closed till the dataflow finish its transformation in this case recommendation is to restart the debug session
the second reason is due to resource problem, or an outage in that particular region.
Error code 5000 Failure type User configuration issue Details [plugins.*** ADF.adf-ir-001 WorkspaceType: CCID:] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
A temporary error is one that says "Livy job state dead caused by unknown error." At the backend of the dataflow, a spark cluster is used, and this error is generated by the spark cluster. to get the more information about error go to StdOut of sparkpool execution.
The backend cluster may be experiencing a network problem, a resource problem, or an outage.
If error persist my suggestion is to raise Microsoft support ticket here

Dataflow job with ProfileOptions profile_cpu=True not writing profile files

I am trying to profile the CPU usage of a Dataflow Pipeline job, run on Apache Beam Python 3.7 SDK 2.27.0. I triggered the job with the --profile_cpu and profile_location args set, and can see that they are set in the Dataflow console:
Dataflow Pipeline Options showing that profile_cpu and profile_location are set.
However, after the job completed there were no files written to the profile_location GSC bucket.
When looking at the Dataflow logs with jsonPayload.logger:"apache_beam.utils.profiler:profiler.py" I can see the logs that say "Start profiling" and "Stop profiling":
Logs showing the "Start profiling" and "Stop profiling" messages from the Profiler.
but there are no logs corresponding to the "Copying profiler data to:" step even though the profile_location is set in the ProfilingOptions and therefore should be set on the Profiler. Any advice on what could be going wrong, or knowledge of whether this functionality is currently supported would be very helpful.
This was resolved by using the --experiments=use_runner_v2 flag. Looks like this is only supported on Dataflow Runner v2, which has not been rolled out as the default runner yet.

what caused druid tasks failed

I had set up druid cluster(10 nodes),ingestion kafka data using indexing service.However,I found many of tasks are failed like below,but some data had been existed in segments,I am not sure if all datas are pushed in the segments.
failed task lists
besides that,I choose some logs of failed tasks,found there are no fatal error messages,I posted the log file, please help me what caused the task failed.thank so much.
one log of failed tasks
there are 2 questions I want to ask,one is how to confirm all consumer data are pushed in the segments,the other is what caused the task Failure.
This looks to be the issue of Hadoop, where multiple threads trying to write to the same file at same time, you need to set overwrite=false
Check if you are running multiple ingestion tasks for same segments.
you can refer below link for further debugging it -
https://community.hortonworks.com/questions/139150/no-lease-on-file-inode-5425306-file-does-not-exist.html

Getting warning in pyspark job in Google dataproc when writing files directly to google storage for each part file

I am getting this warning for each part file the spark job is creating when writing to google storage:
17/08/01 11:31:47 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://temp_bucket/output/part-09698
17/08/01 11:31:47 WARN com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://temp_bucket/output/part-09698 - removing from cache
The spark job has 10 stages and this warning comes after 9 stages. And since the spark job is creating ~11500 part files. This warning comes for each of the ~11500 part files. Because of this warning my spark job is running for 15 mins extra and since I am running around 80 such jobs. I losing a lot of time and incurring a lot cost.
Is there a way to suppress this warning?
Recent changes have made it safe to disable the enforced list-consistency entirely; future releases are expected to phase it out gradually. Try the following in your job properties to disable the CacheSupplementedGoogleCloudStorage:
--properties spark.hadoop.fs.gs.metadata.cache.enable=false
Or if you're creating a new Dataproc cluster, in your cluster properties:
--properties core:fs.gs.metadata.cache.enable=false

Spark: long delay between jobs

So we are running spark job that extract data and do some expansive data conversion and writes to several different files. Everything is running fine but I'm getting random expansive delays between resource intensive job finish and next job start.
In below picture, we can see that job that was scheduled at 17:22:02 took 15 min to finish, which means I'm expecting next job to be scheduled around 17:37:02. However, next job was scheduled at 22:05:59, which is +4 hours after job success.
When I dig into next job's spark UI it show <1 sec scheduler delay. So I'm confused to where does this 4 hours long delay is coming from.
(Spark 1.6.1 with Hadoop 2)
Updated:
I can confirm that David's answer below is spot on about how IO ops are handled in Spark is bit unexpected. (It makes sense to that file write essentially does "collect" behind the curtain before it writes considering ordering and/or other operations.) But I'm bit discomforted by the fact that I/O time is not included in job execution time. I guess you can see it in "SQL" tab of spark UI as queries are still running even with all jobs being successful but you cannot dive into it at all.
I'm sure there are more ways to improve but below two methods were sufficient for me:
reduce file count
set parquet.enable.summary-metadata to false
I/O operations often come with significant overhead that will occur on the master node. Since this work isn't parallelized, it can take quite a bit of time. And since it is not a job, it does not show up in the resource manager UI. Some examples of I/O tasks that are done by the master node
Spark will write to temporary s3 directories, then move the files using the master node
Reading of text files often occur on the master node
When writing parquet files, the master node will scan all the files post-write to check the schema
These issues can be solved by tweaking yarn settings or redesigning your code. If you provide some source code, I might be able to pinpoint your issue.
Discussion of writing I/O Overhead with Parquet and s3
Discussion of reading I/O Overhead "s3 is not a filesystem"
Problem:
I faced similar issue when writing parquet data on s3 with pyspark on EMR 5.5.1. All workers would finish writing data in _temporary bucket in output folder & Spark UI would show that all tasks have completed. But Hadoop Resource Manager UI would not release resources for the application neither mark it as complete. On checking s3 bucket, it seemed like spark driver was moving the files 1 by 1 from _temporary directory to output bucket which was extremely slow & all the cluster was idle except Driver node.
Solution:
The solution that worked for me was to use committer class by AWS ( EmrOptimizedSparkSqlParquetOutputCommitter ) by setting the configuration property spark.sql.parquet.fs.optimized.committer.optimization-enabled to true.
e.g.:
spark-submit ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
or
pyspark ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
Note that this property is available in EMR 5.19 or higher.
Result:
After running the spark job on EMR 5.20.0 using above solution, it did not create any _temporary directory & all the files were directly written to the output bucket, hence job finished very quickly.
Fore more details:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-committer.html