After running PySpark job long enough, I encounter error of "Task Lease Expired"; then I tried to re-submit the job, it gives "Task not acquired" and log field is empty.
What would be the reason or how should I diagnose this issue?
1 Master node: n1-standard-4 (4 vCPUs, 15 GB memory)
4 Worker nodes: n1-standard-1 (1 vCPU, 3.75 GB memory)
Edit:
The cluster appears to be healthy on the GCP console, but it wouldn't "acquire" any job any more. I have to recreate new clusters to run the same job, which seems Ok so far.
This is too old. My answer had would be:
Check your cluster health in the YARN UI rather than using the GCP console. Something wrong should appear, for example workers not available.
If YARN UI is ok and you submitted the job by gcloud, it may that some internal process in GCP would be lost, so you could try restarting first. If it doesn't help recreate it is the option as you mention.
Related
I've a AWS EMR cluster executing a spark streaming job. It takes streaming data from Kinesis stream and process it. It works fine for few days but after 12-15 days the cluster terminates automatically. I checked in events tab, it shows
cluster has terminated with errors with a reason of STEP_FAILURE.
Anyone has any idea why step failure can occur when the step successfully ran for few days ?
Go to the EMR console, and check the step option. If it is set as follows:
Action on failure:Terminate cluster
then the cluster will be terminated when the step failed.
I am using JupyterHub on AWS EMR cluster. I am using EMR version 5.16
I submitted a spark application using a pyspark3 notebook.
My application is trying to write 1TB data to s3.
I am using autoscaling feature of the EMR to scale us the task node.
Hardware configurations:
1.Master node:32 GB RAM with 16 cores
2.Core node:32 GB RAM with 16 cores
3.Task node:16 GB with 8 cores each. (Task nodes scales up 15)
I have observed that Spark application gets killed after running for 50 to 60 minutes.
I tried debugging:
1. My cluster still had scope for scaling up. So it is not an issue with a shortage of resources.
2. Livy session also gets killed.
3. In the job log, I saw error message RECVD TERM SIGNAL "Shutdown hook
received"
Please note:
1. I have kept :spark.dynamicAllocation.enabled=true"
2. I am using the yarn fair scheduler with user impersonation in Jupiter hub
Can you please help me in understanding the problem and solution for it?
I think that I faced the same problem and I found the solution thanks to this answer.
The issue comes from the Livy configuration parameter livy.server.session.timeout, which sets the timeout for a session by default to 1 hour.
You should set it by adding the following line into the configurations of the EMR cluster.
[{'classification': 'livy-conf','Properties': {'livy.server.session.timeout':'5h'}}]
This solved the issue for me.
I was deployed a Twitter Heron cluster with Aurora and Mesos. The components of the cluster as following list:
Scheduler: Aurora scheduler
State Manager: zookeeper
Uploader: HDFS
The instances of Aurora are always pending status after I submitted the example topology named WordCountTopology. The following is a screenshot of the cluster running.
Mesos agents:
Aurora scheduler:
Where is the problem? Is the machine's resources in the cluster can not meet the needs the tasks of togology? Thanks for your help.
Well, the message in the lower screen has a clear indication that you assigned not enough memory to the agents.
481 and 485mb doesn’t seem to be enough.
Definitely, it looks like a resourcing issue.
I'm running multiple dataproc clusters for various spark streaming jobs. All clusters are configured as single node.
Recently (cca 10 days ago) i started to experience job failures on all clusters. Each job is running for approx. 3 days then fail with the same message:
=========== Cloud Dataproc Agent Error ===========
com.google.cloud.hadoop.services.agent.AgentException: Node was restarted while executing a job. This could be user-initiated or caused by Compute Engine maintenance event. (TASK_FAILED)
at com.google.cloud.hadoop.services.agent.AgentException$Builder.build(AgentException.java:83)
at com.google.cloud.hadoop.services.agent.job.AbstractJobHandler.lambda$kill$0(AbstractJobHandler.java:211)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:211)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:200)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:130)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:435)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:900)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:634)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.addListener(AbstractFuture.java:98)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.AbstractTransformFuture.create(AbstractTransformFuture.java:50)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.Futures.transformAsync(Futures.java:551)
at com.google.cloud.hadoop.services.agent.job.AbstractJobHandler.kill(AbstractJobHandler.java:202)
at com.google.cloud.hadoop.services.agent.job.JobManagerImpl.recoverAndKill(JobManagerImpl.java:145)
at com.google.cloud.hadoop.services.agent.MasterRequestReceiver$NormalWorkReceiver.receivedJob(MasterRequestReceiver.java:142)
at com.google.cloud.hadoop.services.agent.MasterRequestReceiver.pollForJobsAndTasks(MasterRequestReceiver.java:106)
at com.google.cloud.hadoop.services.agent.MasterRequestReceiver.pollForWork(MasterRequestReceiver.java:78)
at com.google.cloud.hadoop.services.agent.MasterRequestReceiver.lambda$doStart$0(MasterRequestReceiver.java:68)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.MoreExecutors$ScheduledListeningDecorator$NeverSuccessfulListenableFutureTask.run(MoreExecutors.java:623)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
======== End of Cloud Dataproc Agent Error ========
This is also the very last thing that can be seen in logs.
This started to happen without any changes in the spark code, for applications that were previously running for 50+ days without problems.
All clusters are in the europe-west1-d zone, global region.
All applications are written in scala.
Anyone experienced something similar? Any help would be welcome.
Since you're saying this is fairly persistent in the last few days, I wonder if something about your input data has changed and if you were running close to 100% utilization before the failures started.
Since Compute Engine VMs don't configure a swap partition, when you run out of RAM all daemons will crash and restart.
To check this, SSH into the VM and run:
sudo journalctl -u google-dataproc-agent
Somewhere in the output should be JVM crash header. You can also repeat this for other Hadoop daemons like hadoop-hdfs-namenode. They should crash at roughly the same time.
I recommend enabling stackdriver monitoring [1] on the cluster to get RAM usage over time. If my theory is validated, you can try switching to either a highmem variant of the machine type you're using or a custom VM [2] with same CPUs but more RAM.
Additionally, if your jobs use Spark Streaming with Checkpointing (or can be converted to it) then consider Dataproc Restartable Jobs [3]. After such a crash Dataproc will auto-restart the job for you [4].
[1] https://cloud.google.com/dataproc/docs/concepts/stackdriver-monitoring
[2] https://cloud.google.com/dataproc/docs/concepts/custom-machine-types
[3] https://cloud.google.com/dataproc/docs/concepts/restartable-jobs
[4] How to restart Spark Streaming job from checkpoint on Dataproc?
This is related to a bug in image version 1.1.34. Downgrade to image 1.1.29 and that fixes the issue.
For creating a cluster with image 1.1.29 use --image-version 1.1.29
Refer to https://groups.google.com/forum/#!topic/cloud-dataproc-discuss/H5Dy2vmGZ8I for more information:
I changed my initialization script after creating a cluster with 2 worker nodes for spark. Then I changed the script a bit and tried to update the cluster with 2 more worker nodes. The script failed because I simply forgot to apt-get update before apt-get install, so dataproc reports error and the cluster's status changed to ERROR. When I try to reduce the size back to 2 nodes again, it doesn't work anymore with the following message
ERROR: (gcloud.dataproc.clusters.update) Cluster 'cluster-1' must be running before it can be updated, current cluster state is 'ERROR'.
The two worker nodes are still added, but they don't seem to be detected by a running spark application at first because no more executors are added. I manually reset the two instances on the Google Compute Engine page, and then 4 executors are added. So it seems everything is working fine again except that the cluster's status is still ERROR, and I cannot increase or decrease the number of worker nodes anymore.
How can I update the cluster status back to normal (RUNNING)?
In your case ERROR indicates that workflow to re-configure the cluster has failed, and Dataproc is not sure of its health. At this point Dataproc cannot guarantee that another reconfigure attempt will succeed so further updates are disallowed. You can however submit jobs.
Your best bet is to delete it and start over.