Dataproc creation fails when updating fs.defaltFS using --properties - google-cloud-dataproc

I created a dataproc with defaultFS as hdfs.that is working fine.But whenever i try to update fs.defaultFS to 'gs' the dataproc dashboard showing an error "Unable to Start Master, Insufficient number of datanodes reporting."

As mentioned in the comments, this question is a duplicate of Cannot create a Dataproc cluster when setting the fs.defaultFS property?
Re-pasting here for easier readability/discoverability vs just putting it in the comment thread.

Related

How to force delete dataproc serverless batch

I am running a pyspark dataproc serverless batch. It has been running for too long so I decided to delete it. But neither the GCP console nor the CLI allow me to delete the batch.
The command I tried is
gcloud dataproc batches delete <Batch ID> --region=us-central1
I get the following error:
ERROR: (gcloud.dataproc.batches.delete) FAILED_PRECONDITION: Cannot delete non terminal batch 'Batch(<project-id/batch-id>)'; current state: 'RUNNING'
gcloud dataproc batches cancel is used to cancel a running batch, while gcloud dataproc batches delete is used to delete the batch resource. In this case, you want to use cancel.

How to run hudi on dataproc and write to gcs bucket

I want to write to a gcs bucket from dataproc using hudi.
To write to gcs using hudi it says to set prop fs.defaultFS to value gs:// (https://hudi.apache.org/docs/gcs_hoodie)
However when I set fs.defaultFS on dataproc to be a gcs bucket I get errors at startup relating to the job not being able to find my jar. It is looking in a gs:/ prefix, presumably because I have overridden defaultFs which it was previously using the find the jar. How would I fix this?
org.apache.spark.SparkException: Application application_1617963833977_0009 failed 2 times due to AM Container for appattempt_1617963833977_0009_000002 exited with exitCode: -1000
Failing this attempt.Diagnostics: [2021-04-12 15:36:05.142]java.io.FileNotFoundException: File not found : gs:/user/root/.sparkStaging/application_1617963833977_0009/myjar.jar
If it is relevant I am setting the defaultFs from within the code.
sparkConfig.set("spark.hadoop.fs.defaultFS", gs://defaultFs)
You can try setting fs.defaultFS to GCS when creating the cluster. For example:
gcloud dataproc clusters create ...\
--properties 'core:fs.defaultFS=gs://my-bucket'

Updating a Dataproc cluster (metadata or labels) directly while in initialization actions script

I'd like to save more specific errors in the case of a failed initialization script of a Dataproc cluster. Is it possible to update the cluster metadata or add a label to the cluster (without using gcloud dataproc clusters update) from within the script? Or any other method to write a more useful error message? Thanks in advance!
If your goal is to report an error from within an initialization action, there is a feature within Dataproc to extract messages from init action output.
As long as you emit a message in this format: StructuredError{message}
For example:
message="something went wrong"
echo "StructuredError{${message}}"

Deployed jobs stopped working with an image error?

In the last few hours I am no longer able to execute deployed Data Fusion pipeline jobs - they just end in an error state almost instantly.
I can run the jobs in Preview mode, but when trying to run deployed jobs this error appears in the logs:
com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Selected software image version '1.2.65-deb9' can no longer be used to create new clusters. Please select a more recent image
I've tried with both an existing instance and a new instance, and all deployed jobs including the sample jobs give this error.
Any ideas? I cannot find any config options for what image is used for execution
We are currently investigating an issue with the image for Cloud Dataproc used by Cloud Data Fusion. We had pinned a version of Dataproc VM image for the launch that is causing an issue.
We apologize for you inconvenience. We are working to resolve the issue as soon as possible for you.
Will provide update on this thread.
Nitin

how to update cluster status in dataproc

I changed my initialization script after creating a cluster with 2 worker nodes for spark. Then I changed the script a bit and tried to update the cluster with 2 more worker nodes. The script failed because I simply forgot to apt-get update before apt-get install, so dataproc reports error and the cluster's status changed to ERROR. When I try to reduce the size back to 2 nodes again, it doesn't work anymore with the following message
ERROR: (gcloud.dataproc.clusters.update) Cluster 'cluster-1' must be running before it can be updated, current cluster state is 'ERROR'.
The two worker nodes are still added, but they don't seem to be detected by a running spark application at first because no more executors are added. I manually reset the two instances on the Google Compute Engine page, and then 4 executors are added. So it seems everything is working fine again except that the cluster's status is still ERROR, and I cannot increase or decrease the number of worker nodes anymore.
How can I update the cluster status back to normal (RUNNING)?
In your case ERROR indicates that workflow to re-configure the cluster has failed, and Dataproc is not sure of its health. At this point Dataproc cannot guarantee that another reconfigure attempt will succeed so further updates are disallowed. You can however submit jobs.
Your best bet is to delete it and start over.