How to run hudi on dataproc and write to gcs bucket - google-cloud-dataproc

I want to write to a gcs bucket from dataproc using hudi.
To write to gcs using hudi it says to set prop fs.defaultFS to value gs:// (https://hudi.apache.org/docs/gcs_hoodie)
However when I set fs.defaultFS on dataproc to be a gcs bucket I get errors at startup relating to the job not being able to find my jar. It is looking in a gs:/ prefix, presumably because I have overridden defaultFs which it was previously using the find the jar. How would I fix this?
org.apache.spark.SparkException: Application application_1617963833977_0009 failed 2 times due to AM Container for appattempt_1617963833977_0009_000002 exited with exitCode: -1000
Failing this attempt.Diagnostics: [2021-04-12 15:36:05.142]java.io.FileNotFoundException: File not found : gs:/user/root/.sparkStaging/application_1617963833977_0009/myjar.jar
If it is relevant I am setting the defaultFs from within the code.
sparkConfig.set("spark.hadoop.fs.defaultFS", gs://defaultFs)

You can try setting fs.defaultFS to GCS when creating the cluster. For example:
gcloud dataproc clusters create ...\
--properties 'core:fs.defaultFS=gs://my-bucket'

Related

gcloud CLI application logs to bucket

There are Scala application Spark jobs that run daily in GCP. I am trying to set up a notification to be sent when run is compeleted. So, one way I thought of doing that was to get the logs and grep for a specific completion message from it (not sure if there's a better way). But I figured out the logs are just being shown in the console, inside the job details page and not being saved on a file.
Is there a way to route these logs to a file in a bucket so that I can search in it? Do I have to specify where to show these logs in the log4j properties file, like give a bucket location to
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
I tried to submit a job with this but it's giving me this error: grep:**-2022-07-08.log: No such file or directory
...
gcloud dataproc jobs submit spark \
--project $PROJECT --cluster=$CLUSTER --region=$REGION --class=***.spark.offer.Main \
--jars=gs://**.jar\
--properties=driver-memory=10G,spark.ui.filters="",spark.memory.fraction=0.6,spark.sql.files.maxPartitionBytes=5368709120,spark.memory.storageFraction=0.1,spark.driver.extraJavaOptions="-Dcq.config.name=gcp.conf",spark.executor.extraJavaOptions="-Dlog4j.configuration=log4j-executor.properties -Dcq.config.name=gcp.conf" \
--gcp.conf > gs://***-$date.log 2>&1
By default, Dataproc job driver logs are saved in GCS at the Dataproc-generated driverOutputResourceUri of the job. See this doc for more details.
But IMHO, a better way to determine if a job has finished is through gcloud dataproc jobs describe <job-id> 1, or the jobs.get REST API 2.

Updating a Dataproc cluster (metadata or labels) directly while in initialization actions script

I'd like to save more specific errors in the case of a failed initialization script of a Dataproc cluster. Is it possible to update the cluster metadata or add a label to the cluster (without using gcloud dataproc clusters update) from within the script? Or any other method to write a more useful error message? Thanks in advance!
If your goal is to report an error from within an initialization action, there is a feature within Dataproc to extract messages from init action output.
As long as you emit a message in this format: StructuredError{message}
For example:
message="something went wrong"
echo "StructuredError{${message}}"

Dataproc creation fails when updating fs.defaltFS using --properties

I created a dataproc with defaultFS as hdfs.that is working fine.But whenever i try to update fs.defaultFS to 'gs' the dataproc dashboard showing an error "Unable to Start Master, Insufficient number of datanodes reporting."
As mentioned in the comments, this question is a duplicate of Cannot create a Dataproc cluster when setting the fs.defaultFS property?
Re-pasting here for easier readability/discoverability vs just putting it in the comment thread.

How to cache jars for DataProc Spark job submission

I am submitting a Spark job to Dataproc using either gcloud or Google Cloud DataProc API. One of the arguments is '--jars' (or its Java API equivalent), where I supply comma separated list of jar files to be provided to the executor and driver classpaths:
gs://google-storage-bucket/lib/x1.jar,gs://google-storage-bucket/lib/x2.jar, etc...
Same JAR files are copied from Google storage bucket to the working directory for each SparkContext on the executor nodes every time I submit a job and it takes about 2 minutes, before the job really starts execution (I can see that on the Google Cloud console - https://console.cloud.google.com/dataproc/jobs/...).
Is it possible to somehow cache these jar files on Spark nodes and use them in the classpath with every job submission? That would save about 50% of the run time.
Thanks,
Victor
Indeed, if you pass in arguments of the form file:///your/path/on/the/cluster/nodes/filesystem then it will be interpreted as referring to files on the cluster nodes themselves.
You can either copy files from GCS into the nodes at cluster creation time using an initiailization action or try to run some kind of Spark job to do it on an existing cluster and/or manually SSH'ing in to stage those jars.

Getting warning in pyspark job in Google dataproc when writing files directly to google storage for each part file

I am getting this warning for each part file the spark job is creating when writing to google storage:
17/08/01 11:31:47 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://temp_bucket/output/part-09698
17/08/01 11:31:47 WARN com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://temp_bucket/output/part-09698 - removing from cache
The spark job has 10 stages and this warning comes after 9 stages. And since the spark job is creating ~11500 part files. This warning comes for each of the ~11500 part files. Because of this warning my spark job is running for 15 mins extra and since I am running around 80 such jobs. I losing a lot of time and incurring a lot cost.
Is there a way to suppress this warning?
Recent changes have made it safe to disable the enforced list-consistency entirely; future releases are expected to phase it out gradually. Try the following in your job properties to disable the CacheSupplementedGoogleCloudStorage:
--properties spark.hadoop.fs.gs.metadata.cache.enable=false
Or if you're creating a new Dataproc cluster, in your cluster properties:
--properties core:fs.gs.metadata.cache.enable=false