gcloud CLI application logs to bucket - scala

There are Scala application Spark jobs that run daily in GCP. I am trying to set up a notification to be sent when run is compeleted. So, one way I thought of doing that was to get the logs and grep for a specific completion message from it (not sure if there's a better way). But I figured out the logs are just being shown in the console, inside the job details page and not being saved on a file.
Is there a way to route these logs to a file in a bucket so that I can search in it? Do I have to specify where to show these logs in the log4j properties file, like give a bucket location to
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
I tried to submit a job with this but it's giving me this error: grep:**-2022-07-08.log: No such file or directory
...
gcloud dataproc jobs submit spark \
--project $PROJECT --cluster=$CLUSTER --region=$REGION --class=***.spark.offer.Main \
--jars=gs://**.jar\
--properties=driver-memory=10G,spark.ui.filters="",spark.memory.fraction=0.6,spark.sql.files.maxPartitionBytes=5368709120,spark.memory.storageFraction=0.1,spark.driver.extraJavaOptions="-Dcq.config.name=gcp.conf",spark.executor.extraJavaOptions="-Dlog4j.configuration=log4j-executor.properties -Dcq.config.name=gcp.conf" \
--gcp.conf > gs://***-$date.log 2>&1

By default, Dataproc job driver logs are saved in GCS at the Dataproc-generated driverOutputResourceUri of the job. See this doc for more details.
But IMHO, a better way to determine if a job has finished is through gcloud dataproc jobs describe <job-id> 1, or the jobs.get REST API 2.

Related

How to force delete dataproc serverless batch

I am running a pyspark dataproc serverless batch. It has been running for too long so I decided to delete it. But neither the GCP console nor the CLI allow me to delete the batch.
The command I tried is
gcloud dataproc batches delete <Batch ID> --region=us-central1
I get the following error:
ERROR: (gcloud.dataproc.batches.delete) FAILED_PRECONDITION: Cannot delete non terminal batch 'Batch(<project-id/batch-id>)'; current state: 'RUNNING'
gcloud dataproc batches cancel is used to cancel a running batch, while gcloud dataproc batches delete is used to delete the batch resource. In this case, you want to use cancel.

How to run hudi on dataproc and write to gcs bucket

I want to write to a gcs bucket from dataproc using hudi.
To write to gcs using hudi it says to set prop fs.defaultFS to value gs:// (https://hudi.apache.org/docs/gcs_hoodie)
However when I set fs.defaultFS on dataproc to be a gcs bucket I get errors at startup relating to the job not being able to find my jar. It is looking in a gs:/ prefix, presumably because I have overridden defaultFs which it was previously using the find the jar. How would I fix this?
org.apache.spark.SparkException: Application application_1617963833977_0009 failed 2 times due to AM Container for appattempt_1617963833977_0009_000002 exited with exitCode: -1000
Failing this attempt.Diagnostics: [2021-04-12 15:36:05.142]java.io.FileNotFoundException: File not found : gs:/user/root/.sparkStaging/application_1617963833977_0009/myjar.jar
If it is relevant I am setting the defaultFs from within the code.
sparkConfig.set("spark.hadoop.fs.defaultFS", gs://defaultFs)
You can try setting fs.defaultFS to GCS when creating the cluster. For example:
gcloud dataproc clusters create ...\
--properties 'core:fs.defaultFS=gs://my-bucket'

Launch a dataproc job with --files through the rest API

I am able to submit a job to dataproc via the command line
gcloud dataproc jobs submit pyspark --cluster=my_cluster --region=myregion --py-files file1.py script.py
I would like to transform this command line to a POST request to the rest api, https://cloud.google.com/dataproc/docs/guides/submit-job
However I am not able to understand how to set the
--py-files file1.py
parameter in the request. Could somebody help me?
I have found that this can be accomplished though
"pythonFileUris": [
"gs://file1.py"

How to retrieve Dataproc's jobId within a PySpark job

I run several batch jobs and I would like to reference the jobId from dataproc to the saved output files.
That would allow to have all logs for arguments and output associated with the results. Downside remains: As executors in YARN past away, no logs for the single executor can be obtained anymore.
The context of Google dataproc is passed into Spark jobs by using tags. Therefore all suitable information are present in the SparkConfig and can be accessed:
pyspark.SparkConf().get("spark.yarn.application.tags", "unknown")
pyspark.SparkConf().get("spark.yarn.tags", "unknown")
Output looks the following:
dataproc_job_3f4025a0-bce1-a254-9ddc-518a4d8b2f3d
That information can then be assigned to our export folder and output is saved with Dataproc reference:
df.select("*").write. \
format('com.databricks.spark.csv').options(header='true') \
.save(export_folder)

Where are the Spark logs on EMR?

I'm not able to locate error logs or message's from println calls in Scala while running jobs on Spark in EMR.
Where can I access these?
I'm submitting the Spark job, written in Scala to EMR using script-runner.jar with arguments --deploy-mode set to cluster and --master set to yarn. It runs the job fine.
However I do not see my println statements in the Amazon EMR UI where it lists "stderr, stdoutetc. Furthermore if my job errors I don't see why it had an error. All I see is this in thestderr`:
15/05/27 20:24:44 INFO yarn.Client: Application report from ResourceManager:
application identifier: application_1432754139536_0002
appId: 2
clientToAMToken: null
appDiagnostics:
appMasterHost: ip-10-185-87-217.ec2.internal
appQueue: default
appMasterRpcPort: 0
appStartTime: 1432758272973
yarnAppState: FINISHED
distributedFinalState: FAILED
appTrackingUrl: http://10.150.67.62:9046/proxy/application_1432754139536_0002/A
appUser: hadoop
`
With the deploy mode of cluster on yarn the Spark driver and hence the user code executed will be within the Application Master container. It sounds like you had EMR debugging enabled on the cluster so logs should have also pushed to S3. In the S3 location look at task-attempts/<applicationid>/<firstcontainer>/*.
If you SSH into the master node of your cluster then you should be able to find the stdout, stderr, syslog and controller logs under:
/mnt/var/log/hadoop/steps/<stepname>
I also spent a lot of time figuring this out. Found logs in the following location:
EMR UI Console -> Summary -> Log URI -> Containers -> application_xxx_xxx -> container_yyy_yy_yy -> stdout.gz.
The event logs, the ones required for the spark-history-server can be found at :
hdfs:///var/log/spark/apps
If you submit your job with emr-bootstrap you can specify the log directory as an s3 bucket with --log-uri