How to retrieve Dataproc's jobId within a PySpark job - google-cloud-dataproc

I run several batch jobs and I would like to reference the jobId from dataproc to the saved output files.
That would allow to have all logs for arguments and output associated with the results. Downside remains: As executors in YARN past away, no logs for the single executor can be obtained anymore.

The context of Google dataproc is passed into Spark jobs by using tags. Therefore all suitable information are present in the SparkConfig and can be accessed:
pyspark.SparkConf().get("spark.yarn.application.tags", "unknown")
pyspark.SparkConf().get("spark.yarn.tags", "unknown")
Output looks the following:
dataproc_job_3f4025a0-bce1-a254-9ddc-518a4d8b2f3d
That information can then be assigned to our export folder and output is saved with Dataproc reference:
df.select("*").write. \
format('com.databricks.spark.csv').options(header='true') \
.save(export_folder)

Related

gcloud CLI application logs to bucket

There are Scala application Spark jobs that run daily in GCP. I am trying to set up a notification to be sent when run is compeleted. So, one way I thought of doing that was to get the logs and grep for a specific completion message from it (not sure if there's a better way). But I figured out the logs are just being shown in the console, inside the job details page and not being saved on a file.
Is there a way to route these logs to a file in a bucket so that I can search in it? Do I have to specify where to show these logs in the log4j properties file, like give a bucket location to
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
I tried to submit a job with this but it's giving me this error: grep:**-2022-07-08.log: No such file or directory
...
gcloud dataproc jobs submit spark \
--project $PROJECT --cluster=$CLUSTER --region=$REGION --class=***.spark.offer.Main \
--jars=gs://**.jar\
--properties=driver-memory=10G,spark.ui.filters="",spark.memory.fraction=0.6,spark.sql.files.maxPartitionBytes=5368709120,spark.memory.storageFraction=0.1,spark.driver.extraJavaOptions="-Dcq.config.name=gcp.conf",spark.executor.extraJavaOptions="-Dlog4j.configuration=log4j-executor.properties -Dcq.config.name=gcp.conf" \
--gcp.conf > gs://***-$date.log 2>&1
By default, Dataproc job driver logs are saved in GCS at the Dataproc-generated driverOutputResourceUri of the job. See this doc for more details.
But IMHO, a better way to determine if a job has finished is through gcloud dataproc jobs describe <job-id> 1, or the jobs.get REST API 2.

How to run hudi on dataproc and write to gcs bucket

I want to write to a gcs bucket from dataproc using hudi.
To write to gcs using hudi it says to set prop fs.defaultFS to value gs:// (https://hudi.apache.org/docs/gcs_hoodie)
However when I set fs.defaultFS on dataproc to be a gcs bucket I get errors at startup relating to the job not being able to find my jar. It is looking in a gs:/ prefix, presumably because I have overridden defaultFs which it was previously using the find the jar. How would I fix this?
org.apache.spark.SparkException: Application application_1617963833977_0009 failed 2 times due to AM Container for appattempt_1617963833977_0009_000002 exited with exitCode: -1000
Failing this attempt.Diagnostics: [2021-04-12 15:36:05.142]java.io.FileNotFoundException: File not found : gs:/user/root/.sparkStaging/application_1617963833977_0009/myjar.jar
If it is relevant I am setting the defaultFs from within the code.
sparkConfig.set("spark.hadoop.fs.defaultFS", gs://defaultFs)
You can try setting fs.defaultFS to GCS when creating the cluster. For example:
gcloud dataproc clusters create ...\
--properties 'core:fs.defaultFS=gs://my-bucket'

Finding the location of my spark job output file

I am testing pyspark jobs in an EMR cluster on AWS. The goal is to use a Lambda function to fire the spark job, but for now I am manually running the spark job. So, I SSH to the master node and then run the spark job as below:
spark-submit /home/hadoop/testspark.py mybucket
mybucket - parameter passed to the spark job.
The line that saves the RDD is
rddFiltered.repartition(1).saveAsTextFile("/home/hadoop/output.txt")
The spark job seems to run but it puts the output file in some location - Output directory hdfs://ip-xxx-xx-xx-xx.ec2.internal:8020/home/hadoop/output.txt.
Where is this exactly located and how can I view the contents? Forgive my ignorance on HDFS and Hadoop.
Eventually, I want to rename output.txt to something meaningful and then transfer to S3, just haven't gotten there yet.
If I re-run the spark job it says "Output directory hdfs://ip-xxx-xx-xx-xx.ec2.internal:8020/home/hadoop/output.txt already exists". How do I prevent this or at least overwrite the file?
Thanks
Based on the EMR documentation:
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html
if you do not specify prefix, spark will write data to HDFS by default. You can check EMR HDFS with this command:
hadoop fs -ls /home/hadoop/
You can also transfer from HDFS to S3 with S3DistCp:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html
Unfortunately you cannot overwrite the existing file using saveAsTextFile:
https://spark-project.atlassian.net/browse/SPARK-1100
As I can see you re-partitioned the file into one partition, so you can write it into the local file-system as well:
rddFiltered.repartition(1).collect().saveAsTextFile("file:///home/hadoop/output.txt")
Note, if you are using distributed cluster you have to collect() back to driver first!

How to cache jars for DataProc Spark job submission

I am submitting a Spark job to Dataproc using either gcloud or Google Cloud DataProc API. One of the arguments is '--jars' (or its Java API equivalent), where I supply comma separated list of jar files to be provided to the executor and driver classpaths:
gs://google-storage-bucket/lib/x1.jar,gs://google-storage-bucket/lib/x2.jar, etc...
Same JAR files are copied from Google storage bucket to the working directory for each SparkContext on the executor nodes every time I submit a job and it takes about 2 minutes, before the job really starts execution (I can see that on the Google Cloud console - https://console.cloud.google.com/dataproc/jobs/...).
Is it possible to somehow cache these jar files on Spark nodes and use them in the classpath with every job submission? That would save about 50% of the run time.
Thanks,
Victor
Indeed, if you pass in arguments of the form file:///your/path/on/the/cluster/nodes/filesystem then it will be interpreted as referring to files on the cluster nodes themselves.
You can either copy files from GCS into the nodes at cluster creation time using an initiailization action or try to run some kind of Spark job to do it on an existing cluster and/or manually SSH'ing in to stage those jars.

passing properties argument for gcloud dataproc jobs submit pyspark

I am trying to submit a pyspark job to google cloud dataproc via the command line
these are my arguments;
gcloud dataproc jobs submit pyspark --cluster mongo-load --properties org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 mongo_load.py
I am constantly getting an exception
--properties: Bad syntax for dict arg: [org.mongodb.spark:mongo-spark-connector_2.11:2.2.0]
I tried some of the escaping options from google shown here but nothing seems to work.
figured out I just needed to pass
spark.jars.packages=org.mongodb.spark:mongo-spark-connector_2.11:2.2.0
In addition to the answer by #Marlon Gray, if you need to pass more that one package you need to escape the spark.jars.packages string, like
--properties=^#^spark.jars.packages=mavencoordinate1,mavencoordinate2
Please check https://cloud.google.com/sdk/gcloud/reference/topic/escaping for further details.