I am able to submit a job to dataproc via the command line
gcloud dataproc jobs submit pyspark --cluster=my_cluster --region=myregion --py-files file1.py script.py
I would like to transform this command line to a POST request to the rest api, https://cloud.google.com/dataproc/docs/guides/submit-job
However I am not able to understand how to set the
--py-files file1.py
parameter in the request. Could somebody help me?
I have found that this can be accomplished though
"pythonFileUris": [
"gs://file1.py"
Related
There are Scala application Spark jobs that run daily in GCP. I am trying to set up a notification to be sent when run is compeleted. So, one way I thought of doing that was to get the logs and grep for a specific completion message from it (not sure if there's a better way). But I figured out the logs are just being shown in the console, inside the job details page and not being saved on a file.
Is there a way to route these logs to a file in a bucket so that I can search in it? Do I have to specify where to show these logs in the log4j properties file, like give a bucket location to
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
I tried to submit a job with this but it's giving me this error: grep:**-2022-07-08.log: No such file or directory
...
gcloud dataproc jobs submit spark \
--project $PROJECT --cluster=$CLUSTER --region=$REGION --class=***.spark.offer.Main \
--jars=gs://**.jar\
--properties=driver-memory=10G,spark.ui.filters="",spark.memory.fraction=0.6,spark.sql.files.maxPartitionBytes=5368709120,spark.memory.storageFraction=0.1,spark.driver.extraJavaOptions="-Dcq.config.name=gcp.conf",spark.executor.extraJavaOptions="-Dlog4j.configuration=log4j-executor.properties -Dcq.config.name=gcp.conf" \
--gcp.conf > gs://***-$date.log 2>&1
By default, Dataproc job driver logs are saved in GCS at the Dataproc-generated driverOutputResourceUri of the job. See this doc for more details.
But IMHO, a better way to determine if a job has finished is through gcloud dataproc jobs describe <job-id> 1, or the jobs.get REST API 2.
I'm trying to minimize changes in my code so I'm wondering if there is a way to submit a spark-streaming job from my personal PC/VM as follows:
spark-submit --class path.to.your.Class --master yarn --deploy-mode client \
[options] <app jar> [app options]
without using GCP SDK.
I also have to specify a directory with configuration files HADOOP_CONF_DIR which I was able to download from Ambari.
Is there a way to do the same?
Thank you
Setting up an external machine as a YARN client node is generally difficult to do and not a workflow that will work easily with Dataproc.
In a comment you mention that what you really want to do is
Submit a Spark job to the Dataproc cluster.
Run a local script on each "batchFinish" (StreamingListener.onBatchCompleted?).
The script has dependencies that mean it cannot run inside of the Dataproc master node.
Again, configuring a client node outside of the Dataproc cluster and getting it to work with spark-submit is not going to work directly. However, if you can configure your network such that the Spark driver (running within Dataproc) has access to the service/script you need to run, and then invoke that when desired.
If you run your service on a VM that has access to the network of the Dataproc cluster, then your Spark driver should be able to access the service.
I am trying to submit a pyspark job to google cloud dataproc via the command line
these are my arguments;
gcloud dataproc jobs submit pyspark --cluster mongo-load --properties org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 mongo_load.py
I am constantly getting an exception
--properties: Bad syntax for dict arg: [org.mongodb.spark:mongo-spark-connector_2.11:2.2.0]
I tried some of the escaping options from google shown here but nothing seems to work.
figured out I just needed to pass
spark.jars.packages=org.mongodb.spark:mongo-spark-connector_2.11:2.2.0
In addition to the answer by #Marlon Gray, if you need to pass more that one package you need to escape the spark.jars.packages string, like
--properties=^#^spark.jars.packages=mavencoordinate1,mavencoordinate2
Please check https://cloud.google.com/sdk/gcloud/reference/topic/escaping for further details.
Is there any flag available to give custom job_id to dataproc jobs. I am using this command to run pig jobs.
gcloud dataproc jobs submit pig --cluster my_cluster --file my_queries.pig
I use similar commands to submit pyspark/hive jobs.
This command creates a job_id on its own and tracking them later on is difficult.
Reading the gcloud code you can see that the args called id is used as job name
https://github.com/google-cloud-sdk/google-cloud-sdk/blob/master/lib/googlecloudsdk/command_lib/dataproc/jobs/submitter.py#L56
therefore you only need to add the --id to you gcloud command
gcloud dataproc jobs submit spark --id this-is-my-job-name --cluster my-cluster --class com.myClass.Main --jars gs://my.jar
While it's possible to provide your own generated jobid when using the underlying REST API, there isn't currently any way to specify your own jobid when submitting with gcloud dataproc jobs submit; this feature might be added in the future. That said, typically when people want to specify job ids they also want to be able to list with more complex match expressions, or potentially to have multiple categories of jobs listed by different kinds of expressions at different points in time.
So, you might want to consider dataproc labels instead; labels are intended specifically for this kind of use case, and are optimized for efficient lookup. For example:
gcloud dataproc jobs submit pig --labels jobtype=mylogspipeline,date=20170508 ...
gcloud dataproc jobs submit pig --labels jobtype=mylogspipeline,date=20170509 ...
gcloud dataproc jobs submit pig --labels jobtype=mlpipeline,date=20170509 ...
gcloud dataproc jobs list --filter "labels.jobtype=mylogspipeline"
gcloud dataproc jobs list --filter "labels.date=20170509"
gcloud dataproc jobs list --filter "labels.date=20170509 AND labels.jobtype=mlpipeline"
I am using google dataproc cluster to run spark job, the script is in python.
When there is only one script (test.py for example), i can submit job with the following command:
gcloud dataproc jobs submit pyspark --cluster analyse ./test.py
But now test.py import modules from other scripts written by myself, how can i specify the dependency in the command ?
You could use the --py-files option mentioned here.
If you have a structure as
- maindir - lib - lib.py
- run - script.py
You could include additional files with the --files flag or the --py-files flag
gcloud dataproc jobs submit pyspark --cluster=clustername --region=regionname --files /lib/lib.py /run/script.py
and you can import in script.py as
from lib import something
However, I am not aware of a method to avoid the tedious process of adding the file list manually. Please check Submit a python project to dataproc job for a more detailed explaination