how to submit pyspark job with dependency on google dataproc cluster - pyspark

I am using google dataproc cluster to run spark job, the script is in python.
When there is only one script (test.py for example), i can submit job with the following command:
gcloud dataproc jobs submit pyspark --cluster analyse ./test.py
But now test.py import modules from other scripts written by myself, how can i specify the dependency in the command ?

You could use the --py-files option mentioned here.

If you have a structure as
- maindir - lib - lib.py
- run - script.py
You could include additional files with the --files flag or the --py-files flag
gcloud dataproc jobs submit pyspark --cluster=clustername --region=regionname --files /lib/lib.py /run/script.py
and you can import in script.py as
from lib import something
However, I am not aware of a method to avoid the tedious process of adding the file list manually. Please check Submit a python project to dataproc job for a more detailed explaination

Related

Launch a dataproc job with --files through the rest API

I am able to submit a job to dataproc via the command line
gcloud dataproc jobs submit pyspark --cluster=my_cluster --region=myregion --py-files file1.py script.py
I would like to transform this command line to a POST request to the rest api, https://cloud.google.com/dataproc/docs/guides/submit-job
However I am not able to understand how to set the
--py-files file1.py
parameter in the request. Could somebody help me?
I have found that this can be accomplished though
"pythonFileUris": [
"gs://file1.py"

GCP Dataproc: Directly working with Spark over Yarn Cluster

I'm trying to minimize changes in my code so I'm wondering if there is a way to submit a spark-streaming job from my personal PC/VM as follows:
spark-submit --class path.to.your.Class --master yarn --deploy-mode client \
[options] <app jar> [app options]
without using GCP SDK.
I also have to specify a directory with configuration files HADOOP_CONF_DIR which I was able to download from Ambari.
Is there a way to do the same?
Thank you
Setting up an external machine as a YARN client node is generally difficult to do and not a workflow that will work easily with Dataproc.
In a comment you mention that what you really want to do is
Submit a Spark job to the Dataproc cluster.
Run a local script on each "batchFinish" (StreamingListener.onBatchCompleted?).
The script has dependencies that mean it cannot run inside of the Dataproc master node.
Again, configuring a client node outside of the Dataproc cluster and getting it to work with spark-submit is not going to work directly. However, if you can configure your network such that the Spark driver (running within Dataproc) has access to the service/script you need to run, and then invoke that when desired.
If you run your service on a VM that has access to the network of the Dataproc cluster, then your Spark driver should be able to access the service.

passing properties argument for gcloud dataproc jobs submit pyspark

I am trying to submit a pyspark job to google cloud dataproc via the command line
these are my arguments;
gcloud dataproc jobs submit pyspark --cluster mongo-load --properties org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 mongo_load.py
I am constantly getting an exception
--properties: Bad syntax for dict arg: [org.mongodb.spark:mongo-spark-connector_2.11:2.2.0]
I tried some of the escaping options from google shown here but nothing seems to work.
figured out I just needed to pass
spark.jars.packages=org.mongodb.spark:mongo-spark-connector_2.11:2.2.0
In addition to the answer by #Marlon Gray, if you need to pass more that one package you need to escape the spark.jars.packages string, like
--properties=^#^spark.jars.packages=mavencoordinate1,mavencoordinate2
Please check https://cloud.google.com/sdk/gcloud/reference/topic/escaping for further details.

Give custom job_id to Google Dataproc cluster for running pig/hive/spark jobs

Is there any flag available to give custom job_id to dataproc jobs. I am using this command to run pig jobs.
gcloud dataproc jobs submit pig --cluster my_cluster --file my_queries.pig
I use similar commands to submit pyspark/hive jobs.
This command creates a job_id on its own and tracking them later on is difficult.
Reading the gcloud code you can see that the args called id is used as job name
https://github.com/google-cloud-sdk/google-cloud-sdk/blob/master/lib/googlecloudsdk/command_lib/dataproc/jobs/submitter.py#L56
therefore you only need to add the --id to you gcloud command
gcloud dataproc jobs submit spark --id this-is-my-job-name --cluster my-cluster --class com.myClass.Main --jars gs://my.jar
While it's possible to provide your own generated jobid when using the underlying REST API, there isn't currently any way to specify your own jobid when submitting with gcloud dataproc jobs submit; this feature might be added in the future. That said, typically when people want to specify job ids they also want to be able to list with more complex match expressions, or potentially to have multiple categories of jobs listed by different kinds of expressions at different points in time.
So, you might want to consider dataproc labels instead; labels are intended specifically for this kind of use case, and are optimized for efficient lookup. For example:
gcloud dataproc jobs submit pig --labels jobtype=mylogspipeline,date=20170508 ...
gcloud dataproc jobs submit pig --labels jobtype=mylogspipeline,date=20170509 ...
gcloud dataproc jobs submit pig --labels jobtype=mlpipeline,date=20170509 ...
gcloud dataproc jobs list --filter "labels.jobtype=mylogspipeline"
gcloud dataproc jobs list --filter "labels.date=20170509"
gcloud dataproc jobs list --filter "labels.date=20170509 AND labels.jobtype=mlpipeline"

Spark-Scala read application.conf file based on environment like Prod/UAT etc

I have a spark application running on AWS EMR. We have different environments on AWS like prod,uat,dev etc. I created application.conf file with required variables like s3 buket, iam role etc. but obviously these variables are different for each env.
How can I pass different conf file to spark-submit so that i don't have to change application.conf file for each environment during deployments?
Based on the answer given by #ozeebee in this post, we can use it for spark-submit also.
In spark-submit you need to specify the properties in spark.driver.extraJavaOptions, something like this:
spark-submit
--conf "spark.driver.extraJavaOptions=-Dconfig.resource=devtest.conf"
--class thesyscat.query.ingest.eventenrichment.EventEnrichSparkApp
--master yarn
--deploy-mode client
<jar_location>