How can I run uncompiled Spark Scala/spark-shell code as a Dataproc job? - scala

Normally, if I'm using Scala for Spark jobs I'll compile a jarfile and submit it with gcloud dataproc jobs submit spark, but sometimes for very lightweight jobs I might be using uncompiled Scala code in a notebook or using the spark-shell REPL, where I assume a SparkContext is already available.
For some of these lightweight use cases I can equivalently use PySpark and submit with gcloud dataproc jobs submit pyspark but sometimes I need easier access to Scala/Java libraries such as directly creating a org.apache.hadoop.fs.FileSystem object inside of map functions. Is there any easy way to submit such "spark-shell" equivalent jobs directly from a command-line using Dataproc Jobs APIs?

At the moment, there isn't a specialized top-level Dataproc Job type for uncompiled Spark Scala, but under the hood, spark-shell is just using the same mechanisms as spark-submit to run a specialized REPL driver: org.apache.spark.repl.Main. Thus, combining this with the --files flag available in gcloud dataproc jobs submit spark, you can just write snippets of Scala that you may have tested in a spark-shell or notebook session, and run that as your entire Dataproc job, assuming job.scala is a local file on your machine:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files job.scala \
-- -i job.scala
Just like any other file, you can also specify any Hadoop-compatible path in the --files argument as well, such as gs:// or even hdfs://, assuming you've already placed your job.scala file there:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files gs://${BUCKET}/job.scala \
-- -i job.scala
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files hdfs:///tmp/job.scala \
-- -i job.scala
If you've staged your job file onto the Dataproc master node via an init action, you'd use file:/// to specify that the file is found on the cluster's local filesystem instead of your local filesystem where you're running gcloud:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files file:///tmp/job.scala \
-- -i job.scala
Note in all cases, the file becomes a local file in the working-directory of the main driver job, so the argument to "-i" can just be a relative path to the filename.

Related

why dataproc not recognizing argument : spark.submit.deployMode=cluster?

I am submitting a spark job to dataproc this way :
gcloud dataproc jobs submit spark --cluster=$CLUSTER --region=$REGION --properties spark.jars.packages=com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.19.1, spark.submit.deployMode=cluster --class path.to.my.main.class --jars=path.to.jars -- "-p" "some_arg" "-z" "some_other_arg"
But i am getting this error :
ERROR: (gcloud.dataproc.jobs.submit.spark) unrecognized arguments:
spark.submit.deployMode=cluster
Any idea why? thank you in advance for your help.
It works fine this way (without the cluster mode):
gcloud dataproc jobs submit spark --cluster=$CLUSTER --region=$REGION --properties spark.jars.packages=com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.19.1 --class path.to.my.main.class --jars=path.to.jars -- "-p" "some_arg" "-z" "some_other_arg"
It seems you have a space between the first property and the second. Either remove it or surround both of them with quotes.
Another option is to replace this with
--packages com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.19.1 --properties spark.submit.deployMode=cluster

scala spark to read file from hdfs cluster

I am learning to develop spark applications using Scala. And I am in my very first steps.
I have my scala IDE on windows. configured and runs smoothly if reading files from local drive. However, I have access to a remote hdfs cluster and Hive database, and I want to develop, try, and test my applications against that Hadoop cluster... but I don't know how :(
If I try
val rdd=sc.textFile("hdfs://masternode:9000/user/hive/warehouse/dwh_db_jrtf.db/discipline")
I will get an error that contains:
Exception in thread "main" java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; Host Details : local host is: "MyLap/11.22.33.44"; destination host is: "masternode":9000;
Can anyone guide me please ?
You can use SBT to package your code in a .jar file. scp your file on your Node then try to submit it by doing a spark-submit.
spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
You can't access to your Cluster from your Windows Machine in that way.

Dataproc arguments not being read on spark submit

I am using dataproc to submit jobs on spark. However on spark-submit, non-spark arguments are being read as spark arguments!
I am receiving the error/warning below when running a particular job.
Warning: Ignoring non-spark config property: dataproc:dataproc.conscrypt.provider.enable=false
gcloud dataproc jobs submit spark \
--cluster my-cluster \
--region us-east1 \
--properties dataproc:dataproc.conscrypt.provider.enable=false,spark.executor.extraJavaOptions=$SPARK_CONF,spark.executor.memory=${MEMORY}G,spark.executor.cores=$total_cores \
--class com.sample.run \
--jars gs://jars/jobs.jar \
-- 1000
I would like to know whats wrong with my current format. Thanks in advance.
spark-submit just silently ignored conf options that did not start with spark.
thats the reason for this property it was saying it was ignored.
--properties dataproc:dataproc.conscrypt.provider.enable=false
any property you should pass as spark.propertyname
this is just warning.
Why this property required :
The Conscrypt security provider has been temporarily changed from the
default to an optional security provider. This change was made due to
incompatibilities with some workloads. The Conscrypt provider will be
re-enabled as the default with the release of Cloud Dataproc 1.2 in
the future. In the meantime, you can re-enable the Conscrypt provider
when creating a cluster by specifying this Cloud Dataproc property:
--properties dataproc:dataproc.conscrypt.provider.enable=true
This has to be specified when creating cluster since this is cluster property, not property of spark. (means spark framework cant able to understand this and simply ignored.)
Example usage :
gcloud beta dataproc clusters create my-test
--project my-project
--subnet prod-sub-1
--zone southamerica-east1-a
--region=southamerica-east1
--master-machine-type n1-standard-4
--master-boot-disk-size 40
--num-workers 5
--worker-machine-type n1-standard-4
--worker-boot-disk-size 20
--image-version 1.2
--tags internal,ssh,http-server,https-server
--properties dataproc:dataproc.conscrypt.provider.enable=false
--format=json
--max-idle=10m
and then start job like this...
gcloud dataproc jobs submit pyspark gs://path-to-script/spark_full_job.py
--cluster=my-test
--project=my-project
--region=southamerica-east1
--jars=gs://path-to-driver/mssql-jdbc-6.4.0.jre8.jar
--format=json -- [JOB_ARGS]

How to pass external configuration file to pyspark(Spark 2.x) program?

When I am running pyspark program interactive shell able to fetch the configuration file(config.ini) inside pyspark script,
But when I am trying to run same script using Spark submit command with master yarn and cluster deployment mode is cluster it giving me error as config file not exists, I have checked yarn log and able to see same, below is command for running the pyspark job.
spark2-submit --master yarn --deploy-mode cluster test.py /home/sys_user/ask/conf/config.ini
With spark2-sumbmit command there is parameter provided properties-file, you can use that to get this properties file available in spark-submit command.
e.g. spark2-submit --master yarn --deploy-mode cluster --properties-file $CONF_FILE_NAME pyspark_script.py
Pass the ini file in spark.files parameter
.config('spark.files', 'config/local/config.ini') \
Read in pyspark:
with open(SparkFiles.get('config.ini')) as config_file:
print(config_file.read())
It works for me.

Pass packages to pyspark running on dataproc from airflow?

We have an Airflow DAG that involves running a pyspark job on Dataproc. We need a jdbc driver during the job, which I'd normally pass to the dataproc submit command:
gcloud dataproc jobs submit pyspark \
--cluster my-cluster \
--properties spark.jars.packages=mysql:mysql-connector-java:6.0.6 \
--py-files ...
But how can I do it with Airflow's DataProcPySparkOperator?
For now we're adding this library to the cluster itself:
gcloud dataproc clusters create my-cluster \
--region global \
--zone europe-west1-d \
...
--properties spark:spark.jars.packages=mysql:mysql-connector-java:6.0.6 \
...
This seems to be working fine, but it doesn't feel like the right way to do it. Is there another way?
I believe you want to pass dataproc_pyspark_properties to the DataProcPySparkOperator.
See:
https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/dataproc_operator.py