Pass packages to pyspark running on dataproc from airflow? - pyspark

We have an Airflow DAG that involves running a pyspark job on Dataproc. We need a jdbc driver during the job, which I'd normally pass to the dataproc submit command:
gcloud dataproc jobs submit pyspark \
--cluster my-cluster \
--properties spark.jars.packages=mysql:mysql-connector-java:6.0.6 \
--py-files ...
But how can I do it with Airflow's DataProcPySparkOperator?
For now we're adding this library to the cluster itself:
gcloud dataproc clusters create my-cluster \
--region global \
--zone europe-west1-d \
...
--properties spark:spark.jars.packages=mysql:mysql-connector-java:6.0.6 \
...
This seems to be working fine, but it doesn't feel like the right way to do it. Is there another way?

I believe you want to pass dataproc_pyspark_properties to the DataProcPySparkOperator.
See:
https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/dataproc_operator.py

Related

How to get gcloud dataproc create flag in a spark job?

I want to get flags used when creating a dataproc cluster in a spark job.
for example, I created my cluster using this command line:
gcloud dataproc clusters create cluster-name \
--region=region \
--bucket=bucket-name \
--temp-bucket=bucket-name \
other args ...
In my scala spark job I want to get the bucket name and other arguments how to do that, I know if I want to get the arguments of my job I must do that:
val sc = sparkSession.sparkContext
val conf_context=sc.getConf.getAll
conf_context.foreach(println)
Any help, please?
Thanks
Dataproc also publishes some attributes, including the bucket name, to GCE instance Metadata. You can also specify your own Metadata. See https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/metadata.
These will be available to you through the Metadata server. For example, if you want to read the bucket name, you can run
curl -s -H Metadata-Flavor:Google http://metadata.google.internal/computeMetadata/v1/instance/attributes/dataproc-bucket
You can use gcloud dataproc clusters describe shell command to get details about the cluster:
gcloud dataproc clusters describe $clusterName --region $clusterRegion
To get the bucket name from this command, you can use grep:
BUCKET_NAME=$(gcloud dataproc clusters describe $clusterName \
--region $clusterRegion \
| grep 'configBucket:' \
| sed 's/.* //')
You should be able to execute this from Scala, see this post for how to do.

Expose Hue with Component Gateway for Dataproc

Is it possible to expose Hue with Component Gateway for Dataproc? I went through the docs and didn't find any option to add service to it. I am creating Dataproc cluster with below command.
gcloud beta dataproc clusters create hive-cluster \
--scopes sql-admin,bigquery \
--image-version 1.5 \
--master-machine-type n1-standard-4 \
--num-masters 1 \
--worker-machine-type n1-standard-1 \
--num-workers 2 \
--region $REGION \
--zone $ZONE \
--optional-components=ANACONDA,JUPYTER \
--initialization-actions gs://bucket/init-scripts/cloud-sql-proxy.sh,gs://bucket/init-scripts/hue.sh \
--properties hive:hive.metastore.warehouse.dir=gs://$PROJECT-warehouse/datasets,dataproc:jupyter.notebook.gcs.dir=gs://bucket/notebooks/jupyter \
--metadata "hive-metastore-instance=$PROJECT:$REGION:hive-metastore" \
--enable-component-gateway
Hue is not an optional component of Dataproc, hence not accessible from component gateway. For now, you have to use Dataproc web interfaces:
Once the cluster has been created, Hue is configured to run on port 8888 on the master node in a Dataproc cluster. To connect to the Hue web interface, you will need to create an SSH tunnel and use a SOCKS 5 Proxy with your web browser as described in the dataproc web interfaces documentation. In the opened web browser go to 'localhost:8888' and you should see the Hue UI.

How can I run uncompiled Spark Scala/spark-shell code as a Dataproc job?

Normally, if I'm using Scala for Spark jobs I'll compile a jarfile and submit it with gcloud dataproc jobs submit spark, but sometimes for very lightweight jobs I might be using uncompiled Scala code in a notebook or using the spark-shell REPL, where I assume a SparkContext is already available.
For some of these lightweight use cases I can equivalently use PySpark and submit with gcloud dataproc jobs submit pyspark but sometimes I need easier access to Scala/Java libraries such as directly creating a org.apache.hadoop.fs.FileSystem object inside of map functions. Is there any easy way to submit such "spark-shell" equivalent jobs directly from a command-line using Dataproc Jobs APIs?
At the moment, there isn't a specialized top-level Dataproc Job type for uncompiled Spark Scala, but under the hood, spark-shell is just using the same mechanisms as spark-submit to run a specialized REPL driver: org.apache.spark.repl.Main. Thus, combining this with the --files flag available in gcloud dataproc jobs submit spark, you can just write snippets of Scala that you may have tested in a spark-shell or notebook session, and run that as your entire Dataproc job, assuming job.scala is a local file on your machine:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files job.scala \
-- -i job.scala
Just like any other file, you can also specify any Hadoop-compatible path in the --files argument as well, such as gs:// or even hdfs://, assuming you've already placed your job.scala file there:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files gs://${BUCKET}/job.scala \
-- -i job.scala
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files hdfs:///tmp/job.scala \
-- -i job.scala
If you've staged your job file onto the Dataproc master node via an init action, you'd use file:/// to specify that the file is found on the cluster's local filesystem instead of your local filesystem where you're running gcloud:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files file:///tmp/job.scala \
-- -i job.scala
Note in all cases, the file becomes a local file in the working-directory of the main driver job, so the argument to "-i" can just be a relative path to the filename.

Dataproc arguments not being read on spark submit

I am using dataproc to submit jobs on spark. However on spark-submit, non-spark arguments are being read as spark arguments!
I am receiving the error/warning below when running a particular job.
Warning: Ignoring non-spark config property: dataproc:dataproc.conscrypt.provider.enable=false
gcloud dataproc jobs submit spark \
--cluster my-cluster \
--region us-east1 \
--properties dataproc:dataproc.conscrypt.provider.enable=false,spark.executor.extraJavaOptions=$SPARK_CONF,spark.executor.memory=${MEMORY}G,spark.executor.cores=$total_cores \
--class com.sample.run \
--jars gs://jars/jobs.jar \
-- 1000
I would like to know whats wrong with my current format. Thanks in advance.
spark-submit just silently ignored conf options that did not start with spark.
thats the reason for this property it was saying it was ignored.
--properties dataproc:dataproc.conscrypt.provider.enable=false
any property you should pass as spark.propertyname
this is just warning.
Why this property required :
The Conscrypt security provider has been temporarily changed from the
default to an optional security provider. This change was made due to
incompatibilities with some workloads. The Conscrypt provider will be
re-enabled as the default with the release of Cloud Dataproc 1.2 in
the future. In the meantime, you can re-enable the Conscrypt provider
when creating a cluster by specifying this Cloud Dataproc property:
--properties dataproc:dataproc.conscrypt.provider.enable=true
This has to be specified when creating cluster since this is cluster property, not property of spark. (means spark framework cant able to understand this and simply ignored.)
Example usage :
gcloud beta dataproc clusters create my-test
--project my-project
--subnet prod-sub-1
--zone southamerica-east1-a
--region=southamerica-east1
--master-machine-type n1-standard-4
--master-boot-disk-size 40
--num-workers 5
--worker-machine-type n1-standard-4
--worker-boot-disk-size 20
--image-version 1.2
--tags internal,ssh,http-server,https-server
--properties dataproc:dataproc.conscrypt.provider.enable=false
--format=json
--max-idle=10m
and then start job like this...
gcloud dataproc jobs submit pyspark gs://path-to-script/spark_full_job.py
--cluster=my-test
--project=my-project
--region=southamerica-east1
--jars=gs://path-to-driver/mssql-jdbc-6.4.0.jre8.jar
--format=json -- [JOB_ARGS]

How to know when dataproc initialization actions are done

I need to run a Dataproc cluster with both BigQuery and Cloud Storage connectors installed.
I use a variant of this script (because I have no access to the bucket used in the general one), everything is working fine but when I run a job, when the cluster is up and running, it always results in a Task was not acquired error.
I can fix this by simply restarting the dataproc agent on every nodes but I really need this to work properly to be able to run a job right after my cluster is created. it seems that this part of the script is not working properly:
# Restarts Dataproc Agent after successful initialization
# WARNING: this function relies on undocumented and not officially supported Dataproc Agent
# "sentinel" files to determine successful Agent initialization and not guaranteed
# to work in the future. Use at your own risk!
restart_dataproc_agent() {
# Because Dataproc Agent should be restarted after initialization, we need to wait until
# it will create a sentinel file that signals initialization competition (success or failure)
while [[ ! -f /var/lib/google/dataproc/has_run_before ]]; do
sleep 1
done
# If Dataproc Agent didn't create a sentinel file that signals initialization
# failure then it means that initialization succeded and it should be restarted
if [[ ! -f /var/lib/google/dataproc/has_failed_before ]]; then
service google-dataproc-agent restart
fi
}
export -f restart_dataproc_agent
# Schedule asynchronous Dataproc Agent restart so it will use updated connectors.
# It could not be restarted sycnhronously because Dataproc Agent should be restarted
# after its initialization, including init actions execution, has been completed.
bash -c restart_dataproc_agent & disown
My question here are:
How to know that the initialization actions are done?
Do I have/How to properly restart the Dataproc agent one my newly created cluster's nodes?
EDIT:
Here is the command I use to create a cluster (using the 1.3 image version):
gcloud dataproc --region europe-west1 \
clusters create my-cluster \
--bucket my-bucket \
--subnet default \
--zone europe-west1-b \
--master-machine-type n1-standard-1 \
--master-boot-disk-size 50 \
--num-workers 2 \
--worker-machine-type n1-standard-2 \
--worker-boot-disk-size 100 \
--image-version 1.3 \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--project my-project \
--initialization-actions gs://dataproc-initialization-actions/connectors/connectors.sh \
--metadata 'gcs-connector-version=1.9.6' \
--metadata 'bigquery-connector-version=0.13.6'
Also, please note that the connectors initialization script has been fixed and works fine by now, so I am using it now but I still have to restart manually the dataproc agent to be able to run a job.
Dataproc agent logs Custom initialization actions finished. message in the /var/log/google-dataproc-agent.0.log file after initialization actions succeed.
No you don't need to restart Dataproc agent manually.
This issue is caused by Dataproc agent service restart in the connectors initialization action and should be resolved by this PR.
As for knowing when the initialization actions are finished, You can check the dataproc's status.state, if it's CREATING that means initialization actions are still being executed, if RUNNING that would mean that they are done!
Check here