I've setup a highly customized virtual environment on Cloud Dataproc. Some of the libraries in this virtual environment depend on certain shared libraries. which are packaged along with the Virtual Environment.
For the Virtual Environment: I made PYSPARK_PYTHON point to the python present inside the Virtual Environment.
However these libraries are not able to work as the LD_LIBRARY_PATH is not set when I do gcloud dataproc jobs submit....
I've tried:
Setting spark-env.sh on the workers and master to export LD_LIBRARY_PATH
Setting spark.executorEnv.LD_LIBRARY_PATH
Creating an initialization script where (1) is being added during initialization
However both of these fail.
This is what finally worked:
Running the gcloud command as:
gcloud dataproc jobs submit pyspark --cluster spark-tests spark_job.py --properties spark.executorEnv.LD_LIBRARY_PATH="path1:path2"
When I tried to set the spark.executorEnv inside the pyspark script(using the Spark Config object) it didnt work though. I'm not sure why that is?
I'm trying to avoid race conditions with gcloud / gsutil authentication on the same system but different CI/CD jobs on my Gitlab-Runner on a Mac Mini.
I have tried setting the auth manually with
RUN gcloud auth activate-service-account --key-file="gitlab-runner.json"
RUN gcloud config set project $GCP_PROJECT_ID
for the Dockerfile (in which I'm performing a download operation from a Google Cloud Storage bucket).
I'm using a configuration in the bash script to run the docker command and in the same script for authenticating I'm using
gcloud config configurations activate $TARGET
Where I've previously done the above two commands to save them to the configuration.
The configurations are working fine if I start the CI/CD jobs one after the other has finished. But I want to trigger them for all clients at the same time, which causes race conditions with gcloud authentication and one of the jobs trying to download from the wrong project bucket.
How to avoid a race condition? I'm already authenticating before each gsutil command but still its causing the race condition. Do I need something like CloudBuild to separate the runtime environments?
You can use Cloud Build to get separate execution environments but this might be an overkill for your use case, as a Cloud Build worker uses an entire VM which might be just too heavy, linux containers / Docker can provide necessary isolation as well.
You should make sure that each container you run has a unique config file placed in the path expected by gcloud. The issue may come from improper volume mounting (all the containers share the same location from the host/OS), or maybe you should mount a directory containing their configuration file (unique for each bucket) on running an image, or perhaps you should run gcloud config configurations activate in a Dockerfile step (thus creating image variants for different buckets if it’s feasible).
Alternatively, and I think this solution might be easier, you can switch from Cloud SDK distribution to standalone gsutil distribution. That way you can provide a path to a boto configuration file through an environment variable.
Such variables can be specified on running a Docker image.
I have a spark setup currently running in PROD.
In which we are calling spark submit using shell script.
We export some variables inside shell script before spark submit in yarn client mode.
Those exported variable will refer inside scala program using "System.getenv(<variable_name_exportrd>)".
Now the problem is we are switching to yarn cluster mode in spark-submit.
If we submit that job using Cluster mode. Those exported variables coming as null inside the program.
As per below blog, if I use "spark.yarn.appMasterEnv." i am able to access those exported variables. We are exporting nearly 40 variables in shell script. So buliding --conf for 40 variables is tedious task. (Variable changes dynamically)
How to pass environment variables to spark driver in cluster mode with spark-submit
Now my question is : Is there a way to specify multiple environment variables in a file and pass that in spark-submit.
This makes code change very less.
Please help. Thanks in advance.
I'm trying to minimize changes in my code so I'm wondering if there is a way to submit a spark-streaming job from my personal PC/VM as follows:
spark-submit --class path.to.your.Class --master yarn --deploy-mode client \
[options] <app jar> [app options]
without using GCP SDK.
I also have to specify a directory with configuration files HADOOP_CONF_DIR which I was able to download from Ambari.
Is there a way to do the same?
Thank you
Setting up an external machine as a YARN client node is generally difficult to do and not a workflow that will work easily with Dataproc.
In a comment you mention that what you really want to do is
Submit a Spark job to the Dataproc cluster.
Run a local script on each "batchFinish" (StreamingListener.onBatchCompleted?).
The script has dependencies that mean it cannot run inside of the Dataproc master node.
Again, configuring a client node outside of the Dataproc cluster and getting it to work with spark-submit is not going to work directly. However, if you can configure your network such that the Spark driver (running within Dataproc) has access to the service/script you need to run, and then invoke that when desired.
If you run your service on a VM that has access to the network of the Dataproc cluster, then your Spark driver should be able to access the service.
I've wrote a spark program, which needs to be executed on EMR cluster. But there are some dependent files and modules being used by python program. So is there any way around to setup dependent components on a running cluster ?
Can we mount the s3 bucket and mount that one cluster nodes, and can put all the dependent component on s3 ? Is this a good idea, and using Python how we can mount the s3 buckets on EMR ?
(During cluster creation): You can use Amazon EMR bootstrap custom actions which is capable of executing a bash script at the time of creation of the cluster. You can install all the dependent components using this script. Bootstrap action will be performed on all nodes of the cluster.
(On a running cluster): You can use Amazon EMR step option to create a s3-dist-cp command-runner step to copy files from s3.
I have a production environment that consists of several (persistent and ad-hoc) EMR Spark clusters.
I would like to use one instance of spark-jobserver to manage the job JARs for this environment in general, and be able to specify the intended master right when I POST /jobs, and not permanently in the config file (using master = "local[4]" configuration key).
Obviously I would prefer to have spark-jobserver running on a standalone machine, and not on any of the masters.
Is this somehow possible?
You can write a SparkMasterProvider
A complex example is here https://github.com/spark-jobserver/jobserver-cassandra/blob/master/src/main/scala/spark.jobserver/masterLocators/dse/DseSparkMasterProvider.scala
I think all you have to do is write one that will return the config input as spark master, that way you can pass it as part of job config.