Where should I put jars on a dataproc cluster so they can be used by gcloud dataproc jobs submit spark? - google-cloud-dataproc

I have an initialisation script that downloads a .jar from our local artefact repository and places it into /usr/local/bin on each node on the cluster. I can run it using
gcloud dataproc jobs submit spark --cluster=my_cluster \
--region=us-central1 --jar=file:///usr/local/bin/myjar.jar -- arg1 arg2
However I'd prefer it if my end users did not have to know the location of the jar.
Where can I put the .jar so that the location of it does not have to be specified?

For spark jobs, you should be able to just place your jarfiles in /usr/lib/spark/jars on all nodes to automatically be available on the classpath.
For a more general coverage, you could add your jars to /usr/lib/hadoop/lib instead; the hadoop lib directory is also automatically included into Spark jobs on Dataproc, and is where libraries such as the GCS connector jarfile reside. You can see the hadoop lib directory being included via the SPARK_DIST_CLASSPATH environment variable configured in /etc/spark/conf/spark-env.sh.
If the desired behavior is still to specify using the --jar flag to specify a "main jar" instead of --jars to specify library jars that just provide classes, unfortunately there's currently no notion of a "working directory" on the cluster that would allow just specifying relative (instead of absolute) paths to the "main jar". However, there are two approaches that would have similar behavior:
Make the jarfiles local to the user's workspace from which jobs are being submitted - gcloud will then upload the jarfile at job-submission time into GCS and point the job at the jarfile when it runs in a job-specific directory. Note that this would cause duplicate uploads of the jarfile into GCS each time the job runs since it's always staging into a unique job directory; you'd have to gcloud dataproc jobs delete later on to clean up GCS space used by those jarfiles
(Preferred approach): Use the --class instead of the --jar argument for specifying what job to run after doing the steps above to make the jar available in the Spark classpath already. While the invocation of a classname is a bit more verbose, it still achieves the goal of hiding details of jarfile location from the user.
For example, the classes used for "spark-shell" implementation are already on the classpath, so if you wanted to run a scala file as if you were running it through spark-shell, you could run:
gcloud dataproc jobs submit spark --cluster my-cluster \
--class org.apache.spark.repl.Main \
-- -i myjob.scala

Related

please give one scenario when to use spark-submit --files

when to use --files with spark-submit. what will happen internally?
Sorry to ask this basic question but I did not find much documentation about --files
please provide any example or scenario..
Thanks in adv,
You can see details about an option by doing spark-submit --help. Here is what it says which is pretty clear. You can use it for resolving a file dependency, example - distributing a config file while submitting the application.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor. File paths of these files
in executors can be accessed via SparkFiles.get(fileName).

integrate newrelic in flink scala project

I want to integrate newrelic in my flink project. I have downloaded my newrelic.yml file from my account and have changed the app name only and I have created a folder named newrelic in my project root folder and have placed newrelic.yml file in it.
I have also placed the following dependency in my buld.sbt file:
"com.newrelic.agent.java" % "newrelic-api" % "3.0.0"
I am using the following command to run my jar:
flink run -m yarn-cluster -yn 2 -c Main /home/hadoop/test-assembly-0.2.jar
I guess, my code is not able to read my newrelic.yml file because I can't see my app name in newrelic. Do i need to initialize newrelic agent somewhere (if yes, how?). Please help me with this integration.
You should only need the newrelic.jar and newrelic.yml files to be accessible and have -javaagent:path/to/newrelic.jar passed to the JVM as an argument. You could try putting both newrelic.jar and newrelic.yml into your lib/ directory so they get copied to the job & task managers, then adding this to your conf/flink-conf.yaml:
env.java.opts: -javaagent:lib/newrelic.jar
Both New Relic files should be in the same directory and you ought to be able to remove the New Relic line from your build.sbt file. Also double check that your license key is in the newrelic.yml file.
I haven't tested this but the main goal is for the .yml and .jar to be accessible in the same directory(the yml can go into a different directory but other JVM arguments will need to be passed to reference it) and to pass -javaagent:path/to/newrelic.jar to as a JVM argument. If you run into issues try checking for new relic logs in the log folder of the directory where the .jar is located.

Upload zip file using --archives option of spark-submit on yarn

I have a directory with some model files and my application has to access these models files in local file system due to some reason.
Of course I know that --files option of spark-submit can upload file to the working directory of each executor and it does work.
However, I want keep the directory structure of my files so I come up with --archives option, which is said
YARN-only:
......
--archives ARCHIVES Comma separated list of archives to be extracted into the working directory of each executor.
......
But when I actually use it to upload models.zip, I found yarn just put it there without extraction, like what it did with --files. Have I misunderstood to be extracted or misused this option?
Found the answer myself.
YARN does extract the archive but add an extra folder with the same name of the archive. To make it clear, If I put models/model1 and models/models2 in models.zip, then I have to access my models by models.zip/models/model1 and models.zip/models/model2.
Moreover, we can make this more beautiful using the # syntax.
The --files and --archives options support specifying file names with the # similar to Hadoop. For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.
Edit:
This answer was tested on spark 2.0.0 and I'm not sure the behavior in other versions.

How do I specify a config file with play 2.4 and activator

I am building a Scala Play 2.4 application which uses the typesafe activator.
I would like to run my tests 2 times with a different configuration file for each run.
How can I specify alternative config files, or override the config settings?
I currently run tests with the command "./activator test"
You can create different configuration files for different environments/purposes. For example, I have three configuration files for local testing, alpha deployment, and production deployment as in this project https://github.com/luongbalinh/play-mongo
You can specify the configuration for running as follows:
activator run -Dconfig.resource=application.conf
where application.conf is the configuration you want to use.
You can create different configuration files for different environments. To specify the configuration to use it with activator run, use the following command:
activator "run -Dconfig.resource=application.conf"
where the application.conf is the desired configuration. Without the quotes it did not work for me. This is using the same configuration parameters as you use when going into production mode as described here:
https://www.playframework.com/documentation/2.5.x/ProductionConfiguration#Specifying-an-alternate-configuration-file
Important to know is also that config.resource tries to locate the configuration within the conf/ folder, so no need to specify that as well. For full paths not among the resources, use config.file. Further reading is also in the above link.
The quotes need to be used because you do not want to send the -D to activator, but to the run command. Using the quotes, the activator's JVM gets no -D argument but it interprets "run -Dconfig.file=application.conf" and sets the config.file property accordingly, also in the activator's JVM.
This was already discussed here: Activator : Play Framework 2.3.x : run vs. start
Since all the above are partially incorrect, here is my hard wrought knowledge from the last weekend.
Use include "application.conf" not include "application" (which Akka does)
Configs must be named .conf or Play will discard them silently
You probably want -Dconfig.file=<file>.conf so you're not classpath dependent
Make sure your provide the full file path (e.g. /opt/configs/prod.conf)
Example
Here is an example of this we run:
#prod.conf
include "application"
akka.remote.hostname = "prod.blah.com"
# Example of passing in S3 keys
s3.awsAccessKeyId="YOUR_KEY"
s3.awsSecretAccessKey="YOUR_SECRET_KEY"
And just pass it in like so:
activator -Dconfig.file=/var/lib/jenkins/jenkins.conf test
of if you fancy SBT:
sbt -Dconfig.file=/var/lib/jenkins/jenkins.conf test
Dev Environment
Also note it's easy to make a developer.conf file as well, to keep all your passwords/local ports, and then set a .gitignore so dev's don't accidentally check them in.
The below command works with Play 2.5
$ activator -Dconfig.resource=jenkins.conf run
https://www.playframework.com/documentation/2.5.x/ProductionConfiguration

Docker Data Volume for SBT Dependencies

I am using docker for continuous integration of a Scala project. Inside the container I am building the project and creating a distribution with "sbt dist".
This takes ages pulling down all the dependencies and I would like to use a docker data volume as mentioned here: http://docs.docker.io/en/latest/use/working_with_volumes/
However, I don't understand how I could get SBT to put the jar files in the volume, or how SBT would know how to read them from that volume.
SBT uses ivy to resolve project dependencies. Ivy caches downloaded artifacts locally and every time it is asked to pull something, it first goes to that cache and if nothing found downloads from remote. By default cache is located in ~/.ivy2, but it is actually a configurable property. So just mount volume, point ivy to it (or mount it in a way it will be on default location) and enjoy the caches.
Not sure if this makes sense on an integration server, but when developing on localhost, I'm mapping my host's .ivy2/ and .sbt/ directories to volumes in the container, like so:
docker run ... -v ~/.ivy2:/root/.ivy2 -v ~/.sbt:/root/.sbt ...
(Apparently, inside the container, .ivy2/ and .sbt/ are placed in /root/, since we're logging in to the container as the root user.)