please give one scenario when to use spark-submit --files - scala

when to use --files with spark-submit. what will happen internally?
Sorry to ask this basic question but I did not find much documentation about --files
please provide any example or scenario..
Thanks in adv,

You can see details about an option by doing spark-submit --help. Here is what it says which is pretty clear. You can use it for resolving a file dependency, example - distributing a config file while submitting the application.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor. File paths of these files
in executors can be accessed via SparkFiles.get(fileName).

Related

In what order will multiple pyspark program get executed on a spark cluster

If I submit multiple python (pyspark) files to a spark submit command, in which order will they get executed?
For Java, there is a main method which will get executed first and the rest of the classes will get executed in the order their objects/methdos are created/invoked.
But python (and also scala) allows the whole REPL style syntax whereby one is allowed to type commands in an 'open code' fashion, i.e outside method blocks.
So when a whole bunch of this REPL statements get submitted to the spark cluster, in what order will they execute?
According to http://spark.apache.org/docs/3.0.1/configuration.html
spark.submit.pyFiles (which is --py-files): Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps. Globs are allowed.
So the python files which are added by --py-files meant to be libraries, or modules, or packages, not for runnable scripts. You will need to create a main.py or something similar, then import the other 5 files and trigger in any order you want spark-submit --py-files five-files.zip main.py

Where should I put jars on a dataproc cluster so they can be used by gcloud dataproc jobs submit spark?

I have an initialisation script that downloads a .jar from our local artefact repository and places it into /usr/local/bin on each node on the cluster. I can run it using
gcloud dataproc jobs submit spark --cluster=my_cluster \
--region=us-central1 --jar=file:///usr/local/bin/myjar.jar -- arg1 arg2
However I'd prefer it if my end users did not have to know the location of the jar.
Where can I put the .jar so that the location of it does not have to be specified?
For spark jobs, you should be able to just place your jarfiles in /usr/lib/spark/jars on all nodes to automatically be available on the classpath.
For a more general coverage, you could add your jars to /usr/lib/hadoop/lib instead; the hadoop lib directory is also automatically included into Spark jobs on Dataproc, and is where libraries such as the GCS connector jarfile reside. You can see the hadoop lib directory being included via the SPARK_DIST_CLASSPATH environment variable configured in /etc/spark/conf/spark-env.sh.
If the desired behavior is still to specify using the --jar flag to specify a "main jar" instead of --jars to specify library jars that just provide classes, unfortunately there's currently no notion of a "working directory" on the cluster that would allow just specifying relative (instead of absolute) paths to the "main jar". However, there are two approaches that would have similar behavior:
Make the jarfiles local to the user's workspace from which jobs are being submitted - gcloud will then upload the jarfile at job-submission time into GCS and point the job at the jarfile when it runs in a job-specific directory. Note that this would cause duplicate uploads of the jarfile into GCS each time the job runs since it's always staging into a unique job directory; you'd have to gcloud dataproc jobs delete later on to clean up GCS space used by those jarfiles
(Preferred approach): Use the --class instead of the --jar argument for specifying what job to run after doing the steps above to make the jar available in the Spark classpath already. While the invocation of a classname is a bit more verbose, it still achieves the goal of hiding details of jarfile location from the user.
For example, the classes used for "spark-shell" implementation are already on the classpath, so if you wanted to run a scala file as if you were running it through spark-shell, you could run:
gcloud dataproc jobs submit spark --cluster my-cluster \
--class org.apache.spark.repl.Main \
-- -i myjob.scala

Google Spread Sheet Spark library

I am using https://github.com/potix2/spark-google-spreadsheets library for reading the spread sheet file in spark. It is working perfectly in my local.
val df = sqlContext.read.
format("com.github.potix2.spark.google.spreadsheets").
option("serviceAccountId", "xxxxxx#developer.gserviceaccount.com").
option("credentialPath", "/path/to/credentail.p12").
load("<spreadsheetId>/worksheet1")
I created a new assembly jar with included all the credentials and use that jar for reading the file. But I am facing issue with reading the credentialPath file. I tried using
getClass.getResourceAsStream("/resources/Aircraft/allAircraft.txt")
But library only supports absolute path. Please help me to resolve this issue.
You can use --files argument of spark-submit or SparkContext.addFile() to distribute a credential file. If you want to get a local path of the credential file in worker node, you should call SparkFiles.get("credential filename").
import org.apache.spark.SparkFiles
// you can also use `spark-submit --files=credential.p12`
sqlContext.sparkContext.addFile("credential.p12")
val credentialPath = SparkFiles.get("credential.p12")
val df = sqlContext.read.
format("com.github.potix2.spark.google.spreadsheets").
option("serviceAccountId", "xxxxxx#developer.gserviceaccount.com").
option("credentialPath", credentialPath).
load("<spreadsheetId>/worksheet1")
Use SBT and try typesafe config library.
Here is a simple but complete sample which reads some information from the config file placed in resources folder.
Then you can assemble a jar file using sbt-assembly plugin.
If you're working in the Databricks environment, you can upload the credentials file.
Setting the GOOGLE_APPLICATION_CREDENTIALS environment variable, as described here, does not get you around this requirement because it's a link to the file path, not the actual credentials. See here for more details about getting the right credentials and using the library.

Upload zip file using --archives option of spark-submit on yarn

I have a directory with some model files and my application has to access these models files in local file system due to some reason.
Of course I know that --files option of spark-submit can upload file to the working directory of each executor and it does work.
However, I want keep the directory structure of my files so I come up with --archives option, which is said
YARN-only:
......
--archives ARCHIVES Comma separated list of archives to be extracted into the working directory of each executor.
......
But when I actually use it to upload models.zip, I found yarn just put it there without extraction, like what it did with --files. Have I misunderstood to be extracted or misused this option?
Found the answer myself.
YARN does extract the archive but add an extra folder with the same name of the archive. To make it clear, If I put models/model1 and models/models2 in models.zip, then I have to access my models by models.zip/models/model1 and models.zip/models/model2.
Moreover, we can make this more beautiful using the # syntax.
The --files and --archives options support specifying file names with the # similar to Hadoop. For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.
Edit:
This answer was tested on spark 2.0.0 and I'm not sure the behavior in other versions.

"Not A Valid Jar" When trying to run Map Reduce Job

I am trying to run a my MapReduce job by building a jar from eclipse , but while trying to execute the job , I am getting "Not a valid Jar" error.
I have tried to follow the link Not a valid Jar but that didnt help.
Can anyone please give me the instructions on how to build the jar from eclipse, for it to run on Hadoop.
I am aware of the process of building the Jar file from eclipse,however I am not sure, do I have to take any special care for building a jar file, so that it runs on Hadoop.
When you submit the command, make certain you have the following things on the line to do the command:
When you indicate the jar, make certain you are directing to the jar properly. It may be easiest to be certain by using the absolute path. To get the absolute path, if you navigate to the place where the jar is, then run 'readlink -f ' command to get the absolute path. So for you, not just hist.jar, but maybe /home/akash_user/jars/hist.jar or wherever it is on your system. If you are using Eclipse, it may be saving it somewhere funny, so make sure that is not the problem. The jar cannot be run from HDFS storage. must run from local storage.
When you name your main class, in your example Histogram, you must use the fully qualified name of the class, with the package, the project, and the class. So, usually, if the program/project is named Histogram, and there is a HistogramDriver, HistogramMapper, HistogramReducer, and your main() is in HistogramDriver, you need to type Histogram.HistogramDriver to get the program running. (Unless you made your jar runnable, which requires extra stuff at the beginning, making .mdf and things.)
Make sure that the jar you are submitting (hist.jar) is in the current directory from where you are submitting the 'hadoop jar' command.
If the issue is still persisting, please tell the Java, Hadoop and Linux version you are using.
You should not keep the jar file in HDFS when executing the MapReduce job. Make sure Jar is available in the local path. Input path and output directory should be the path from HDFS.