Upload zip file using --archives option of spark-submit on yarn - scala

I have a directory with some model files and my application has to access these models files in local file system due to some reason.
Of course I know that --files option of spark-submit can upload file to the working directory of each executor and it does work.
However, I want keep the directory structure of my files so I come up with --archives option, which is said
YARN-only:
......
--archives ARCHIVES Comma separated list of archives to be extracted into the working directory of each executor.
......
But when I actually use it to upload models.zip, I found yarn just put it there without extraction, like what it did with --files. Have I misunderstood to be extracted or misused this option?

Found the answer myself.
YARN does extract the archive but add an extra folder with the same name of the archive. To make it clear, If I put models/model1 and models/models2 in models.zip, then I have to access my models by models.zip/models/model1 and models.zip/models/model2.
Moreover, we can make this more beautiful using the # syntax.
The --files and --archives options support specifying file names with the # similar to Hadoop. For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.
Edit:
This answer was tested on spark 2.0.0 and I'm not sure the behavior in other versions.

Related

Data Fusion: GCS create creating folders not object

I am trying to create an GCS object (File) with GCS create plugin of Data Fusion.
but it is creating a folder instead.
How I can have a file created instead of a folder ??
It seems that the description of the plugin leads to a misunderstanding. Cloud Storage doesn't work like a conventional filesystem, so you cannot "strictly" create empty files. The gsutil command doesn't have an equivalent to a touch command (on Linux) and all "basic" operation in this product is limited to the cp command (upload and download files).
Therefore, since there is no file when you specify the storage url, it's expected that a folder will be created instead of a file.
Based on this, I would like to suggest you two workarounds:
If you are using this plugin to create a file as a ‘flag’, you can continue using the plugin since the created folder also serves as a flag (to trigger a Cloud Function, for example)
If you need to create a file, you can create files with the GCS plugin located in ‘Sink’ plugins group to write records to one or more files in a directory on Google Cloud Storage. Files can be written in various formats such as csv, avro, parquet, and json.

please give one scenario when to use spark-submit --files

when to use --files with spark-submit. what will happen internally?
Sorry to ask this basic question but I did not find much documentation about --files
please provide any example or scenario..
Thanks in adv,
You can see details about an option by doing spark-submit --help. Here is what it says which is pretty clear. You can use it for resolving a file dependency, example - distributing a config file while submitting the application.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor. File paths of these files
in executors can be accessed via SparkFiles.get(fileName).

integrate newrelic in flink scala project

I want to integrate newrelic in my flink project. I have downloaded my newrelic.yml file from my account and have changed the app name only and I have created a folder named newrelic in my project root folder and have placed newrelic.yml file in it.
I have also placed the following dependency in my buld.sbt file:
"com.newrelic.agent.java" % "newrelic-api" % "3.0.0"
I am using the following command to run my jar:
flink run -m yarn-cluster -yn 2 -c Main /home/hadoop/test-assembly-0.2.jar
I guess, my code is not able to read my newrelic.yml file because I can't see my app name in newrelic. Do i need to initialize newrelic agent somewhere (if yes, how?). Please help me with this integration.
You should only need the newrelic.jar and newrelic.yml files to be accessible and have -javaagent:path/to/newrelic.jar passed to the JVM as an argument. You could try putting both newrelic.jar and newrelic.yml into your lib/ directory so they get copied to the job & task managers, then adding this to your conf/flink-conf.yaml:
env.java.opts: -javaagent:lib/newrelic.jar
Both New Relic files should be in the same directory and you ought to be able to remove the New Relic line from your build.sbt file. Also double check that your license key is in the newrelic.yml file.
I haven't tested this but the main goal is for the .yml and .jar to be accessible in the same directory(the yml can go into a different directory but other JVM arguments will need to be passed to reference it) and to pass -javaagent:path/to/newrelic.jar to as a JVM argument. If you run into issues try checking for new relic logs in the log folder of the directory where the .jar is located.

Google Spread Sheet Spark library

I am using https://github.com/potix2/spark-google-spreadsheets library for reading the spread sheet file in spark. It is working perfectly in my local.
val df = sqlContext.read.
format("com.github.potix2.spark.google.spreadsheets").
option("serviceAccountId", "xxxxxx#developer.gserviceaccount.com").
option("credentialPath", "/path/to/credentail.p12").
load("<spreadsheetId>/worksheet1")
I created a new assembly jar with included all the credentials and use that jar for reading the file. But I am facing issue with reading the credentialPath file. I tried using
getClass.getResourceAsStream("/resources/Aircraft/allAircraft.txt")
But library only supports absolute path. Please help me to resolve this issue.
You can use --files argument of spark-submit or SparkContext.addFile() to distribute a credential file. If you want to get a local path of the credential file in worker node, you should call SparkFiles.get("credential filename").
import org.apache.spark.SparkFiles
// you can also use `spark-submit --files=credential.p12`
sqlContext.sparkContext.addFile("credential.p12")
val credentialPath = SparkFiles.get("credential.p12")
val df = sqlContext.read.
format("com.github.potix2.spark.google.spreadsheets").
option("serviceAccountId", "xxxxxx#developer.gserviceaccount.com").
option("credentialPath", credentialPath).
load("<spreadsheetId>/worksheet1")
Use SBT and try typesafe config library.
Here is a simple but complete sample which reads some information from the config file placed in resources folder.
Then you can assemble a jar file using sbt-assembly plugin.
If you're working in the Databricks environment, you can upload the credentials file.
Setting the GOOGLE_APPLICATION_CREDENTIALS environment variable, as described here, does not get you around this requirement because it's a link to the file path, not the actual credentials. See here for more details about getting the right credentials and using the library.

"Not A Valid Jar" When trying to run Map Reduce Job

I am trying to run a my MapReduce job by building a jar from eclipse , but while trying to execute the job , I am getting "Not a valid Jar" error.
I have tried to follow the link Not a valid Jar but that didnt help.
Can anyone please give me the instructions on how to build the jar from eclipse, for it to run on Hadoop.
I am aware of the process of building the Jar file from eclipse,however I am not sure, do I have to take any special care for building a jar file, so that it runs on Hadoop.
When you submit the command, make certain you have the following things on the line to do the command:
When you indicate the jar, make certain you are directing to the jar properly. It may be easiest to be certain by using the absolute path. To get the absolute path, if you navigate to the place where the jar is, then run 'readlink -f ' command to get the absolute path. So for you, not just hist.jar, but maybe /home/akash_user/jars/hist.jar or wherever it is on your system. If you are using Eclipse, it may be saving it somewhere funny, so make sure that is not the problem. The jar cannot be run from HDFS storage. must run from local storage.
When you name your main class, in your example Histogram, you must use the fully qualified name of the class, with the package, the project, and the class. So, usually, if the program/project is named Histogram, and there is a HistogramDriver, HistogramMapper, HistogramReducer, and your main() is in HistogramDriver, you need to type Histogram.HistogramDriver to get the program running. (Unless you made your jar runnable, which requires extra stuff at the beginning, making .mdf and things.)
Make sure that the jar you are submitting (hist.jar) is in the current directory from where you are submitting the 'hadoop jar' command.
If the issue is still persisting, please tell the Java, Hadoop and Linux version you are using.
You should not keep the jar file in HDFS when executing the MapReduce job. Make sure Jar is available in the local path. Input path and output directory should be the path from HDFS.