Google Spread Sheet Spark library - scala

I am using https://github.com/potix2/spark-google-spreadsheets library for reading the spread sheet file in spark. It is working perfectly in my local.
val df = sqlContext.read.
format("com.github.potix2.spark.google.spreadsheets").
option("serviceAccountId", "xxxxxx#developer.gserviceaccount.com").
option("credentialPath", "/path/to/credentail.p12").
load("<spreadsheetId>/worksheet1")
I created a new assembly jar with included all the credentials and use that jar for reading the file. But I am facing issue with reading the credentialPath file. I tried using
getClass.getResourceAsStream("/resources/Aircraft/allAircraft.txt")
But library only supports absolute path. Please help me to resolve this issue.

You can use --files argument of spark-submit or SparkContext.addFile() to distribute a credential file. If you want to get a local path of the credential file in worker node, you should call SparkFiles.get("credential filename").
import org.apache.spark.SparkFiles
// you can also use `spark-submit --files=credential.p12`
sqlContext.sparkContext.addFile("credential.p12")
val credentialPath = SparkFiles.get("credential.p12")
val df = sqlContext.read.
format("com.github.potix2.spark.google.spreadsheets").
option("serviceAccountId", "xxxxxx#developer.gserviceaccount.com").
option("credentialPath", credentialPath).
load("<spreadsheetId>/worksheet1")

Use SBT and try typesafe config library.
Here is a simple but complete sample which reads some information from the config file placed in resources folder.
Then you can assemble a jar file using sbt-assembly plugin.

If you're working in the Databricks environment, you can upload the credentials file.
Setting the GOOGLE_APPLICATION_CREDENTIALS environment variable, as described here, does not get you around this requirement because it's a link to the file path, not the actual credentials. See here for more details about getting the right credentials and using the library.

Related

How to add to classpath of running PySpark session

I have a PySpark notebook running in AWS EMR. In my specific case, I want to use pyspark2pmml to create pmml for a model I just trained. However, I get the following error (when running pyspark2pmml.PMMLBuilder but I don't think that matters).
JPMML-SparkML not found on classpath
Traceback (most recent call last):
File "/tmp/1623111492721-0/lib/python3.7/site-packages/pyspark2pmml/__init__.py", line 14, in __init__
raise RuntimeError("JPMML-SparkML not found on classpath")
RuntimeError: JPMML-SparkML not found on classpath
I know that this is caused by my Spark session not have reference to the needed class. What I don't know is how to start a Spark session with that class available. I found one other answer using %%conf -f, but that changed other settings which in turn kept me from using sc.install_pypi_package, which I also needed.
Is there a way that I could have started the Spark session with that JPMML class available, but without changing any other settings?
So, here's an answer, but not the one I want.
To add that class to the classpath I can start my work with this:
%%configure -f
{
"jars": [
"{some_path_to_s3}/jpmml-sparkml-executable-1.5.13.jar"
]
}
That creates the issue I referenced above, where I don't have the ability to sc.install_pypi_package. However, I can add that package in a more manual way. First step was to create a zip file of just the needed modules using the zip from the project's github (in this case, just the pyspark2pmml directory, instead of the whole zip). Then that module can be added using sc.addPyFile
sc.addPyFile('{some_path_to_s3}/pyspark2pmml.zip')
After this, I can run the original commands exactly as I expected.

scala read the configs dynamically during run time

I have a config file in the below folder subtract
main
scala
resources
application.conf
and that contains
path{
http{
url = "http://testingurl"
}
}
I am reading using below code
import com.typesafe.config.Config;
val url = conf.getString("path.http.url")
I am reading this static information which is provided durng build time.
Now I want to read these in runtime, user should be able to modify configs even after jar is built.
my requirement is to modify url event after jar is build, I dont want to pass as a arguments to the main function because I have so many such values which needs to modified after jar built
See:
Lightbend Config Readme
Use at runtime:
-Dconfig.file=<relative or absolute path>
See:
For applications using application.{conf,json,properties},
system properties can be used to force a different config source
(e.g. from command line -Dconfig.file=path/to/config-file):
config.resource specifies a resource name - not a basename, i.e. application.conf not application
config.file specifies a filesystem path, again it should include the extension, not be a basename
config.url specifies a URL
These system properties specify a replacement for application.{conf,json,properties}, not an addition.
They only affect apps using the default ConfigFactory.load() configuration.
In the replacement config file, you can use include "application" to include the original default config file;
after the include statement you could go on to override certain settings.
I assume that a .jar file is build with a key provided as
val url = conf.getString("path.http.url")
and you are going to run this .jar every time with modified .config file
my requirement is to modify url event after jar is build,
a possible solution is to provide a array of the config values, where key remains same in .jar file
import com.typesafe.config.ConfigFactory
ConfigFactory.load().getStringList("url.key").stream().map(eachUrl => functionToBeCalled)

integrate newrelic in flink scala project

I want to integrate newrelic in my flink project. I have downloaded my newrelic.yml file from my account and have changed the app name only and I have created a folder named newrelic in my project root folder and have placed newrelic.yml file in it.
I have also placed the following dependency in my buld.sbt file:
"com.newrelic.agent.java" % "newrelic-api" % "3.0.0"
I am using the following command to run my jar:
flink run -m yarn-cluster -yn 2 -c Main /home/hadoop/test-assembly-0.2.jar
I guess, my code is not able to read my newrelic.yml file because I can't see my app name in newrelic. Do i need to initialize newrelic agent somewhere (if yes, how?). Please help me with this integration.
You should only need the newrelic.jar and newrelic.yml files to be accessible and have -javaagent:path/to/newrelic.jar passed to the JVM as an argument. You could try putting both newrelic.jar and newrelic.yml into your lib/ directory so they get copied to the job & task managers, then adding this to your conf/flink-conf.yaml:
env.java.opts: -javaagent:lib/newrelic.jar
Both New Relic files should be in the same directory and you ought to be able to remove the New Relic line from your build.sbt file. Also double check that your license key is in the newrelic.yml file.
I haven't tested this but the main goal is for the .yml and .jar to be accessible in the same directory(the yml can go into a different directory but other JVM arguments will need to be passed to reference it) and to pass -javaagent:path/to/newrelic.jar to as a JVM argument. If you run into issues try checking for new relic logs in the log folder of the directory where the .jar is located.

How to read system variable into a conf file in Play framework

I would read an environment variable like this below
my.key = ${?MY_KEY_ENV}
But how to read a system variable that's passed in via
-Dmysystem.var=XXX
It is not being resolved in my conf file
assuming your project is managed via SBT. make sure you have the following set in the build file
javaOptions in Global += "-Dmysystem.var=XXX"
and your application.conf file has the following
my_key=${mysystem.var}
and now you should be able to be refer the my_key using the below code
configuration.getString("my_key")
tested this in my play app and it is working as expected.

Upload zip file using --archives option of spark-submit on yarn

I have a directory with some model files and my application has to access these models files in local file system due to some reason.
Of course I know that --files option of spark-submit can upload file to the working directory of each executor and it does work.
However, I want keep the directory structure of my files so I come up with --archives option, which is said
YARN-only:
......
--archives ARCHIVES Comma separated list of archives to be extracted into the working directory of each executor.
......
But when I actually use it to upload models.zip, I found yarn just put it there without extraction, like what it did with --files. Have I misunderstood to be extracted or misused this option?
Found the answer myself.
YARN does extract the archive but add an extra folder with the same name of the archive. To make it clear, If I put models/model1 and models/models2 in models.zip, then I have to access my models by models.zip/models/model1 and models.zip/models/model2.
Moreover, we can make this more beautiful using the # syntax.
The --files and --archives options support specifying file names with the # similar to Hadoop. For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.
Edit:
This answer was tested on spark 2.0.0 and I'm not sure the behavior in other versions.