How to add to classpath of running PySpark session - pyspark

I have a PySpark notebook running in AWS EMR. In my specific case, I want to use pyspark2pmml to create pmml for a model I just trained. However, I get the following error (when running pyspark2pmml.PMMLBuilder but I don't think that matters).
JPMML-SparkML not found on classpath
Traceback (most recent call last):
File "/tmp/1623111492721-0/lib/python3.7/site-packages/pyspark2pmml/__init__.py", line 14, in __init__
raise RuntimeError("JPMML-SparkML not found on classpath")
RuntimeError: JPMML-SparkML not found on classpath
I know that this is caused by my Spark session not have reference to the needed class. What I don't know is how to start a Spark session with that class available. I found one other answer using %%conf -f, but that changed other settings which in turn kept me from using sc.install_pypi_package, which I also needed.
Is there a way that I could have started the Spark session with that JPMML class available, but without changing any other settings?

So, here's an answer, but not the one I want.
To add that class to the classpath I can start my work with this:
%%configure -f
{
"jars": [
"{some_path_to_s3}/jpmml-sparkml-executable-1.5.13.jar"
]
}
That creates the issue I referenced above, where I don't have the ability to sc.install_pypi_package. However, I can add that package in a more manual way. First step was to create a zip file of just the needed modules using the zip from the project's github (in this case, just the pyspark2pmml directory, instead of the whole zip). Then that module can be added using sc.addPyFile
sc.addPyFile('{some_path_to_s3}/pyspark2pmml.zip')
After this, I can run the original commands exactly as I expected.

Related

scala issue with reading file from resources directory

I wrote something like this to read file from resource directory:
val filePath = MyClass.getClass.getResource("/myFile.csv")
val file = filePath.getFile
println(file)
CSVReader.open(file)
and the result I got was something like this:
file:/path/to/project/my_module/src/main/resources/my_module-assembly-0.1.jar!/myFile.csv
Exception in thread "main" java.io.FileNotFoundException: file:/path/to/project/my_module/src/main/resources/my_module-assembly-0.1.jar!/myFile.csv (No such file or directory)
Whereas, if I run the same code in IDE(Intellij), no issues and the path printed to console is:
/path/to/project/my_module/target/scala-2.11/classes/myFile.csv
FYI, its a multi build project with a couple of modules and I build the jars using sbt assembly
This is more related to Java or the JVM itself than to Scala or SBT.
There is a difference when running your application from the IDE vs the command line (or outside the IDE). The method getClass.getResource(...) attempts to find the resource URL in the current classpath, and that's the key difference.
If you look at the URL itself, you will find that in the first case you have a my_module-assembly-0.1.jar! bit in it, meaning that URL is actually pointing towards the contents of the JAR, not to a file accessible from the file system.
From inside your IDE your class path will include the actual files on disk, from the source folders, because the IDE assumes that there is not any JAR file itself. So when obtaining the URL given by getClass.getResource(...) you have an URL that does not have the my_module-assembly-0.1.jar! bit in it.
Since you want to read the contents of the file, you may want to do a getClass.getResourceAsStream(...). That will give you an InputStream that you can use to read the contents regardless you are in the IDE or anywhere else.
Your CSVReader class may have a method that allows it read the data from an InputStream or a Reader or something similar.
EDIT: As pointed out by Luis Miguel Mejia Suarez, a more Scala idiomatic way of reading files from your class path is using the Source.fromResource method. This will return a BufferedSource that then can be used to read the file contents.

Upload zip file using --archives option of spark-submit on yarn

I have a directory with some model files and my application has to access these models files in local file system due to some reason.
Of course I know that --files option of spark-submit can upload file to the working directory of each executor and it does work.
However, I want keep the directory structure of my files so I come up with --archives option, which is said
YARN-only:
......
--archives ARCHIVES Comma separated list of archives to be extracted into the working directory of each executor.
......
But when I actually use it to upload models.zip, I found yarn just put it there without extraction, like what it did with --files. Have I misunderstood to be extracted or misused this option?
Found the answer myself.
YARN does extract the archive but add an extra folder with the same name of the archive. To make it clear, If I put models/model1 and models/models2 in models.zip, then I have to access my models by models.zip/models/model1 and models.zip/models/model2.
Moreover, we can make this more beautiful using the # syntax.
The --files and --archives options support specifying file names with the # similar to Hadoop. For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.
Edit:
This answer was tested on spark 2.0.0 and I'm not sure the behavior in other versions.

Eclipse jython import modules/class/.jar

I have been trying to use eclipse as a jython ide. Currently I have been working with a program that has a script editor inside the program(uses jython), but the script must be run in its entirety each time, it is not 'interactive'. There is a large api associated with it and I want to use eclipse to have an interactive console. I cannot import the modules/jar/classes.
I have tried to append the folder containing the jar files to the sys path seen here, I have exploded the jar files and added those files (which have the class files in them) to the sys path seen here. I have added the jar files and class files to the classpath and the user libraries under eclipse>windows>preferences>java>Build Path
Currently I can import without error, I can seemingly construct a class without error, but nothing happens.
An example of what my console looks like is
from myFile import myClass
>>>myObj = myClass.open(fileName)
>>>myObj
>>>type(myObj)
>>>myObj.__class__
>>>type('string')
<type 'str'>
>>>'string'.__class__
<type 'str'>
when I try to create an instance of my class it doesn't throw an error, but it doesn't do anything either. Yet other objects appear to work.
Any insight is appreciated
I use at my project at the moment the jython-standalone.jar on a Windows 7 machine and also using a Java Modul.
I used 7 Zip Manager(because unzipping and adding my jar via the normal zip causes errors) to get acces to the standalone.jar(but it should also work with the normal jython.jar) and add my module via drag&drop into the folder "Lib".
If you're using another OS system try something similar.
Since then I haven't got any problems with importing my java modules so far.

"Not A Valid Jar" When trying to run Map Reduce Job

I am trying to run a my MapReduce job by building a jar from eclipse , but while trying to execute the job , I am getting "Not a valid Jar" error.
I have tried to follow the link Not a valid Jar but that didnt help.
Can anyone please give me the instructions on how to build the jar from eclipse, for it to run on Hadoop.
I am aware of the process of building the Jar file from eclipse,however I am not sure, do I have to take any special care for building a jar file, so that it runs on Hadoop.
When you submit the command, make certain you have the following things on the line to do the command:
When you indicate the jar, make certain you are directing to the jar properly. It may be easiest to be certain by using the absolute path. To get the absolute path, if you navigate to the place where the jar is, then run 'readlink -f ' command to get the absolute path. So for you, not just hist.jar, but maybe /home/akash_user/jars/hist.jar or wherever it is on your system. If you are using Eclipse, it may be saving it somewhere funny, so make sure that is not the problem. The jar cannot be run from HDFS storage. must run from local storage.
When you name your main class, in your example Histogram, you must use the fully qualified name of the class, with the package, the project, and the class. So, usually, if the program/project is named Histogram, and there is a HistogramDriver, HistogramMapper, HistogramReducer, and your main() is in HistogramDriver, you need to type Histogram.HistogramDriver to get the program running. (Unless you made your jar runnable, which requires extra stuff at the beginning, making .mdf and things.)
Make sure that the jar you are submitting (hist.jar) is in the current directory from where you are submitting the 'hadoop jar' command.
If the issue is still persisting, please tell the Java, Hadoop and Linux version you are using.
You should not keep the jar file in HDFS when executing the MapReduce job. Make sure Jar is available in the local path. Input path and output directory should be the path from HDFS.

How do I access a UDF DLL in Firebird Embedded?

I tried building a UDF for Firebird. I was able to compile the DLL and register the UDF with the database, but I can't actually run it. Every time, I get an error:
invalid request BLR at offset 63.
function [FUNCTION_NAME] is not defined.
module name or entrypoint could not be found.
I've tried dropping the UDF DLL in the same folder as the application, and in the same folder as the database, but either way it never seems to load it.
When I tried Googling for help, all the results I got back seemed to either deal with making it work on an FB server by putting it in the UDF folder for your server (which doesn't apply as I'm using FB Embedded) or with fixing permissions issues on a FB sever by editing the conf file (which doesn't apply as I'm using FB Embedded).
So, how do I determine/configure the correct place to put the UDF DLL if I'm using FB Embedded?
I think that by default Firebird expects the UDF DLLs to be in the subdirectory (relative to the fbembed.dll file) named udf.
You can configure the UDF paths via firebird.conf file, using UdfAccess parameter. The conf file which comes with installation has a explanation how to use the parameter. Also the README_embedded.txt file contains good explanation how the embedded server files should be placed.