Scala get file path of file in resources folder - scala

I am using the Stanford CRFClassifier and in order to run, it requires a file that is the trained classifier model. I have put this file in the resources directory. From the Javadocs for the CRFClassifier http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ie/crf/CRFClassifier.html#getClassifier(java.lang.String)
the path to the file must be an input to CRFClassifier.getClassifier() and it is a java.lang.String object. So my question is how do I tell .getClassifier() that the file is in the resources directory? i.e. how do I get the file path of a file in the resources directory?
I have tried simply
val classifier = CRFClassifier.getClassifier("./src/main/resources/my_model.ser.gz")
But this returns a FileNotFoundException.
I have also tried
Source.fromURL(getClass.getResource("/my_model.ser.gz"))
which returns a BufferedSource object, but I do not know how to get a file path from this.
Any help would be greatly appreciated.

I managed to be able to get the file path by doing the following
val url=getClass.getResource("/my_model.ser.gz")
val classifier = CRFClassifier.getClassifier(url.getPath())

Related

How to load a spark-nlp pre-trained model from disk

From the spark-nlp Github page I downloaded a .zip file containing a pre-trained NerCRFModel. The zip contains three folders: embeddings, fields, and metadata.
How do I load that into a Scala NerCrfModel so that I can use it? Do I have to drop it into HDFS or the host where I launch my Spark Shell? How do I reference it?
you just need to provide the path where the folders you mentioned are contained,
import com.johnsnowlabs.nlp.annotators.ner.crf.NerCrfModel
val path = "path/to/unziped/file/folder"
val model = NerCrfModel.read.load(path)
// use your model
model.setInputCols(someCol)
model.transform(yourData) // which contains 'someCol',
As long as I remember, you can place the folder in local FS or distributed FS, hope this helps other users as well!.
best,
Alberto.

Spark - Get from a directory with nested folders all filenames of a particular data type

I have a directory with some subfolders which content different parquet files. Something like this:
2017-09-05
10-00
part00000.parquet
part00001.parquet
11-00
part00000.parquet
part00001.parquet
12-00
part00000.parquet
part00001.parquet
What I want is by passing the path to the directory 05-09 to get a list of names of all parquet files.
I was able to achieve it, but in a very inefficient way:
val allParquetFiles = sc.wholeTextFiles("C:/MyDocs/2017-09-05/*/*.parquet")
allParquetFiles.keys.foreach((k) => println("The path to the file is: "+k))
So each key is the name I am looking for, but this process requires me to load all files as well, which then I can't use, since I get them in binary (and I don't know how to convert them into a dataframe).
Once I have the keys (so the list of filePaths) I am planning to invoke:
val myParquetDF = sqlContext.read.parquet(filePath);
As you may have already understood I am quite new in Spark. So please if there is a faster or easier approach to read a list of parquet files located in different folders, please let me know.
My Partial Solution: I wasn't able to get all paths for all filenames in a folder, but I was able to get the content of all files of that type into the same dataframe. Which was my ultimate goal. In case someone may need it in the future, I used the following line:
val df = sqlContext.read.parquet("C:/MyDocs/2017-05-09/*/*.parquet")
Thanks for your time
You can do it using the hdfs api like this
import org.apache.hadoop.fs._
import org.apache.hadoop.conf._
val fs = FileSystem.get(new Configuration())
val files = ( fs.listStatus(new Path("C:/MyDocs/2017-09-05/*/*.parquet")) ).map(_.getPath.toString)
First, it is better to avoid using wholeTextFiles. This method reads the whole file at once. Try to use textFile method. read more
Second, if you need to get all files recursively in one directory, you can achieve it by textFile method:
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
This configuration will enable recursive search (works for spark jobs as for mapreduce jobs). And then just invoke sc.textFile(path).

Private assets in Play 2.1

In a Play 2.1 application, where is the proper place to store private assets?
By "private asset", I mean a data file that is used by the application but not accessible to the user.
For example, if I have a text file (Foo.json) that contains sample data that is parsed every time the application starts, what would be the proper directory in the project to store it?
Foo.json needs to be included in the deployment, and needs to be uniformly accessible from the code in both development and production.
Some options:
Usually the files goes to conf folder. ie: conf/privatefiles/Foo.json
If they are subject of often change you can consider adding to your application.conf path to the external folder somwhere in the filesystem (full path), in such case you'll be able to edit the content easily without redeploying the apps: /home/scrapdog/privatefiles/Foo.json
You can store them in database as well, benefits are the same as in previous option - easy editing.
In all cases consider using memory cache to avoid reading it from filesystem/database every time when required.
I simply use a folder called data at the application root. You can use the name you want or better, store the actual name in the configuration file.
To resolve its path, I use the following snippet:
lazy val rootPath = {
import play.api.Play.current
play.api.Play.application.path.getPath
}
lazy val dataPath = rootPath + "/data/"
You can do what I did, I got the answer from #Marius Soutier here. Please upvote his answer there if you like it:
You can put "internal" documents in the conf folder, it's the equivalent to resources in standard sbt projects.
Basically create a dir under conf called json and to access it, you'd use Play.resourceAsStream(). Note that this gives you a java.io.InputStream because your file will be part of the JAR created by activator dist.
My example is using it in a view but you can modify it as you want.
Play.resourceAsStream("json/Foo.json") map { inputStream =>
Ok(views.html.xxx(XXX.do_something_with_stream(inputStream)))
} getOrElse (InternalServerError)
You can also use Play.resource(), this will give you a java.net.URL, you can use getFile() to get the java.io.File out of it.
Play.resource("json/Foo.json") map { fileURL =>
Ok(views.html.xxx(XXX.do_something_with_file(fileURL.getFile())))
} getOrElse (InternalServerError)

Exporting an JAR file in Eclipse and referencing a file

I have a project with a image stored as a logo that I wish to use.
URL logoPath = new MainApplication().getClass().getClassLoader().getResource("img/logo.jpg");
Using that method I get the URL for the file and convert it to string. I then have to substring that by 5 to get rid of this output "file:/C:/Users/Stephen/git/ILLA/PoC/bin/img/logo.jpg"
However when I export this as a jar and run it I run into trouble. The URL now reads /ILLA.jar!/ and my image is just blank. I have a gut feeling that it's tripping me up so how do I fix this?
Cheers
You are almost there.
Images in a jar are treated as resources. You need to refer to them using the classpath
Just use getClass().getResource: something like:
getClass().getResource("/images/logo.jpg"));
where "images" is a package inside the jar file, with the path as above
see the leading / in the call - this will help accessing the path correctly (using absolute instead of relative). Just make sure the path is correct
Also see:
How to includes all images in jar file using eclipse
See here: Create a file object from a resource path to an image in a jar file
String imgName = "/resources/images/image.jpg";
InputStream in = getClass().getResourceAsStream(imgName);
ImageIcon img = new ImageIcon(ImageIO.read(in));
Note it looks like you need to use a stream for a resource inside an archive.

ResourceException when creating IMarker on IFile (linked resource)

I have some problems updating an "old" Eclipse plugin. Here is what I would like to do and what the original plugin did:
(parse compiler output on console with file name and error information --> still works)
--> set link to the location within the file
--> set marker to location in the file
What I did in the past was to get the IFile from the path String of the file and generated link and marker from it:
IFile ifile;
IWorkspace workspace = ResourcesPlugin.getWorkspace();
IPath path = new Path(fileName);
IFiles[] files = workspace.getRoot().findFilesForLocation(path);
...
ifile = iFiles[0];
Map attributes = new HashMap();
attributes.put(IMarker.SEVERITY, new Integer (severity));
MarkerUtilities.setLineNumber(attributes, lineNumber);
MarkerUtilities.setMessage(attributes, message);
MarkerUtilities.createMarker(ifile, attributes,
IMarker
Since findFilesForLocation is deprecated, I tried to find another way but I am not succeeding whatsoever. Using the changed code to get the IFile always results in a exception: org.eclipse.core.internal.resources.ResourceException: Resource '/path/to/file.c' does not exist.
Is it possible that this relates to the fact that the source file is only linked into the project, and not physically within the project?
IWorkspace workspace = ResourcesPlugin.getWorkspace();
IPath location = new Path(fileName);
IFile ifile = workspace.getRoot().getFile(location);
Can anyone help?
Thank you!
I am guessing that fileName is the fully qualified path to the file you want to get. I'm also guessing that the file that you are looking for is already in the workspace, even if it is linked (if not, then this won't work. You will first need to add the file to a project before getting the IFile for it).
You need to do something like this:
IFiles[] files = workspace.getRoot().findFilesForLocationURI("file:" + fileName);
Then this will find all files in the workspace that correspond to the file in the file system.
The reason why your attempt above is giving you a ResourceException is that you are trying to pass in a file system path to get an IFile object from the workspace. The Eclipse workspace is an abstraction over the underlying filesystem and cannot directly work with absolute paths.
For the Resources APIs, Paths usually means a path in the workspace and Location usually refers to a place in the filesystem or outside the workspace. If you already have a workspace path to start with, just ask the IWorkspaceRoot for the IFile and get on with what you're doing.