How to load a spark-nlp pre-trained model from disk - scala

From the spark-nlp Github page I downloaded a .zip file containing a pre-trained NerCRFModel. The zip contains three folders: embeddings, fields, and metadata.
How do I load that into a Scala NerCrfModel so that I can use it? Do I have to drop it into HDFS or the host where I launch my Spark Shell? How do I reference it?

you just need to provide the path where the folders you mentioned are contained,
import com.johnsnowlabs.nlp.annotators.ner.crf.NerCrfModel
val path = "path/to/unziped/file/folder"
val model = NerCrfModel.read.load(path)
// use your model
model.setInputCols(someCol)
model.transform(yourData) // which contains 'someCol',
As long as I remember, you can place the folder in local FS or distributed FS, hope this helps other users as well!.
best,
Alberto.

Related

Searching all file names recursively in hdfs using Spark

I’ve been looking for a while now for a way to get all filenames in a directory and its sub-directories in Hadoop file system (hdfs).
I found out I can use these commands to get it :
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
sc.wholeTextFiles(path).map(_._1)
Here is "wholeTextFiles" documentation:
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
Parameters:
path - Directory to the input data files, the path can be
comma separated paths as the list of inputs.
minPartitions - A
suggestion value of the minimal splitting number for input data.
Returns:
RDD representing tuples of file path and the corresponding
file content
Note: Small files are preferred, large file is also
allowable, but may cause bad performance., On some filesystems,
.../path/* can be a more efficient way to read all files in a
directory rather than .../path/ or .../path, Partitioning is
determined by data locality. This may result in too few partitions by
default.
As you can see "wholeTextFiles" returns a pair RDD with both the filenames and their content. So I tried mapping it and taking only the file names, but I suspect it still reads the files.
The reason I suspect so: if I try to count (for example) and I get the spark equivalent of "out of memory" (losing executors and not being able to complete the tasks).
I would rather use Spark to achieve this goal the fastest way possible, however, if there are other ways with a reasonable performance I would be happy to give them a try.
EDIT:
To clear it - I want to do it using Spark, I know I can do it using HDFS commands and such thing - I would like to know how to do such thing with the existing tools provided with Spark and maybe an explanation on how I can make "wholeTextFiles" not reading the text itself (kind of like how transformations only happen after an action and some of the "commands" never really happen).
Thank you very much!
This is the way to list out all the files till the depth of last subdirectory....and is with out using wholetextfiles
and is recursive call till the depth of subdirectories...
val lb = new scala.collection.mutable[String] // variable to hold final list of files
def getAllFiles(path:String, sc: SparkContext):scala.collection.mutable.ListBuffer[String] = {
val conf = sc.hadoopConfiguration
val fs = FileSystem.get(conf)
val files: RemoteIterator[LocatedFileStatus] = fs.listLocatedStatus(new Path(path))
while(files.hasNext) {// if subdirectories exist then has next is true
var filepath = files.next.getPath.toString
//println(filepath)
lb += (filepath)
getAllFiles(filepath, sc) // recursive call
}
println(lb)
lb
}
Thats it. it was tested with success. you can use as is..

Spark - Get from a directory with nested folders all filenames of a particular data type

I have a directory with some subfolders which content different parquet files. Something like this:
2017-09-05
10-00
part00000.parquet
part00001.parquet
11-00
part00000.parquet
part00001.parquet
12-00
part00000.parquet
part00001.parquet
What I want is by passing the path to the directory 05-09 to get a list of names of all parquet files.
I was able to achieve it, but in a very inefficient way:
val allParquetFiles = sc.wholeTextFiles("C:/MyDocs/2017-09-05/*/*.parquet")
allParquetFiles.keys.foreach((k) => println("The path to the file is: "+k))
So each key is the name I am looking for, but this process requires me to load all files as well, which then I can't use, since I get them in binary (and I don't know how to convert them into a dataframe).
Once I have the keys (so the list of filePaths) I am planning to invoke:
val myParquetDF = sqlContext.read.parquet(filePath);
As you may have already understood I am quite new in Spark. So please if there is a faster or easier approach to read a list of parquet files located in different folders, please let me know.
My Partial Solution: I wasn't able to get all paths for all filenames in a folder, but I was able to get the content of all files of that type into the same dataframe. Which was my ultimate goal. In case someone may need it in the future, I used the following line:
val df = sqlContext.read.parquet("C:/MyDocs/2017-05-09/*/*.parquet")
Thanks for your time
You can do it using the hdfs api like this
import org.apache.hadoop.fs._
import org.apache.hadoop.conf._
val fs = FileSystem.get(new Configuration())
val files = ( fs.listStatus(new Path("C:/MyDocs/2017-09-05/*/*.parquet")) ).map(_.getPath.toString)
First, it is better to avoid using wholeTextFiles. This method reads the whole file at once. Try to use textFile method. read more
Second, if you need to get all files recursively in one directory, you can achieve it by textFile method:
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
This configuration will enable recursive search (works for spark jobs as for mapreduce jobs). And then just invoke sc.textFile(path).

Reading files from Apache Spark textFileStream

I'm trying to read/monitor txt files from a Hadoop file system directory. But I've noticed all txt files inside this directory are directories themselves as showed in this example bellow:
/crawlerOutput/b6b95b75148cdac44cd55d93fe2bbaa76aa5cccecf3d723c5e47d361b28663be-1427922269.txt/_SUCCESS
/crawlerOutput/b6b95b75148cdac44cd55d93fe2bbaa76aa5cccecf3d723c5e47d361b28663be-1427922269.txt/part-00000
/crawlerOutput/b6b95b75148cdac44cd55d93fe2bbaa76aa5cccecf3d723c5e47d361b28663be-1427922269.txt/part-00001
I'd want read all the data inside the part's files. I'm trying to use the following code as showed in this snippet:
val testData = ssc.textFileStream("/crawlerOutput/*/*")
But, unfortunately it said it doesn't exist /crawlerOutput/*/*. Doesn't textFileStream accept wildcards? What should I do to solve this problem?
The textFileStream() is just a wrapper for fileStream() and does not support subdirectories (see https://spark.apache.org/docs/1.3.0/streaming-programming-guide.html).
You would need to list the specific directories to monitor. If you need to detect new directories a StreamingListener could be used to check then stop streaming context and restart with new values.
Just thinking out loud.. If you intend to process each subdirectory once and just want to detect these new directories then potentially key off another location that may contain job info or a file token that once present could be consumed in the streaming context and call the appropriate textFile() to ingest the new path.

Scala get file path of file in resources folder

I am using the Stanford CRFClassifier and in order to run, it requires a file that is the trained classifier model. I have put this file in the resources directory. From the Javadocs for the CRFClassifier http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ie/crf/CRFClassifier.html#getClassifier(java.lang.String)
the path to the file must be an input to CRFClassifier.getClassifier() and it is a java.lang.String object. So my question is how do I tell .getClassifier() that the file is in the resources directory? i.e. how do I get the file path of a file in the resources directory?
I have tried simply
val classifier = CRFClassifier.getClassifier("./src/main/resources/my_model.ser.gz")
But this returns a FileNotFoundException.
I have also tried
Source.fromURL(getClass.getResource("/my_model.ser.gz"))
which returns a BufferedSource object, but I do not know how to get a file path from this.
Any help would be greatly appreciated.
I managed to be able to get the file path by doing the following
val url=getClass.getResource("/my_model.ser.gz")
val classifier = CRFClassifier.getClassifier(url.getPath())

Private assets in Play 2.1

In a Play 2.1 application, where is the proper place to store private assets?
By "private asset", I mean a data file that is used by the application but not accessible to the user.
For example, if I have a text file (Foo.json) that contains sample data that is parsed every time the application starts, what would be the proper directory in the project to store it?
Foo.json needs to be included in the deployment, and needs to be uniformly accessible from the code in both development and production.
Some options:
Usually the files goes to conf folder. ie: conf/privatefiles/Foo.json
If they are subject of often change you can consider adding to your application.conf path to the external folder somwhere in the filesystem (full path), in such case you'll be able to edit the content easily without redeploying the apps: /home/scrapdog/privatefiles/Foo.json
You can store them in database as well, benefits are the same as in previous option - easy editing.
In all cases consider using memory cache to avoid reading it from filesystem/database every time when required.
I simply use a folder called data at the application root. You can use the name you want or better, store the actual name in the configuration file.
To resolve its path, I use the following snippet:
lazy val rootPath = {
import play.api.Play.current
play.api.Play.application.path.getPath
}
lazy val dataPath = rootPath + "/data/"
You can do what I did, I got the answer from #Marius Soutier here. Please upvote his answer there if you like it:
You can put "internal" documents in the conf folder, it's the equivalent to resources in standard sbt projects.
Basically create a dir under conf called json and to access it, you'd use Play.resourceAsStream(). Note that this gives you a java.io.InputStream because your file will be part of the JAR created by activator dist.
My example is using it in a view but you can modify it as you want.
Play.resourceAsStream("json/Foo.json") map { inputStream =>
Ok(views.html.xxx(XXX.do_something_with_stream(inputStream)))
} getOrElse (InternalServerError)
You can also use Play.resource(), this will give you a java.net.URL, you can use getFile() to get the java.io.File out of it.
Play.resource("json/Foo.json") map { fileURL =>
Ok(views.html.xxx(XXX.do_something_with_file(fileURL.getFile())))
} getOrElse (InternalServerError)