Simplest method for text lemmatization in Scala and Spark - scala

I want to use lemmatization on a text file:
surprise heard thump opened door small seedy man clasping package wrapped.
upgrading system found review spring 2008 issue moody audio backed.
omg left gotta wrap review order asap . understand hand delivered dali lama
speak hands wear earplugs lives . listen maintain link long .
cables cables finally able hear gem long rumored music .
...
and expected output is :
surprise heard thump open door small seed man clasp package wrap.
upgrade system found review spring 2008 issue mood audio back.
omg left gotta wrap review order asap . understand hand deliver dali lama
speak hand wear earplug live . listen maintain link long .
cable cable final able hear gem long rumor music .
...
Can anybody help me ? and who knows the simplest method for lemmatization that it have been implemented in Scala and Spark ?

There is a function from the book Adavanced analitics in Spark, chapter about Lemmatization:
val plainText = sc.parallelize(List("Sentence to be precessed."))
val stopWords = Set("stopWord")
import edu.stanford.nlp.pipeline._
import edu.stanford.nlp.ling.CoreAnnotations._
import scala.collection.JavaConversions._
def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))
lemmatized.foreach(println)
Now just use this for every line in mapper.
val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))
EDIT:
I added to the code line
import scala.collection.JavaConversions._
this is needed because otherwise sentences are Java not Scala List. This should now compile without problems.
I used scala 2.10.4 and fallowing stanford.nlp dependencies:
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.5.2</version>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.5.2</version>
<classifier>models</classifier>
</dependency>
You can also look at stanford.nlp page there is a lot of examples (in Java) http://nlp.stanford.edu/software/corenlp.shtml.
EDIT:
MapPartition version:
Although i dont know if its gonna speed up job significantly.
def plainTextToLemmas(text: String, stopWords: Set[String], pipeline: StanfordCoreNLP): Seq[String] = {
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
val lemmatized = plainText.mapPartitions(p => {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
p.map(q => plainTextToLemmas(q, stopWords, pipeline))
})
lemmatized.foreach(println)

I think #user52045 has the right idea. The only modification I would make would be to use mapPartitions instead of map -- this allows you to only do the potentially expensive pipeline creation once per partition. This may not be a huge hit on a lemmatization pipeline, but it will be extremely important if you want to do something that requires a model, like the NER portion of the pipeline.
def plainTextToLemmas(text: String, stopWords: Set[String], pipeline:StanfordCoreNLP): Seq[String] = {
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
val lemmatized = plainText.mapPartitions(strings => {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
strings.map(string => plainTextToLemmas(string, stopWords, pipeline))
})
lemmatized.foreach(println)

I would suggest using the Stanford CoreNLP wrapper for Apache Spark as it gives the official API for the basic core nlp function such as Lemmatization, tokenization, etc.
I have used the same for lemmatization on a spark dataframe.
Link to use :https://github.com/databricks/spark-corenlp

Related

Stanford CoreNLP Options in Scala

Hello I am trying to update the following options in Stanford CoreNLP:
ssplit.newlineIsSentenceBreak
https://stanfordnlp.github.io/CoreNLP/ssplit.html
ner.applyFineGrained
https://stanfordnlp.github.io/CoreNLP/ner.html
I am running Spark in Scala with the following versions:
Software
Version
Spark
2.3.0
Scala
2.11.8
Java
8 (1.8.0_73)
spark-corenlp
0.3.1
stanford-corenlp
3.9.1
I have found what I believe is the definition on where the newlineIsSentenceBreak option is updated but when I try and implement I keep getting error messages.
https://nlp.stanford.edu/nlp/javadoc/javanlp-3.9.1/edu/stanford/nlp/process/WordToSentenceProcessor.html
Here is a working code snippet:
import edu.stanford.nlp.process.WordToSentenceProcessor
WordToSentenceProcessor.NewlineIsSentenceBreak.values
WordToSentenceProcessor.NewlineIsSentenceBreak.valueOf("ALWAYS")
But when I try and set the option I get an error. Specifically I am trying to implement something similar to:
WordToSentenceProcessor.NewlineIsSentenceBreak.stringToNewlineIsSentenceBreak("ALWAYS")
but I get this error:
error: value stringToNewlineIsSentenceBreak is not a member of object edu.stanford.nlp.process.WordToSentenceProcessor.NewlineIsSentenceBreak
Any help is appreciated!
Thank you stackoverflow for being my rubber duck! https://en.wikipedia.org/wiki/Rubber_duck_debugging
To set the parameters in Scala (not using the spark wrapper functions) you can assign it to the properties of the pipeline object like this:
val props: Properties = new Properties()
props.put("annotators", "tokenize,ssplit,pos,lemma,ner")
props.put("ssplit.newlineIsSentenceBreak", "always")
props.put("ner.applyFineGrained", "false")
Before creating a Stanford Core NLP pipeline:
val pipeline: StanfordCoreNLP = new StanfordCoreNLP(props)
Because the Spark wrapper functions use the simple implementation I don't think I can modify them? Please post an answer if you are aware of how to do that!
Here is a full example:
import java.util.Properties
import edu.stanford.nlp.ling.CoreAnnotations.{SentencesAnnotation, TextAnnotation, TokensAnnotation}
import edu.stanford.nlp.ling.CoreLabel
import edu.stanford.nlp.pipeline.{Annotation, StanfordCoreNLP}
import edu.stanford.nlp.util.CoreMap
import scala.collection.JavaConverters._
val props: Properties = new Properties()
props.put("annotators", "tokenize,ssplit,pos,lemma,ner")
props.put("ssplit.newlineIsSentenceBreak", "always")
props.put("ner.applyFineGrained", "false")
val pipeline: StanfordCoreNLP = new StanfordCoreNLP(props)
val text = "Quick brown fox jumps over the lazy dog. This is Harshal, he lives in Chicago. I added \nthis sentence"
// create blank annotator
val document: Annotation = new Annotation(text)
// run all Annotator - Tokenizer on this text
pipeline.annotate(document)
val sentences: List[CoreMap] = document.get(classOf[SentencesAnnotation]).asScala.toList
(for {
sentence: CoreMap <- sentences
token: CoreLabel <- sentence.get(classOf[TokensAnnotation]).asScala.toList
lemmas: String = token.word()
ner = token.ner()
} yield (sentence, lemmas, ner)) foreach(t => println("sentence: " + t._1 + " | lemmas: " + t._2 + " | ner: " + t._3))

Scala Convert [Seq[string] to [String]? (TF-IDF after lemmatization)

I try to learn scala and specificaly text minning (lemmatization ,TF-IDF matrix and LSA).
I have some texts i want to lemmatize and make a classification (LSA). I use spark on cloudera.
So i used the stanfordCore NLP fonction:
def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <-sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
After that, i try to make an TF-IDF matrix but here is my problem:
The Stanford fonction make an RDD in [Seq[string] form.
But, i have an error.
I need to use a RDD in [String] form (not the [Seq[string]] form).
val (termDocMatrix, termIds, docIds, idfs) = termDocumentMatrix(lemmatized-text, stopWords, numTerms, sc)
Someone know how convert a [Seq[string]] to [String]?
Or i need to change one of my request?.
Thanks for the help.
Sorry if it's a dumb question and for the english.
Bye
I am not sure what this lemmatization thingy is, but as far as making a string out of a sequence, you can just do seq.mkString("\n") (or replace "\n" with whatever other separator you want), or just seq.mkString if you want it merged without any separator.
Also, don't use mutable structures, it's bad taste in scala:
val lemmas = sentences
.map(_.get(classOf[TokensAnnotation]))
.map(_.get(classOf[LemmaAnnotation]))
.filter(_.length > 2)
.filterNot(stopWords)
.mkString

Stemming and Lemmatisation using Stanford NLP library

I am using the Stanford NLP library for Stemming and Lemmatisation. I followed the example on their documentation
def plainTextToLemmas(text: String, stopWords: Set[String]): List[String] = {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
//empty annotation with given text
val doc = new Annotation(text)
//run annotators on text
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas.toList
}
val x = sentence.map(plainTextToLemmas(_, stopWords))
However, it does not lemmatize sentences without space after the fullstop very well. Is there an option to fix this? Also can there be an option to filter html tags? Adding it to the stopwords is not working.

Stemming and Lemmatization in Spark and Scala

I have used Stanford NLP Library to perform stemming and lemmatization on a sentence.
For example, Car is an easy way for commute. But there are too many cars on roads these days.
So the expected output is:
car be easy way commute car road day
But I am getting this :
ArrayBuffer(car, easy, way, for, commute, but, there, too, many, car, road, these, day)
Here is the code
val stopWords = sc.broadcast(
scala.io.Source.fromFile("src/main/common-english-words.txt").getLines().toSet).value
def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
val lemmatized = stringRDD.map(plainTextToLemmas(_, stopWords))
lemmatized.foreach(println)
I have taken it from the advanced analytics book on Spark, it seems like the stop words are not removed, and "is" is not converted to "be". Can we add or delete rules from these libraries?
http://www.textfixer.com/resources/common-english-words.txt

How to close enumerated file?

Say, in an action I have:
val linesEnu = {
val is = new java.io.FileInputStream(path)
val isr = new java.io.InputStreamReader(is, "UTF-8")
val br = new java.io.BufferedReader(isr)
import scala.collection.JavaConversions._
val rows: scala.collection.Iterator[String] = br.lines.iterator
Enumerator.enumerate(rows)
}
Ok.feed(linesEnu).as(HTML)
How to close readers/streams?
There is a onDoneEnumerating callback that functions like finally (will always be called whether or not the Enumerator fails). You can close the streams there.
val linesEnu = {
val is = new java.io.FileInputStream(path)
val isr = new java.io.InputStreamReader(is, "UTF-8")
val br = new java.io.BufferedReader(isr)
import scala.collection.JavaConversions._
val rows: scala.collection.Iterator[String] = br.lines.iterator
Enumerator.enumerate(rows).onDoneEnumerating {
is.close()
// ... Anything else you want to execute when the Enumerator finishes.
}
}
The IO tools provided by Enumerator give you this kind of resource management out of the box—e.g. if you create an enumerator with fromStream, the stream is guaranteed to get closed after running (even if you only read a single line, etc.).
So for example you could write the following:
import play.api.libs.iteratee._
val splitByNl = Enumeratee.grouped(
Traversable.splitOnceAt[Array[Byte], Byte](_ != '\n'.toByte) &>>
Iteratee.consume()
) compose Enumeratee.map(new String(_, "UTF-8"))
def fileLines(path: String): Enumerator[String] =
Enumerator.fromStream(new java.io.FileInputStream(path)).through(splitByNl)
It's a shame that the library doesn't provide a linesFromStream out of the box, but I personally would still prefer to use fromStream with hand-rolled splitting, etc. over using an iterator and providing my own resource management.