Stanford CoreNLP Options in Scala - scala

Hello I am trying to update the following options in Stanford CoreNLP:
ssplit.newlineIsSentenceBreak
https://stanfordnlp.github.io/CoreNLP/ssplit.html
ner.applyFineGrained
https://stanfordnlp.github.io/CoreNLP/ner.html
I am running Spark in Scala with the following versions:
Software
Version
Spark
2.3.0
Scala
2.11.8
Java
8 (1.8.0_73)
spark-corenlp
0.3.1
stanford-corenlp
3.9.1
I have found what I believe is the definition on where the newlineIsSentenceBreak option is updated but when I try and implement I keep getting error messages.
https://nlp.stanford.edu/nlp/javadoc/javanlp-3.9.1/edu/stanford/nlp/process/WordToSentenceProcessor.html
Here is a working code snippet:
import edu.stanford.nlp.process.WordToSentenceProcessor
WordToSentenceProcessor.NewlineIsSentenceBreak.values
WordToSentenceProcessor.NewlineIsSentenceBreak.valueOf("ALWAYS")
But when I try and set the option I get an error. Specifically I am trying to implement something similar to:
WordToSentenceProcessor.NewlineIsSentenceBreak.stringToNewlineIsSentenceBreak("ALWAYS")
but I get this error:
error: value stringToNewlineIsSentenceBreak is not a member of object edu.stanford.nlp.process.WordToSentenceProcessor.NewlineIsSentenceBreak
Any help is appreciated!

Thank you stackoverflow for being my rubber duck! https://en.wikipedia.org/wiki/Rubber_duck_debugging
To set the parameters in Scala (not using the spark wrapper functions) you can assign it to the properties of the pipeline object like this:
val props: Properties = new Properties()
props.put("annotators", "tokenize,ssplit,pos,lemma,ner")
props.put("ssplit.newlineIsSentenceBreak", "always")
props.put("ner.applyFineGrained", "false")
Before creating a Stanford Core NLP pipeline:
val pipeline: StanfordCoreNLP = new StanfordCoreNLP(props)
Because the Spark wrapper functions use the simple implementation I don't think I can modify them? Please post an answer if you are aware of how to do that!
Here is a full example:
import java.util.Properties
import edu.stanford.nlp.ling.CoreAnnotations.{SentencesAnnotation, TextAnnotation, TokensAnnotation}
import edu.stanford.nlp.ling.CoreLabel
import edu.stanford.nlp.pipeline.{Annotation, StanfordCoreNLP}
import edu.stanford.nlp.util.CoreMap
import scala.collection.JavaConverters._
val props: Properties = new Properties()
props.put("annotators", "tokenize,ssplit,pos,lemma,ner")
props.put("ssplit.newlineIsSentenceBreak", "always")
props.put("ner.applyFineGrained", "false")
val pipeline: StanfordCoreNLP = new StanfordCoreNLP(props)
val text = "Quick brown fox jumps over the lazy dog. This is Harshal, he lives in Chicago. I added \nthis sentence"
// create blank annotator
val document: Annotation = new Annotation(text)
// run all Annotator - Tokenizer on this text
pipeline.annotate(document)
val sentences: List[CoreMap] = document.get(classOf[SentencesAnnotation]).asScala.toList
(for {
sentence: CoreMap <- sentences
token: CoreLabel <- sentence.get(classOf[TokensAnnotation]).asScala.toList
lemmas: String = token.word()
ner = token.ner()
} yield (sentence, lemmas, ner)) foreach(t => println("sentence: " + t._1 + " | lemmas: " + t._2 + " | ner: " + t._3))

Related

main class not found in spark scala program

//package com.jsonReader
import play.api.libs.json._
import play.api.libs.json._
import play.api.libs.json.Reads._
import play.api.libs.json.Json.JsValueWrapper
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SQLContext
//import org.apache.spark.implicits._
//import sqlContext.implicits._
object json {
def flatten(js: JsValue, prefix: String = ""): JsObject = js.as[JsObject].fields.foldLeft(Json.obj()) {
case (acc, (k, v: JsObject)) => {
val nk = if(prefix.isEmpty) k else s"$prefix.$k"
acc.deepMerge(flatten(v, nk))
}
case (acc, (k, v: JsArray)) => {
val nk = if(prefix.isEmpty) k else s"$prefix.$k"
val arr = flattenArray(v, nk).foldLeft(Json.obj())(_++_)
acc.deepMerge(arr)
}
case (acc, (k, v)) => {
val nk = if(prefix.isEmpty) k else s"$prefix.$k"
acc + (nk -> v)
}
}
def flattenArray(a: JsArray, k: String = ""): Seq[JsObject] = {
flattenSeq(a.value.zipWithIndex.map {
case (o: JsObject, i: Int) =>
flatten(o, s"$k[$i]")
case (o: JsArray, i: Int) =>
flattenArray(o, s"$k[$i]")
case a =>
Json.obj(s"$k[${a._2}]" -> a._1)
})
}
def flattenSeq(s: Seq[Any], b: Seq[JsObject] = Seq()): Seq[JsObject] = {
s.foldLeft[Seq[JsObject]](b){
case (acc, v: JsObject) =>
acc:+v
case (acc, v: Seq[Any]) =>
flattenSeq(v, acc)
}
}
def main(args: Array[String]) {
val appName = "Stream example 1"
val conf = new SparkConf().setAppName(appName).setMaster("local[*]")
//val spark = new SparkContext(conf)
val sc = new SparkContext(conf)
//val sqlContext = new SQLContext(sc)
val sqlContext=new SQLContext(sc);
//val spark=sqlContext.sparkSession
val spark = SparkSession.builder().appName("json Reader")
val df = sqlContext.read.json("C://Users//ashda//Desktop//test.json")
val set = df.select($"user",$"status",$"reason",explode($"dates")).show()
val read = flatten(df)
read.printSchema()
df.show()
}
}
I'm trying to use this code to flatten a higly nested json. For this I created a project and converted it to a maven project. I edited the pom.xml and included the libraries I needed but when I run program it says "Error: Could not find or load main class".
I tried converting the code to sbt project and then run but I get the same error. I tried packaging the code and run through spark-submit which gives me same error. Please let me know what am I missing here. I have tried I could for this.
Thanks
Hard to say, but maybe you have many classes that qualify as main so the build tool does not know which one to choose. Maybe try to clean the project first sbt clean.
Anyway in scala the preferred way to define a main class is to extend the App -trait.
object SomeApp extends App
Then the whole object body will become your main method.
You can also define in your build.sbt the main class. This is necessary if you have many objects that extend the App -trait.
mainClass in (Compile, run) := Some("io.example.SomeApp")
I am answering this question for sbt configurations. I also got the same issues which I resolved recently and made some basic mistakes which I would like you to note :
1. Configure your sbt file
go to build.sbt file and see that the scala version you are using is compatible with spark.As per version 2.4.0 of spark https://spark.apache.org/docs/latest/ ,scala version required is 2.11.x and not 2.12.x . So, even though your IDE (Eclipse/IntelliJ) shows the latest version of scala or the version you downloaded, change it to compatible version. Also, include this line of code
libraryDependencies += "org.scala-lang" % "scala-library" % "2.11.6"
2.11.x is your scala version
2. File Hierarchy
Make sure your Scala file is under /src/main/scala package only
3. Terminal
If your IDE allows you to launch terminal within it, launch it(IntelliJ allows, Not sure of Eclipse or any other) OR Go to terminal and change directory to your project directory
then run :
sbt clean
This will clear any libraries loaded previously or folders created after compilation.
sbt package
This will pack your files into a single jar file under target/scala-/ package
Then submit to spark :
spark-submit target/scala-<version>/<.jar file> --class "<ClassName>(In your case , com.jsonReader.json)" --jars target/scala-<version>/<.jar file> --master local[*]
Note here that -- if specified in a program isnt required here

Easy way to generate 1.sql from model?

As a Slick noob I do not understand why I have to specify my model twice, first in Scala and then in 1.sql to create the tables. That does not look DRY. Is there an easy way to generate 1.sql (and 2..n.sql) from the model during development?
I created my own sbt task to easily generate the 1.sql from model using the code generation:
in build.sbt file:
val generate_schema = taskKey[Unit]("Schema Generator")
generate_schema <<= (fullClasspath in Runtime) map {classpath =>
val loader: ClassLoader = ClasspathUtilities.toLoader(classpath.map(_.data).map(_.getAbsoluteFile))
val schemaGenerator = loader.loadClass("misc.SchemaGenerator").newInstance().asInstanceOf[Runnable]
schemaGenerator.run
}
the misc.SchemaGenerator class:
package misc
import models.Article
import models.Category
import play.api.Logger
import slick.driver.PostgresDriver.api._
import scala.concurrent._
import ExecutionContext.Implicits.global
import scala.reflect.io.File
class SchemaGenerator extends Runnable {
def run = {
println("---------------------GENERATING SCHEMA.....")
val categories = TableQuery[Category]
val articles = TableQuery[Article]
val file = File("/home/pedro/NetBeansProjects/play-scala-angular-sample/my-blog-server/conf/evolutions/default/1.sql")
val sb = new StringBuilder("# --- !Ups \n\n")
categories.schema.create.statements.foreach(st => sb.append(st.toString + ";\n"))
sb.append("\n\n")
articles.schema.create.statements.foreach(st => sb.append(st.toString + ";\n"))
sb.append("\n\n")
sb.append("# --- !Downs")
sb.append("\n\n")
categories.schema.drop.statements.foreach(st => sb.append(st.toString + ";\n"))
sb.append("\n\n")
articles.schema.drop.statements.foreach(st => sb.append(st.toString + ";\n"))
// Logger.info("value: [" + sb + "] sb")
file.writeAll(sb.toString)
Logger.info("----------------------FINISHED GENERATING SCHEMA--------------------------")
}
}
You can execute the task from the activator console: generate_schema.
Hope it helps.
...ddl.create is no longer supported since slick 3.0. A motivation can be found at the bottom of the following page: https://www.playframework.com/documentation/2.4.x/PlaySlickMigrationGuide
So I have to hand edit my schemas or use code generation.

Tokenization by Stanford parser is slow?

Quession Summary: tokenization by stanford parser is slow on my local machine, but unreasonably much much faster on spark. Why?
I'm using stanford coreNLP tool to tokenize sentences.
My script in Scala is like this:
import java.util.Properties
import scala.collection.JavaConversions._
import scala.collection.immutable.ListMap
import scala.io.Source
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation
import edu.stanford.nlp.ling.CoreLabel
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
val properties = new Properties()
val coreNLP = new StanfordCoreNLP(properties)
def tokenize(s: String) = {
properties.setProperty("annotators", "tokenize")
val annotation = new Annotation(s)
coreNLP.annotate(annotation)
annotation.get(classOf[TokensAnnotation]).map(_.value.toString)
}
tokenize("Here is my sentence.")
One call of tokenize function takes roughly (at least) 0.1 sec.
This is very very slow because I have 3 million sentences.
(3M * 0.1sec = 300K sec = 5000H)
As an alternative approach, I have applied the tokenizer on Spark.
(with four worker machines.)
import java.util.List
import java.util.Properties
import scala.collection.JavaConversions._
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation
import edu.stanford.nlp.ling.CoreLabel
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
val file = sc.textFile("hdfs:///myfiles")
def tokenize(s: String) = {
val properties = new Properties()
properties.setProperty("annotators", "tokenize")
val coreNLP = new StanfordCoreNLP(properties)
val annotation = new Annotation(s)
coreNLP.annotate(annotation)
annotation.get(classOf[TokensAnnotation]).map(_.toString)
}
def normalizeToken(t: String) = {
val ts = t.toLowerCase
val num = "[0-9]+[,0-9]*".r
ts match {
case num() => "NUMBER"
case _ => ts
}
}
val tokens = file.map(tokenize(_))
val tokenList = tokens.flatMap(_.map(normalizeToken))
val wordCount = tokenList.map((_,1)).reduceByKey(_ + _).sortBy(_._2, false)
wordCount.saveAsTextFile("wordcount")
This scripts finishes tokenization and word count of 3 million sentences just in 5 minites!
And results seems reasonable.
Why this is so first? Or, why the first scala script is so slow?
The problem with your first approach is that you set the annotators property after you initialize the StanfordCoreNLP object. Therefore CoreNLP is initialized with the list of default annotators which include the part-of-speech tagger and the parser which are orders of magnitude slower than the tokenizer.
To fix this, simply move the line
properties.setProperty("annotators", "tokenize")
before the line
val coreNLP = new StanfordCoreNLP(properties)
This should be even slightly faster than your second approach as you don't have to reinitialize CoreNLP for each sentence.

Simplest method for text lemmatization in Scala and Spark

I want to use lemmatization on a text file:
surprise heard thump opened door small seedy man clasping package wrapped.
upgrading system found review spring 2008 issue moody audio backed.
omg left gotta wrap review order asap . understand hand delivered dali lama
speak hands wear earplugs lives . listen maintain link long .
cables cables finally able hear gem long rumored music .
...
and expected output is :
surprise heard thump open door small seed man clasp package wrap.
upgrade system found review spring 2008 issue mood audio back.
omg left gotta wrap review order asap . understand hand deliver dali lama
speak hand wear earplug live . listen maintain link long .
cable cable final able hear gem long rumor music .
...
Can anybody help me ? and who knows the simplest method for lemmatization that it have been implemented in Scala and Spark ?
There is a function from the book Adavanced analitics in Spark, chapter about Lemmatization:
val plainText = sc.parallelize(List("Sentence to be precessed."))
val stopWords = Set("stopWord")
import edu.stanford.nlp.pipeline._
import edu.stanford.nlp.ling.CoreAnnotations._
import scala.collection.JavaConversions._
def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))
lemmatized.foreach(println)
Now just use this for every line in mapper.
val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))
EDIT:
I added to the code line
import scala.collection.JavaConversions._
this is needed because otherwise sentences are Java not Scala List. This should now compile without problems.
I used scala 2.10.4 and fallowing stanford.nlp dependencies:
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.5.2</version>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.5.2</version>
<classifier>models</classifier>
</dependency>
You can also look at stanford.nlp page there is a lot of examples (in Java) http://nlp.stanford.edu/software/corenlp.shtml.
EDIT:
MapPartition version:
Although i dont know if its gonna speed up job significantly.
def plainTextToLemmas(text: String, stopWords: Set[String], pipeline: StanfordCoreNLP): Seq[String] = {
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
val lemmatized = plainText.mapPartitions(p => {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
p.map(q => plainTextToLemmas(q, stopWords, pipeline))
})
lemmatized.foreach(println)
I think #user52045 has the right idea. The only modification I would make would be to use mapPartitions instead of map -- this allows you to only do the potentially expensive pipeline creation once per partition. This may not be a huge hit on a lemmatization pipeline, but it will be extremely important if you want to do something that requires a model, like the NER portion of the pipeline.
def plainTextToLemmas(text: String, stopWords: Set[String], pipeline:StanfordCoreNLP): Seq[String] = {
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
val lemmatized = plainText.mapPartitions(strings => {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
strings.map(string => plainTextToLemmas(string, stopWords, pipeline))
})
lemmatized.foreach(println)
I would suggest using the Stanford CoreNLP wrapper for Apache Spark as it gives the official API for the basic core nlp function such as Lemmatization, tokenization, etc.
I have used the same for lemmatization on a spark dataframe.
Link to use :https://github.com/databricks/spark-corenlp

Getting a Scala Map from a Java Properties

I was trying to pull environment variables into a scala script using java Iterators and / or Enumerations and realised that Dr Frankenstein might claim parentage, so I hacked the following from the ugly tree instead:
import java.util.Map.Entry
import System._
val propSet = getProperties().entrySet().toArray()
val props = (0 until propSet.size).foldLeft(Map[String, String]()){(m, i) =>
val e = propSet(i).asInstanceOf[Entry[String, String]]
m + (e.getKey() -> e.getValue())
}
For example to print the said same environment
props.keySet.toList.sortWith(_ < _).foreach{k =>
println(k+(" " * (30 - k.length))+" = "+props(k))
}
Please, please don't set about polishing this t$#d, just show me the scala gem that I'm convinced exists for this situation (i.e java Properties --> scala.Map), thanks in advance ;#)
Scala 2.10.3
import scala.collection.JavaConverters._
//Create a variable to store the properties in
val props = new Properties
//Open a file stream to read the file
val fileStream = new FileInputStream(new File(fileName))
props.load(fileStream)
fileStream.close()
//Print the contents of the properties file as a map
println(props.asScala.toMap)
Scala 2.7:
val props = Map() ++ scala.collection.jcl.Conversions.convertMap(System.getProperties).elements
Though that needs some typecasting. Let me work on it a bit more.
val props = Map() ++ scala.collection.jcl.Conversions.convertMap(System.getProperties).elements.asInstanceOf[Iterator[(String, String)]]
Ok, that was easy. Let me work on 2.8 now...
import scala.collection.JavaConversions.asMap
val props = System.getProperties() : scala.collection.mutable.Map[AnyRef, AnyRef] // or
val props = System.getProperties().asInstanceOf[java.util.Map[String, String]] : scala.collection.mutable.Map[String, String] // way too many repetitions of types
val props = asMap(System.getProperties().asInstanceOf[java.util.Map[String, String]])
The verbosity, of course, can be decreased with a couple of imports. First of all, note that Map will be a mutable map on 2.8. On the bright side, if you convert back the map, you'll get the original object.
Now, I have no clue why Properties implements Map<Object, Object>, given that the javadocs clearly state that key and value are String, but there you go. Having to typecast this makes the implicit option much less attractive. This being the case, the alternative is the most concise of them.
EDIT
Scala 2.8 just acquired an implicit conversion from Properties to mutable.Map[String,String], which makes most of that code moot.
In Scala 2.9.1 this is solved by implicit conversions inside collection.JavaConversions._ . The other answers use deprecated functions. The details are documented here. This is a relevant snippet out of that page:
scala> import collection.JavaConversions._
import collection.JavaConversions._
scala> import collection.mutable._
import collection.mutable._
scala> val jul: java.util.List[Int] = ArrayBuffer(1, 2, 3)
jul: java.util.List[Int] = [1, 2, 3]
scala> val buf: Seq[Int] = jul
buf: scala.collection.mutable.Seq[Int] = ArrayBuffer(1, 2, 3)
scala> val m: java.util.Map[String, Int] = HashMap("abc" -> 1, "hello" -> 2)
m: java.util.Map[String,Int] = {hello=2, abc=1}
Getting from a mutable map to an immutable map is a matter of calling toMap on it.
In Scala 2.8.1 you can do it with asScalaMap(m : java.util.Map[A, B]) in a more concise way:
var props = asScalaMap(System.getProperties())
props.keySet.toList.sortWith(_ < _).foreach { k =>
println(k + (" " * (30 - k.length)) + " = " + props(k))
}
In Scala 2.13.2:
import scala.jdk.javaapi.CollectionConverters._
val props = asScala(System.getProperties)
Looks like in the most recent version of Scala (2.10.2 as of the time of this answer), the preferred way to do this is using the explicit .asScala from scala.collection.JavaConverters:
import scala.collection.JavaConverters._
val props = System.getProperties().asScala
assert(props.isInstanceOf[Map[String, String]])