Scala Convert [Seq[string] to [String]? (TF-IDF after lemmatization) - scala

I try to learn scala and specificaly text minning (lemmatization ,TF-IDF matrix and LSA).
I have some texts i want to lemmatize and make a classification (LSA). I use spark on cloudera.
So i used the stanfordCore NLP fonction:
def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <-sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
After that, i try to make an TF-IDF matrix but here is my problem:
The Stanford fonction make an RDD in [Seq[string] form.
But, i have an error.
I need to use a RDD in [String] form (not the [Seq[string]] form).
val (termDocMatrix, termIds, docIds, idfs) = termDocumentMatrix(lemmatized-text, stopWords, numTerms, sc)
Someone know how convert a [Seq[string]] to [String]?
Or i need to change one of my request?.
Thanks for the help.
Sorry if it's a dumb question and for the english.
Bye

I am not sure what this lemmatization thingy is, but as far as making a string out of a sequence, you can just do seq.mkString("\n") (or replace "\n" with whatever other separator you want), or just seq.mkString if you want it merged without any separator.
Also, don't use mutable structures, it's bad taste in scala:
val lemmas = sentences
.map(_.get(classOf[TokensAnnotation]))
.map(_.get(classOf[LemmaAnnotation]))
.filter(_.length > 2)
.filterNot(stopWords)
.mkString

Related

How can I construct a String with the contents of a given DataFrame in Scala

Consider I have a dataframe. How can I retrieve the contents of that dataframe and represent it as a string.
Consider I try to do that with the below example code.
val tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278)
val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5, 7.489564524121306E-13, 2.8363794106756046E-6, 0.0)
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]");
val sc = new SparkContext(conf)
val df = sc.parallelize(tvalues zip pvalues)
val sb = StringBuilder.newBuilder
df.foreach(x => {
println("x = ", x)
sb.append(x)
})
println("sb = ", sb)
The output of the code shows the example dataframe has contents:
(x = ,(1.866393526974307,0.064020056478447))
(x = ,(7.876169953355888,7.489564524121306E-13))
(x = ,(2.864048126935307,0.004808399479386827))
(x = ,(4.032486069215076,8.914865448939047E-5))
(x = ,(4.875333799256043,2.8363794106756046E-6))
However, the final stringbuilder contains an empty string.
Any thoughts how to retrieve a String for a given dataframe in Scala?
Many thanks
UPD: as mentioned by #user8371915, solution below will work only in single JVM in development (local) mode. In fact we cant modify broadcast variables like globals. You can use accumulators, but it will be quite inefficient. Also you can read an answer about read/write global vars here. Hope it will help you.
I think you should read topic about shared variables in Spark. Link here
Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.
Let's have a look at broadcast variables. I edited your code:
val tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278)
val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5, 7.489564524121306E-13, 2.8363794106756046E-6, 0.0)
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]");
val sc = new SparkContext(conf)
val df = sc.parallelize(tvalues zip pvalues)
val sb = StringBuilder.newBuilder
val broadcastVar = sc.broadcast(sb)
df.foreach(x => {
println("x = ", x)
broadcastVar.value.append(x)
})
println("sb = ", broadcastVar.value)
Here I used broadcastVar as a container for a StringBuilder variable sb.
Here is output:
(x = ,(1.866393526974307,0.064020056478447))
(x = ,(2.864048126935307,0.004808399479386827))
(x = ,(4.032486069215076,8.914865448939047E-5))
(x = ,(7.876169953355888,7.489564524121306E-13))
(x = ,(4.875333799256043,2.8363794106756046E-6))
(x = ,(14.316322626848278,0.0))
(sb = ,(7.876169953355888,7.489564524121306E-13)(1.866393526974307,0.064020056478447)(4.875333799256043,2.8363794106756046E-6)(2.864048126935307,0.004808399479386827)(14.316322626848278,0.0)(4.032486069215076,8.914865448939047E-5))
Hope this helps.
Does the output of df.show(false) help? If yes, then this SO answer helps: Is there any way to get the output of Spark's Dataset.show() method as a string?
Thanks everybody for the feedback and for understanding this slightly better.
The combination of responses result in the below. The requirements have changed slightly in that I represent my df as a list of jsons. The code below does this, without the use of the broadcast.
class HandleDf(df: DataFrame, limit: Int) extends java.io.Serializable {
val jsons = df.limit(limit).collect.map(rowToJson(_))
def rowToJson(r: org.apache.spark.sql.Row) : JSONObject = {
try { JSONObject(r.getValuesMap(r.schema.fieldNames)) }
catch { case t: Throwable =>
JSONObject.apply(Map("Row with error" -> t.toString))
}
}
}
The class I use here...
val jsons = new HandleDf(df, 100).jsons

Stemming and Lemmatisation using Stanford NLP library

I am using the Stanford NLP library for Stemming and Lemmatisation. I followed the example on their documentation
def plainTextToLemmas(text: String, stopWords: Set[String]): List[String] = {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
//empty annotation with given text
val doc = new Annotation(text)
//run annotators on text
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas.toList
}
val x = sentence.map(plainTextToLemmas(_, stopWords))
However, it does not lemmatize sentences without space after the fullstop very well. Is there an option to fix this? Also can there be an option to filter html tags? Adding it to the stopwords is not working.

Stemming and Lemmatization in Spark and Scala

I have used Stanford NLP Library to perform stemming and lemmatization on a sentence.
For example, Car is an easy way for commute. But there are too many cars on roads these days.
So the expected output is:
car be easy way commute car road day
But I am getting this :
ArrayBuffer(car, easy, way, for, commute, but, there, too, many, car, road, these, day)
Here is the code
val stopWords = sc.broadcast(
scala.io.Source.fromFile("src/main/common-english-words.txt").getLines().toSet).value
def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
val lemmatized = stringRDD.map(plainTextToLemmas(_, stopWords))
lemmatized.foreach(println)
I have taken it from the advanced analytics book on Spark, it seems like the stop words are not removed, and "is" is not converted to "be". Can we add or delete rules from these libraries?
http://www.textfixer.com/resources/common-english-words.txt

Converting a String to a Map

Given a String : {'Name':'Bond','Job':'Agent','LastEntry':'15/10/2015 13:00'}
I want to parse it into a Map[String,String], I already tried this answer but it doesn't work when the character : is inside the parsed value. Same thing with the ' character, it seems to break every JSON Mappers...
Thanks for any help.
Let
val s0 = "{'Name':'Bond','Job':'Agent','LastEntry':'15/10/2015 13:00'}"
val s = s0.stripPrefix("{").stripSuffix("}")
Then
(for (e <- s.split(",") ; xs = e.split(":",2)) yield xs(0) -> xs(1)).toMap
Here we split each key-value by the first occurrence of ":". Further this is a strong assumption, in that the key does not contain any ":".
You can use the familiar jackson-module-scala that can do this in much better scale.
For example:
val src = "{'Name':'Bond','Job':'Agent','LastEntry':'15/10/2015 13:00'}"
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
val myMap = mapper.readValue[Map[String,String]](src)

Simplest method for text lemmatization in Scala and Spark

I want to use lemmatization on a text file:
surprise heard thump opened door small seedy man clasping package wrapped.
upgrading system found review spring 2008 issue moody audio backed.
omg left gotta wrap review order asap . understand hand delivered dali lama
speak hands wear earplugs lives . listen maintain link long .
cables cables finally able hear gem long rumored music .
...
and expected output is :
surprise heard thump open door small seed man clasp package wrap.
upgrade system found review spring 2008 issue mood audio back.
omg left gotta wrap review order asap . understand hand deliver dali lama
speak hand wear earplug live . listen maintain link long .
cable cable final able hear gem long rumor music .
...
Can anybody help me ? and who knows the simplest method for lemmatization that it have been implemented in Scala and Spark ?
There is a function from the book Adavanced analitics in Spark, chapter about Lemmatization:
val plainText = sc.parallelize(List("Sentence to be precessed."))
val stopWords = Set("stopWord")
import edu.stanford.nlp.pipeline._
import edu.stanford.nlp.ling.CoreAnnotations._
import scala.collection.JavaConversions._
def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))
lemmatized.foreach(println)
Now just use this for every line in mapper.
val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))
EDIT:
I added to the code line
import scala.collection.JavaConversions._
this is needed because otherwise sentences are Java not Scala List. This should now compile without problems.
I used scala 2.10.4 and fallowing stanford.nlp dependencies:
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.5.2</version>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.5.2</version>
<classifier>models</classifier>
</dependency>
You can also look at stanford.nlp page there is a lot of examples (in Java) http://nlp.stanford.edu/software/corenlp.shtml.
EDIT:
MapPartition version:
Although i dont know if its gonna speed up job significantly.
def plainTextToLemmas(text: String, stopWords: Set[String], pipeline: StanfordCoreNLP): Seq[String] = {
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
val lemmatized = plainText.mapPartitions(p => {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
p.map(q => plainTextToLemmas(q, stopWords, pipeline))
})
lemmatized.foreach(println)
I think #user52045 has the right idea. The only modification I would make would be to use mapPartitions instead of map -- this allows you to only do the potentially expensive pipeline creation once per partition. This may not be a huge hit on a lemmatization pipeline, but it will be extremely important if you want to do something that requires a model, like the NER portion of the pipeline.
def plainTextToLemmas(text: String, stopWords: Set[String], pipeline:StanfordCoreNLP): Seq[String] = {
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
val lemmatized = plainText.mapPartitions(strings => {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
strings.map(string => plainTextToLemmas(string, stopWords, pipeline))
})
lemmatized.foreach(println)
I would suggest using the Stanford CoreNLP wrapper for Apache Spark as it gives the official API for the basic core nlp function such as Lemmatization, tokenization, etc.
I have used the same for lemmatization on a spark dataframe.
Link to use :https://github.com/databricks/spark-corenlp