Inference Labeled LDA/pLDA [Topic Modelling Toolbox] - scala

I have been trying to get through with the code for inference from trained labeled LDA model and pLDA using TMT toolbox(stanford nlp group).
I have gone through the examples provided in the following links:
http://nlp.stanford.edu/software/tmt/tmt-0.3/
http://nlp.stanford.edu/software/tmt/tmt-0.4/
Here is the code I'm trying for labeled LDA inference
val modelPath = file("llda-cvb0-59ea15c7-31-61406081-75faccf7");
val model = LoadCVB0LabeledLDA(modelPath);`
val source = CSVFile("pubmed-oa-subset.csv") ~> IDColumn(1);
val text = {
source ~> // read from the source file
Column(4) ~> // select column containing text
TokenizeWith(model.tokenizer.get) //tokenize with model's tokenizer
}
val labels = {
source ~> // read from the source file
Column(2) ~> // take column two, the year
TokenizeWith(WhitespaceTokenizer())
}
val outputPath = file(modelPath, source.meta[java.io.File].getName.replaceAll(".csv",""));
val dataset = LabeledLDADataset(text,labels,model.termIndex,model.topicIndex);
val perDocTopicDistributions = InferCVB0LabeledLDADocumentTopicDistributions(model, dataset);
val perDocTermTopicDistributions =EstimateLabeledLDAPerWordTopicDistributions(model, dataset, perDocTopicDistributions);
TSVFile(outputPath+"-word-topic-distributions.tsv").write({
for ((terms,(dId,dists)) <- text.iterator zip perDocTermTopicDistributions.iterator) yield {
require(terms.id == dId);
(terms.id,
for ((term,dist) <- (terms.value zip dists)) yield {
term + " " + dist.activeIterator.map({
case (topic,prob) => model.topicIndex.get.get(topic) + ":" + prob
}).mkString(" ");
});
}
});
Error
found : scalanlp.collection.LazyIterable[(String, Array[Double])]
required: Iterable[(String, scalala.collection.sparse.SparseArray[Double])]
EstimateLabeledLDAPerWordTopicDistributions(model, dataset, perDocTopicDistributions);
I understand it's a type mismatch error. But I don't know how to resolve this for scala.
Basically I don't understand how should I extract the
1. per doc topic distribution
2. per doc label distribution after the output of the infer command.
Please help.
Same in the case of pLDA.
I reach the inference command and after that clueless what to do with it.

Scala type system is much more complex then Java one and understanding it will make you a better programmer. The problem lies here:
val perDocTermTopicDistributions =EstimateLabeledLDAPerWordTopicDistributions(model, dataset, perDocTopicDistributions);
because either model, or dataset, or perDocTopicDistributions are of type:
scalanlp.collection.LazyIterable[(String, Array[Double])]
while EstimateLabeledLDAPerWordTopicDistributions.apply expects a
Iterable[(String, scalala.collection.sparse.SparseArray[Double])]
The best way to investigate this type errors is to look at the ScalaDoc (for example the one for tmt is there: http://nlp.stanford.edu/software/tmt/tmt-0.4/api/#package ) and if you cannot find out where the problem lies easily, you should explicit the type of your variables inside your code like the following:
val perDocTopicDistributions:LazyIterable[(String, Array[Double])] = InferCVB0LabeledLDADocumentTopicDistributions(model, dataset)
If we look together to the javadoc of edu.stanford.nlp.tmt.stage:
def
EstimateLabeledLDAPerWordTopicDistributions (model: edu.stanford.nlp.tmt.model.llda.LabeledLDA[_, _, _], dataset: Iterable[LabeledLDADocumentParams], perDocTopicDistributions: Iterable[(String, SparseArray[Double])]): LazyIterable[(String, Array[SparseArray[Double]])]
def
InferCVB0LabeledLDADocumentTopicDistributions (model: CVB0LabeledLDA, dataset: Iterable[LabeledLDADocumentParams]): LazyIterable[(String, Array[Double])]
It now should be clear to you that the return of InferCVB0LabeledLDADocumentTopicDistributions cannot be used directly to feed EstimateLabeledLDAPerWordTopicDistributions.
I never used stanford nlp but this is by design how the api works, so you only need to convert your scalanlp.collection.LazyIterable[(String, Array[Double])] into Iterable[(String, scalala.collection.sparse.SparseArray[Double])] before calling the function.
If you look at the scaladoc on how to do this conversion it's pretty simple. Inside the package stage, in package.scala I can read import scalanlp.collection.LazyIterable;
So I know where to look, and in fact inside http://www.scalanlp.org/docs/core/data/#scalanlp.collection.LazyIterable you have a toIterable method which turns a LazyIterable into an Iterable, still you have to transform your internal array into a SparseArray
Again, I look into the package.scala for the stage package inside tmt and I see: import scalala.collection.sparse.SparseArray; And I look for scalala documentation :
http://www.scalanlp.org/docs/scalala/0.4.1-SNAPSHOT/#scalala.collection.sparse.SparseArray
It turns out that the constructors seems complicated to me, so it sounds much like I would have to look into the companion object for a factory method. It turns out that the method I am looking for is there, and it's called apply like as usual in Scala.
def
apply [T] (values: T*)(implicit arg0: ClassManifest[T], arg1: DefaultArrayValue[T]): SparseArray[T]
By using this, you can write a function with the following signature:
def f: Array[Double] => SparseArray[Double]
Once this has done, you can turn your result of InferCVB0LabeledLDADocumentTopicDistributions into a non-lazy iterable of sparse Array with one line of code:
result.toIterable.map { case (name, values => (name, f(values)) }

Related

Iterating over two Source and filter using a property in Scala

I am trying to filter out common elements in terms of latest versions of the object and return another Source. My object looks like:
case class Record(id: String, version: Long)
My method's input are two Source:
val sourceA: Source[Record, _] = <>
val sourceB: Source[Record, _] = <>
sourceA and sourceB has common id of the Record object but there is a possibility that versions are different in both. I want to create a method which returns a Source[Record, _] which will have latest version for an id. I tried
val latestCombinedSource: Source[Record, _] = sourceA map {each => {
sourceB.map(eachB => eachB.version > each.version? eachB: each)
.....
}
}
You did not mention what type of Source / what streaming library you are asking about (please update your question to clarify that). From the signatures in the code, I assume that this is about akka-stream. If that is correct, then you probably want to use zipLatestWith:
val latestCombinedSource: Source[Record, _] =
sourceA.zipLatestWith(sourceB) { (a, b) =>
if (a.version > b.version) a else b
}
Note that there is also zipWith and I'm not 100% sure which one you'd want to use. The difference (quoted from the API docs) is: zipWithLatest "Emits when all of the inputs have at least an element available, and then each time an element becomes available on either of the inputs" while zipWith "Emits when all of the inputs have an element available".

How to return successfully parsed rows that converted into my case class

I have a file, each row is a json array.
I reading each line of the file, and trying to convert the rows into a json array, and then for each element I am converting to a case class using json spray.
I have this so far:
for (line <- source.getLines().take(10)) {
val jsonArr = line.parseJson.convertTo[JsArray]
for (ele <- jsonArr.elements) {
val tryUser = Try(ele.convertTo[User])
}
}
How could I convert this entire process into a single line statement?
val users: Seq[User] = source.getLines.take(10).map(line => line.parseJson.convertTo[JsonArray].elements.map(ele => Try(ele.convertTo[User])
The error is:
found : Iterator[Nothing]
Note: I used Scala 2.13.6 for all my examples.
There is a lot to unpack in these few lines of code. First of all, I'll share some code that we can use to generate some meaningful input to play around with.
object User {
import scala.util.Random
private def randomId: Int = Random.nextInt(900000) + 100000
private def randomName: String = Iterator
.continually(Random.nextInt(26) + 'a')
.map(_.toChar)
.take(6)
.mkString
def randomJson(): String = s"""{"id":$randomId,"name":"$randomName"}"""
def randomJsonArray(size: Int): String =
Iterator.continually(randomJson()).take(size).mkString("[", ",", "]")
}
final case class User(id: Int, name: String)
import scala.util.{Try, Success, Failure}
import spray.json._
import DefaultJsonProtocol._
implicit val UserFormat = jsonFormat2(User.apply)
This is just some scaffolding to define some User domain object and come up with a way to generate a JSON representation of an array of such objects so that we can then use a JSON library (spray-json in this case) to parse it back into what we want.
Now, going back to your question. This is a possible way to massage your data into its parsed representation. It may not fit 100% what your are trying to do, but there's some nuance in the data types involved and how they work:
val parsedUsers: Iterator[Try[User]] =
for {
line <- Iterator.continually(User.randomJsonArray(4)).take(10)
element <- line.parseJson.convertTo[JsArray].elements
} yield Try(element.convertTo[User])
First difference: notice that I use the for comprehension in a form in which the "outcome" of an iteration is not a side effect (for (something) { do something }) but an actual value for (something) yield { return a value }).
Second difference: I explicitly asked for an Iterator[Try[User]] rather than a Seq[User]. We can go very down into a rabbit hole on the topic of why the types are what they are here, but the simple explanation is that a for ... yield expression:
returns the same type as the one in the first line of the generation -- if you start with a val ns: Iterator[Int]; for (n<- ns) ... you'll get an iterator at the end
if you nest generators, they need to be of the same type as the "outermost" one
You can read more on for comprehensions on the Tour of Scala and the Scala Book.
One possible way of consuming this is the following:
for (user <- parsedUsers) {
user match {
case Success(user) => println(s"parsed object $user")
case Failure(error) => println(s"ERROR: '${error.getMessage}'")
}
As for how to turn this into a "one liner", for comprehensions are syntactic sugar applied by the compiler which turns every nested call into a flatMap and the final one into map, as in the following example (which yields an equivalent result as the for comprehension above and very close to what the compiler does automatically):
val parsedUsers: Iterator[Try[User]] = Iterator
.continually(User.randomJsonArray(4))
.take(10)
.flatMap(line =>
line.parseJson
.convertTo[JsArray]
.elements
.map(element => Try(element.convertTo[User]))
)
One note that I would like to add is that you should be mindful of readability. Some teams prefer for comprehensions, others manually rolling out their own flatMap/map chains. Coders discretion is advised.
You can play around with this code here on Scastie (and here is the version with the flatMap/map calls).

How can I dynamically (runtime) generate a sorted collection in Scala using the java.lang.reflect.Type?

Given an array of items I need to generate a sorted collection in Scala for a java.lang.reflect.Type but I'm unable to do so. The following snippet might explain better.
def buildList(paramType: Type): SortedSet[_] = {
val collection = new Array[Any](5)
for (i <- 0 until 5) {
collection(i) = new EasyRandom().nextObject(paramType.asInstanceOf[Class[Any]])
}
SortedSet(collection:_*)
}
I'm unable to do as I get the error "No implicits found for parameter ord: Ordering[Any]". I'm able to work around this if I swap to an unsorted type such as Set.
def buildList(paramType: Type): Set[_] = {
val collection = new Array[Any](5)
for (i <- 0 until 5) {
collection(i) = new EasyRandom().nextObject(paramType.asInstanceOf[Class[Any]])
}
Set(collection:_*)
}
How can I dynamically build a sorted set at runtime? I've been looking into how Jackson tries to achieve the same but I couldn't quite follow how to get T here: https://github.com/FasterXML/jackson-module-scala/blob/0e926622ea4e8cef16dd757fa85400a0b9dcd1d3/src/main/scala/com/fasterxml/jackson/module/scala/introspect/OrderingLocator.scala#L21
(Please excuse me if my question is unclear.)
This happens because SortedSet needs a contextual (implicit) Ordering type class instance for a given type A
However, as Luis said on the comment section, I'd strongly advice you against using this approach and using a safer, strongly typed one, instead.
Generating random case classes (which I suppose you're using since you're using Scala) should be easy with the help of some libraries like magnolia. That would turn your code into something like this:
def randomList[A : Ordering : Arbitrary]: SortedSet[A] = {
val arb: Arbitrary[A] = implicitly[Arbitrary[A]]
val sampleData = (1 to 5).map(arb.arbitrary.sample)
SortedSet(sampleData)
}
This approach involves some heavy concepts like implicits and type classes, but is way safer.

Yield mutable.seq from mutable.traversable type in Scala

I have a variable underlying of type Option[mutable.Traversable[Field]]
All I wanted todo in my class was provide a method to return this as Sequence in the following way:
def toSeq: scala.collection.mutable.Seq[Field] = {
for {
f <- underlying.get
} yield f
}
This fails as it complains that mutable.traversable does not conform to mutable.seq. All it is doing is yielding something of type Field - in my mind this should work?
A possible solution to this is:
def toSeq: Seq[Field] = {
underlying match {
case Some(x) => x.toSeq
case None =>
}
}
Although I have no idea what is actually happening when x.toSeq is called and I imagine there is more memory being used here that actually required to accomplish this.
An explanation or suggestion would be much appreciated.
I am confused why you say that "I imagine there is more memory being used here than actually required to accomplish". Scala will not copy your Field values when doing x.toSeq, it is simply going to create an new Seq which will have pointers to the same Field values that underlying is pointing to. Since this new structure is exactly what you want there is no avoiding the additional memory associated with the extra pointers (but the amount of additional memory should be small). For a more in-depth discussion see the wiki on persistent data structures.
Regarding your possible solution, it could be slightly modified to get the result you're expecting:
def toSeq : Seq[Field] =
underlying
.map(_.toSeq)
.getOrElse(Seq.empty[Field])
This solution will return an empty Seq if underlying is a None which is safer than your original attempt which uses get. I say it's "safer" because get throws a NoSuchElementException if the Option is a None whereas my toSeq can never fail to return a valid value.
Functional Approach
As a side note: when I first started programming in scala I would write many functions of the form:
def formatSeq(seq : Seq[String]) : Seq[String] =
seq map (_.toUpperCase)
This is less functional because you are expecting a particular collection type, e.g. formatSeq won't work on a Future.
I have found that a better approach is to write:
def formatStr(str : String) = str.toUpperCase
Or my preferred coding style:
val formatStr = (_ : String).toUpperCase
Then the user of your function can apply formatStr in any fashion they want and you don't have to worry about all of the collection casting:
val fut : Future[String] = ???
val formatFut = fut map formatStr
val opt : Option[String] = ???
val formatOpt = opt map formatStr

Creating RDDs and outputting to text files with Scala and Spark

I apologise for what will probably be a simple question but I'm struggling to get to grips with parsing rdd's with scala/spark. I have an RDD created from a CSV, read in with
val partitions: RDD[(String, String, String, String, String)] = withoutHeader.mapPartitions(lines => {
val parser = new CSVParser(',')
lines.map(line => {
val columns = parser.parseLine(line)
(columns(0), columns(1), columns(2), columns(3), columns(4))
})
})
When I output this to a file with
partitions.saveAsTextFile(file)
I get the output with the parentheses on each line. I don't want these parentheses. I'm struggling in general to understand what is happening here. My background is with low level languages and I'm struggling to see through the abstractions to what it's actually doing. I understand the mappings but it's the output that is escaping me. Can someone either explain to me what is going on in the line (columns(0), columns(1), columns(2), columns(3), columns(4)) or point me to a guide that simply explains what is happening?
My ultimate goal is to be able to manipulate files that are on hdsf in spark to put them in formats suitable for mllib.I'm unimpressed with the spark or scala guides as they look like they have been produced with poorly annotated javadocs and don't really explain anything.
Thanks in advance.
Dean
I would just convert your tuple to the string format you want. For example, to create |-delimited output:
partitions.map{ tup => s"${tup._1}|${tup._2}|${tup._3}|${tup._4}|${tup._5}" }
or using pattern matching (which incurs a little more runtime overhead):
partitions.map{ case (a,b,c,d,e) => s"$a|$b|$c|$d|$e" }
I'm using the string interpolation feature of Scala (note the s"..." format).
Side note, you can simplify your example by just mapping over the RDD as a whole, rather than the individual partitions:
val parser = new CSVParser(',')
val partitions: RDD[(String, String, String, String, String)] =
withoutHeader.map { line =>
val columns = parser.parseLine(line)
(columns(0), columns(1), columns(2), columns(3), columns(4))
}