the accuracy of LDA predict for new documents with Spark - scala

I'm work with Mllib of Spark, and now is doing something with LDA.
But when I use the code provided by Spark(see bellow) to predict a Doc used in training the model, the result(document-topics) of predict is at opposite poles with the result of trained document-topics.
I don't know what caused the result.
Asking for help, and here is my code below:
train:$lda.run(corpus) the corpus is an RDD like this: $RDD[(Long, Vector)] the Vector contains vocabulary, index of words, wordcounts.
predict:
def predict(documents: RDD[(Long, Vector)], ldaModel: LDAModel): Array[(Long, Vector)] = {
var docTopicsWeight = new Array[(Long, Vector)](documents.collect().length)
ldaModel match {
case localModel: LocalLDAModel =>
docTopicsWeight = localModel.topicDistributions(documents).collect()
case distModel: DistributedLDAModel =>
docTopicsWeight = distModel.toLocal.topicDistributions(documents).collect()
}
docTopicsWeight
}

I'm not sure if your question actually concerns on why you were getting errors on your code but from I have understand, it seems first that you were using the default Vector class. Secondly, you can't use case class on the model directly. You'll need to use the isInstanceOf and asInstanceOf method for that.
def predict(documents: RDD[(Long, org.apache.spark.mllib.linalg.Vector)], ldaModel: LDAModel): Array[(Long, org.apache.spark.mllib.linalg.Vector)] = {
var docTopicsWeight = new Array[(Long, org.apache.spark.mllib.linalg.Vector)](documents.collect().length)
if (ldaModel.isInstanceOf[LocalLDAModel]) {
docTopicsWeight = ldaModel.asInstanceOf[LocalLDAModel].topicDistributions(documents).collect
} else if (ldaModel.isInstanceOf[DistributedLDAModel]) {
docTopicsWeight = ldaModel.asInstanceOf[DistributedLDAModel].toLocal.topicDistributions(documents).collect
}
docTopicsWeight
}

Related

Spark custom encoder for dataframe

I know about How to store custom objects in Dataset? but still, it is not really clear for me how to build this custom encoder which properly serializes to multiple fields. Manually, I created some functions https://github.com/geoHeil/geoSparkScalaSample/blob/master/src/main/scala/myOrg/GeoSpark.scala#L122-L154 wich map a Polygon back and forth between Dataset - RDD - Dataset by mapping the objects to primitive types spark can handle i.e. a tuple (String, Int)(edit: full code below).
For example, to go from the Polygon Object to a tuple of (String, Int) I use the following
def writeSerializableWKT(iterator: Iterator[AnyRef]): Iterator[(String, Int)] = {
val writer = new WKTWriter()
iterator.flatMap(cur => {
val cPoly = cur.asInstanceOf[Polygon]
// TODO is it efficient to create this collection? Is this a proper iterator 2 iterator transformation?
List(((writer.write(cPoly), cPoly.getUserData.asInstanceOf[Int])).iterator
})
}
def createSpatialRDDFromLinestringDataSet(geoDataset: Dataset[WKTGeometryWithPayload]): RDD[Polygon] = {
geoDataset.rdd.mapPartitions(iterator => {
val reader = new WKTReader()
iterator.flatMap(cur => {
try {
reader.read(cur.lineString) match {
case p: Polygon => {
val polygon = p.asInstanceOf[Polygon]
polygon.setUserData(cur.payload)
List(polygon).iterator
}
case _ => throw new NotImplementedError("Multipolygon or others not supported")
}
} catch {
case e: ParseException =>
logger.error("Could not parse")
logger.error(e.getCause)
logger.error(e.getMessage)
None
}
})
})
}
I noticed that already I start to do a lot of work twice (see the link to both methods). Now wanting to be able to handle
https://github.com/geoHeil/geoSparkScalaSample/blob/master/src/main/scala (full code below)
/myOrg/GeoSpark.scala#L82-L84
val joinResult = JoinQuery.SpatialJoinQuery(objectRDD, minimalPolygonCustom, true)
// joinResult.map()
val joinResultCounted = JoinQuery.SpatialJoinQueryCountByKey(objectRDD, minimalPolygonCustom, true)
which is a PairRDD[Polygon, HashSet[Polygon]], or respectively PairRDD[Polygon, Int] how would I need to specify my functions as an Encoder in order to not solve the same problem 2 more times?

In Apache Spark, how to make an RDD/DataFrame operation lazy?

Assuming that I would like to write a function foo that transforms a DataFrame:
object Foo {
def foo(source: DataFrame): DataFrame = {
...complex iterative algorithm with a stopping condition...
}
}
since the implementation of foo has many "Actions" (collect, reduce etc.), calling foo will immediately triggers the expensive execution.
This is not a big problem, however since foo only converts a DataFrame to another, by convention it should be better to allow lazy execution: the implementation of foo should be executed only if the resulted DataFrame or its derivative(s) are being used on the Driver (through another "Action").
So far, the only way to reliably achieve this is through writing all implementations into a SparkPlan, and superimpose it into the DataFrame's SparkExecution, this is very error-prone and involves lots of boilerplate codes. What is the recommended way to do this?
It is not exactly clear to me what you try to achieve but Scala itself provides at least few tools which you may find useful:
lazy vals:
val rdd = sc.range(0, 10000)
lazy val count = rdd.count // Nothing is executed here
// count: Long = <lazy>
count // count is evaluated only when it is actually used
// Long = 10000
call-by-name (denoted by => in the function definition):
def foo(first: => Long, second: => Long, takeFirst: Boolean): Long =
if (takeFirst) first else second
val rdd1 = sc.range(0, 10000)
val rdd2 = sc.range(0, 10000)
foo(
{ println("first"); rdd1.count },
{ println("second"); rdd2.count },
true // Only first will be evaluated
)
// first
// Long = 10000
Note: In practice you should create local lazy binding to make sure that arguments are not evaluated on every access.
infinite lazy collections like Stream
import org.apache.spark.mllib.random.RandomRDDs._
val initial = normalRDD(sc, 1000000L, 10)
// Infinite stream of RDDs and actions and nothing blows :)
val stream: Stream[RDD[Double]] = Stream(initial).append(
stream.map {
case rdd if !rdd.isEmpty =>
val mu = rdd.mean
rdd.filter(_ > mu)
case _ => sc.emptyRDD[Double]
}
)
Some subset of these should be more than enough to implement complex lazy computations.

Distributing a loop to different machines of a cluster in Spark

Here is a for loop that I'm running in my code:
for(x<-0 to vertexArray.length-1)
{
for(y<-0 to vertexArray.length-1)
{
breakable {
if (x.equals(y)) {
break
}
else {
var d1 = vertexArray(x)._2._2
var d2 = vertexArray(y)._2._2
val ps = new Period(d1, d2)
if (ps.getMonths() == 0 && ps.getYears() == 0 && Math.abs(ps.toStandardHours().getHours()) <= 5) {
edgeArray += Edge(vertexArray(x)._1, vertexArray(y)._1, Math.abs(ps.toStandardHours().getHours()))
}
}
}
}
}
I want to speed up the running time of this code by distributing it across multiple machines in a cluster. I'm using Scala on intelliJ-idea with Spark. How would I implement this type of code to work on multiple machines?
As already stated by Mariano Kamp Spark is probably not a good choice here and there are much better options out there. To add on top of that any approach which has to work on a relatively large data and requires O(N^2) time is simply unacceptable. So the first thing you should do is to focus on choosing suitable algorithm not a platform.
Still it is possible to translate it to Spark. A naive approach which directly reflects your code would be to use Cartesian product:
def check(v1: T, v2: T): Option[U] = {
if (v1 == v2) {
None
} else {
// rest of your logic, Some[U] if all tests passed
// None otherwise
???
}
}
val vertexRDD = sc.parallelize(vertexArray)
.map{case (v1, v2) => check(v1, 2)}
.filter(_.isDefined)
.map(_.get)
If vertexArray is small you could use flatMap with broadcast variable
val vertexBd = sc.broadcast(vertexArray)
vertexRDD.flatMap(v1 =>
vertexBd.map(v2 => check(v1, v2)).filter(_.isDefined).map(_.get))
)
Another improvement is to perform proper join. The obvious condition is year and month:
def toPair(v: T): ((Int, Int), T) = ??? // Return ((year, month), vertex)
val vertexPairs = vertexRDD.map(toPair)
vertexPairs.join(vertexPairs)
.map{case ((_, _), (v1, v2)) => check(v1, v2) // Check should be simplified
.filter(_.isDefined)
.map(_.get)
Of course this can be achieved with a broadcast variable as well. You simply have to group vertexArray by (year, month) pair and broadcast Map[(Int, Int), T].
From here you can improve further by avoiding naive checks by partition and traversing data sorted by timestamp:
def sortPartitionByDatetime(iter: Iterator[U]): Iterator[U] = ???
def yieldMatching(iter: Iterator[U]): Iterator[V] = {
// flatmap keeping track of values in open window
???
}
vertexPairs
.partitionBy(new HashPartitioner(n))
.mapPartitions(sortPartitionByDatetime)
.mapPartitions(yieldMatching)
or using a DataFrame with window function and range clause.
Note:
All types are simply placeholders. In the future please try to provide type information. Right now all I can tell is there are some tuples and dates involved
Welcome to Stack Overflow. Unfortunately this is not the right approach ;(
Spark is not a tool to parallelize tasks, but to parallelize data.
So you need to think how you can distribute/parallelize/partition your data, then compute the individual partitions, then consolidate the results as a last step.
Also you need to read up on Spark in general. A simple answer here cannot get you started. This is just the wrong format.
Start here: http://spark.apache.org/docs/latest/programming-guide.html

Extracting weights from FlinkML Multiple Linear Regression

I am running the example multiple linear regression for Flink (0.10-SNAPSHOT). I can't figure out how to extract the weights (e.g. slope and intercept, beta0-beta1, what ever you want to call them). I'm not super seasoned in Scala, that is probably half my problem.
Thanks for any help any one can give.
object Job {
def main(args: Array[String]) {
// set up the execution environment
val env = ExecutionEnvironment.getExecutionEnvironment
val survival = env.readCsvFile[(String, String, String, String)]("/home/danger/IdeaProjects/quickstart/docs/haberman.data")
val survivalLV = survival
.map{tuple =>
val list = tuple.productIterator.toList
val numList = list.map(_.asInstanceOf[String].toDouble)
LabeledVector(numList(3), DenseVector(numList.take(3).toArray))
}
val mlr = MultipleLinearRegression()
.setStepsize(1.0)
.setIterations(100)
.setConvergenceThreshold(0.001)
mlr.fit(survivalLV)
println(mlr.toString()) // This doesn't do anything productive...
println(mlr.weightsOption) // Neither does this.
}
}
The problem is that you've only constructed the Flink job (DAG) which will calculate the weights but it is not yet executed. The easiest way to trigger the execution is to use the collect method which will retrieve the result of the DataSet back to your client.
mlr.fit(survivalLV)
val weights = mlr.weightsOption match {
case Some(weights) => weights.collect()
case None => throw new Exception("Could not calculate the weights.")
}
println(weights)

Predicting Probabilities in Logistic Regression Model in Apache Spark MLib

I am working on Apache Spark to build the LRM using the LogisticRegressionWithLBFGS() class provided by MLib. Once the Model is built, we can use the predict function provided which gives only the binary labels as the output. I also want the probabilities to be calculated for the same.
There is an implementation for the same found in
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
override protected def predictPoint(
dataMatrix: Vector,
weightMatrix: Vector,
intercept: Double) = {
require(dataMatrix.size == numFeatures)
// If dataMatrix and weightMatrix have the same dimension, it's binary logistic regression.
if (numClasses == 2) {
val margin = dot(weightMatrix, dataMatrix) + intercept
val score = 1.0 / (1.0 + math.exp(-margin))
threshold match {
case Some(t) => if (score > t) 1.0 else 0.0
case None => score
}
}
This method is not exposed, and also the probabilities are not available. Can I know how to use this function to get probabilities.
The dot method which is used in the above function is also not exposed, it is present in the BLAS Package but it is not public.
Call myModel.clearThreshold to get the raw prediction instead of the 0/1 labels.
Mind this only works for Binary Logistic Regression (numClasses == 2).
I encountered a similar problem in trying to obtain the raw predictions for a multiples problem. For me, the best solution was to create a method by borrowing and customizing from the Spark MLlib Logistic Regression src. You can create a like so:
object ClassificationUtility {
def predictPoint(dataMatrix: Vector, model: LogisticRegressionModel):
(Double, Array[Double]) = {
require(dataMatrix.size == model.numFeatures)
val dataWithBiasSize: Int = model.weights.size / (model.numClasses - 1)
val weightsArray: Array[Double] = model.weights match {
case dv: DenseVector => dv.values
case _ =>
throw new IllegalArgumentException(
s"weights only supports dense vector but got type ${model.weights.getClass}.")
}
var bestClass = 0
var maxMargin = 0.0
val withBias = dataMatrix.size + 1 == dataWithBiasSize
val classProbabilities: Array[Double] = new Array[Double](model.numClasses)
(0 until model.numClasses - 1).foreach { i =>
var margin = 0.0
dataMatrix.foreachActive { (index, value) =>
if (value != 0.0) margin += value * weightsArray((i * dataWithBiasSize) + index)
}
// Intercept is required to be added into margin.
if (withBias) {
margin += weightsArray((i * dataWithBiasSize) + dataMatrix.size)
}
if (margin > maxMargin) {
maxMargin = margin
bestClass = i + 1
}
classProbabilities(i+1) = 1.0 / (1.0 + Math.exp(-(margin - maxMargin)))
}
return (bestClass.toDouble, classProbabilities)
}
}
Note it is only slightly different from the original method, it just calculates the logistic as a function of the input features. It also defines some vals and vars that are originally private and included outside of this method. Ultimately, it indexes the scores in an Array and returns it along with the best answer. I call my method like so:
// Compute raw scores on the test set.
val predictionAndLabelsAndProbabilities = test
.map { case LabeledPoint(label, features) =>
val (prediction, probabilities) = ClassificationUtility
.predictPoint(features, model)
(prediction, label, probabilities)}
However:
It seems the Spark contributors are discouraging the use of MLlib in favor of ML. The ML logistic regression API currently does not support multiples classification. I am now using OneVsRest which acts as a wrapper for one vs all classification. I am working on a similar customization to get the raw scores.
I believe the call is myModel.clearThreshold(); i.e. myModel.clearThreshold without the parentheses fails. See the linear SVM example here.