Predicting Probabilities in Logistic Regression Model in Apache Spark MLib - scala

I am working on Apache Spark to build the LRM using the LogisticRegressionWithLBFGS() class provided by MLib. Once the Model is built, we can use the predict function provided which gives only the binary labels as the output. I also want the probabilities to be calculated for the same.
There is an implementation for the same found in
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
override protected def predictPoint(
dataMatrix: Vector,
weightMatrix: Vector,
intercept: Double) = {
require(dataMatrix.size == numFeatures)
// If dataMatrix and weightMatrix have the same dimension, it's binary logistic regression.
if (numClasses == 2) {
val margin = dot(weightMatrix, dataMatrix) + intercept
val score = 1.0 / (1.0 + math.exp(-margin))
threshold match {
case Some(t) => if (score > t) 1.0 else 0.0
case None => score
}
}
This method is not exposed, and also the probabilities are not available. Can I know how to use this function to get probabilities.
The dot method which is used in the above function is also not exposed, it is present in the BLAS Package but it is not public.

Call myModel.clearThreshold to get the raw prediction instead of the 0/1 labels.
Mind this only works for Binary Logistic Regression (numClasses == 2).

I encountered a similar problem in trying to obtain the raw predictions for a multiples problem. For me, the best solution was to create a method by borrowing and customizing from the Spark MLlib Logistic Regression src. You can create a like so:
object ClassificationUtility {
def predictPoint(dataMatrix: Vector, model: LogisticRegressionModel):
(Double, Array[Double]) = {
require(dataMatrix.size == model.numFeatures)
val dataWithBiasSize: Int = model.weights.size / (model.numClasses - 1)
val weightsArray: Array[Double] = model.weights match {
case dv: DenseVector => dv.values
case _ =>
throw new IllegalArgumentException(
s"weights only supports dense vector but got type ${model.weights.getClass}.")
}
var bestClass = 0
var maxMargin = 0.0
val withBias = dataMatrix.size + 1 == dataWithBiasSize
val classProbabilities: Array[Double] = new Array[Double](model.numClasses)
(0 until model.numClasses - 1).foreach { i =>
var margin = 0.0
dataMatrix.foreachActive { (index, value) =>
if (value != 0.0) margin += value * weightsArray((i * dataWithBiasSize) + index)
}
// Intercept is required to be added into margin.
if (withBias) {
margin += weightsArray((i * dataWithBiasSize) + dataMatrix.size)
}
if (margin > maxMargin) {
maxMargin = margin
bestClass = i + 1
}
classProbabilities(i+1) = 1.0 / (1.0 + Math.exp(-(margin - maxMargin)))
}
return (bestClass.toDouble, classProbabilities)
}
}
Note it is only slightly different from the original method, it just calculates the logistic as a function of the input features. It also defines some vals and vars that are originally private and included outside of this method. Ultimately, it indexes the scores in an Array and returns it along with the best answer. I call my method like so:
// Compute raw scores on the test set.
val predictionAndLabelsAndProbabilities = test
.map { case LabeledPoint(label, features) =>
val (prediction, probabilities) = ClassificationUtility
.predictPoint(features, model)
(prediction, label, probabilities)}
However:
It seems the Spark contributors are discouraging the use of MLlib in favor of ML. The ML logistic regression API currently does not support multiples classification. I am now using OneVsRest which acts as a wrapper for one vs all classification. I am working on a similar customization to get the raw scores.

I believe the call is myModel.clearThreshold(); i.e. myModel.clearThreshold without the parentheses fails. See the linear SVM example here.

Related

TapeEquilibrium ScalaCheck

I have been trying to code some scalacheck property to verify the Codility TapeEquilibrium problem. For those who do not know the problem, see the following link: https://app.codility.com/programmers/lessons/3-time_complexity/tape_equilibrium/.
I coded the following yet incomplete code.
test("Lesson 3 property"){
val left = Gen.choose(-1000, 1000).sample.get
val right = Gen.choose(-1000, 1000).sample.get
val expectedSum = Math.abs(left - right)
val leftArray = Gen.listOfN(???, left) retryUntil (_.sum == left)
val rightArray = Gen.listOfN(???, right) retryUntil (_.sum == right)
val property = forAll(leftArray, rightArray){ (r: List[Int], l: List[Int]) =>
val array = (r ++ l).toArray
Lesson3.solution3(array) == expectedSum
}
property.check()
}
The idea is as follows. I choose two random numbers (values left and right) and calculate its absolute difference. Then, my idea is to generate two arrays. Each array will be random numbers whose sum will be either "left" or "right". Then by concatenating these array, I should be able to verify this property.
My issue is then generating the leftArray and rightArray. This itself is a complex problem and I would have to code a solution for this. Therefore, writing this property seems over-complicated.
Is there any way to code this? Is coding this property an overkill?
Best.
My issue is then generating the leftArray and rightArray
One way to generate these arrays or (lists), is to provide a generator of nonEmptyList whose elements sum is equal to a given number, in other word, something defined by method like this:
import org.scalacheck.{Gen, Properties}
import org.scalacheck.Prop.forAll
def listOfSumGen(expectedSum: Int): Gen[List[Int]] = ???
That verifies the property:
forAll(Gen.choose(-1000, 1000)){ sum: Int =>
forAll(listOfSumGen(sum)){ listOfSum: List[Int] =>
(listOfSum.sum == sum) && listOfSum.nonEmpty
}
}
To build such a list only poses a constraint on one element of the list, so basically here is a way to generate:
Generate list
The extra constrained element, will be given by the expectedSum - the sum of list
Insert the constrained element into a random index of the list (because obviously any permutation of the list would work)
So we get:
def listOfSumGen(expectedSum: Int): Gen[List[Int]] =
for {
list <- Gen.listOf(Gen.choose(-1000,1000))
constrainedElement = expectedSum - list.sum
index <- Gen.oneOf(0 to list.length)
} yield list.patch(index, List(constrainedElement), 0)
Now we the above generator, leftArray and rightArray could be define as follows:
val leftArray = listOfSumGen(left)
val rightArray = listOfSumGen(right)
However, I think that the overall approach of the property described is incorrect, as it builds an array where a specific partition of the array equals the expectedSum but this doesn't ensure that another partition of the array would produce a smaller sum.
Here is a counter-example run-through:
val left = Gen.choose(-1000, 1000).sample.get // --> 4
val right = Gen.choose(-1000, 1000).sample.get // --> 9
val expectedSum = Math.abs(left - right) // --> |4 - 9| = 5
val leftArray = listOfSumGen(left) // Let's assume one of its sample would provide List(3,1) (whose sum equals 4)
val rightArray = listOfSumGen(right)// Let's assume one of its sample would provide List(2,4,3) (whose sum equals 9)
val property = forAll(leftArray, rightArray){ (l: List[Int], r: List[Int]) =>
// l = List(3,1)
// r = List(2,4,3)
val array = (l ++ r).toArray // --> Array(3,1,2,4,3) which is the array from the given example in the exercise
Lesson3.solution3(array) == expectedSum
// According to the example Lesson3.solution3(array) equals 1 which is different from 5
}
Here is an example of a correct property that essentially applies the definition:
def tapeDifference(index: Int, array: Array[Int]): Int = {
val (left, right) = array.splitAt(index)
Math.abs(left.sum - right.sum)
}
forAll(Gen.nonEmptyListOf(Gen.choose(-1000,1000))) { list: List[Int] =>
val array = list.toArray
forAll(Gen.oneOf(array.indices)) { index =>
Lesson3.solution3(array) <= tapeDifference(index, array)
}
}
This property definition might collides with the way the actual solution has been implemented (which is one of the potential pitfall of scalacheck), however, that would be a slow / inefficient solution hence this would be more a way to check an optimized and fast implementation against slow and correct implementation (see this presentation)
Try this with c# :
using System;
using System.Collections.Generic;
using System.Linq;
private static int TapeEquilibrium(int[] A)
{
var sumA = A.Sum();
var size = A.Length;
var take = 0;
var res = new List<int>();
for (int i = 1; i < size; i++)
{
take = take + A[i-1];
var resp = Math.Abs((sumA - take) - take);
res.Add(resp);
if (resp == 0) return resp;
}
return res.Min();
}

Is there any way to replace nested For loop with Higher order methods in scala

I am having a mutableList and want to take sum of all of its rows and replacing its rows with some other values based on some criteria. Code below is working fine for me but i want to ask is there any way to get rid of nested for loops as for loops slows down the performance. I want to use scala higher order methods instead of nested for loop. I tried flodLeft() higher order method to replace single for loop but can not implement to replace nested for loop
def func(nVect : Int , nDim : Int) : Unit = {
var Vector = MutableList.fill(nVect,nDimn)(math.random)
var V1Res =0.0
var V2Res =0.0
var V3Res =0.0
for(i<- 0 to nVect -1) {
for (j <- i +1 to nVect -1) {
var resultant = Vector(i).zip(Vector(j)).map{case (x,y) => x + y}
V1Res = choice(Vector(i))
V2Res = choice(Vector(j))
V3Res = choice(resultant)
if(V3Res > V1Res){
Vector(i) = res
}
if(V3Res > V2Res){
Vector(j) = res
}
}
}
}
There are no "for loops" in this code; the for statements are already converted to foreach calls by the compiler, so it is already using higher-order methods. These foreach calls could be written out explicitly, but it would make no difference to the performance.
Making the code compile and then cleaning it up gives this:
def func(nVect: Int, nDim: Int): Unit = {
val vector = Array.fill(nVect, nDim)(math.random)
for {
i <- 0 until nVect
j <- i + 1 until nVect
} {
val res = vector(i).zip(vector(j)).map { case (x, y) => x + y }
val v1Res = choice(vector(i))
val v2Res = choice(vector(j))
val v3Res = choice(res)
if (v3Res > v1Res) {
vector(i) = res
}
if (v3Res > v2Res) {
vector(j) = res
}
}
}
Note that using a single for does not make any difference to the result, it just looks better!
At this point it gets difficult to make further improvements. The only parallelism possible is with the inner map call, but vectorising this is almost certainly a better option. If choice is expensive then the results could be cached, but this cache needs to be updated when vector is updated.
If the choice could be done in a second pass after all the cross-sums have been calculated then it would be much more parallelisable, but clearly that would also change the results.

the accuracy of LDA predict for new documents with Spark

I'm work with Mllib of Spark, and now is doing something with LDA.
But when I use the code provided by Spark(see bellow) to predict a Doc used in training the model, the result(document-topics) of predict is at opposite poles with the result of trained document-topics.
I don't know what caused the result.
Asking for help, and here is my code below:
train:$lda.run(corpus) the corpus is an RDD like this: $RDD[(Long, Vector)] the Vector contains vocabulary, index of words, wordcounts.
predict:
def predict(documents: RDD[(Long, Vector)], ldaModel: LDAModel): Array[(Long, Vector)] = {
var docTopicsWeight = new Array[(Long, Vector)](documents.collect().length)
ldaModel match {
case localModel: LocalLDAModel =>
docTopicsWeight = localModel.topicDistributions(documents).collect()
case distModel: DistributedLDAModel =>
docTopicsWeight = distModel.toLocal.topicDistributions(documents).collect()
}
docTopicsWeight
}
I'm not sure if your question actually concerns on why you were getting errors on your code but from I have understand, it seems first that you were using the default Vector class. Secondly, you can't use case class on the model directly. You'll need to use the isInstanceOf and asInstanceOf method for that.
def predict(documents: RDD[(Long, org.apache.spark.mllib.linalg.Vector)], ldaModel: LDAModel): Array[(Long, org.apache.spark.mllib.linalg.Vector)] = {
var docTopicsWeight = new Array[(Long, org.apache.spark.mllib.linalg.Vector)](documents.collect().length)
if (ldaModel.isInstanceOf[LocalLDAModel]) {
docTopicsWeight = ldaModel.asInstanceOf[LocalLDAModel].topicDistributions(documents).collect
} else if (ldaModel.isInstanceOf[DistributedLDAModel]) {
docTopicsWeight = ldaModel.asInstanceOf[DistributedLDAModel].toLocal.topicDistributions(documents).collect
}
docTopicsWeight
}

MathContexts in BigDecimals - ScalaCheck generator creates BigDecimals which can't be serialized then deserialized. How to use MathContexts correctly?

I discovered an issue in Scalacheck whereby arbitrary[BigDecimal] would generate BigDecimals which could not be converted to Strings and then back into BigDecimals, and I'm trying to work with the creator to find a fix for it, but I'm unsure of how the MathContexts come into play.
The original generator looks like this:
/** Arbitrary BigDecimal */
implicit lazy val arbBigDecimal: Arbitrary[BigDecimal] = {
import java.math.MathContext._
val mcGen = oneOf(UNLIMITED, DECIMAL32, DECIMAL64, DECIMAL128)
val bdGen = for {
x <- arbBigInt.arbitrary
mc <- mcGen
limit <- const(if(mc == UNLIMITED) 0 else math.max(x.abs.toString.length - mc.getPrecision, 0))
scale <- Gen.chooseNum(Int.MinValue + limit , Int.MaxValue)
} yield {
try {
BigDecimal(x, scale, mc)
} catch {
case ae: java.lang.ArithmeticException => BigDecimal(x, scale, UNLIMITED) // Handle the case where scale/precision conflict
}
}
Arbitrary(bdGen)
}
The problem lies within the fact that the BigDecimal constructor used inverts the sign of the scale argument, thereby making Int.MinValue turn into a scale bigger than 2^32 -1.
scala> val orig = BigDecimal(BigInt("-28334198897217871282176"), -2147483640, UNLIMITED)
orig: scala.math.BigDecimal = -2.8334198897217871282176E+2147483662
scala> BigDecimal(orig.toString)
java.lang.NumberFormatException
at java.math.BigDecimal.<init>(BigDecimal.java:554)
at java.math.BigDecimal.<init>(BigDecimal.java:383)
at java.math.BigDecimal.<init>(BigDecimal.java:806)
at scala.math.BigDecimal$.exact(BigDecimal.scala:125)
at scala.math.BigDecimal$.apply(BigDecimal.scala:283)
... 33 elided
The core of the fix is to increase the lower bound by the number of digits in the unscaledVal, but I only thought of a way to do it with only MathContext.UNLIMITED. I fear we miss out on generator robustness if we do that:
lazy val genBigDecimal: Gen[BigDecimal] = for {
unscaledVal <- arbitrary[BigInt]
scale <- Gen.chooseNum(Int.MinValue + unscaledVal.abs.toString.length, Int.MaxValue)
} yield BigDecimal(unscaledVal, scale)
So, if we want to keep using the other MathContexts, what do we have to do to ensure we use them correctly?

variable parameters in Scala constructor

I would like to write a Matrix class in Scala from that I can instantiate objects like this:
val m1 = new Matrix( (1.,2.,3.),(4.,5.,6.),(7.,8.,9.) )
val m2 = new Matrix( (1.,2.,3.),(4.,5.,6.) )
val m3 = new Matrix( (1.,2.),(3.,4.) )
val m4 = new Matrix( (1.,2.),(3.,4.),(5.,6.) )
I have tried this:
class Matrix(a: (List[Double])* ) { }
but then I get a type mismatch because the matrix rows are not of type List[Double].
Further it would be nice to just have to type Integers (1,2,3) instead of (1.,2.,3.) but still get a double matrix.
How to solve this?
Thanks!
Malte
(1.0, 2.0) is a Tuple2[Double, Double] not a List[Double]. Similarly (1.0, 2.0, 3.0) is a Tuple3[Double, Double, Double].
If you need to handle a fixed number of cardinality, the simplest solution in plain vanilla scala would be to have
class Matrix2(rows: Tuple2[Double, Double]*)
class Matrix3(rows: Tuple3[Double, Double, Double]*)
and so on.
Since there exist an implicit conversion from Int to Double, you can pass a tuple of ints and it will be automatically converted.
new Matrix2((1, 2), (3, 4))
If you instead need to abstract over the row cardinality, enforcing an NxM using types, you would have to resort to some more complex solutions, perhaps using the shapeless library.
Or you can use an actual list, but you cannot restrict the cardinality, i.e. you cannot ensure that all rows have the same length (again, in vanilla scala, shapeless can help)
class Matrix(rows: List[Double]*)
new Matrix(List(1, 2), List(3, 4))
Finally, the 1. literal syntax is deprecated since scala 2.10 and removed in scala 2.11. Use 1.0 instead.
If you need support for very large matrices, consider using an existing implementation like Breeze. Breeze has a DenseMatrix which probably meets your requirements. For performance reasons, Breeze offloads more complex operations into native code.
Getting Matrix algebra right is a difficult exercise and unless you are specifically implementing matrix to learn/assignment, better to go with proven libraries.
Edited based on comment below:
You can consider the following design.
class Row(n : Int*)
class Matrix(rows: Row*) {...}
Usage:
val m = new Matrix(Row(1, 2, 3), Row(2,3,4))
You need to validate that all Rows are of the length and reject the input if they are not.
I have hacked a solution in an - I think - a bit unscala-ish way
class Matrix(values: Object*) { // values is of type WrappedArray[Object]
var arr : Array[Double] = null
val rows : Integer = values.size
var cols : Integer = 0
var _arrIndex = 0
for(row <- values) {
// if Tuple (Tuple extends Product)
if(row.isInstanceOf[Product]) {
var colcount = row.asInstanceOf[Product].productIterator.length
assert(colcount > 0, "empty row")
assert(cols == 0 || colcount == cols, "varying number of columns")
cols = colcount
if(arr == null) {
arr = new Array[Double](rows*cols)
}
for(el <- row.asInstanceOf[Product].productIterator) {
var x : Double = 0.0
if(el.isInstanceOf[Integer]) {
x = el.asInstanceOf[Integer].toDouble
}
else {
assert(el.isInstanceOf[Double], "unknown element type")
x = el.asInstanceOf[Double]
}
arr(_arrIndex) = x
_arrIndex = _arrIndex + 1
}
}
}
}
works like
object ScalaMatrix extends App {
val m1 = new Matrix((1.0,2.0,3.0),(5,4,5))
val m2 = new Matrix((9,8),(7,6))
println(m1.toString())
println(m2.toString())
}
I don't really like it. What do you think about it?