Spark - LinearRegressionWithSGD on Coursera Machine Learning by Stanford University samples - scala

Software Version: Apache Spark v1.3
Context: I've been trying to "translate" Octave/MATLAB code to Scala on Apache Spark. More precisely, I work on ex1data1.txt and ex1data2.txt from coursera practical part ex1. I've made such translation into Julia lang (it went smoothly) and now I've been struggling with Spark...without success.
Problem: Performance of my implementation on Spark is very poor. I cannot even say it works correctly. That's why for ex1data1.txt I added polynomial feature, and I also worked with: theta0 using setIntercept(true) and with extra non-normalized column of 1.0 values(in this case I set Intercept to false). I receive only silly results.
So, then I 've decided to start working with ex1data2.txt. Below you can find the code and the expected result. Of course Spark result is wrong.
Did you have similar experience? I will be grateful for your help.
The Scala code for the exd1data2.txt:
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.optimization.SquaredL2Updater
import org.apache.spark.mllib.regression.{LinearRegressionModel, LinearRegressionWithSGD, LabeledPoint}
import org.apache.spark.{SparkContext, SparkConf}
object MLibOnEx1data2 extends App {
val conf = new SparkConf()
conf.set("spark.app.name", "coursera ex1data2.txt test")
val sc = new SparkContext(conf)
val input = sc.textFile("hdfs:///ex1data2.txt")
val trainData = input.map { line =>
val parts = line.split(',')
val y = parts(2).toDouble
val features = Vectors.dense(parts(0).toDouble, parts(1).toDouble)
println(s"x = $features y = $y")
LabeledPoint(y, features)
}.cache()
// Building the model
val numIterations = 1500
val alpha = 0.01
// Scale the features
val scaler = new StandardScaler(withMean = true, withStd = true)
.fit(trainData.map(x => x.features))
val scaledTrainData = trainData.map{ td =>
val normFeatures = scaler.transform(td.features)
println(s"normalized features = $normFeatures")
LabeledPoint(td.label, normFeatures)
}.cache()
val tsize = scaledTrainData.count()
println(s"Training set size is $tsize")
val alg = new LinearRegressionWithSGD().setIntercept(true)
alg.optimizer
.setNumIterations(numIterations)
.setStepSize(alpha)
.setUpdater(new SquaredL2Updater)
.setRegParam(0.0) //regularization - off
val model = alg.run(scaledTrainData)
println(s"Theta is $model.weights")
val total1 = model.predict(scaler.transform(Vectors.dense(1650, 3)))
println(s"Estimate the price of a 1650 sq-ft, 3 br house = $total1 dollars") //it should give ~ $289314.620338
// Evaluate model on training examples and compute training error
val valuesAndPreds = scaledTrainData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = ((valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()) / 2)
println("Training Mean Squared Error = " + MSE)
// Save and load model
val trySaveAndLoad = util.Try(model.save(sc, "myModelPath"))
.flatMap { _ => util.Try(LinearRegressionModel.load(sc, "myModelPath")) }
.getOrElse(-1)
println(s"trySaveAndLoad result is $trySaveAndLoad")
}
STDOUT result is:
Training set size is 47
Theta is (weights=[52090.291641275864,19342.034885388926],
intercept=181295.93717032953).weights
Estimate the price of a 1650 sq-ft, 3 br house = 153983.5541846754
dollars
Training Mean Squared Error = 1.5876093757127676E10
trySaveAndLoad result is -1

Well, after some digging I believe there is nothing here. First I saved content of the valuesAndPreds to text file:
valuesAndPreds.map{
case {x, y} => s"$x,$y"}.repartition(1).saveAsTextFile("results.txt")'
Rest of the code is written in R.
First lets create a model using closed form solution:
# Load data
df <- read.csv('results.txt/ex1data2.txt', header=FALSE)
# Scale features
df[, 1:2] <- apply(df[, 1:2], 2, scale)
# Build linear model
model <- lm(V3 ~ ., df)
For reference:
> summary(model)
Call:
lm(formula = V3 ~ ., data = df)
Residuals:
Min 1Q Median 3Q Max
-130582 -43636 -10829 43698 198147
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 340413 9637 35.323 < 2e-16 ***
V1 110631 11758 9.409 4.22e-12 ***
V2 -6650 11758 -0.566 0.575
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 66070 on 44 degrees of freedom
Multiple R-squared: 0.7329, Adjusted R-squared: 0.7208
F-statistic: 60.38 on 2 and 44 DF, p-value: 2.428e-13
Now prediction:
closedFormPrediction <- predict(model, df)
closedFormRMSE <- sqrt(mean((closedFormPrediction - df$V3)**2))
plot(
closedFormPrediction, df$V3,
ylab="Actual", xlab="Predicted",
main=paste("Closed form, RMSE: ", round(closedFormRMSE, 3)))
'
Now we can compare above to SGD results:
sgd <- read.csv('results.txt/part-00000', header=FALSE)
sgdRMSE <- sqrt(mean(sgd$V2 - sgd$V1)**2)
plot(
sgd$V2, sgd$V1, ylab="Actual",
xlab="Predicted", main=paste("SGD, RMSE: ", round(sgdRMSE, 3)))
Finally lets compare both:
plot(
sgd$V2, closedFormPrediction,
xlab="SGD", ylab="Closed form", main="SGD vs Closed form")
So, result are clearly not perfect but nothing seems to be completely off here.

Related

infinite centroid for kmeans spark scala

(i think i am almost sure what the answer is)
here is my code:
val fileName = """file:///home/user/data/csv/sessions_sample.csv"""
val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(fileName)
// calculate input for kmeans
val input1 = df.select("id", "duration", "ip_dist", "txr1", "txr2", "txr3", "txr4").na.fill(3.0)
val input2 = input1.map(r => (r.getInt(0), Vectors.dense((1 until r.size - 1).map{ i => r.getDouble(i)}.toArray[Double])))
val input3 = input2.toDF("id", "features")
// initiate kmeans
val kmeans = new KMeans().setK(100).setSeed(1L).setFeaturesCol("features").setPredictionCol("prediction")
val model = kmeans.fit(input3)
val model = kmeans.fit(input3.select("features"))
// Make predictions
val predictions = model.transform(input3.select("features"))
val predictions = model.transform(input3)
val evaluator = new ClusteringEvaluator()
// i get an error when i run this line
val silhouette = evaluator.evaluate(predictions)
java.lang.AssertionError: assertion failed: Number of clusters must be
greater than one. at scala.Predef$.assert(Predef.scala:170) at
org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette$.computeSilhouetteScore(ClusteringEvaluator.scala:416)
at
org.apache.spark.ml.evaluation.ClusteringEvaluator.evaluate(ClusteringEvaluator.scala:96)
... 49 elided
But my centroids look like this:
model.clusterCenters.foreach(println)
[3217567.1300936914,145.06533614203505,Infinity,Infinity,Infinity]
i think that beceause some centers are infinite => kmeans is unstable => silhouette measure goes wrong.
But it still doesnt answer why, if i try to change k, any k > 1 so far, i have an error saying "Number of clusters must be greater than one".
please advice.
I once saw the same message. The root cause is that every data is the same (my data is generated by a program) so of course there is only one cluster. BTW, I did not check its centers so I am not sure whether my case is the same as yours.

Apache Spark - Two-sample Kolmogorov-Smirnov Test

I have two sets of data (let's call them d1, d2) in Spark. I would like to perform a Two-sample Kolmogorov-Smirnov test, to test wether their underlying poplation distribution function is different. Can MLLib's Statistics.kolmogorovSmirnovTest do this?
The documentation provides this example:
import org.apache.spark.mllib.stat.Statistics
val data: RDD[Double] = ... // an RDD of sample data
// perform a KS test using a cumulative distribution function of our making
val myCDF: Double => Double = ...
val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF)
I tried computing the empirical cumulative distribution function of d2 (collecting it as Map) and comparing it with d1.
Statistics.kolmogorovSmirnovTest(d1, ecdf_map)
The test runs, but the results are wrong.
Am I doing something wrong? Is it possible to do this? Any ideas?
Thank you for the help!
In Spark Mllib KolmogorovSmirnovTest is one-sampled and two-sided. So if you want specificly two-sampled variant it's not possible within this library. However, you can still compare datasets by calculating empirical cumulative distribution function (I found a library to do that so I'll update this answer if the results will be any good) or using deviations from normal distribution. In this example I'll go with the later.
Comparing datasets by KST statistics against normal distribution
For the purposes of this testing I generated 3 distributions: 2 triangular that look similar and an exponential one to show big difference in stats.
Note:
I couldn't find any scientific papers describing this method as viable for distribution comparison so the idea is mostly empirical.
For every distribution you most definetely could find a mirrored one with the same global maximum distance between its CDF and normal distribution.
Next step was to get KS results against normal distribution with given mean and standart deviation. I visualized them to get a better picture:
As you can see, results (KS statistics and p-value) for triangual distributions are close to each other while exponential one is way off. As I stated in the note you could easily fool this method by mirroring dataset but for a real world data it could be ok.
import org.apache.spark._
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.stat.Statistics
import org.apache.commons.math3.distribution.{ ExponentialDistribution, TriangularDistribution }
import breeze.plot._
import breeze.linalg._
import breeze.numerics._
object Main {
def main( args: Array[ String ] ): Unit = {
val conf =
new SparkConf()
.setAppName( "SO Spark" )
.setMaster( "local[*]" )
.set( "spark.driver.host", "localhost" )
val sc = new SparkContext( conf )
// Create similar distributions
val triDist1 = new TriangularDistribution( -3, 5, 7 )
val triDist2 = new TriangularDistribution( -3, 7, 7 )
// Exponential distribution to show big difference
val expDist1 = new ExponentialDistribution( 0.6 )
// Sample data from the distributions and parallelize it
val n = 100000
val sampledTriDist1 = sc.parallelize( triDist1.sample( n ) )
val sampledTriDist2 = sc.parallelize( triDist2.sample( n ) )
val sampledExpDist1 = sc.parallelize( expDist1.sample( n ) )
// KS tests
val resultTriDist1 = Statistics
.kolmogorovSmirnovTest( sampledTriDist1,
"norm",
sampledTriDist1.mean,
sampledTriDist1.stdev )
val resultTriDist2 = Statistics
.kolmogorovSmirnovTest( sampledTriDist2,
"norm",
sampledTriDist2.mean,
sampledTriDist2.stdev )
val resultExpDist1 = Statistics
.kolmogorovSmirnovTest( sampledExpDist1,
"norm",
sampledExpDist1.mean,
sampledExpDist1.stdev )
// Results
val statsTriDist1 =
"Tri1: ( " +
resultTriDist1.statistic +
", " +
resultTriDist1.pValue +
" )"
val statsTriDist2 =
"Tri2: ( " +
resultTriDist2.statistic +
", " +
resultTriDist2.pValue +
" )"
val statsExpDist1 =
"Exp1: ( " +
resultExpDist1.statistic +
", " +
resultExpDist1.pValue +
" )"
println( statsTriDist1 )
println( statsTriDist2 )
println( statsExpDist1 )
// Visualize
val graphCanvas = Figure()
val mainPlot =
graphCanvas
.subplot( 0 )
mainPlot.legend = true
val x = linspace( 1, n, n )
mainPlot += plot( x,
sampledTriDist1.sortBy( x => x ).take( n ),
name = statsTriDist1 )
mainPlot += plot( x,
sampledTriDist2.sortBy( x => x ).take( n ),
name = statsTriDist2 )
mainPlot += plot( x,
sampledExpDist1.sortBy( x => x ).take( n ),
name = statsExpDist1 )
mainPlot.xlabel = "x"
mainPlot.ylabel = "sorted sample"
mainPlot.title = "KS results for 2 Triangular and 1 Exponential Distributions"
graphCanvas.saveas( "ks-sample.png", 300 )
sc.stop()
}
}

Mean Squared Error (MSE) returns a huge number

I'm new in Scala and Spark in general. I'm using this code for Regression (based on this link Spark official site):
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("Year100")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
// Building the model
val numIterations = 100
val stepSize = 0.00000001
val model = LinearRegressionWithSGD.train(parsedData, numIterations,stepSize )
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
The dataset that I'm using can be seen here: Pastebin link.
So my question is: why MSE equals as 889717.74 (which is a huge number)?
Edit: As the commentators suggested, I tried these:
1) I changed the step to default and the MSE now returns as NaN
2) If I try this constructor:
LinearRegressionWithSGD.train(parsedData, numIterations,stepSize,intercept=True) the spark-shell returns an error (error: not found:value True)
You've passed a tiny step size and capped the number of iterations at 100. The maximum value by which your parameters can change is 0.00000001 * 100 = 0.000001. Try using the default step size, I imagine that will fix it.

Spark MLib ALS input and output domains

I'm using Spark MLib ALS and trying to use the trainImplicit() interface to feed it the number of an item purchased by a user as the implicit preference. I don't know how to validate my model though. My input is in the domain [1, inf), but the output predictions seem to be floats in (0, 1).
The usual kind of code:
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
from pyspark.sql import HiveContext
from pyspark import SparkContext
sc = SparkContext(appName="Quantity Prediction Model")
hive = HiveContext(sc)
d = hive.sql("select o.user_id as user, l.product_id as product, sum(l.quantity) as qty from order_line l join order_order o ON l.order_id = o.id group by o.user_id, l.product_id")
d.write.save('user_product_qty')
ratings = d.rdd.map(tuple)
testdata = ratings.map(lambda t: (t[0], t[1]))
for rank in (4, 8, 12):
model = ALS.trainImplicit(ratings, rank, 10, alpha=0.01)
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
# Error is pretty bad because output raitings aren't in the same domain as quantity
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print("Rank: {} MSE: {}".format(rank, MSE))
Extra credit: When using train() what is the input/output domain? Are "ratings" some how expected to be on a five point scale? This isn't documented anywhere.
RMSE is not an ideal metric for implicit ALS (the original paper suggests a more elaborate evaluation technique). However, you can still apply RMSE to evaluate implicit ALS training results if you map your input ratings to (-1;1) as well.
See https://github.com/apache/spark/pull/597 for details.
Finally, some code to get you started (inpsired by MovieLens example for ALS from Spark):
// RMSE
logger.info(s"Calculating RMSE on ${testingSet.count()} ratings")
def groupRatings(rs: RDD[MLlibRating]): RDD[((Int, Int), Double)] =
rs.map { r => ((r.user, r.product), r.rating) }
// When using implicit ALS we should treat actual and predicted
// ratings as confidence levels. See also apache/spark#597.
//
// Predicted ratings are clamped to [0;1]
def clampPredicted(r: Double): Double =
math.max(math.min(r, 1.0), 0.0)
def clampActual(r: Double): Double = if (r > 0.0) 1.0 else 0.0
def sqr(x: Double): Double = x * x
val ratingSquaredErrors =
groupRatings(alsModel.predict(testingSet.map { r => (r.user, r.product) }))
.join(groupRatings(testingSet))
.map { case (_, (predictedRating, actualRating)) =>
sqr(clampPredicted(predictedRating) - clampActual(actualRating)) }
val rmse = sqrt(ratingSquaredErrors.mean())
logger.info(s"RMSE: ${rmse}")

Retrieving not only top one predictions from Multiclass Regression with Spark [duplicate]

I'm running a Bernoulli Naive Bayes using code:
val splits = MyData.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli")
My question is how can I get the probability of membership to class 0 (or 1) and count AUC. I want to get similar result to LogisticRegressionWithSGD or SVMWithSGD where I was using this code:
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)
model.clearThreshold()
// Compute raw scores on the test set.
val labelAndPreds = test.map { point =>
val prediction = model.predict(point.features)
(prediction, point.label)
}
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(labelAndPreds)
val auROC = metrics.areaUnderROC()
Unfortunately this code isn't working for NaiveBayes.
Concerning the probabilities for Bernouilli Naive Bayes, here is an example :
// Building dummy data
val data = sc.parallelize(List("0,1 0 0", "1,0 1 0", "1,0 0 1", "0,1 0 1","1,1 1 0"))
// Transforming dummy data into LabeledPoint
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}
// Prepare data for training
val splits = parsedData.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli")
// labels
val labels = model.labels
// Probabilities for all feature vectors
val features = parsedData.map(lp => lp.features)
model.predictProbabilities(features).take(10) foreach println
// For one specific vector, I'm taking the first vector in the parsedData
val testVector = parsedData.first.features
println(s"For vector ${testVector} => probability : ${model.predictProbabilities(testVector)}")
As for the AUC :
// Compute raw scores on the test set.
val labelAndPreds = test.map { point =>
val prediction = model.predict(point.features)
(prediction, point.label)
}
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(labelAndPreds)
val auROC = metrics.areaUnderROC()
Concerning the inquiry from the chat :
val results = parsedData.map { lp =>
val probs: Vector = model.predictProbabilities(lp.features)
(for (i <- 0 to (probs.size - 1)) yield ((lp.label, labels(i), probs(i))))
}.flatMap(identity)
results.take(10).foreach(println)
// (0.0,0.0,0.59728640251696)
// (0.0,1.0,0.40271359748304003)
// (1.0,0.0,0.2546873180388961)
// (1.0,1.0,0.745312681961104)
// (1.0,0.0,0.47086939671877026)
// (1.0,1.0,0.5291306032812298)
// (0.0,0.0,0.6496075621805428)
// (0.0,1.0,0.3503924378194571)
// (1.0,0.0,0.4158585282373076)
// (1.0,1.0,0.5841414717626924)
and if you are only interested in the argmax classes :
val results = training.map { lp => val probs: Vector = model.predictProbabilities(lp.features)
val bestClass = probs.argmax
(labels(bestClass), probs(bestClass))
}
results.take(10) foreach println
// (0.0,0.59728640251696)
// (1.0,0.745312681961104)
// (1.0,0.5291306032812298)
// (0.0,0.6496075621805428)
// (1.0,0.5841414717626924)
Note: Works with Spark 1.5+
EDIT: (for Pyspark users)
It seems like some are having troubles getting probabilities using pyspark and mllib. Well that's normal, spark-mllib doesn't present that function for pyspark.
Thus you'll need to use the spark-ml DataFrame-based API :
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import NaiveBayes
df = spark.createDataFrame([
Row(label=0.0, features=Vectors.dense([0.0, 0.0])),
Row(label=0.0, features=Vectors.dense([0.0, 1.0])),
Row(label=1.0, features=Vectors.dense([1.0, 0.0]))])
nb = NaiveBayes(smoothing=1.0, modelType="bernoulli")
model = nb.fit(df)
model.transform(df).show(truncate=False)
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
# |features |label|rawPrediction |probability |prediction|
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
# |[0.0,0.0]|0.0 |[-1.4916548767777167,-2.420368128650429] |[0.7168141592920354,0.28318584070796465]|0.0 |
# |[0.0,1.0]|0.0 |[-1.4916548767777167,-3.1135153092103742]|[0.8350515463917526,0.16494845360824742]|0.0 |
# |[1.0,0.0]|1.0 |[-2.5902671654458262,-1.7272209480904837]|[0.29670329670329676,0.7032967032967034]|1.0 |
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
You'll just need to select your prediction column and compute your AUC.
For more information about Naive Bayes in spark-ml, please refer to the official documentation here.