I'm new in Scala and Spark in general. I'm using this code for Regression (based on this link Spark official site):
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("Year100")
val parsedData = { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
// Building the model
val numIterations = 100
val stepSize = 0.00000001
val model = LinearRegressionWithSGD.train(parsedData, numIterations,stepSize )
// Evaluate model on training examples and compute training error
val valuesAndPreds = { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
val MSE ={case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
The dataset that I'm using can be seen here: Pastebin link.
So my question is: why MSE equals as 889717.74 (which is a huge number)?
Edit: As the commentators suggested, I tried these:
1) I changed the step to default and the MSE now returns as NaN
2) If I try this constructor:
LinearRegressionWithSGD.train(parsedData, numIterations,stepSize,intercept=True) the spark-shell returns an error (error: not found:value True)

You've passed a tiny step size and capped the number of iterations at 100. The maximum value by which your parameters can change is 0.00000001 * 100 = 0.000001. Try using the default step size, I imagine that will fix it.


infinite centroid for kmeans spark scala

(i think i am almost sure what the answer is)
here is my code:
val fileName = """file:///home/user/data/csv/sessions_sample.csv"""
val df ="csv").option("header", "true").option("inferSchema", "true").load(fileName)
// calculate input for kmeans
val input1 ="id", "duration", "ip_dist", "txr1", "txr2", "txr3", "txr4").na.fill(3.0)
val input2 = => (r.getInt(0), Vectors.dense((1 until r.size - 1).map{ i => r.getDouble(i)}.toArray[Double])))
val input3 = input2.toDF("id", "features")
// initiate kmeans
val kmeans = new KMeans().setK(100).setSeed(1L).setFeaturesCol("features").setPredictionCol("prediction")
val model =
val model ="features"))
// Make predictions
val predictions = model.transform("features"))
val predictions = model.transform(input3)
val evaluator = new ClusteringEvaluator()
// i get an error when i run this line
val silhouette = evaluator.evaluate(predictions)
java.lang.AssertionError: assertion failed: Number of clusters must be
greater than one. at scala.Predef$.assert(Predef.scala:170) at$.computeSilhouetteScore(ClusteringEvaluator.scala:416)
... 49 elided
But my centroids look like this:
i think that beceause some centers are infinite => kmeans is unstable => silhouette measure goes wrong.
But it still doesnt answer why, if i try to change k, any k > 1 so far, i have an error saying "Number of clusters must be greater than one".
please advice.
I once saw the same message. The root cause is that every data is the same (my data is generated by a program) so of course there is only one cluster. BTW, I did not check its centers so I am not sure whether my case is the same as yours.

Retrieving not only top one predictions from Multiclass Regression with Spark [duplicate]

I'm running a Bernoulli Naive Bayes using code:
val splits = MyData.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli")
My question is how can I get the probability of membership to class 0 (or 1) and count AUC. I want to get similar result to LogisticRegressionWithSGD or SVMWithSGD where I was using this code:
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)
// Compute raw scores on the test set.
val labelAndPreds = { point =>
val prediction = model.predict(point.features)
(prediction, point.label)
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(labelAndPreds)
val auROC = metrics.areaUnderROC()
Unfortunately this code isn't working for NaiveBayes.
Concerning the probabilities for Bernouilli Naive Bayes, here is an example :
// Building dummy data
val data = sc.parallelize(List("0,1 0 0", "1,0 1 0", "1,0 0 1", "0,1 0 1","1,1 1 0"))
// Transforming dummy data into LabeledPoint
val parsedData = { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
// Prepare data for training
val splits = parsedData.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli")
// labels
val labels = model.labels
// Probabilities for all feature vectors
val features = => lp.features)
model.predictProbabilities(features).take(10) foreach println
// For one specific vector, I'm taking the first vector in the parsedData
val testVector = parsedData.first.features
println(s"For vector ${testVector} => probability : ${model.predictProbabilities(testVector)}")
As for the AUC :
// Compute raw scores on the test set.
val labelAndPreds = { point =>
val prediction = model.predict(point.features)
(prediction, point.label)
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(labelAndPreds)
val auROC = metrics.areaUnderROC()
Concerning the inquiry from the chat :
val results = { lp =>
val probs: Vector = model.predictProbabilities(lp.features)
(for (i <- 0 to (probs.size - 1)) yield ((lp.label, labels(i), probs(i))))
// (0.0,0.0,0.59728640251696)
// (0.0,1.0,0.40271359748304003)
// (1.0,0.0,0.2546873180388961)
// (1.0,1.0,0.745312681961104)
// (1.0,0.0,0.47086939671877026)
// (1.0,1.0,0.5291306032812298)
// (0.0,0.0,0.6496075621805428)
// (0.0,1.0,0.3503924378194571)
// (1.0,0.0,0.4158585282373076)
// (1.0,1.0,0.5841414717626924)
and if you are only interested in the argmax classes :
val results = { lp => val probs: Vector = model.predictProbabilities(lp.features)
val bestClass = probs.argmax
(labels(bestClass), probs(bestClass))
results.take(10) foreach println
// (0.0,0.59728640251696)
// (1.0,0.745312681961104)
// (1.0,0.5291306032812298)
// (0.0,0.6496075621805428)
// (1.0,0.5841414717626924)
Note: Works with Spark 1.5+
EDIT: (for Pyspark users)
It seems like some are having troubles getting probabilities using pyspark and mllib. Well that's normal, spark-mllib doesn't present that function for pyspark.
Thus you'll need to use the spark-ml DataFrame-based API :
from pyspark.sql import Row
from import Vectors
from import NaiveBayes
df = spark.createDataFrame([
Row(label=0.0, features=Vectors.dense([0.0, 0.0])),
Row(label=0.0, features=Vectors.dense([0.0, 1.0])),
Row(label=1.0, features=Vectors.dense([1.0, 0.0]))])
nb = NaiveBayes(smoothing=1.0, modelType="bernoulli")
model =
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
# |features |label|rawPrediction |probability |prediction|
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
# |[0.0,0.0]|0.0 |[-1.4916548767777167,-2.420368128650429] |[0.7168141592920354,0.28318584070796465]|0.0 |
# |[0.0,1.0]|0.0 |[-1.4916548767777167,-3.1135153092103742]|[0.8350515463917526,0.16494845360824742]|0.0 |
# |[1.0,0.0]|1.0 |[-2.5902671654458262,-1.7272209480904837]|[0.29670329670329676,0.7032967032967034]|1.0 |
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
You'll just need to select your prediction column and compute your AUC.
For more information about Naive Bayes in spark-ml, please refer to the official documentation here.

Convert Rdd[Vector] to Rdd[Double]

How do I convert csv to Rdd[Double]? I have the error: cannot be applied to (org.apache.spark.rdd.RDD[Unit]) at this line:
val kd = new KernelDensity().setSample(rows)
My full code is here:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
class KdeAnalysis {
val conf = new SparkConf().setAppName("sample").setMaster("local")
val sc = new SparkContext(conf)
val DATAFILE: String = "C:\\Users\\ajohn\\Desktop\\spark_R\\data\\mass_cytometry\\mass.csv"
val rows = sc.textFile(DATAFILE).map {
line => val values = line.split(',').map(_.toDouble)
// Construct the density estimator with the sample data and a standard deviation for the Gaussian
// kernels
val rdd : RDD[Double] = sc.parallelize(rows)
val kd = new KernelDensity().setSample(rdd)
// Find density estimates for the given values
val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
Since rows is a RDD[org.apache.spark.mllib.linalg.Vector] following line cannot work:
val rdd : RDD[Double] = sc.parallelize(rows)
parallelize expects Seq[T] and RDD is not a Seq.
Even if this part worked as you expect your input is simply wrong. A correct argument for KernelDensity.setSample is either RDD[Double] or JavaRDD[java.lang.Double]. It looks like it doesn't support a multivariate data at this moment.
Regarding a question from the tile you can flatMap
or even better when you create rows
val rows = sc.textFile(DATAFILE).flatMap(_.split(',').map(_.toDouble)).cache()
but I doubt it is really what you need.
Have prepared this code, please evaluate if it can help you out ->
val doubleRDD = => x)

Spark - LinearRegressionWithSGD on Coursera Machine Learning by Stanford University samples

Software Version: Apache Spark v1.3
Context: I've been trying to "translate" Octave/MATLAB code to Scala on Apache Spark. More precisely, I work on ex1data1.txt and ex1data2.txt from coursera practical part ex1. I've made such translation into Julia lang (it went smoothly) and now I've been struggling with Spark...without success.
Problem: Performance of my implementation on Spark is very poor. I cannot even say it works correctly. That's why for ex1data1.txt I added polynomial feature, and I also worked with: theta0 using setIntercept(true) and with extra non-normalized column of 1.0 values(in this case I set Intercept to false). I receive only silly results.
So, then I 've decided to start working with ex1data2.txt. Below you can find the code and the expected result. Of course Spark result is wrong.
Did you have similar experience? I will be grateful for your help.
The Scala code for the exd1data2.txt:
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.optimization.SquaredL2Updater
import org.apache.spark.mllib.regression.{LinearRegressionModel, LinearRegressionWithSGD, LabeledPoint}
import org.apache.spark.{SparkContext, SparkConf}
object MLibOnEx1data2 extends App {
val conf = new SparkConf()
conf.set("", "coursera ex1data2.txt test")
val sc = new SparkContext(conf)
val input = sc.textFile("hdfs:///ex1data2.txt")
val trainData = { line =>
val parts = line.split(',')
val y = parts(2).toDouble
val features = Vectors.dense(parts(0).toDouble, parts(1).toDouble)
println(s"x = $features y = $y")
LabeledPoint(y, features)
// Building the model
val numIterations = 1500
val alpha = 0.01
// Scale the features
val scaler = new StandardScaler(withMean = true, withStd = true)
.fit( => x.features))
val scaledTrainData ={ td =>
val normFeatures = scaler.transform(td.features)
println(s"normalized features = $normFeatures")
LabeledPoint(td.label, normFeatures)
val tsize = scaledTrainData.count()
println(s"Training set size is $tsize")
val alg = new LinearRegressionWithSGD().setIntercept(true)
.setUpdater(new SquaredL2Updater)
.setRegParam(0.0) //regularization - off
val model =
println(s"Theta is $model.weights")
val total1 = model.predict(scaler.transform(Vectors.dense(1650, 3)))
println(s"Estimate the price of a 1650 sq-ft, 3 br house = $total1 dollars") //it should give ~ $289314.620338
// Evaluate model on training examples and compute training error
val valuesAndPreds = { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
val MSE = (({case(v, p) => math.pow((v - p), 2)}.mean()) / 2)
println("Training Mean Squared Error = " + MSE)
// Save and load model
val trySaveAndLoad = util.Try(, "myModelPath"))
.flatMap { _ => util.Try(LinearRegressionModel.load(sc, "myModelPath")) }
println(s"trySaveAndLoad result is $trySaveAndLoad")
STDOUT result is:
Training set size is 47
Theta is (weights=[52090.291641275864,19342.034885388926],
Estimate the price of a 1650 sq-ft, 3 br house = 153983.5541846754
Training Mean Squared Error = 1.5876093757127676E10
trySaveAndLoad result is -1
Well, after some digging I believe there is nothing here. First I saved content of the valuesAndPreds to text file:{
case {x, y} => s"$x,$y"}.repartition(1).saveAsTextFile("results.txt")'
Rest of the code is written in R.
First lets create a model using closed form solution:
# Load data
df <- read.csv('results.txt/ex1data2.txt', header=FALSE)
# Scale features
df[, 1:2] <- apply(df[, 1:2], 2, scale)
# Build linear model
model <- lm(V3 ~ ., df)
For reference:
> summary(model)
lm(formula = V3 ~ ., data = df)
Min 1Q Median 3Q Max
-130582 -43636 -10829 43698 198147
Estimate Std. Error t value Pr(>|t|)
(Intercept) 340413 9637 35.323 < 2e-16 ***
V1 110631 11758 9.409 4.22e-12 ***
V2 -6650 11758 -0.566 0.575
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 66070 on 44 degrees of freedom
Multiple R-squared: 0.7329, Adjusted R-squared: 0.7208
F-statistic: 60.38 on 2 and 44 DF, p-value: 2.428e-13
Now prediction:
closedFormPrediction <- predict(model, df)
closedFormRMSE <- sqrt(mean((closedFormPrediction - df$V3)**2))
closedFormPrediction, df$V3,
ylab="Actual", xlab="Predicted",
main=paste("Closed form, RMSE: ", round(closedFormRMSE, 3)))
Now we can compare above to SGD results:
sgd <- read.csv('results.txt/part-00000', header=FALSE)
sgdRMSE <- sqrt(mean(sgd$V2 - sgd$V1)**2)
sgd$V2, sgd$V1, ylab="Actual",
xlab="Predicted", main=paste("SGD, RMSE: ", round(sgdRMSE, 3)))
Finally lets compare both:
sgd$V2, closedFormPrediction,
xlab="SGD", ylab="Closed form", main="SGD vs Closed form")
So, result are clearly not perfect but nothing seems to be completely off here.

Writing output of the Principal Components Analysis to text file

I have performed a Principal Component Analysis on a matrix I previously loaded with sc.textFile. The output being a org.apache.spark.mllib.linalg.Matrix I then converted it to a RDD[Vector[Double]].
I did:
val pw = new PrintWriter("Matrix.csv")
rows3.collect().foreach(line => pw.println(line))
The output csv is promising. the only problem is that each line is a DenseVector(some values). How do I split each line into the corresponding coefficients?
Thanks a lot
You can use results of the computePrincipalComponents and breeze.linalg.csvwrite:
import breeze.linalg.{DenseMatrix => BDM, csvwrite}
val mat: RowMatrix = ...
val pca = mat.computePrincipalComponents(...)
new File("Matrix.csv"),
new BDM[Double](mat.numRows, mat.numCols, mat.toArray))
convert each vector to a string (you can do it either on the driver or the executers)
val pw = new PrintWriter("Matrix.csv")",")).collect().foreach(line => pw.println(line))
if your data is too big to fit in the memory of the driver, you can try something like that:
val rdd =",")).zipWithIndex.cache
val total = rdd.count
val step = 10000 //rows in each chunk
val range = 0 to total by step
val limits =
limits.foreach { case(start, end) =>
rdd.filter(x => x._2 >= start && x._2 < end)
I can't try this out, but that is the general idea