spark ml LinearRegression prediction is a constant for all observations - scala

I'm trying to build a simple linear regression model in spark using scala. To test the method I'm trying to perform a single valriable regression using a test data set.
my data set is as follows.
x - integers from 1 to 100
y - random values generated from excel using the formula =RANDBETWEEN(-10,10)*RAND() + x_i
I've run a regression for this data set using python sklearn library and it gives me the best fit line (with r2 = 0.98) for the data as expected.
However, if I run a regression using spark my prediction has a constant value for all the x values in the dataset with an r2 value of 2e-16.
Why doesn't this code give me the best fit line as the prediction? What am I missing?
Here's the code I'm using
Python Code that works
x = np.array(df['x'])
y = np.array(df['x'])
x = x.reshape(-1,1)
y = y.reshape(-1,1)
clf = LinearRegression(normilize=True)
clf.fit(x,y)
y_predictions = clf.predict(x)
print(r2_score(y, y_predictions))
Here's a plot from the python regression.
Scala code that gives a constant prediction
val labelCol = "y"
val assembler = new VectorAssembler()
.setInputCols(Array("x"))
.setOutputCol("features")
val df2 = assembler.transform(df)
val labelIndexer = new StringIndexer().setInputCol(labelCol).setOutputCol("label")
val df3 = labelIndexer.fit(df2).transform(df2)
val regressor = new LinearRegression()
.setMaxIter(10)
.setRegParam(1.0)
.setElasticNetParam(1.0)
val model = regressor.fit(df3)
val predictions = model.transform(df3)
val modelSummary = model.summary
println(s"r2 = ${modelSummary.r2}")

The issue was using the stringIndexer which should not be used on numeric columns. In my case, instead of using the stringIndxer, I should've just renamed the y column to label. This fixes the problem.

Related

Spark K-Means get original Cluster Center / Centroids with Normalization

I ran a k-means model
val kmeans = new KMeans().setK(k).setSeed(1L)
val model = kmeans.fit(train_dataset)
and then extract the cluster centers (centroids)
var clusterCenters:Seq[(Double,Double,Double,Double,Double,Double,Double,Double,Double)] = Seq()
for(e <- model.clusterCenters){
clusterCenters = clusterCenters :+ ((e(0)),e(1),e(2),e(3),e(4),e(5),e(6),e(7),e(8))
}
import sc.implicits._
var centroidsDF = clusterCenters.toDF()
to write the results back I create a DataFrame of the resulting cluster centers.
Now I have the problem that I have normalized the data beforehand to improve the clustering results.
val scaler = new StandardScaler()
.setInputCol("features")
.setOutputCol("scaledFeatures")
.setWithStd(true)
.setWithMean(false)
scalerModel = scaler.fit(train_dataset)
scaledData = scalerModel.transform(train_dataset)
How can I get the centroids in its original form de-normalized?
I am not sure if it makes any sense to do it, but since don't center, you can just multiply by std vector:
import org.apache.spark.ml.feature.ElementwiseProduct
val kmeans: KMeansModel = ???
val scaler: StandardScalerModel = ???
new ElementwiseProduct()
.setScalingVec(scaler.std) // Standard deviation used by scaler
.setOutputCol("rescaled")
.setInputCol("cluster")
.transform(sc.parallelize(
// Get centers and convert to `DataFrame`
kmeans.clusterCenters.zipWithIndex).toDF("cluster", "id"))

Kmeans Spark ML

I would like to perform KMeans using the Spark ML. Input is a libsvm dataset:
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
// Start time
//val intial_Data=spark.read.option("header",true).csv("C://sample_lda_data.txt")
val dataset = spark.read.format("libsvm").load("C:\\spark\\data\\mllib\\sample_kmeans_data.txt")
// Trains a k-means model.
val kmeans = new KMeans().setK(2).setSeed(1L)
val model = kmeans.fit(dataset)
// Evaluate clustering by computing Within Set Sum of Squared Errors.
val WSSSE = model.computeCost(dataset)
println(s"Within Set Sum of Squared Errors = $WSSSE")
// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
So i would like to use a csv file and apply KMeans by the Spark ML.
I did this:
val intial_Data=spark.read.option("header",true).csv("C://sample_lda_data.txt")
val arrayCol= array(inputData.columns.drop(1).map(col).map(_.cast(DoubleType)): _*)
import spark.implicits._
// select array column and first column, and map into LabeledPoints
val result = inputData.select(col("col1").cast(DoubleType), arrayCol).map(r => LabeledPoint(r.getAs[Double](0),Vectors.dense(r.getAs[WrappedArray[Double]](1).toArray)))
// Trains a k-means model
val kmeans = new KMeans().setK(2)
val model = kmeans.fit(result)
// Evaluate clustering by computing Within Set Sum of Squared Errors.
val WSSSE = model.computeCost(dataset)
println(s"Within Set Sum of Squared Errors = $WSSSE")
// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
I tried to turn csv file into a Dataset[LabledPoint].
Is my transformation correct?
In spark 2 instead of MLlib , we are using ML package. Which workon dataset and ML flows work in pipeline model. What u need to do is U have to make a dataset and make two columns feature,label. feature is the a vector of features u need to feed into the algo. The other column label is the target column. To make feature column u just need to use vector assembler to assemble all the features u want to use. If you have a target colunm then rename it as label. after fitting this dataset into algo u will get your model.

Spark ML Linear Regression - What Hyper-parameters to Tune

I'm using the LinearRegression model in the Spark ML for predicting price. It is a single variate regression (x=time, y=price).
Assume my data is clean, what are the usual steps to take to improve this model?
So far, I tried tuning regularization parameter using cross-validation, and got rmse=15 given stdev=30.
Are there any other significant hyper-parameters I should care about? It seems Spark ML is not well documented for hyper-parameter tuning...
Update
I was able to tune up parameters using ParamGrid and Cross-Validation. However, is there any way to see how the fitted line looks like after correctly training a linear regression model? How can I know if the line is quadric or cubic etc? It would be great if there is a way to visualize the fitted line with all training data points.
The link you provided points to the main hyperparameters:
.setRegParam(0.3) // lambda for regularization
.setElasticNetParam(0.8) // coefficient for L1 vs L2
You can perform a GridSearch to optimize their usage .. say for
lambda in 0 to 0.8
elasticNet in 0 to 1.0
This can be done by providing ParamMap to CrossValidator
val estimatorParamMaps: Param[Array[ParamMap]]
param for estimator param maps
To answer your follow-up question, LinearRegression will also be a linear fit. You can plot it by predicting on a dataset of points across your range for your y-values with a line plot. Then, you can plot your training data on top of it.
val session = SparkSession.builder().master("local").appName("PredictiveAnalysis").getOrCreate();
def main(args: Array[String]): Unit = {
val data = session.sparkContext.textFile("C:\\Users\\Test\\new_workspace\\PredictionAlgo\\src\\main\\resources\\data.txt");
val parsedData = data.map { line =>
val x : Array[String] = line.replace(",", " ").split(" ")
val y = x.map{ (a => a.toDouble)}
val d = y.size - 1
val c = Vectors.dense(y(0),y(d))
LabeledPoint(y(0), c)
}.cache();
val numIterations = 100;
val stepSize = 0.00000001;
val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize);
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
valuesAndPreds.foreach((result) => println(s"predicted label: ${result._1}, actual label: ${result._2}"))
val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2) }.mean()
println("training Mean Squared Error = " + MSE)
}
}

How to get probability from predictions using GeneralizedLinearRegression model using spark

I'm newbie to machine-learning and I was trying to implement binomial family of GeneralizedLinearRegression model using spark.
I tried this,
val trainingData = sparkSession.read.format("libsvm").load("trainingData.txt")
val testData = sparkSession.read.format("libsvm").load("testData.txt")
val glr = new GeneralizedLinearRegression().setFamily("binomial").setLink("logit").setRegParam(0.3).setMaxIter(10)
val glrModel = glr.fit(trainingData)
model.transform(testData).show()
For my testData, I got my prediction value as 1.0E-16. And when I'm using LogisticRegression, it gives probability(0.765394663) and prediction(0.0) value.
I want to know,
How to predict classes using GeneralizedLinearRegression from prediction value. Should I find classes from prediction value by using a threshold value ?
How to find probability of the predicted value ?

Spark random forest binary classifier metrics

How can we get model metrics when training a random forest binary classifier model in Spark Mllib (F score, AUROC, AUPRC etc.)?
The issue is that BinaryClassificationMetrics takes probabilities while the predict method of a RandomForest classifier returns discrete values 0 or 1.
See: https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#binary-classification
A RandomForest.trainClassifier does not have any clearThreshold method which would make it return probabilities instead of discrete 0 or 1 labels.
We need to use the new ml DataFrames based API to get the probabilities instead of the RDD based mllib API.
Update
Following is updated example from Spark documentation to use a BinaryClassificationEvaluator and display the metrics: Area Under Receiver Operating Characteristic (AUROC) and Area Under Precision Recall Curve (AUPRC).
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
// Load and parse the data file, converting it to a DataFrame.
val data = sqlContext.read.format("libsvm").load("D:/Sources/spark/data/mllib/sample_libsvm_data.txt")
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(data)
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)
// Split the data into training and test sets (30% held out for testing)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a RandomForest model.
val rf = new RandomForestClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
.setNumTrees(10)
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels)
// Chain indexers and forest in a Pipeline
val pipeline = new Pipeline()
.setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions
.select("indexedLabel", "rawPrediction", "prediction")
.show()
val binaryClassificationEvaluator = new BinaryClassificationEvaluator()
.setLabelCol("indexedLabel")
.setRawPredictionCol("rawPrediction")
def printlnMetric(metricName: String): Unit = {
println(metricName + " = " + binaryClassificationEvaluator.setMetricName(metricName).evaluate(predictions))
}
printlnMetric("areaUnderROC")
printlnMetric("areaUnderPR")