Spark ML Linear Regression - What Hyper-parameters to Tune - linear-regression

I'm using the LinearRegression model in the Spark ML for predicting price. It is a single variate regression (x=time, y=price).
Assume my data is clean, what are the usual steps to take to improve this model?
So far, I tried tuning regularization parameter using cross-validation, and got rmse=15 given stdev=30.
Are there any other significant hyper-parameters I should care about? It seems Spark ML is not well documented for hyper-parameter tuning...
Update
I was able to tune up parameters using ParamGrid and Cross-Validation. However, is there any way to see how the fitted line looks like after correctly training a linear regression model? How can I know if the line is quadric or cubic etc? It would be great if there is a way to visualize the fitted line with all training data points.

The link you provided points to the main hyperparameters:
.setRegParam(0.3) // lambda for regularization
.setElasticNetParam(0.8) // coefficient for L1 vs L2
You can perform a GridSearch to optimize their usage .. say for
lambda in 0 to 0.8
elasticNet in 0 to 1.0
This can be done by providing ParamMap to CrossValidator
val estimatorParamMaps: Param[Array[ParamMap]]
param for estimator param maps

To answer your follow-up question, LinearRegression will also be a linear fit. You can plot it by predicting on a dataset of points across your range for your y-values with a line plot. Then, you can plot your training data on top of it.

val session = SparkSession.builder().master("local").appName("PredictiveAnalysis").getOrCreate();
def main(args: Array[String]): Unit = {
val data = session.sparkContext.textFile("C:\\Users\\Test\\new_workspace\\PredictionAlgo\\src\\main\\resources\\data.txt");
val parsedData = data.map { line =>
val x : Array[String] = line.replace(",", " ").split(" ")
val y = x.map{ (a => a.toDouble)}
val d = y.size - 1
val c = Vectors.dense(y(0),y(d))
LabeledPoint(y(0), c)
}.cache();
val numIterations = 100;
val stepSize = 0.00000001;
val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize);
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
valuesAndPreds.foreach((result) => println(s"predicted label: ${result._1}, actual label: ${result._2}"))
val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2) }.mean()
println("training Mean Squared Error = " + MSE)
}
}

Related

spark ml LinearRegression prediction is a constant for all observations

I'm trying to build a simple linear regression model in spark using scala. To test the method I'm trying to perform a single valriable regression using a test data set.
my data set is as follows.
x - integers from 1 to 100
y - random values generated from excel using the formula =RANDBETWEEN(-10,10)*RAND() + x_i
I've run a regression for this data set using python sklearn library and it gives me the best fit line (with r2 = 0.98) for the data as expected.
However, if I run a regression using spark my prediction has a constant value for all the x values in the dataset with an r2 value of 2e-16.
Why doesn't this code give me the best fit line as the prediction? What am I missing?
Here's the code I'm using
Python Code that works
x = np.array(df['x'])
y = np.array(df['x'])
x = x.reshape(-1,1)
y = y.reshape(-1,1)
clf = LinearRegression(normilize=True)
clf.fit(x,y)
y_predictions = clf.predict(x)
print(r2_score(y, y_predictions))
Here's a plot from the python regression.
Scala code that gives a constant prediction
val labelCol = "y"
val assembler = new VectorAssembler()
.setInputCols(Array("x"))
.setOutputCol("features")
val df2 = assembler.transform(df)
val labelIndexer = new StringIndexer().setInputCol(labelCol).setOutputCol("label")
val df3 = labelIndexer.fit(df2).transform(df2)
val regressor = new LinearRegression()
.setMaxIter(10)
.setRegParam(1.0)
.setElasticNetParam(1.0)
val model = regressor.fit(df3)
val predictions = model.transform(df3)
val modelSummary = model.summary
println(s"r2 = ${modelSummary.r2}")
The issue was using the stringIndexer which should not be used on numeric columns. In my case, instead of using the stringIndxer, I should've just renamed the y column to label. This fixes the problem.

Create Linear Regression Model from an array of coefficients in Spark

I have an array of coefficients already computed and I want to create a Linear Regression Model out of it in Spark 2.0.1 so that I can use it for prediction.
What is the easiest way to create a LinearRegressionModel class with an array of coefficients?
Your linear model is just a linear equation, so for example if your coefficients are
val coefficients=Array[Double](c0,c1,c2,...,cn)
where the first value is the intercept coefficient (assuming you have intercept) then your linear equation is
y = c0 + c1*x1 + c2*x2 + ... + c_n*xn
So you could define
class LinearModel(coefficients:Array[Double]){
def predict(newObservation:Array[Double]):Double={
val intercept=coefficients(0)
val weights=coefficients.drop(1)
val multiplication=newObservation.zip(weights).map{case (x,y)=>x*y}.sum
val prediction=intercept+multiplication
prediction
}
}
For example, if your coefficients are
val coefficients=Array(2.0,2.1,2.2)
then define a new linear model
val model = new LinearModel(coefficients)
So if you have a new observation
newObservation=Array(1.0,1.0)
the prediction is
model.predict(newObservation)
and the output is
scala> model.predict(newObservation)
res16: Double = 6.300000000000001
And you can adapt the previous code if you want to predict a bunch of observations instead of just one.

Spark K-Means get original Cluster Center / Centroids with Normalization

I ran a k-means model
val kmeans = new KMeans().setK(k).setSeed(1L)
val model = kmeans.fit(train_dataset)
and then extract the cluster centers (centroids)
var clusterCenters:Seq[(Double,Double,Double,Double,Double,Double,Double,Double,Double)] = Seq()
for(e <- model.clusterCenters){
clusterCenters = clusterCenters :+ ((e(0)),e(1),e(2),e(3),e(4),e(5),e(6),e(7),e(8))
}
import sc.implicits._
var centroidsDF = clusterCenters.toDF()
to write the results back I create a DataFrame of the resulting cluster centers.
Now I have the problem that I have normalized the data beforehand to improve the clustering results.
val scaler = new StandardScaler()
.setInputCol("features")
.setOutputCol("scaledFeatures")
.setWithStd(true)
.setWithMean(false)
scalerModel = scaler.fit(train_dataset)
scaledData = scalerModel.transform(train_dataset)
How can I get the centroids in its original form de-normalized?
I am not sure if it makes any sense to do it, but since don't center, you can just multiply by std vector:
import org.apache.spark.ml.feature.ElementwiseProduct
val kmeans: KMeansModel = ???
val scaler: StandardScalerModel = ???
new ElementwiseProduct()
.setScalingVec(scaler.std) // Standard deviation used by scaler
.setOutputCol("rescaled")
.setInputCol("cluster")
.transform(sc.parallelize(
// Get centers and convert to `DataFrame`
kmeans.clusterCenters.zipWithIndex).toDF("cluster", "id"))

Apache Spark MLLib get maximum value

I have the following model:
case class Product(price:Int,distance:Int)
and I have data that tells me if a customer is willing to buy the product for price x if distance is y (true/false).
I used a logistic regression in spark on it and can now predict (price,distance) pairs. What if I now want to know the maximum price I can charge for distance x?
code:
val products:List[(Product,Double)] = getProductVotes()
val points:List[LabeledPoints] = products.map{ case (product,vote) =>
LabeledPoint(vote,Vectors.dense(product.price,product.distance)) }
val data: RDD[LabeledPoint] = sc.parallelize(points)
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1).cache()
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(10)
.run(training)
To know the maximum price for a given distance X.
Take subset of your training data for which the vote=True
Build label point with label being "Price" and feature being "Distance"
Train a linear regression model on the set of labeled points to predict "Price" given the "Distance"

How to define a function and pass training and test datasets in Scala?

I want to define a function in Scala in which I can pass my training and test datasets and then it perform a simple machine learning algorithm and returns some statistics. How should do that? What will be the parameters data type?
Imagine, you need to define a function which by taking training and test datasets performs a simple classification algorithm and then return the accuracy.
What I expect to have is like as follow:
val data = MLUtils.loadLibSVMFile(sc, datadir + "/example.txt");
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L);
val training = splits(0).cache();
val test = splits(1);
val results1 = SVMFunction(training, test)
val results2 = RegressionFunction(training, test)
val results3 = ClassificationFunction(training, test)
I need just the declaration of the functions and not the code that produce the results1, results2, and results3.
def SVMFunction ("I need help here"){
//I know how to work with the training and test datasets to generate the results.
//So no need to discuss what should be here
}
Thanks.
In case you're using supervised learning you should opt for LabeledPoint. Excerpt from mllib doc:
A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification. For binary classification, a label should be either 0 (negative) or 1 (positive). For multiclass classification, labels should be class indices starting from zero: 0, 1, 2, ....
And example is:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
// Create a labeled point with a positive label and a dense feature vector.
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
// Create a labeled point with a negative label and a sparse feature vector.
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))