Create Linear Regression Model from an array of coefficients in Spark - scala

I have an array of coefficients already computed and I want to create a Linear Regression Model out of it in Spark 2.0.1 so that I can use it for prediction.
What is the easiest way to create a LinearRegressionModel class with an array of coefficients?

Your linear model is just a linear equation, so for example if your coefficients are
val coefficients=Array[Double](c0,c1,c2,...,cn)
where the first value is the intercept coefficient (assuming you have intercept) then your linear equation is
y = c0 + c1*x1 + c2*x2 + ... + c_n*xn
So you could define
class LinearModel(coefficients:Array[Double]){
def predict(newObservation:Array[Double]):Double={
val intercept=coefficients(0)
val weights=coefficients.drop(1)
val multiplication=newObservation.zip(weights).map{case (x,y)=>x*y}.sum
val prediction=intercept+multiplication
prediction
}
}
For example, if your coefficients are
val coefficients=Array(2.0,2.1,2.2)
then define a new linear model
val model = new LinearModel(coefficients)
So if you have a new observation
newObservation=Array(1.0,1.0)
the prediction is
model.predict(newObservation)
and the output is
scala> model.predict(newObservation)
res16: Double = 6.300000000000001
And you can adapt the previous code if you want to predict a bunch of observations instead of just one.

Related

Kmeans Spark ML

I would like to perform KMeans using the Spark ML. Input is a libsvm dataset:
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
// Start time
//val intial_Data=spark.read.option("header",true).csv("C://sample_lda_data.txt")
val dataset = spark.read.format("libsvm").load("C:\\spark\\data\\mllib\\sample_kmeans_data.txt")
// Trains a k-means model.
val kmeans = new KMeans().setK(2).setSeed(1L)
val model = kmeans.fit(dataset)
// Evaluate clustering by computing Within Set Sum of Squared Errors.
val WSSSE = model.computeCost(dataset)
println(s"Within Set Sum of Squared Errors = $WSSSE")
// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
So i would like to use a csv file and apply KMeans by the Spark ML.
I did this:
val intial_Data=spark.read.option("header",true).csv("C://sample_lda_data.txt")
val arrayCol= array(inputData.columns.drop(1).map(col).map(_.cast(DoubleType)): _*)
import spark.implicits._
// select array column and first column, and map into LabeledPoints
val result = inputData.select(col("col1").cast(DoubleType), arrayCol).map(r => LabeledPoint(r.getAs[Double](0),Vectors.dense(r.getAs[WrappedArray[Double]](1).toArray)))
// Trains a k-means model
val kmeans = new KMeans().setK(2)
val model = kmeans.fit(result)
// Evaluate clustering by computing Within Set Sum of Squared Errors.
val WSSSE = model.computeCost(dataset)
println(s"Within Set Sum of Squared Errors = $WSSSE")
// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
I tried to turn csv file into a Dataset[LabledPoint].
Is my transformation correct?
In spark 2 instead of MLlib , we are using ML package. Which workon dataset and ML flows work in pipeline model. What u need to do is U have to make a dataset and make two columns feature,label. feature is the a vector of features u need to feed into the algo. The other column label is the target column. To make feature column u just need to use vector assembler to assemble all the features u want to use. If you have a target colunm then rename it as label. after fitting this dataset into algo u will get your model.

How to obtain coefficient values from Spark-MLlib Linear Regression model (Scala)?

I'd like to obtain coefficient values of Linear Regression(LR) model in Spark-MLlib. Here I use the 'LinearRegressionWithSGD' to build the model and you can find the sample from the following link:
https://spark.apache.org/docs/2.1.0/mllib-linear-methods.html#regression
I could get the coefficient values from Spark-ML Linear Regression. Please find the reference link from below.
https://spark.apache.org/docs/2.1.0/ml-classification-regression.html#linear-regression
Please help me with this. Thanks in advance !!
Took first lines of model creation from the first link you sent:
val model: LinearRegressionModel = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)
.run(training)
// Here are the coefficient and intercept
val weights: org.apache.spark.mllib.linalg.Vector = model.weights
val intercept = model.intercept
val weightsData: Array[Double] = weights.asInstanceOf[DenseVector].values
The last 3 lines are the coefficient and intercept
The type of weights is
: org.apache.spark.mllib.linalg.Vector
That is a wrapper around the Breeze DenseVector

Spark ML Linear Regression - What Hyper-parameters to Tune

I'm using the LinearRegression model in the Spark ML for predicting price. It is a single variate regression (x=time, y=price).
Assume my data is clean, what are the usual steps to take to improve this model?
So far, I tried tuning regularization parameter using cross-validation, and got rmse=15 given stdev=30.
Are there any other significant hyper-parameters I should care about? It seems Spark ML is not well documented for hyper-parameter tuning...
Update
I was able to tune up parameters using ParamGrid and Cross-Validation. However, is there any way to see how the fitted line looks like after correctly training a linear regression model? How can I know if the line is quadric or cubic etc? It would be great if there is a way to visualize the fitted line with all training data points.
The link you provided points to the main hyperparameters:
.setRegParam(0.3) // lambda for regularization
.setElasticNetParam(0.8) // coefficient for L1 vs L2
You can perform a GridSearch to optimize their usage .. say for
lambda in 0 to 0.8
elasticNet in 0 to 1.0
This can be done by providing ParamMap to CrossValidator
val estimatorParamMaps: Param[Array[ParamMap]]
param for estimator param maps
To answer your follow-up question, LinearRegression will also be a linear fit. You can plot it by predicting on a dataset of points across your range for your y-values with a line plot. Then, you can plot your training data on top of it.
val session = SparkSession.builder().master("local").appName("PredictiveAnalysis").getOrCreate();
def main(args: Array[String]): Unit = {
val data = session.sparkContext.textFile("C:\\Users\\Test\\new_workspace\\PredictionAlgo\\src\\main\\resources\\data.txt");
val parsedData = data.map { line =>
val x : Array[String] = line.replace(",", " ").split(" ")
val y = x.map{ (a => a.toDouble)}
val d = y.size - 1
val c = Vectors.dense(y(0),y(d))
LabeledPoint(y(0), c)
}.cache();
val numIterations = 100;
val stepSize = 0.00000001;
val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize);
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
valuesAndPreds.foreach((result) => println(s"predicted label: ${result._1}, actual label: ${result._2}"))
val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2) }.mean()
println("training Mean Squared Error = " + MSE)
}
}

How to get probability from predictions using GeneralizedLinearRegression model using spark

I'm newbie to machine-learning and I was trying to implement binomial family of GeneralizedLinearRegression model using spark.
I tried this,
val trainingData = sparkSession.read.format("libsvm").load("trainingData.txt")
val testData = sparkSession.read.format("libsvm").load("testData.txt")
val glr = new GeneralizedLinearRegression().setFamily("binomial").setLink("logit").setRegParam(0.3).setMaxIter(10)
val glrModel = glr.fit(trainingData)
model.transform(testData).show()
For my testData, I got my prediction value as 1.0E-16. And when I'm using LogisticRegression, it gives probability(0.765394663) and prediction(0.0) value.
I want to know,
How to predict classes using GeneralizedLinearRegression from prediction value. Should I find classes from prediction value by using a threshold value ?
How to find probability of the predicted value ?

How to define a function and pass training and test datasets in Scala?

I want to define a function in Scala in which I can pass my training and test datasets and then it perform a simple machine learning algorithm and returns some statistics. How should do that? What will be the parameters data type?
Imagine, you need to define a function which by taking training and test datasets performs a simple classification algorithm and then return the accuracy.
What I expect to have is like as follow:
val data = MLUtils.loadLibSVMFile(sc, datadir + "/example.txt");
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L);
val training = splits(0).cache();
val test = splits(1);
val results1 = SVMFunction(training, test)
val results2 = RegressionFunction(training, test)
val results3 = ClassificationFunction(training, test)
I need just the declaration of the functions and not the code that produce the results1, results2, and results3.
def SVMFunction ("I need help here"){
//I know how to work with the training and test datasets to generate the results.
//So no need to discuss what should be here
}
Thanks.
In case you're using supervised learning you should opt for LabeledPoint. Excerpt from mllib doc:
A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification. For binary classification, a label should be either 0 (negative) or 1 (positive). For multiclass classification, labels should be class indices starting from zero: 0, 1, 2, ....
And example is:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
// Create a labeled point with a positive label and a dense feature vector.
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
// Create a labeled point with a negative label and a sparse feature vector.
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))