How to classify new training example after model training in apache spark? - scala

Reading the src of https://spark.apache.org/docs/1.5.2/ml-ann.html :
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.Row
// Load training data
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_multiclass_classification_data.txt").toDF()
// Split the data into train and test
val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
// specify layers for the neural network:
// input layer of size 4 (features), two intermediate of size 5 and 4 and output of size 3 (classes)
val layers = Array[Int](4, 5, 4, 3)
// create the trainer and set its parameters
val trainer = new MultilayerPerceptronClassifier()
.setLayers(layers)
.setBlockSize(128)
.setSeed(1234L)
.setMaxIter(100)
// train the model
val model = trainer.fit(train)
// compute precision on the test set
val result = model.transform(test)
val predictionAndLabels = result.select("prediction", "label")
val evaluator = new MulticlassClassificationEvaluator()
.setMetricName("precision")
println("Precision:" + evaluator.evaluate(predictionAndLabels))
Once the model has been trained how can a new training example be classified ?
Can a new training example be added to the model where the label is not set and the classifier will try to classify this training example based on the training data ?
Why is it required that the dataframe labels are of type Double ?

Firstly, the only way to add another observation to the model is by incorporating that data point into the training data, in this case to your train variable. In order to achieve this, you can convert that point into a DataFrame (obviously of only one record) and then use the unionAll method. Nevertheless, you will have to retrain the model using this new dataset.
However, to classify observations using your model you will have to convert your unclassified data into a DataFrame with the same structure that had your training data. And then use the method transform of your model. By the way, notice that models have that method, because they are subclasses of Transformer.
Finally, you have to use Double because that is the way how the LabeledPoint class was defined. It receives a Double as label and a SparseVector or DenseVector as features. I don't know the exact motivation but in my own experience, which isn't wide, all classification and regression algorithms work with float point numbers.Furthermore, gradient descent algorithm, which is widely used to fit most models, uses numbers not letters nor classes to compute the error in each iteration.

Related

Kmeans Spark ML

I would like to perform KMeans using the Spark ML. Input is a libsvm dataset:
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
// Start time
//val intial_Data=spark.read.option("header",true).csv("C://sample_lda_data.txt")
val dataset = spark.read.format("libsvm").load("C:\\spark\\data\\mllib\\sample_kmeans_data.txt")
// Trains a k-means model.
val kmeans = new KMeans().setK(2).setSeed(1L)
val model = kmeans.fit(dataset)
// Evaluate clustering by computing Within Set Sum of Squared Errors.
val WSSSE = model.computeCost(dataset)
println(s"Within Set Sum of Squared Errors = $WSSSE")
// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
So i would like to use a csv file and apply KMeans by the Spark ML.
I did this:
val intial_Data=spark.read.option("header",true).csv("C://sample_lda_data.txt")
val arrayCol= array(inputData.columns.drop(1).map(col).map(_.cast(DoubleType)): _*)
import spark.implicits._
// select array column and first column, and map into LabeledPoints
val result = inputData.select(col("col1").cast(DoubleType), arrayCol).map(r => LabeledPoint(r.getAs[Double](0),Vectors.dense(r.getAs[WrappedArray[Double]](1).toArray)))
// Trains a k-means model
val kmeans = new KMeans().setK(2)
val model = kmeans.fit(result)
// Evaluate clustering by computing Within Set Sum of Squared Errors.
val WSSSE = model.computeCost(dataset)
println(s"Within Set Sum of Squared Errors = $WSSSE")
// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
I tried to turn csv file into a Dataset[LabledPoint].
Is my transformation correct?
In spark 2 instead of MLlib , we are using ML package. Which workon dataset and ML flows work in pipeline model. What u need to do is U have to make a dataset and make two columns feature,label. feature is the a vector of features u need to feed into the algo. The other column label is the target column. To make feature column u just need to use vector assembler to assemble all the features u want to use. If you have a target colunm then rename it as label. after fitting this dataset into algo u will get your model.

In mllib LinearRegressionWithSGD, I can not control numIterations, why?

I have one ml task, 34461 samples and I use LinearRegressionWithSGD to train model. But the results of the two training result(MSE and Correlation) are the same:
train model (setNumIterations(5))
val linearRegressionWithSGD: LinearRegressionWithSGD = new LinearRegressionWithSGD()
linearRegressionWithSGD.setIntercept(true)
linearRegressionWithSGD.optimizer.setStepSize(0.1)
linearRegressionWithSGD.optimizer.setNumIterations(5)
train model (setNumIterations(10000))
val linearRegressionWithSGD: LinearRegressionWithSGD = new LinearRegressionWithSGD()
linearRegressionWithSGD.setIntercept(true)
linearRegressionWithSGD.optimizer.setStepSize(0.1)
linearRegressionWithSGD.optimizer.setNumIterations(10000)
I can not control the number of iterations, but why?

How to get probability from predictions using GeneralizedLinearRegression model using spark

I'm newbie to machine-learning and I was trying to implement binomial family of GeneralizedLinearRegression model using spark.
I tried this,
val trainingData = sparkSession.read.format("libsvm").load("trainingData.txt")
val testData = sparkSession.read.format("libsvm").load("testData.txt")
val glr = new GeneralizedLinearRegression().setFamily("binomial").setLink("logit").setRegParam(0.3).setMaxIter(10)
val glrModel = glr.fit(trainingData)
model.transform(testData).show()
For my testData, I got my prediction value as 1.0E-16. And when I'm using LogisticRegression, it gives probability(0.765394663) and prediction(0.0) value.
I want to know,
How to predict classes using GeneralizedLinearRegression from prediction value. Should I find classes from prediction value by using a threshold value ?
How to find probability of the predicted value ?

Spark random forest binary classifier metrics

How can we get model metrics when training a random forest binary classifier model in Spark Mllib (F score, AUROC, AUPRC etc.)?
The issue is that BinaryClassificationMetrics takes probabilities while the predict method of a RandomForest classifier returns discrete values 0 or 1.
See: https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#binary-classification
A RandomForest.trainClassifier does not have any clearThreshold method which would make it return probabilities instead of discrete 0 or 1 labels.
We need to use the new ml DataFrames based API to get the probabilities instead of the RDD based mllib API.
Update
Following is updated example from Spark documentation to use a BinaryClassificationEvaluator and display the metrics: Area Under Receiver Operating Characteristic (AUROC) and Area Under Precision Recall Curve (AUPRC).
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
// Load and parse the data file, converting it to a DataFrame.
val data = sqlContext.read.format("libsvm").load("D:/Sources/spark/data/mllib/sample_libsvm_data.txt")
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(data)
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)
// Split the data into training and test sets (30% held out for testing)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a RandomForest model.
val rf = new RandomForestClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
.setNumTrees(10)
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels)
// Chain indexers and forest in a Pipeline
val pipeline = new Pipeline()
.setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions
.select("indexedLabel", "rawPrediction", "prediction")
.show()
val binaryClassificationEvaluator = new BinaryClassificationEvaluator()
.setLabelCol("indexedLabel")
.setRawPredictionCol("rawPrediction")
def printlnMetric(metricName: String): Unit = {
println(metricName + " = " + binaryClassificationEvaluator.setMetricName(metricName).evaluate(predictions))
}
printlnMetric("areaUnderROC")
printlnMetric("areaUnderPR")

How to define a function and pass training and test datasets in Scala?

I want to define a function in Scala in which I can pass my training and test datasets and then it perform a simple machine learning algorithm and returns some statistics. How should do that? What will be the parameters data type?
Imagine, you need to define a function which by taking training and test datasets performs a simple classification algorithm and then return the accuracy.
What I expect to have is like as follow:
val data = MLUtils.loadLibSVMFile(sc, datadir + "/example.txt");
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L);
val training = splits(0).cache();
val test = splits(1);
val results1 = SVMFunction(training, test)
val results2 = RegressionFunction(training, test)
val results3 = ClassificationFunction(training, test)
I need just the declaration of the functions and not the code that produce the results1, results2, and results3.
def SVMFunction ("I need help here"){
//I know how to work with the training and test datasets to generate the results.
//So no need to discuss what should be here
}
Thanks.
In case you're using supervised learning you should opt for LabeledPoint. Excerpt from mllib doc:
A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification. For binary classification, a label should be either 0 (negative) or 1 (positive). For multiclass classification, labels should be class indices starting from zero: 0, 1, 2, ....
And example is:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
// Create a labeled point with a positive label and a dense feature vector.
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
// Create a labeled point with a negative label and a sparse feature vector.
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))