Handling unseen categorical variables and MaxBins calculation in Spark Multiclass-classification - scala

Below is the code I have for a RandomForest multiclass-classification model. I am reading from a CSV file and doing various transformations as seen in the code.
I am calculating the max number of categories and then giving it as a parameter to RF. This takes a lot of time! Is there a parameter to set or an easier way to make the model automatically infer the max categories?Since it can go more than 1000 and I cannot omit them.
How do I handle unseen labels on new data for prediction since StringIndexer will not work in that case. the code below is just a split of data but I will be introducing new data as well in future
// Need to predict 2 classes
val cols_to_predict=Array("Label1","Label2")
// ID col
val omit_cols=Array("Key")
// reading the csv file
val data = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("abc.csv")
.cache()
// creating a features DF by droppping the labels so that I can run all
// the cols through String Indexer
val features=data.drop("Label1").drop("Label2").drop("Key")
// Since I do not know my max categories possible, I find it out
// and use it for maxBins parameter in RF
val distinct_col_counts=features.columns.map(x => data.select(x).distinct().count ).max
val transformers: Array[org.apache.spark.ml.PipelineStage] = features.columns.map(
cname => new StringIndexer().setInputCol(cname).setOutputCol(s"${cname}_index").fit(features)
)
val assembler = new VectorAssembler()
.setInputCols(features.columns.map(cname => s"${cname}_index"))
.setOutputCol("features")
val labelIndexer2 = new StringIndexer()
.setInputCol("prog_label2")
.setOutputCol("Label2")
.fit(data)
val labelIndexer1 = new StringIndexer()
.setInputCol("orig_label1")
.setOutputCol("Label1")
.fit(data)
val rf = new RandomForestClassifier()
.setLabelCol("Label1")
.setFeaturesCol("features")
.setNumTrees(100)
.setMaxBins(distinct_col_counts.toInt)
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer1.labels)
// Split into train and test
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
trainingData.cache()
testData.cache()
// Running only for one label for now Label1
val stages: Array[org.apache.spark.ml.PipelineStage] =transformers :+ labelIndexer1 :+ assembler :+ rf :+ labelConverter //:+ labelIndexer2
val pipeline=new Pipeline().setStages(stages)
val model=pipeline.fit(trainingData)
val predictions = model.transform(testData)

Related

Predict and accuracy using neural network with Scala spark

I am a new user of spark on Scala, here is my code, but I can not figure out how I can calculate prediction and accuracy.
Do I have to transform the CSV file into Libsvm format, or can I just load the CSV file?
object Test2 {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("WineQualityDecisionTreeRegressorPMML")
.master("local")
.getOrCreate()
// Load and parse the data file.
val df = spark.read
.format("csv")
.option("header", "true")
.option("mode", "DROPMALFORMED")
.option("delimiter", ",")
.load("file:///c:/tmp/spark-warehouse/winequality_red_names.csv")
val inputFields = List("fixed acidity", "volatile acidity", "citric acid", "residual sugar", "chlorides",
"free sulfur dioxide", "total sulfur dioxide", "density", "pH", "sulphates", "alcohol")
val toDouble = udf[Double, String]( _.toDouble)
val dff = df.
withColumn("fixed acidity", toDouble(df("fixed acidity"))). // 0 +
withColumn("volatile acidity", toDouble(df("volatile acidity"))). // 1 +
withColumn("citric acid", toDouble(df("citric acid"))). // 2 -
withColumn("residual sugar", toDouble(df("residual sugar"))). // 3 +
withColumn("chlorides", toDouble(df("chlorides"))). // 4 -
withColumn("free sulfur dioxide", toDouble(df("free sulfur dioxide"))). // 5 +
withColumn("total sulfur dioxide", toDouble(df("total sulfur dioxide"))). // 6 +
withColumn("density", toDouble(df("density"))). // 7 -
withColumn("pH", toDouble(df("pH"))). // 8 +
withColumn("sulphates", toDouble(df("sulphates"))). // 9 +
withColumn("alcohol", toDouble(df("alcohol"))) // 10 +
val assembler = new VectorAssembler().
setInputCols(inputFields.toArray).
setOutputCol("features")
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
.setInputCol("quality")
.setOutputCol("indexedLabel")
.fit(dff)
// specify layers for the neural network:
// input layer of size 11 (features), two intermediate of size 10 and 20
// and output of size 6 (classes)
val layers = Array[Int](11, 10, 20, 6)
// Train a DecisionTree model.
val dt = new MultilayerPerceptronClassifier()
.setLayers(layers)
.setBlockSize(128)
.setSeed(1234L)
.setMaxIter(100)
.setLabelCol("indexedLabel")
.setFeaturesCol("features")
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels)
// create pileline
val pipeline = new Pipeline()
.setStages(Array(assembler, labelIndexer, dt, labelConverter))
// Train model
val model = pipeline.fit(dff)
}
}
Any idea please?
I can't find any example for neural networking with a CSV file using pipline.
When you have your model trained (val model = pipeline.fit(dff)), you need to predict for every test sample the label using model.transform method. For each prediction you have to check, if it matches label. Then accuracy would be the ratio of properly classified to size of training set.
If you want to use the same DataFrame, that was used for training, then simply val predictions = model.transform(dff). Then iterate over predictions and check, if they match with corresponding labels. However I do not recommend reusing DataFrame - it's better to split it for training and testing subsets.

How to set data for logistic regression in scala?

I am new to scala and I want to implement a logistic regression model.So initially I load a csv file as below:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("D:/sample.txt")
The file is as below:
P,P,A,A,A,P,NB
N,N,A,A,A,N,NB
A,A,A,A,A,A,NB
P,P,P,P,P,P,NB
N,N,P,P,P,N,NB
A,A,P,P,P,A,NB
P,P,A,P,P,P,NB
P,P,P,A,A,P,NB
P,P,A,P,A,P,NB
P,P,A,A,P,P,NB
P,P,P,P,A,P,NB
P,P,P,A,P,P,NB
N,N,A,P,P,N,NB
N,N,P,A,A,N,NB
N,N,A,P,A,N,NB
N,N,A,P,A,N,NB
N,N,A,A,P,N,NB
N,N,P,P,A,N,NB
N,N,P,A,P,N,NB
A,A,A,P,P,A,NB
A,A,P,A,A,A,NB
A,A,A,P,A,A,NB
A,A,A,A,P,A,NB
A,A,P,P,A,A,NB
A,A,P,A,P,A,NB
P,N,A,A,A,P,NB
N,P,A,A,A,N,NB
P,N,A,A,A,N,NB
P,N,P,P,P,P,NB
N,P,P,P,P,N,NB
Then I want to train the model by below code:
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
.setFeaturesCol("Feature")
.setLabelCol("Label")
Then I fit the model by below:
val lrModel = lr.fit(df)
println(lrModel.coefficients +"are the coefficients")
println(lrModel.interceptVector+"are the intercerpt vactor")
println(lrModel.summary +"is summary")
But it is not printing the results.
Any help is appreciated.
from your code:
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
.setFeaturesCol("Feature") <- here
.setLabelCol("Label") <- here
you are setting features column and label column. As you didn't mention column names, i am assuming the column containing NB values is your label and you want to include all others are the columns for prediction.
All predictor variables that you want include in your model, needs to be in form of single vector column, generally called as features column. You need to create it using VectorAssembler as follows:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
//creating features column
val assembler = new VectorAssembler()
.setInputCols(Array(" insert your column names here "))
.setOutputCol("Feature")
Refer: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler.
Now you can proceed to fit the logistic regression model. pipeline is used to combine multiple transformations beforefitting the data.
val pipeline = new Pipeline().setStages(Array(assembler,lr))
//fitting the model
val lrModel = pipeline.fit(df)

Spark ML Convert Prediction label to string without training DataFrame

I am using NaiveBayes multinomial classifier in Apache Spark ML (version 2.1.0) to predict some text categories.
Problem is how do I convert the prediction label(0.0, 1.0, 2.0) to string without trained DataFrame.
I know IndexToString can be used but its only helpful if training and prediction both are at the same time. But, In my case its independent job.
code looks like as
1) TrainingModel.scala : Train the model and save the model in file.
2) CategoryPrediction.scala : Load the trained model from file and do prediction on test data.
Please suggest the solution:
TrainingModel.scala
val trainData: Dataset[LabeledRecord] = spark.read.option("inferSchema", "false")
.schema(schema).csv("trainingdata1.csv").as[LabeledRecord]
val labelIndexer = new StringIndexer().setInputCol("category").setOutputCol("label").fit(trainData).setHandleInvalid("skip")
val tokenizer = new RegexTokenizer().setInputCol("text").setOutputCol("words")
val hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("features")
.setNumFeatures(1000)
val rf = new NaiveBayes().setLabelCol("label").setFeaturesCol("features").setModelType("multinomial")
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, labelIndexer, rf))
val model = pipeline.fit(trainData)
model.write.overwrite().save("naivebayesmodel");
CategoryPrediction.scala
val testData: Dataset[PredictLabeledRecord] = spark.read.option("inferSchema", "false")
.schema(predictSchema).csv("testingdata.csv").as[PredictLabeledRecord]
val model = PipelineModel.load("naivebayesmodel")
val predictions = model.transform(testData)
// val labelConverter = new IndexToString()
// .setInputCol("prediction")
// .setOutputCol("predictedLabelString")
// .setLabels(trainDataFrameIndexer.labels)
predictions.select("prediction", "text").show(false)
trainingdata1.csv
category,text
Drama,"a b c d e spark"
Action,"b d"
Horror,"spark f g h"
Thriller,"hadoop mapreduce"
testingdata.csv
text
"a b c d e spark"
"spark f g h"
Add a converter that will translate the prediction categories back to your labels in your pipeline, something like this:
val categoryConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("category")
.setLabels(labelIndexer.labels)
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, labelIndexer, rf, categoryConverter))
This will take the prediction and convert it back to a label using your labelIndexer.

Error adding VectorAssembler to Spark ML Pipeline

Trying to add VectorAssembler to the GBT pipeline example and get an error the pipeline cannot find the features field. I'm bringing in a sample file instead of a libsvm so I needed to transform the feature set set.
Error:
Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("data/training_example.csv")
val sampleDF = df.sample(false,0.05,987897L)
val assembler = new VectorAssembler()
.setInputCols(Array("val1","val2","val3",...,"valN"))
.setOutputCol("features")
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(sampleDF)
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(sampleDF)
val Array(trainingData, testData) = sampleDF.randomSplit(Array(0.7, 0.3))
val gbt = new GBTClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
.setMaxIter(3)
.setMaxDepth(5)
val pipeline = new Pipeline()
.setStages(Array(assembler,labelIndexer,featureIndexer,gbt))
val model = pipeline.fit(trainingData)
val predictions = model.transform(testData)
predictions.show(10)
Basic problem:
Why you're calling fit() in featureIndexer?
If you call fit(sampleDF), VectorIndexer will search for features column in sampleDF, but this dataset doesn't have such column.
Pipeline's fit() will call all transformator and estimators, so call fit on assembler, then pass the result to fit of labelIndexer and pass previous step result to fit of featureIndexer.
DataFrame that will be used in featureIndexer.fit() called inside Pipeline will have all columns generated by previous transformers.
In your code sampleDF doesn't have features column, however, during Pipeline fit() this column will be added by assembler
Documentation sample has features column from the beginning.
val data = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
You must fit a DF having features column.So transform your original DF with VectorAssembler and give it as input.

How to get probabilities corresponding to the class from Spark ML random forest

I've been using org.apache.spark.ml.Pipeline for machine learning tasks. It is particularly important to know the actual probabilities instead of just a predicted label , and I am having difficulties to get it. Here I am doing a binary classification task with random forest. The class labels are "Yes" and "No". I would like to output probability for label "Yes" . The probabilities are stored in a DenseVector as the pipeline output, such as [0.69, 0.31], but I don't know which one is corresponding to "Yes" (0.69 or 0.31?). I guess there should be someway to retrieve it from labelIndexer?
Here is my task Code for training the model
val sc = new SparkContext(new SparkConf().setAppName(" ML").setMaster("local"))
val data = .... // load data from file
val df = sqlContext.createDataFrame(data).toDF("label", "features")
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(df)
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(2)
.fit(df)
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels)
val Array(trainingData, testData) = df.randomSplit(Array(0.7, 0.3))
// Train a RandomForest model.
val rf = new RandomForestClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
.setNumTrees(10)
.setFeatureSubsetStrategy("auto")
.setImpurity("gini")
.setMaxDepth(4)
.setMaxBins(32)
// Create pipeline
val pipeline = new Pipeline()
.setStages(Array(labelIndexer, featureIndexer, rf,labelConverter))
// Train model
val model = pipeline.fit(trainingData)
// Save model
sc.parallelize(Seq(model), 1).saveAsObjectFile("/my/path/pipeline")
Then I will load the pipeline and make predictions on new data, and here is the code piece
// Ignoring loading data part
// Create DF
val testdf = sqlContext.createDataFrame(testData).toDF("features", "line")
// Load pipeline
val model = sc.objectFile[org.apache.spark.ml.PipelineModel]("/my/path/pipeline").first
// My Question comes here : How to extract the probability that corresponding to class label "1"
// This is my attempt, I would like to output probability for label "Yes" and predicted label . The probabilities are stored in a denseVector, but I don't know which one is corresponding to "Yes". Something like this:
val predictions = model.transform(testdf).select("probability").map(e=> e.asInstanceOf[DenseVector])
References regarding to the probabilities and labels for RF:
http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forests
do you mean that you wanna extract probability of positive label in the DenseVector? If so, you may create a udf function to solve the probability.
In the DenseVector of binary classification, the first col presents the probability of "0" and the second col presents of "1".
val prediction = pipelineModel.transform(result)
val pre = prediction.select(getOne($"probability")).withColumnRenamed("UDF(probability)","probability")
You're on the right track with retrieving it from label indexer.
See comments in the code for more information.
This example works with Scala 2.11.8 and Spark 2.2.1.
import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.SparkConf
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{IndexToString, StringIndexer}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.{Column, SparkSession}
object Example {
case class Record(features: org.apache.spark.ml.linalg.Vector)
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession
.builder
.appName("Example")
.config(new SparkConf().setMaster("local[2]"))
.getOrCreate
val sc = spark.sparkContext
import spark.implicits._
val data = sc.parallelize(
Array(
(Vectors.dense(0.9, 0.6), "n"),
(Vectors.dense(0.1, 0.1), "y"),
(Vectors.dense(0.2, 0.15), "y"),
(Vectors.dense(0.8, 0.9), "n"),
(Vectors.dense(0.3, 0.4), "y"),
(Vectors.dense(0.5, 0.5), "n"),
(Vectors.dense(0.6, 0.7), "n"),
(Vectors.dense(0.3, 0.3), "y"),
(Vectors.dense(0.3, 0.3), "y"),
(Vectors.dense(-0.5, -0.1), "dunno"),
(Vectors.dense(-0.9, -0.6), "dunno")
)).toDF("features", "label")
// NOTE: you're fitting StringIndexer to all your data.
// The StringIndexer orders the labels by label frequency.
// In this example there are 5 "y" labels, 4 "n" labels
// and 2 "dunno" labels, so the probability columns will be
// listed in the following order: "y", "n", "dunno".
// You can play with label frequencies to convince yourself
// that it sorts labels by frequency in provided data.
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("label_indexed")
.fit(data)
val indexToLabel = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predicted_label")
.setLabels(labelIndexer.labels)
// Here I use logistic regression, but the exact algorithm doesn't
// matter in this case.
val lr = new LogisticRegression()
.setFeaturesCol("features")
.setLabelCol("label_indexed")
.setPredictionCol("prediction")
val pipeline = new Pipeline().setStages(Array(
labelIndexer,
lr,
indexToLabel
))
val model = pipeline.fit(data)
// Prepare test set
val toPredictDf = sc.parallelize(Array(
Record(Vectors.dense(0.1, 0.5)),
Record(Vectors.dense(0.8, 0.8)),
Record(Vectors.dense(-0.2, -0.5))
)).toDF("features")
// Make predictions
val results = model.transform(toPredictDf)
// The column containing probabilities has to be converted from Vector to Array
val vecToArray = udf( (xs: org.apache.spark.ml.linalg.Vector) => xs.toArray )
val dfArr = results.withColumn("probabilityArr" , vecToArray($"probability") )
// labelIndexer.labels contains the list of your labels.
// It is zipped with index to match the label name with
// related probability found in probabilities array.
// In other words:
// label labelIndexer.labels.apply(idx)
// matches:
// col("probabilityArr").getItem(idx)
// See also: https://stackoverflow.com/a/49917851
val probColumns = labelIndexer.labels.zipWithIndex.map {
case (alias, idx) => (alias, col("probabilityArr").getItem(idx).as(alias))
}
// 'probColumns' is of type Array[(String, Column)] so now
// concatenate these Column objects to DataFrame containing predictions
// See also: https://stackoverflow.com/a/43494322
val columnsAdded = probColumns.foldLeft(dfArr) { case (d, (colName, colContents)) =>
if (d.columns.contains(colName)) {
d
} else {
d.withColumn(colName, colContents)
}
}
columnsAdded.show()
}
}
Once you run this code, it will produce the following data frame:
+-----------+---------------+--------------------+--------------------+--------------------+
| features|predicted_label| y| n| dunno|
+-----------+---------------+--------------------+--------------------+--------------------+
| [0.1,0.5]| y| 0.9999999999994298|5.702468131669394...|9.56953780171369E-19|
| [0.8,0.8]| n|5.850695258713685...| 1.0|4.13416875406573E-81|
|[-0.2,-0.5]| dunno|1.207908506571593...|8.157018363627128...| 0.9998792091493428|
+-----------+---------------+--------------------+--------------------+--------------------+
Columns y, n and dunno are the columns that we have just added to the ordinary output of Spark's ML pipeline.