Prepare data for MultilayerPerceptronClassifier in scala - scala

Please keep in mind I'm new to scala.
This is the example I am trying to follow:
https://spark.apache.org/docs/1.5.1/ml-ann.html
It uses this dataset:
https://github.com/apache/spark/blob/master/data/mllib/sample_multiclass_classification_data.txt
I have prepared my .csv using the code below to get a data frame for classification in Scala.
//imports for ML
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.Row
//imports for transformation
import sqlContext.implicits._
import com.databricks.spark.csv._
import org.apache.spark.mllib.linalg.{Vector, Vectors}
//load data
val data2 = sqlContext.csvFile("/Users/administrator/Downloads/ds_15k_10-2.csv")
//Rename any one column to features
//val df2 = data.withColumnRenamed("ip_crowding", "features")
val DF2 = data2.select("gst_id_matched","ip_crowding","lat_long_dist");
scala> DF2.take(2)
res6: Array[org.apache.spark.sql.Row] = Array([0,0,0], [0,0,1628859.542])
//define doublelfunc
val toDouble = udf[Double, String]( _.toDouble)
//Convert all to double
val featureDf = DF2
.withColumn("gst_id_matched",toDouble(DF2("gst_id_matched")))
.withColumn("ip_crowding",toDouble(DF2("ip_crowding")))
.withColumn("lat_long_dist",toDouble(DF2("lat_long_dist")))
.select("gst_id_matched","ip_crowding","lat_long_dist")
//Define the format
val toVec4 = udf[Vector, Double,Double] { (v1,v2) => Vectors.dense(v1,v2) }
//Format for features which is gst_id_matched
val encodeLabel = udf[Double, String]( _ match
{ case "0.0" => 0.0 case "1.0" => 1.0} )
//Transformed dataset
val df = featureDf
.withColumn("features",toVec4(featureDf("ip_crowding"),featureDf("lat_long_dist")))
.withColumn("label",encodeLabel(featureDf("gst_id_matched")))
.select("label", "features")
val splits = df.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
// specify layers for the neural network:
// input layer of size 4 (features), two intermediate of size 5 and 4 and output of size 3 (classes)
val layers = Array[Int](0, 0, 0, 0)
// create the trainer and set its parameter
val trainer = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(12).setSeed(1234L).setMaxIter(10)
// train the model
val model = trainer.fit(train)
The last line generates this error
15/11/21 22:46:23 ERROR Executor: Exception in task 1.0 in stage 11.0 (TID 15)
java.lang.ArrayIndexOutOfBoundsException: 0
My suspicions:
When I examine the dataset,it looks fine for classification
scala> df.take(2)
res3: Array[org.apache.spark.sql.Row] = Array([0.0,[0.0,0.0]], [0.0,[0.0,1628859.542]])
But the apache example dataset is different and my transformation does not give me what I need.Can some one please help me with the dataset transformation or understand the root cause of the problem.
This is what the apache dataset looks like:
scala> data.take(1)
res8: Array[org.apache.spark.sql.Row] = Array([1.0,(4,[0,1,2,3],[-0.222222,0.5,-0.762712,-0.833333])])

The source of your problems is a wrong definition of layers. When you use
val layers = Array[Int](0, 0, 0, 0)
it means you want a network with zero nodes in each layer which simply doesn't make sense. Generally speaking number of neurons in the input layer should be equal to the number of features and each hidden layer should contain at least one neuron.
Lets recreate your data simpling your code on the way:
import org.apache.spark.sql.functions.col
val df = sc.parallelize(Seq(
("0", "0", "0"), ("0", "0", "1628859.542")
)).toDF("gst_id_matched", "ip_crowding", "lat_long_dist")
Convert all columns to doubles:
val numeric = df
.select(df.columns.map(c => col(c).cast("double").alias(c)): _*)
.withColumnRenamed("gst_id_matched", "label")
Assemble features:
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(Array("ip_crowding","lat_long_dist"))
.setOutputCol("features")
val data = assembler.transform(numeric)
data.show
// +-----+-----------+-------------+-----------------+
// |label|ip_crowding|lat_long_dist| features|
// +-----+-----------+-------------+-----------------+
// | 0.0| 0.0| 0.0| (2,[],[])|
// | 0.0| 0.0| 1628859.542|[0.0,1628859.542]|
// +-----+-----------+-------------+-----------------+
Train and test network:
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
val layers = Array[Int](2, 3, 5, 3) // Note 2 neurons in the input layer
val trainer = new MultilayerPerceptronClassifier()
.setLayers(layers)
.setBlockSize(128)
.setSeed(1234L)
.setMaxIter(100)
val model = trainer.fit(data)
model.transform(data).show
// +-----+-----------+-------------+-----------------+----------+
// |label|ip_crowding|lat_long_dist| features|prediction|
// +-----+-----------+-------------+-----------------+----------+
// | 0.0| 0.0| 0.0| (2,[],[])| 0.0|
// | 0.0| 0.0| 1628859.542|[0.0,1628859.542]| 0.0|
// +-----+-----------+-------------+-----------------+----------+

Related

Unable to "explode" spark vector from PCA

I am trying to scatter plot the 2 features resulting from the PCA in spark ml library.
To be more precise I am trying to convert result into something like this:
_________
id | X | Y
__________
1 |0.1|0.1
2 |0.2|0.2
3 |0.4|0.4
4 |0.3|0.3
...
from something like this
_________
id | pca
__________
1 |[0.1,0.1]
2 |[0.2,0.2]
3 |[0.4,0.4]
4 |[0.3,0.3]
...
But it seem that spark vector aren't iterable or something like this. I don't understand what is going on. If someone know the answer that would be grate
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.VectorAssembler
val convertToVector = udf((array: Array[Double]) => {
Vectors.dense(array.toArray)
})
val convertToDouble = udf((array: Array[Float]) => {
array.map(_.toDouble).toArray
})
val ds = model.userFactors.withColumn("features", convertToDouble($"features"))
val userMatrixDs = ds.withColumn("features", convertToVector($"features"))
//val df3 = assembler.transform(df2)
val pca = new PCA()
.setInputCol("features")
.setOutputCol("pca")
.setK(2)
.fit(userMatrixDs)
// Project vectors to the linear space spanned by the top 2 principal
// components, keeping the label
val result = pca.transform(userMatrixDs).select("id","pca");
result.show()
result.select(
result.id,
result.col("pca")[0].as("eigenVector1"),
result.col("pca")[1].as("eigenVector2")
)
.show()
Welcome to StackOverflow. Take a look to this example:
val df = spark.createDataFrame(
spark.sparkContext.parallelize(Seq(Row(1, 1.0, 2.0))),
StructType(
List(
StructField("id", IntegerType),
StructField("one", DoubleType),
StructField("two", DoubleType)
)
))
import org.apache.spark.ml.linalg.Vector
import spark.implicits._
val assembler =
new VectorAssembler()
.setInputCols(Array("one", "two"))
.setOutputCol("vector")
val df0 = assembler.transform(df)
df0
.select("id", "vector")
.as[(Int, Vector)]
.map { case (id, vector) =>
val arr = vector.toArray
(id, arr(0), arr(1))
}
.select($"_1".as("id"), $"_2".as("pca_x"), $"_3".as("pca_y"))
First I create with VectorAsembler a Vector column and then extract the value transforming it to a Dataset[(Int, Vector)]. With map you can easily manipulate the row.

Calculating Precision and Recall for specific threshold values

I want to set the threshold value of my logistic regression to 0.5 and I want to get precision, recall, f1 score for this value for pipeline model .but
model.setThreshold(0.5)
give me
value setThreshold is not a member of
org.apache.spark.ml.PipelineModel
val Array(train, test) = dataset
.randomSplit(Array(0.8, 0.2), seed = 1234L)
.map(_.cache())
val assembler = new VectorAssembler()
.setInputCols(Array("label", "id", "features"))
.setOutputCol("feature")
val pca = new PCA()
.setInputCol("feature")
.setK(2)
.setOutputCol("pcaFeatures")
val classifier = new LogisticRegression()
.setFeaturesCol("pcaFeatures")
.setLabelCol("label")
.setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
val pipeline = new Pipeline().setStages(Array(assembler, pca, classifier))
val model = pipeline.fit(train)
val predicted = model.transform(test)
predicted.show()
import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.sql.Row
val predictions = predicted.filter(row => row.getAs[Int]("label") == 1).map(row => (row.getAs[Int]("label"), row.getAs[DenseVector] ("probability")(0)))
predictions.show()
import org.apache.spark.mllib.evaluation.MulticlassMetrics
val predictionAndLabels = predicted.
select($"label",$"prediction").
as[(Double, Double)].
rdd
val metrics = new BinaryClassificationMetrics(predictionAndLabels)
val precision = metrics.precisionByThreshold()
precision.foreach { case (t, p) =>
println(s"Threshold is: $t, Precision is: $p")
}
val recall = metrics.recallByThreshold
recall.foreach { case (t, p) =>
println(s"Threshold is: $t,recall is: $p")
}
+---+-------------------+-----+--------------------+--------------------+--------------------+--------------------+----------+
| id| features|label| feature| pcaFeatures| rawPrediction| probability|prediction|
+---+-------------------+-----+--------------------+--------------------+--------------------+--------------------+----------+
| 3|0.03731932516607228| 1|[1.0,3.0,0.037319...|[-3.0000000581646...|[-0.8840273374633...|[0.29234391132806...| 1.0|
| 7| 0.9636476860201426| 1|[1.0,7.0,0.963647...|[-7.0000000960209...|[-0.8831455606697...|[0.29252636578097...| 1.0|
| 8| 0.4766320058073684| 0|[0.0,8.0,0.476632...|[-8.0000000194785...|[0.87801311177017...|[0.70641031990863...| 0.0|
| 45| 0.1474318959104205| 1|[1.0,45.0,0.14743...|[-45.000000062664...|[-0.8839183791391...|[0.29236645302163...| 1.0|
|103| 0.3443839885873453| 1|[1.0,103.0,0.3443...|[-103.00000007071...|[-0.8837251994055...|[0.29240642125330...| 1.0|
How to set threshold t value of my Logistic regression model with pipeline?

How to generate a DataFrame with random content and N rows?

How can I create a Spark DataFrame in Scala with 100 rows and 3 columns that have random integer values in range (1, 100)?
I know how to create a DataFrame manually, but I cannot automate it:
val df = sc.parallelize(Seq((1,20, 40), (60, 10, 80), (30, 15, 30))).toDF("col1", "col2", "col3")
Generating the data locally and then parallelizing it is totally fine, especially if you don't have to generate a lot of data.
However, should you ever need to generate a huge dataset, you can alway implement an RDD that does this for you in parallel, as in the following example.
import scala.reflect.ClassTag
import org.apache.spark.{Partition, TaskContext}
import org.apache.spark.rdd.RDD
// Each random partition will hold `numValues` items
final class RandomPartition[A: ClassTag](val index: Int, numValues: Int, random: => A) extends Partition {
def values: Iterator[A] = Iterator.fill(numValues)(random)
}
// The RDD will parallelize the workload across `numSlices`
final class RandomRDD[A: ClassTag](#transient private val sc: SparkContext, numSlices: Int, numValues: Int, random: => A) extends RDD[A](sc, deps = Seq.empty) {
// Based on the item and executor count, determine how many values are
// computed in each executor. Distribute the rest evenly (if any).
private val valuesPerSlice = numValues / numSlices
private val slicesWithExtraItem = numValues % numSlices
// Just ask the partition for the data
override def compute(split: Partition, context: TaskContext): Iterator[A] =
split.asInstanceOf[RandomPartition[A]].values
// Generate the partitions so that the load is as evenly spread as possible
// e.g. 10 partition and 22 items -> 2 slices with 3 items and 8 slices with 2
override protected def getPartitions: Array[Partition] =
((0 until slicesWithExtraItem).view.map(new RandomPartition[A](_, valuesPerSlice + 1, random)) ++
(slicesWithExtraItem until numSlices).view.map(new RandomPartition[A](_, valuesPerSlice, random))).toArray
}
Once you have this you can use it passing your own random data generator to get an RDD[Int]
val rdd = new RandomRDD(spark.sparkContext, 10, 22, scala.util.Random.nextInt(100) + 1)
rdd.foreach(println)
/*
* outputs:
* 30
* 86
* 75
* 20
* ...
*/
or an RDD[(Int, Int, Int)]
def rand = scala.util.Random.nextInt(100) + 1
val rdd = new RandomRDD(spark.sparkContext, 10, 22, (rand, rand, rand))
rdd.foreach(println)
/*
* outputs:
* (33,22,15)
* (65,24,64)
* (41,81,44)
* (58,7,18)
* ...
*/
and of course you can wrap it in a DataFrame very easily as well:
spark.createDataFrame(rdd).show()
/*
* outputs:
* +---+---+---+
* | _1| _2| _3|
* +---+---+---+
* |100| 48| 92|
* | 34| 40| 30|
* | 98| 63| 61|
* | 95| 17| 63|
* | 68| 31| 34|
* .............
*/
Notice how in this case the generated data is different every time the RDD/DataFrame is acted upon. By changing the implementation of RandomPartition to actually store the values instead of generating them on the fly, you can have a stable set of random items, while still retaining the flexibility and scalability of this approach.
One nice property of the stateless approach is that you can generate huge dataset even locally. The following ran in a few seconds on my laptop:
new RandomRDD(spark.sparkContext, 10, Int.MaxValue, 42).count
// returns: 2147483647
Here you go, Seq.fill is your friend:
def randomInt1to100 = scala.util.Random.nextInt(100)+1
val df = sc.parallelize(
Seq.fill(100){(randomInt1to100,randomInt1to100,randomInt1to100)}
).toDF("col1", "col2", "col3")
You can simply use scala.util.Random to generate the random numbers within range and loop for 100 rows and finally use createDataFrame api
import scala.util.Random
val data = 1 to 100 map(x => (1+Random.nextInt(100), 1+Random.nextInt(100), 1+Random.nextInt(100)))
sqlContext.createDataFrame(data).toDF("col1", "col2", "col3").show(false)
You can use this below generic code
//no of rows required
val rows = 15
//no of columns required
val cols = 10
val spark = SparkSession.builder
.master("local[*]")
.appName("testApp")
.config("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse")
.getOrCreate()
import spark.implicits._
val columns = 1 to cols map (i => "col" + i)
// create the DataFrame schema with these columns (in that order)
val schema = StructType(columns.map(StructField(_, IntegerType)))
val lstrows = Seq.fill(rows * cols)(Random.nextInt(100) + 1).grouped(cols).toList.map { x => Row(x: _*) }
val rdd = spark.sparkContext.makeRDD(lstrows)
val df = spark.createDataFrame(rdd, schema)
If you need to create a large amount of random data, Spark provides an object called RandomRDDs that can generate datasets filled with random numbers following a uniform, normal, or various other distributions.
https://spark.apache.org/docs/latest/mllib-statistics.html#random-data-generation
From their example:
import org.apache.spark.mllib.random.RandomRDDs._
// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
val u = normalRDD(sc, 1000000L, 10)
// Apply a transform to get a random double RDD following `N(1, 4)`.
val v = u.map(x => 1.0 + 2.0 * x)

Match Dataframe Categorical Variables in vector Spark Scala

I have been trying to follow the stack overflow example about creating dataframes for machine learning ml library in spark scala.
How to create correct data frame for classification in Spark ML
However, I cannot get the matching udf to work.
Syntax: "kinds of the type arguments (Vector,Int,Int,String,String) do
not conform to the expected kinds of the type parameters (type RT,type
A1,type A2,type A3,type A4). Vector's type parameters do not match
type RT's expected parameters: type Vector has one type parameter, but
type RT has none"
I need to create a dataframe to input into the logistic regression library. Source sample data example has:
Source, Amount, Account, Fraud
CACC1, 9120.50, 999, 0
CACC2, 3897.25, 999, 0
AMXCC1, -523, 999, 0
MASCC2, -8723.15, 999, 0
I suppose my desired output is:
+-------------------+-----+
| features|label|
+-------------------+-----+
|[1.0,9120.50,999] | 0.0|
|[1.0,3897.25,999] | 0.0|
|[2.0,-523.00,999] | 0.0|
|[0.0,-8723.15,999] | 0.0|
+-------------------+-----+
So far I have:
val df = sqlContext.sql("select * from prediction_test")
val df_2 = df.select("source","amount","account")
val toVec3 = udf[Vector,String,Int,Int] { (a,b,c) =>
val e3 = c match {
case "MASCC2" => 0
case "CACC1" => 1
case "AMXCC1" => 2
}
Vectors.dense(e1, b, c)
}
val encodeLabel = udf[Double, Int](_match{case "0" => 0.0 case "1" => 1.0})
val df_3 = df_2.withColumn("features", toVec3(df_2("source"),df_2("amount"),df_2("account")).withColumn("label", encodeLabel(df("fraud"))).select("features","label")
How to create correct data frame for classification in Spark ML
By using Spark 2.3.1 I suggest following codes for classification ready Spark ML Pipeline. If you want to include classification object into Pipeline you need to just add it where I point out. ClassificationPipeline returns a PipelineModel. Once you transform this model you can get a classification ready columns named features and label.
// Handles categorical features
def stringIndexerPipeline(inputCol: String): (Pipeline, String) = {
val indexer = new StringIndexer()
.setHandleInvalid("skip")
.setInputCol(inputCol)
.setOutputCol(inputCol + "_indexed")
val pipeline = new Pipeline().setStages(Array(indexer))
(pipeline, inputCol + "_indexed")
}
// Classification Pipeline Function
def ClassificationPipeline(df:DataFrame): PipelineModel = {
// Preprocessing categorical features
val (SourcePipeline, Source_indexed) = stringIndexerPipeline("Source")
// Use StringIndexer output as input for OneHotEncoderEstimator
val oneHotEncoder = new OneHotEncoderEstimator()
//.setDropLast(true)
//.setHandleInvalid("skip")
.setInputCols(Array("Source_indexed"))
.setOutputCols(Array("Source_indexedVec"))
// Gather features that will be pass through pipeline
val inputCols = oneHotEncoder.getOutputCols ++ Array("Amount","Account")
// Put all inputs in a column as a vector
val vectorAssembler = new VectorAssembler()
.setInputCols(inputCols)
.setOutputCol("featureVector")
// Scale vector column
val standartScaler = new StandardScaler()
.setInputCol("featureVector")
.setOutputCol("features")
.setWithStd(true)
.setWithMean(false)
// Create stringindexer for label col
val labelIndexer = new StringIndexer().
setHandleInvalid("skip").
setInputCol("Fraud").
setOutputCol("label")
// create classification object in here
// val classificationObject = new ....
// Create a pipeline
val pipeline = new Pipeline().setStages(
Array(SourcePipeline, oneHotEncoder, vectorAssembler, standartScaler, labelIndexer/*, classificationObject*/))
pipeline.fit(df)
}
val pipelineModel = ClassificationPipeline(df)
val transformedDF = pipelineModel.transform(df)

How to get probabilities corresponding to the class from Spark ML random forest

I've been using org.apache.spark.ml.Pipeline for machine learning tasks. It is particularly important to know the actual probabilities instead of just a predicted label , and I am having difficulties to get it. Here I am doing a binary classification task with random forest. The class labels are "Yes" and "No". I would like to output probability for label "Yes" . The probabilities are stored in a DenseVector as the pipeline output, such as [0.69, 0.31], but I don't know which one is corresponding to "Yes" (0.69 or 0.31?). I guess there should be someway to retrieve it from labelIndexer?
Here is my task Code for training the model
val sc = new SparkContext(new SparkConf().setAppName(" ML").setMaster("local"))
val data = .... // load data from file
val df = sqlContext.createDataFrame(data).toDF("label", "features")
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(df)
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(2)
.fit(df)
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels)
val Array(trainingData, testData) = df.randomSplit(Array(0.7, 0.3))
// Train a RandomForest model.
val rf = new RandomForestClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
.setNumTrees(10)
.setFeatureSubsetStrategy("auto")
.setImpurity("gini")
.setMaxDepth(4)
.setMaxBins(32)
// Create pipeline
val pipeline = new Pipeline()
.setStages(Array(labelIndexer, featureIndexer, rf,labelConverter))
// Train model
val model = pipeline.fit(trainingData)
// Save model
sc.parallelize(Seq(model), 1).saveAsObjectFile("/my/path/pipeline")
Then I will load the pipeline and make predictions on new data, and here is the code piece
// Ignoring loading data part
// Create DF
val testdf = sqlContext.createDataFrame(testData).toDF("features", "line")
// Load pipeline
val model = sc.objectFile[org.apache.spark.ml.PipelineModel]("/my/path/pipeline").first
// My Question comes here : How to extract the probability that corresponding to class label "1"
// This is my attempt, I would like to output probability for label "Yes" and predicted label . The probabilities are stored in a denseVector, but I don't know which one is corresponding to "Yes". Something like this:
val predictions = model.transform(testdf).select("probability").map(e=> e.asInstanceOf[DenseVector])
References regarding to the probabilities and labels for RF:
http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forests
do you mean that you wanna extract probability of positive label in the DenseVector? If so, you may create a udf function to solve the probability.
In the DenseVector of binary classification, the first col presents the probability of "0" and the second col presents of "1".
val prediction = pipelineModel.transform(result)
val pre = prediction.select(getOne($"probability")).withColumnRenamed("UDF(probability)","probability")
You're on the right track with retrieving it from label indexer.
See comments in the code for more information.
This example works with Scala 2.11.8 and Spark 2.2.1.
import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.SparkConf
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{IndexToString, StringIndexer}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.{Column, SparkSession}
object Example {
case class Record(features: org.apache.spark.ml.linalg.Vector)
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession
.builder
.appName("Example")
.config(new SparkConf().setMaster("local[2]"))
.getOrCreate
val sc = spark.sparkContext
import spark.implicits._
val data = sc.parallelize(
Array(
(Vectors.dense(0.9, 0.6), "n"),
(Vectors.dense(0.1, 0.1), "y"),
(Vectors.dense(0.2, 0.15), "y"),
(Vectors.dense(0.8, 0.9), "n"),
(Vectors.dense(0.3, 0.4), "y"),
(Vectors.dense(0.5, 0.5), "n"),
(Vectors.dense(0.6, 0.7), "n"),
(Vectors.dense(0.3, 0.3), "y"),
(Vectors.dense(0.3, 0.3), "y"),
(Vectors.dense(-0.5, -0.1), "dunno"),
(Vectors.dense(-0.9, -0.6), "dunno")
)).toDF("features", "label")
// NOTE: you're fitting StringIndexer to all your data.
// The StringIndexer orders the labels by label frequency.
// In this example there are 5 "y" labels, 4 "n" labels
// and 2 "dunno" labels, so the probability columns will be
// listed in the following order: "y", "n", "dunno".
// You can play with label frequencies to convince yourself
// that it sorts labels by frequency in provided data.
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("label_indexed")
.fit(data)
val indexToLabel = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predicted_label")
.setLabels(labelIndexer.labels)
// Here I use logistic regression, but the exact algorithm doesn't
// matter in this case.
val lr = new LogisticRegression()
.setFeaturesCol("features")
.setLabelCol("label_indexed")
.setPredictionCol("prediction")
val pipeline = new Pipeline().setStages(Array(
labelIndexer,
lr,
indexToLabel
))
val model = pipeline.fit(data)
// Prepare test set
val toPredictDf = sc.parallelize(Array(
Record(Vectors.dense(0.1, 0.5)),
Record(Vectors.dense(0.8, 0.8)),
Record(Vectors.dense(-0.2, -0.5))
)).toDF("features")
// Make predictions
val results = model.transform(toPredictDf)
// The column containing probabilities has to be converted from Vector to Array
val vecToArray = udf( (xs: org.apache.spark.ml.linalg.Vector) => xs.toArray )
val dfArr = results.withColumn("probabilityArr" , vecToArray($"probability") )
// labelIndexer.labels contains the list of your labels.
// It is zipped with index to match the label name with
// related probability found in probabilities array.
// In other words:
// label labelIndexer.labels.apply(idx)
// matches:
// col("probabilityArr").getItem(idx)
// See also: https://stackoverflow.com/a/49917851
val probColumns = labelIndexer.labels.zipWithIndex.map {
case (alias, idx) => (alias, col("probabilityArr").getItem(idx).as(alias))
}
// 'probColumns' is of type Array[(String, Column)] so now
// concatenate these Column objects to DataFrame containing predictions
// See also: https://stackoverflow.com/a/43494322
val columnsAdded = probColumns.foldLeft(dfArr) { case (d, (colName, colContents)) =>
if (d.columns.contains(colName)) {
d
} else {
d.withColumn(colName, colContents)
}
}
columnsAdded.show()
}
}
Once you run this code, it will produce the following data frame:
+-----------+---------------+--------------------+--------------------+--------------------+
| features|predicted_label| y| n| dunno|
+-----------+---------------+--------------------+--------------------+--------------------+
| [0.1,0.5]| y| 0.9999999999994298|5.702468131669394...|9.56953780171369E-19|
| [0.8,0.8]| n|5.850695258713685...| 1.0|4.13416875406573E-81|
|[-0.2,-0.5]| dunno|1.207908506571593...|8.157018363627128...| 0.9998792091493428|
+-----------+---------------+--------------------+--------------------+--------------------+
Columns y, n and dunno are the columns that we have just added to the ordinary output of Spark's ML pipeline.