I'm working on Spark Mllib on Scala for the first time and I'm having trouble instantiating the BinaryClassificationMetrics class. It gives a Cannot resolve constructor error even though I'm formatting its input as an RDD of Tuples as required. Any ideas what might be going wrong?
def modelEvaluation(model: PipelineModel, test: DataFrame): Unit = {
// Make a prediction on the test set
val predictionAndLabels = model.transform(test)
.select("prediction","label")
.rdd
.map(r => (r(0),r(1)))
/*.collect()
.foreach(r => println(r))*/
// Instantiate metrics object
val metrics = new BinaryClassificationMetrics(predictionAndLabels)
// Precision-Recall Curve
//val PRC = metrics.pr
}
BinaryClassificationMetrics need RDD[(Double, Double)], detail: https://spark.apache.org/docs/2.4.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
so you may change like so :
def modelEvaluation(model: PipelineModel, test: DataFrame): Unit = {
// Make a prediction on the test set
val predictionAndLabels = model.transform(test)
.select("prediction","label")
.rdd
.map(r => (r(0).toString.toDouble,r(1).toString.toDouble))
// Instantiate metrics object
val metrics = new BinaryClassificationMetrics(predictionAndLabels)
// Precision-Recall Curve
//val PRC = metrics.pr
}
Related
I am trying to create a Dataset with only one column from Case Class.
Below is the code:
case class vectorData(value: Array[String], vectors: Vector)
def main(args: Array[String]) {
val spark = SparkSession.builder
.appName("Hello world!")
.master("local[*]")
.getOrCreate()
import spark.implicits._
//blah blah and read data etc.
val word2vec = new Word2Vec()
.setInputCol("value").setOutputCol("vectors")
.setVectorSize(5).setMinCount(0).setWindowSize(5)
val dataset = spark.createDataset(data)
val model = word2vec.fit(dataset)
val encoder = org.apache.spark.sql.Encoders.product[vectorData]
val result = model.transform(dataset).as(encoder)
//val output: Dataset[Vector] = ???
}
As shown in last line of the code, I want the output to be only the 2nd column which has Vector type with vectors data.
I tried with:
val output = result.map(o => o.vectors)
But this line highlighted error No implicit arguments of type: Encoder[Vector]
How to resolve this?
I think line:
implicit val vectorEncoder: Encoder[Vector] = org.apache.spark.sql.Encoders.product[Vector]
should make
val output = result.map(o => o.vectors)
correct
I am using Spark version 2.2.0 and scala version 2.11.8.
I created and saved a decision tree binary classification model using following code:
package...
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.SparkSession
object DecisionTreeClassification {
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local[*]")
.appName("Decision Tree")
.getOrCreate()
// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sparkSession.sparkContext, "path/to/file/xyz.txt")
// Split the data into training and test sets (20% held out for testing)
val splits = data.randomSplit(Array(0.8, 0.2))
val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
// Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 5
val maxBins = 32
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
impurity, maxDepth, maxBins)
// Evaluate model on test instances and compute test error
val labelAndPreds = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val testErr = labelAndPreds.filter(r => r._1 != r._2).count().toDouble / testData.count()
println(s"Test Error = $testErr")
println(s"Learned classification tree model:\n ${model.toDebugString}")
// Save and load model
model.save(sparkSession.sparkContext, "target/tmp/myDecisionTreeClassificationModel")
val sameModel = DecisionTreeModel.load(sparkSession.sparkContext, "target/tmp/myDecisionTreeClassificationModel")
// $example off$
sparkSession.sparkContext.stop()
}
}
Now, I want to predict a label (0 or 1) for a new data using this saved model. I am new to Spark, can anybody please let me know how to do that?
I found answer to this question so I thought I should share it if someone is looking for the answer to similar question
To make prediction for new data simply add few lines before stopping the spark session:
val newData = MLUtils.loadLibSVMFile(sparkSession.sparkContext, "path/to/file/abc.txt")
val newDataPredictions = newData.map
{ point =>
val newPrediction = model.predict(point.features)
(point.label, newPrediction)
}
newDataPredictions.foreach(f => println("Predicted label", f._2))
i use spark with scala a have problem in type TimestampType
object regressionLinear {
case class X(
time:String,nodeID: Int, posX: Double,posY: Double,
speed: Double,period: Int)
def main (args: Array[String]) {
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
/**
* Read the input data
*/
var dataset = "C:\\spark\\A6-d07-h08.csv"
if (args.length > 0) {
dataset = args(0)
}
val spark = SparkSession
.builder
.appName("regressionsol")
.master("local[4]")
.getOrCreate()
import spark.implicits._
val data = spark.sparkContext.textFile(dataset)
.map(line=>line.split(","))
.map(userRecord => (userRecord(0).trim.toString,
userRecord(1).trim.toInt, userRecord(2).trim.toDouble,userRecord(3).trim.toDouble,userRecord(4).trim.toDouble,userRecord(5).trim.toInt))
.toDF("time","nodeID","posX", "posY","speed","period").withColumn("time", $"time".cast("timestamp"))
val assembler = new VectorAssembler()
.setInputCols( Array(
"time","nodeID","posX", "posY","speed","period"))
.setOutputCol("features")
val lr = new LinearRegression()
.setLabelCol("period")
.setFeaturesCol("features")
.setRegParam(0.1)
.setMaxIter(100)
.setSolver("l-bfgs")
val steps =
Array(assembler, lr)
val pipeline = new Pipeline()
.setStages(steps)
val Array(training, test) = data.randomSplit(Array(0.75, 0.25), seed = 12345)
val model = pipeline.fit {
training
}
val holdout = model.transform(test)
holdout.show(20)
val prediction = holdout.select("prediction", "period","nodeID").orderBy(abs(col("prediction")-col("period")))
prediction.show(20)
val rm = new RegressionMetrics(prediction.rdd.map{
x => (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double])
})
println(s"RMSE = ${rm.rootMeanSquaredError}")
println(s"R-squared = ${rm.r2}")
spark.stop()
}
}
it is error
Exception in thread "main" java.lang.IllegalArgumentException: Data type TimestampType of column time is not supported.
at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:124)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:184)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:184)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:184)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:136)
at regressionLinear$.main(regressionLinear.scala:100)
at regressionLinear.main(regressionLinear.scala)
VectorAssembler accepts only numeric columns. Other type of columns have to be encoded first. And considering that you apply LinearRegression data has to be encoded anyway.
Exact steps will depend on the domain specific knowledge:
If you expect linear trend based on time cast field to numeric first.
If you expect some type of seasonal effects you might have to extract individual components (day of week, hour / time of day, month and so on), and typically apply StringIndexer + `OneHotEncoder.
i have a problem with the two different MLLIB Implementations (org.apache.spark.ml. and org.apache.spark.mllib) and KMeans. I am using the new implementation of org.apache.spark.ml which is using Dataframes but I'm struggeling with the documentation and how to predict a cluster index.
import org.apache.spark.ml.clustering.{KMeans, KMeansModel}
import org.apache.spark.ml.feature.LabeledPoint
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.{Row, SparkSession}
/**
* An example showcasing the use of kMeans
*/
object ExploreKMeans {
// Spark configuration.
// Retrieve sparkContext with spark.sparkContext.
private val spark = SparkSession.builder()
.appName("com.example.ml.exploration.kMeans")
.master("local[*]")
.getOrCreate()
// This import, after the definition of a valid SQLContext defines implicits for converting RDDs to Dataframes over .toDF().
import spark.implicits._
def main(args: Array[String]): Unit = {
val data = spark.sparkContext.parallelize(Array((5.0, 2.0,1.5), (2.0, 2.5,2.3), (1.0, 2.1,4.2), (2.0, 5.5, 8.5)))
val df = data.toDF().map { row =>
val label = row(0).asInstanceOf[Double]
val value1 = row(1).asInstanceOf[Double]
val value2 = row(2).asInstanceOf[Double]
LabeledPoint(label, Vectors.dense(value1,value2))
}
val kmeans = new KMeans().setK(3).setSeed(1L)
val model: KMeansModel = kmeans.fit(df)
// Evaluate clustering by computing Within Set Sum of Squared Errors.
val WSSSE = model.computeCost(df)
println(s"Within Set Sum of Squared Errors = $WSSSE")
// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
//TODO How to predict cluster index?
//model.predict(???
}
}
How do I use the model to predict the cluster index of new values? The model.predict function is not visible. This API is really confusing...
well, a easier way to do this is:
model.summary.predictions.show
Ok, I got it. Predictions are now done with the transform method:
println("Transform ")
val transformed = model.transform(df)
transformed.collect().foreach(println)
Cluster Centers:
[2.25,1.9]
[5.5,8.5]
[2.1,4.2]
Transform:
[5.0,[2.0,1.5],0]
[2.0,[2.5,2.3],0]
[1.0,[2.1,4.2],2]
[2.0,[5.5,8.5],1]
I need to train a linear regression model on a streaming data. I read streaming data using textFileStream. But the problem is that RegressionMetrics accepts RDD[(Double, Double)], while output is in format DStream[Double,Double].
How to convert output into the RDD[(Double, Double)] to be able to use RegressionMetrics?
val model = new StreamingLinearRegressionWithSGD()
.setInitialWeights(Vectors.dense(0.0, 0.0))
.setStepSize(0.2)
.setNumIterations(25)
trainingData = ssc.textFileStream("/training/data/dir").map(LabeledPoint.parse)
testData = ssc.textFileStream("/training/data/dir").map(LabeledPoint.parse)
model.trainOn(trainingData)
val output = model.predictOnValues(testData.map(lp => (lp.label, lp.features)))
val metrics = new RegressionMetrics(output)
val rmse = metrics.rootMeanSquaredError
Every DStream consists of an underlying RDD (a separate one for every data batch), which can be accessed using the foreachRDD method:
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).foreachRDD { rdd =>
val metrics = new RegressionMetrics(rdd)
val rmse = metrics.rootMeanSquaredError
// do something with `rmse` here
}