Spark ML throws exception for Decision Tree classification: Column features must be of type numeric but was actually of type struct - scala

I am trying to create a Spark ML model with the Decision Tree Classifier to perform classification , but I am getting an error saying the features in my training set should be of type numeric instead of type struct.
Here is the minimal reproducible example that I tried:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.VectorUDT
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml._
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.DecisionTreeClassifier
val df8 = Seq(
("2022-08-22 10:00:00",417.7,419.97,419.97,417.31,"nothing"),
("2022-08-22 11:30:00",417.35,417.33,417.46,416.77,"buy"),
("2022-08-22 13:00:00",417.55,417.68,418.04,417.48,"sell"),
("2022-08-22 14:00:00",417.22,417.8,421.13,416.83,"sell")
)
val df77 = spark.createDataset(df8).toDF("30mins_date","30mins_close","30mins_open","30mins_high","30mins_low", "signal")
val assembler_features = new VectorAssembler()
.setInputCols(Array("30mins_close","30mins_open","30mins_high","30mins_low"))
.setOutputCol("features")
val output2 = assembler_features.transform(df77)
val indexer = new StringIndexer()
.setInputCol("signal")
.setOutputCol("signalIndex")
val indexed = indexer.fit(output2).transform(output2)
val assembler_label = new VectorAssembler()
.setInputCols(Array("signalIndex"))
.setOutputCol("signalIndexV")
val output = assembler_label.transform(indexed)
val dt = new DecisionTreeClassifier()
.setLabelCol("features")
.setFeaturesCol("signalIndexV")
val Array(trainingData, testData) = output.select("features", "signalIndexV").randomSplit(Array(0.7, 0.3))
val model = dt.fit(trainingData)
Output error:
java.lang.IllegalArgumentException: requirement failed: Column features must be of type numeric but was actually of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:78)
at org.apache.spark.ml.PredictorParams.validateAndTransformSchema(Predictor.scala:54)
at org.apache.spark.ml.PredictorParams.validateAndTransformSchema$(Predictor.scala:47)
at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:73)
at org.apache.spark.ml.classification.ClassifierParams.validateAndTransformSchema(Classifier.scala:43)
at org.apache.spark.ml.classification.ClassifierParams.validateAndTransformSchema$(Classifier.scala:39)
at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:51)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema(ProbabilisticClassifier.scala:38)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema$(ProbabilisticClassifier.scala:34)
at org.apache.spark.ml.classification.DecisionTreeClassifier.org$apache$spark$ml$tree$DecisionTreeClassifierParams$$super$validateAndTransformSchema(DecisionTreeClassifier.scala:46)
at org.apache.spark.ml.tree.DecisionTreeClassifierParams.validateAndTransformSchema(treeParams.scala:245)
at org.apache.spark.ml.tree.DecisionTreeClassifierParams.validateAndTransformSchema$(treeParams.scala:241)
at org.apache.spark.ml.classification.DecisionTreeClassifier.validateAndTransformSchema(DecisionTreeClassifier.scala:46)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:177)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:133)
... 61 elided
I tried above code in spark-shell environment:
spark v 3.3.1
scala v 2.12.15
Here is what trainingData looks like
+-----------------------------+------------+
|features |signalIndexV|
+-----------------------------+------------+
|[417.7,419.97,419.97,417.31] |[2.0] |
|[417.35,417.33,417.46,416.77]|[1.0] |
|[417.55,417.68,418.04,417.48]|[0.0] |
|[417.22,417.8,421.13,416.83] |[0.0] |
+-----------------------------+------------+
So what did I do wrong ? How can I convert column features into numeric type ?

Related

String Matching Spark ML algorithm in Scala

I am new to Scala as well as Spark ML. Trying to create a string matching algorithm based on recommendation from PySpark String Matching. Based on it I was able to implement below so far
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql._
import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, RegexTokenizer}
import spark.implicits._
val vendorData = spark.read.option("header", "true").option("inferSchema", "true").json(path = "Data/*.json").as[vendorData]
// Load IMDB file into an Dataset
val imdbData = spark.read.option("header", "True").option("inferSchema", "True").option("sep", "\t").csv(path = "Data/title.basics.tsv").as[imdbData]
// Remove Special chaacters
val newVendorData = vendorData.withColumn("newtitle", functions.regexp_replace(vendorData.col("title"), "[^A-Za-z0-9_]",""))
val newImdbData = imdbData.withColumn("newprimaryTitle", functions.regexp_replace(imdbData.col("primaryTitle"), "[^A-Za-z0-9_]", ""))
//Algo to find match percentage
val tokenizer = new RegexTokenizer().setPattern("").setInputCol("text").setMinTokenLength(1).setOutputCol("tokens")
val ngram = new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams")
val vectorizer = new HashingTF().setInputCol("ngrams").setOutputCol("vectors")
val lsh = new MinHashLSH().setInputCol("vectors").setOutputCol("lsh")
val pipeline = new Pipeline().setStages(Array(tokenizer, ngram, vectorizer, lsh))
val model = pipeline.fit(newVendorData.select("newtitle"))
val vendorHashed = model.transform(newVendorData.select("newtitle"))
val imdbHashed = model.transform(newImdbData.select("newprimaryTitle"))
model.stages.last.asInstanceOf[ml.feature.MinHashLSHModel].approxSimilarityJoin(vendorHashed, imdbHashed, .85).show()
When running I am getting below error. On Further investigation I could find that the issue is at line no:
val model = pipeline.fit(newVendorData.select("newtitle"))
But can't see what it is.
Exception in thread "main" java.lang.IllegalArgumentException: text does not exist. Available: newtitle
at org.apache.spark.sql.types.StructType.$anonfun$apply$1(StructType.scala:278)
at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:168)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:277)
at org.apache.spark.ml.UnaryTransformer.transformSchema(Transformer.scala:109)
at org.apache.spark.ml.Pipeline.$anonfun$transformSchema$4(Pipeline.scala:184)
at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:184)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:136)
at MatchingJob$.$anonfun$main$1(MatchingJob.scala:84)
at MatchingJob$.$anonfun$main$1$adapted(MatchingJob.scala:43)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at MatchingJob$.main(MatchingJob.scala:43)
at MatchingJob.main(MatchingJob.scala)
Not sure what is wrong I am doing.
My inputs are below:
+------------------+
| newtitle|
+------------------+
| BhaagMilkhaBhaag|
| Fukrey|
| DilTohBacchaHaiJi|
|IndiasJungleHeroes|
| HrudayaGeethe|
**newprimaryTitle**
BhaagMilkhaBhaag
Fukrey
Carmencita
Leclownetseschiens
PauvrePierrot
Unbonbock
BlacksmithScene
ChineseOpiumDen
DilTohBacchaHaiJi
IndiasJungleHeroes
CorbettandCourtne
EdisonKinetoscopi
MissJerry
LeavingtheFactory
AkrobatischesPotp
TheArrivalofaTrain
ThePhotographical
TheWatererWatered
Autourdunecabine
Barquesortantduport
ItalienischerBaue
DasboxendeKnguruh
TheClownBarber
TheDerby1895

type mismatch; found : org.apache.spark.sql.DataFrame required: org.apache.spark.rdd.RDD

I am new to scala and mllib and I have been getting the following error. Please let me know if anyone has been able to resolve something similar.
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors
.
.
.
val conf = new SparkConf().setMaster("local").setAppName("SampleApp")
val sContext = new SparkContext(conf)
val sc = SparkSession.builder().master("local").appName("SampleApp").getOrCreate()
val sampleData = sc.read.json("input/sampleData.json")
val clusters = KMeans.train(sampleData, 10, 10)
val WSSSE = clusters.computeCost(sampleData)
clusters.save(sc, "target/org/apache/spark/KMeansExample/KMeansModel")
val sameModel = KMeansModel.load(sContext, "target/org/apache/spark/KMeansExample/KMeansModel")
this above line gives an error as:
type mismatch; found : org.apache.spark.sql.DataFrame (which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
So I tried:
import org.apache.spark.ml.clustering.KMeans
val kmeans = new KMeans().setK(20)
val model = kmeans.fit(sampleData)
val predictions = model.transform(sampleData)
val evaluator = new ClusteringEvaluator()
val silhouette = evaluator.evaluate(predictions)
This gives the error:
Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
Available fields: address, attributes, business_id
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:267)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:267)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:266)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnTypes(SchemaUtils.scala:58)
at org.apache.spark.ml.util.SchemaUtils$.validateVectorCompatibleColumn(SchemaUtils.scala:119)
at org.apache.spark.ml.clustering.KMeansParams$class.validateAndTransformSchema(KMeans.scala:96)
at org.apache.spark.ml.clustering.KMeans.validateAndTransformSchema(KMeans.scala:285)
at org.apache.spark.ml.clustering.KMeans.transformSchema(KMeans.scala:382)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:341)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:340)
at org.apache.spark.ml.util.Instrumentation$$anonfun$11.apply(Instrumentation.scala:183)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:183)
at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:340)
I have been referring to https://spark.apache.org/docs/latest/ml-clustering.html and https://spark.apache.org/docs/latest/mllib-clustering.html
Edit
Using setFeaturesCol()
import org.apache.spark.ml.clustering.KMeans
val assembler = new VectorAssembler()
.setInputCols(Array("is_open", "review_count", "stars"))
.setOutputCol("features")
val output = assembler.transform(sampleData).select("features")
val kmeans = new KMeans().setK(20).setFeaturesCol("features")
val model = kmeans.fit(output)
val predictions = model.transform(sampleData)
val evaluator = new ClusteringEvaluator()
val silhouette = evaluator.evaluate(predictions)
println(s"Silhouette with squared euclidean distance = $silhouette")
This gives a different error still:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.util.Utils$.getSimpleName(Ljava/lang/Class;)Ljava/lang/String;
at org.apache.spark.ml.util.Instrumentation.logPipelineStage(Instrumentation.scala:52)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:350)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:340)
at org.apache.spark.ml.util.Instrumentation$$anonfun$11.apply(Instrumentation.scala:183)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:183)
at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:340)
Thanks.
Use the scala pipeline
val assembler = new VectorAssembler()
.setInputCols(Array("feature1",feature2","feature3"))
.setOutputCol("assembled_features")
val scaler = new StandardScaler()
.setInputCol("assembled_features")
.setOutputCol("features")
.setWithStd(true)
.setWithMean(false)
val kmeans = new KMeans().setK(2).setSeed(1L)
// create the pipeline
val pipeline = new Pipeline()
.setStages(Array(assembler, scaler, kmeans))
// Fit the model
val clussterModel = pipeline.fit(train)

Cannot pass arrays from MongoDB into Spark Machine Learning functions that require Vectors

My use case:
Read data from a MongoDB collection of the form:
{
"_id" : ObjectId("582cab1b21650fc72055246d"),
"label" : 167.517838916715,
"features" : [
10.0964787450654,
218.621137772497,
18.8833848806122,
11.8010251302327,
1.67037687829152,
22.0766170950477,
11.7122322171201,
12.8014773524475,
8.30441804118235,
29.4821268054137
]
}
And pass it to the org.apache.spark.ml.regression.LinearRegression class to create a model for predictions.
My problem:
The Spark connector reads in "features" as Array[Double].
LinearRegression.fit(...) expects a DataSet with a Label column and a Features column.
The Features column must be of type VectorUDT (so DenseVector or SparseVector will work).
I cannot .map features from Array[Double] to DenseVector because there is no relevant Encoder:
Error:(23, 11) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
.map{case Row(label: Double, features: Array[Double]) => Row(label, Vectors.dense(features))}
Custom Encoders cannot be defined.
My question:
Is there a way I can set the configuration of the Spark connector to
read in the "features" array as a Dense/SparseVector?
Is there any
other way I can achieve this (without, for example, using an
intermediary .csv file and loading that using libsvm)?
My code:
import com.mongodb.spark.MongoSpark
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.sql.{Row, SparkSession}
case class DataPoint(label: Double, features: Array[Double])
object LinearRegressionWithMongo {
def main(args: Array[String]) {
val spark = SparkSession
.builder()
.appName("LinearRegressionWithMongo")
.master("local[4]")
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/LinearRegressionTest.DataPoints")
.getOrCreate()
import spark.implicits._
val dataPoints = MongoSpark.load(spark)
.map{case Row(label: Double, features: Array[Double]) => Row(label, Vectors.dense(features))}
val splitData = dataPoints.randomSplit(Array(0.7, 0.3), 42)
val training = splitData(0)
val test = splitData(1)
val linearRegression = new LinearRegression()
.setLabelCol("label")
.setFeaturesCol("features")
.setRegParam(0.0)
.setElasticNetParam(0.0)
.setMaxIter(100)
.setTol(1e-6)
// Train the model
val startTime = System.nanoTime()
val linearRegressionModel = linearRegression.fit(training)
val elapsedTime = (System.nanoTime() - startTime) / 1e9
println(s"Training time: $elapsedTime seconds")
// Print the weights and intercept for linear regression.
println(s"Weights: ${linearRegressionModel.coefficients} Intercept: ${linearRegressionModel.intercept}")
val modelEvaluator = new ModelEvaluator()
println("Training data results:")
modelEvaluator.evaluateRegressionModel(linearRegressionModel, training, "label")
println("Test data results:")
modelEvaluator.evaluateRegressionModel(linearRegressionModel, test, "label")
spark.stop()
}
}
Any help would be ridiculously appreciated!
There is quick fix for this. If data has been loaded into a DataFrame called df which has:
id - SQL double.
features - SQL array<double>.
like this one
val df = Seq((1.0, Array(2.3, 3.4, 4.5))).toDF("id", "features")
you select columns you need for downstream processing:
val idAndFeatures = df.select("id", "features")
convert to statically typed Dataset:
val tuples = idAndFeatures.as[(Double, Seq[Double])]
map and convert back to Dataset[Row]:
val spark: SparkSession = ???
import spark.implicits._
import org.apache.spark.ml.linalg.Vectors
tuples.map { case (id, features) =>
(id, Vectors.dense(features.toArray))
}.toDF("id", "features")
You can find a detailed explanation what is the difference compared to you current approach here.

OneHotEncoder in Spark Dataframe in Pipeline

I've been trying to get an example running in Spark and Scala with the adult dataset .
Using Scala 2.11.8 and Spark 1.6.1.
The problem (for now) lies in the amount of categorical features in that dataset that all need to be encoded to numbers before a Spark ML algorithm can do its job..
So far I have this:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.OneHotEncoder
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
object Adult {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Adult example").setMaster("local[*]")
val sparkContext = new SparkContext(conf)
val sqlContext = new SQLContext(sparkContext)
val data = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("src/main/resources/adult.data")
val categoricals = data.dtypes filter (_._2 == "StringType")
val encoders = categoricals map (cat => new OneHotEncoder().setInputCol(cat._1).setOutputCol(cat._1 + "_encoded"))
val features = data.dtypes filterNot (_._1 == "label") map (tuple => if(tuple._2 == "StringType") tuple._1 + "_encoded" else tuple._1)
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
val pipeline = new Pipeline()
.setStages(encoders ++ Array(lr))
val model = pipeline.fit(training)
}
}
However, this doesn't work. Calling pipeline.fit still contains the original string features and thus throws an exception.
How can I remove these "StringType" columns in a pipeline?
Or maybe I'm doing it completely wrong, so if someone has a different suggestion I'm happy to all input :).
The reason why I choose to follow this flow is because I have an extensive background in Python and Pandas, but am trying to learn both Scala and Spark.
There is one thing that can be rather confusing here if you're used to higher level frameworks. You have to index the features before you can use encoder. As it is explained in the API docs:
one-hot encoder (...) maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index.
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{StringIndexer, OneHotEncoder}
val df = Seq((1L, "foo"), (2L, "bar")).toDF("id", "x")
val categoricals = df.dtypes.filter (_._2 == "StringType") map (_._1)
val indexers = categoricals.map (
c => new StringIndexer().setInputCol(c).setOutputCol(s"${c}_idx")
)
val encoders = categoricals.map (
c => new OneHotEncoder().setInputCol(s"${c}_idx").setOutputCol(s"${c}_enc")
)
val pipeline = new Pipeline().setStages(indexers ++ encoders)
val transformed = pipeline.fit(df).transform(df)
transformed.show
// +---+---+-----+-------------+
// | id| x|x_idx| x_enc|
// +---+---+-----+-------------+
// | 1|foo| 1.0| (1,[],[])|
// | 2|bar| 0.0|(1,[0],[1.0])|
// +---+---+-----+-------------+
As you can see there is no need to drop string columns from the pipeline. In practice OneHotEncoder will accept numeric column with NominalAttribute, BinaryAttribute or missing type attribute.

How to extract variable weight from spark pipeline logistic model?

I am currently trying to learn Spark Pipeline (Spark 1.6.0). I imported datasets (train and test) as oas.sql.DataFrame objects. After executing the following codes, the produced model is a oas.ml.tuning.CrossValidatorModel.
You can use model.transform (test) to predict based on the test data in Spark. However, I would like to compare the weights that model used to predict with that from R. How to extract the weights of the predictors and intercept (if any) of model? The Scala codes are:
import sqlContext.implicits._
import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.SparkContext
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{LogisticRegression, LogisticRegressionModel}
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
val conTrain = sc.textFile("AbsolutePath2Train.txt")
val conTest = sc.textFile("AbsolutePath2Test.txt")
// parse text and convert to sql.DataFrame
val train = conTrain.map { line =>
val parts = line.split(",")
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(" +").map(_.toDouble)))
}.toDF()
val test =conTest.map{ line =>
val parts = line.split(",")
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(" +").map(_.toDouble)))
}.toDF()
// set parameter space and evaluation method
val lr = new LogisticRegression().setMaxIter(400)
val pipeline = new Pipeline().setStages(Array(lr))
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).addGrid(lr.fitIntercept).addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator).setEstimatorParamMaps(paramGrid).setNumFolds(2)
// fit logistic model
val model = cv.fit(train)
// If you want to predict with test
val pred = model.transform(test)
My spark environment is not accessible. Thus, these codes are retyped and rechecked. I hope they are correct. So far, I have tried searching on webs, asking others. About my coding, welcome suggestions, and criticisms.
// set parameter space and evaluation method
val lr = new LogisticRegression().setMaxIter(400)
val pipeline = new Pipeline().setStages(Array(lr))
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).addGrid(lr.fitIntercept).addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator).setEstimatorParamMaps(paramGrid).setNumFolds(2)
// you can print lr model coefficients as below
val model = cv.bestModel.asInstanceOf[PipelineModel]
val lrModel = model.stages(0).asInstanceOf[LogisticRegressionModel]
println(s"LR Model coefficients:\n${lrModel.coefficients.toArray.mkString("\n")}")
Two steps:
Get the best pipeline from cross validation result.
Get the LR Model from the best pipeline. It's the first stage in your code example.
I was looking for exactly the same thing. You might already have the answer, but anyway, here it is.
import org.apache.spark.ml.classification.LogisticRegressionModel
val lrmodel = model.bestModel.asInstanceOf[LogisticRegressionModel]
print(model.weight, model.intercept)
I am still not sure about how to extract weights from "model" above. But by restructuring the process towards the official tutorial, the following works on spark-1.6.0:
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}
val lr = new LogisticRegression().setMaxIter(400)
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).addGrid(lr.fitIntercept).addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).build()
val trainValidationSplit = new TrainValidationSplit().setEstimator(lr).setEvaluator(new BinaryClassificationEvaluator).setEstimatorParamMaps(paramGrid).setTrainRatio(0.8)
val restructuredModel = trainValidationSplit.fit(train)
val lrmodel = restructuredModel.bestModel.asInstanceOf[LogisticRegressionModel]
lrmodel.weigths
lrmodel.intercept
I noticed the difference between "lrmodel" here and "model" generated above:
model.bestModel --> gives oas.ml.Model[_] = pipeline_****
restructuredModel.bestModel --> gives oas.ml.Model[_] = logreg_****
That's why we can cast resturcturedModel.bestModel as LogisticRegressionModel but not that of model.bestModel. I'll add more when I understand the reason of the differences.