Spark ML, parameter for "rawPredictionCol" for Binary Classification - scala

I want to use The binary Classificator in Spark.ml to evaluate my model after my Pipeline. I use this code :
val gbt = new GBTClassifier()
.setLabelCol("Label_Index")
.setFeaturesCol("features")
.setMaxIter(10)
.setMaxDepth(7)
.setSubsamplingRate(0.1)
.setMinInstancesPerNode(15)
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(indexer_2.labels)
val evaluator_auc = new BinaryClassificationEvaluator()
.setLabelCol("Label_Index")
.setRawPredictionCol("")
.setMetricName("areaUnderROC")
I don't know really which parameters I need to give to "setRawPredictionCol()", I think I need to give the result of my prediction, the column "prediction"

Related

Exception:features must be of type org.apache.spark.ml.linalg.VectorUDT

I want to run pca with KNN in spark. I have a file that contains id, features.
> KNN.printSchema
root
|-- id: int (nullable = true)
|-- features: double (nullable = true)
code:
val dataset = spark.read.parquet("/usr/local/spark/dataset/data/user")
val features = new VectorAssembler()
.setInputCols(Array("id", "features" ))
.setOutputCol("features")
val Array(train, test) = dataset
.randomSplit(Array(0.7, 0.3), seed = 1234L)
.map(_.cache())
//create PCA matrix to reduce feature dimensions
val pca = new PCA()
.setInputCol("features")
.setK(5)
.setOutputCol("pcaFeatures")
val knn = new KNNClassifier()
.setTopTreeSize(dataset.count().toInt / 5)
.setFeaturesCol("pcaFeatures")
.setPredictionCol("predicted")
.setK(1)
val pipeline = new Pipeline()
.setStages(Array(pca, knn))
.fit(train)
Above code block is throwing this exception
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually ArrayType(DoubleType,true).
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.feature.PCAParams$class.validateAndTransformSchema(PCA.scala:54)
at org.apache.spark.ml.feature.PCAModel.validateAndTransformSchema(PCA.scala:125)
at org.apache.spark.ml.feature.PCAModel.transformSchema(PCA.scala:162)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132)
at KNN$.main(KNN.scala:63)
at KNN.main(KNN.scala)
Basically, you are trying to split the dataset into training and test, assemble features, run a PCA and then a classifier to predict something. The overall logic is correct but there are several problems with your code.
A PCA in spark needs assembled features. You created one but you do not use it in the code.
You gave the name features to the output of the assembler, and you already have a column named that way. Since you do not use it, you don't see an error but if you were you would get this exception:
java.lang.IllegalArgumentException: Output column features already exists.
When running a classification, you need to specify at the very least the input features with setFeaturesCol and the label you are trying to learn with setLabelCol. You did not specified the label and by default, the label is "label". You don't have any column named that way, hence the exception spark throws at you.
Here is a working example of what you are trying to do.
// a funky dataset with 3 features (`x1`, `x2`, `x`3) and a label `y`,
// the class we are trying to predict.
val dataset = spark.range(10)
.select('id as "x1", rand() as "x2", ('id * 'id) as "x3")
.withColumn("y", (('x2 * 3 + 'x1) cast "int").mod(2))
.cache()
// splitting the dataset, that part was ok ;-)
val Array(train, test) = dataset
.randomSplit(Array(0.7, 0.3), seed = 1234L)
.map(_.cache())
// An assembler, the output name cannot be one of the inputs.
val assembler = new VectorAssembler()
.setInputCols(Array("x1", "x2", "x3"))
.setOutputCol("features")
// A pca, that part was ok as well
val pca = new PCA()
.setInputCol("features")
.setK(2)
.setOutputCol("pcaFeatures")
// A LogisticRegression classifier. (KNN is not part of spark's standard API, but
// requires the same minimum information: features and label)
val classifier = new LogisticRegression()
.setFeaturesCol("pcaFeatures")
.setLabelCol("y")
// And the full pipeline
val pipeline = new Pipeline().setStages(Array(assembler, pca, classifier))
val model = pipeline.fit(train)

Select 2000+ columns as Features for Classification ML

I'm looking how I can select a lot of columns(2000+) as a feature from a Dataframe. I don't want to write the name one by one.
I'm doing classification and i have around 2000 features.
data is a Dataframe with around 2000 columns.
First, I get all of the columns name of my DF and drop 9 columns because i don't need them.
My idea was to use all the columns names to feed the VectorAssembler. The result should be something like [Value Of the 1st Feature, Value 2nd Feature, Value 3rd Feature...] for the first row and this for all of my Dataframe.
But I have this error :
java.lang.IllegalArgumentException: Field "features" does not exist.
EDIT : If something is unclear, please let me know that I can fix it.
I deleted some Transformers because it's not the point of my question.(StringIndexer, VectorIndexer, IndexToString)
val array = data.columns drop(9)
val assembler = new VectorAssembler()
.setInputCols(array)
.setOutputCol("features")
val Array(trainingData, testData) = data.randomSplit(Array(0.8, 0.2))
val rf = new RandomForestClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("features")
.setNumTrees(50)
val pipeline = new Pipeline()
.setStages(Array(assembler, rf))
val model = pipeline.fit(trainingData)
EDIT 2 I fix my problem. I took off the Vector Indexer and used array in the VectorAssembler and it worked perfectly.
Well at least, I get a result.

Field "features" does not exist. SparkML

I am trying to build a model in Spark ML with Zeppelin.
I am new to this area and would like some help. I think i need to set the correct datatypes to the column and set the first column as the label. Any help would be greatly appreciated, thank you
val training = sc.textFile("hdfs:///ford/fordTrain.csv")
val header = training.first
val inferSchema = true
val df = training.toDF
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
val lrModel = lr.fit(df)
// Print the coefficients and intercept for multinomial logistic regression
println(s"Coefficients: \n${lrModel.coefficientMatrix}")
println(s"Intercepts: ${lrModel.interceptVector}")
A snippet of the csv file i am using is:
IsAlert,P1,P2,P3,P4,P5,P6,P7,P8,E1,E2
0,34.7406,9.84593,1400,42.8571,0.290601,572,104.895,0,0,0,
As you have mentioned, you are missing the features column. It is a vector containing all predictor variables. You have to create it using VectorAssembler.
IsAlert is the label and all others variables (p1,p2,...) are predictor variables, you can create features column (actually you can name it anything you want instead of features) by:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
//creating features column
val assembler = new VectorAssembler()
.setInputCols(Array("P1","P2","P3","P4","P5","P6","P7","P8","E1","E2"))
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
.setFeaturesCol("features") // setting features column
.setLabelCol("IsAlert") // setting label column
//creating pipeline
val pipeline = new Pipeline().setStages(Array(assembler,lr))
//fitting the model
val lrModel = pipeline.fit(df)
Refer: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler.

Spark ML Convert Prediction label to string without training DataFrame

I am using NaiveBayes multinomial classifier in Apache Spark ML (version 2.1.0) to predict some text categories.
Problem is how do I convert the prediction label(0.0, 1.0, 2.0) to string without trained DataFrame.
I know IndexToString can be used but its only helpful if training and prediction both are at the same time. But, In my case its independent job.
code looks like as
1) TrainingModel.scala : Train the model and save the model in file.
2) CategoryPrediction.scala : Load the trained model from file and do prediction on test data.
Please suggest the solution:
TrainingModel.scala
val trainData: Dataset[LabeledRecord] = spark.read.option("inferSchema", "false")
.schema(schema).csv("trainingdata1.csv").as[LabeledRecord]
val labelIndexer = new StringIndexer().setInputCol("category").setOutputCol("label").fit(trainData).setHandleInvalid("skip")
val tokenizer = new RegexTokenizer().setInputCol("text").setOutputCol("words")
val hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("features")
.setNumFeatures(1000)
val rf = new NaiveBayes().setLabelCol("label").setFeaturesCol("features").setModelType("multinomial")
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, labelIndexer, rf))
val model = pipeline.fit(trainData)
model.write.overwrite().save("naivebayesmodel");
CategoryPrediction.scala
val testData: Dataset[PredictLabeledRecord] = spark.read.option("inferSchema", "false")
.schema(predictSchema).csv("testingdata.csv").as[PredictLabeledRecord]
val model = PipelineModel.load("naivebayesmodel")
val predictions = model.transform(testData)
// val labelConverter = new IndexToString()
// .setInputCol("prediction")
// .setOutputCol("predictedLabelString")
// .setLabels(trainDataFrameIndexer.labels)
predictions.select("prediction", "text").show(false)
trainingdata1.csv
category,text
Drama,"a b c d e spark"
Action,"b d"
Horror,"spark f g h"
Thriller,"hadoop mapreduce"
testingdata.csv
text
"a b c d e spark"
"spark f g h"
Add a converter that will translate the prediction categories back to your labels in your pipeline, something like this:
val categoryConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("category")
.setLabels(labelIndexer.labels)
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, labelIndexer, rf, categoryConverter))
This will take the prediction and convert it back to a label using your labelIndexer.

How to make binary classication in Spark ML without StringIndexer

I try to use Spark ML DecisionTreeClassifier in Pipeline without StringIndexer, because my feature is already indexed as (0.0; 1.0). DecisionTreeClassifier as label requires double values, so this code should work:
def trainDecisionTreeModel(training: RDD[LabeledPoint], sqlc: SQLContext): Unit = {
import sqlc.implicits._
val trainingDF = training.toDF()
//format of this dataframe: [label: double, features: vector]
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(trainingDF)
val dt = new DecisionTreeClassifier()
.setLabelCol("label")
.setFeaturesCol("indexedFeatures")
val pipeline = new Pipeline()
.setStages(Array(featureIndexer, dt))
pipeline.fit(trainingDF)
}
But actually I get
java.lang.IllegalArgumentException:
DecisionTreeClassifier was given input with invalid label column label,
without the number of classes specified. See StringIndexer.
Of course I can just put StringIndexer and let him make it's work for my double "label" field, but I want to work with output rawPrediction column of DecisionTreeClassifier to get probability of 0.0 and 1.0 for each row like...
val predictions = model.transform(singletonDF)
val zeroProbability = predictions.select("rawPrediction").asInstanceOf[Vector](0)
val oneProbability = predictions.select("rawPrediction").asInstanceOf[Vector](1)
If I put StringIndexer in Pipeline - I will not know indexes of my input labels "0.0" and "1.0" in rawPrediction vector, because String indexer will index by value's frequency, which could vary.
Please, help to prepare data for DecisionTreeClassifier without using StringIndexer or suggest some another way to get probability of my original labels (0.0; 1.0) for each row.
You can always set required metadata manually:
import sqlContext.implicits._
import org.apache.spark.ml.attribute.NominalAttribute
val meta = NominalAttribute
.defaultAttr
.withName("label")
.withValues("0.0", "1.0")
.toMetadata
val dfWithMeta = df.withColumn("label", $"label".as("label", meta))
pipeline.fit(dfWithMeta)