I am new to scala and I want to implement a logistic regression model.So initially I load a csv file as below:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("D:/sample.txt")
The file is as below:
P,P,A,A,A,P,NB
N,N,A,A,A,N,NB
A,A,A,A,A,A,NB
P,P,P,P,P,P,NB
N,N,P,P,P,N,NB
A,A,P,P,P,A,NB
P,P,A,P,P,P,NB
P,P,P,A,A,P,NB
P,P,A,P,A,P,NB
P,P,A,A,P,P,NB
P,P,P,P,A,P,NB
P,P,P,A,P,P,NB
N,N,A,P,P,N,NB
N,N,P,A,A,N,NB
N,N,A,P,A,N,NB
N,N,A,P,A,N,NB
N,N,A,A,P,N,NB
N,N,P,P,A,N,NB
N,N,P,A,P,N,NB
A,A,A,P,P,A,NB
A,A,P,A,A,A,NB
A,A,A,P,A,A,NB
A,A,A,A,P,A,NB
A,A,P,P,A,A,NB
A,A,P,A,P,A,NB
P,N,A,A,A,P,NB
N,P,A,A,A,N,NB
P,N,A,A,A,N,NB
P,N,P,P,P,P,NB
N,P,P,P,P,N,NB
Then I want to train the model by below code:
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
.setFeaturesCol("Feature")
.setLabelCol("Label")
Then I fit the model by below:
val lrModel = lr.fit(df)
println(lrModel.coefficients +"are the coefficients")
println(lrModel.interceptVector+"are the intercerpt vactor")
println(lrModel.summary +"is summary")
But it is not printing the results.
Any help is appreciated.
from your code:
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
.setFeaturesCol("Feature") <- here
.setLabelCol("Label") <- here
you are setting features column and label column. As you didn't mention column names, i am assuming the column containing NB values is your label and you want to include all others are the columns for prediction.
All predictor variables that you want include in your model, needs to be in form of single vector column, generally called as features column. You need to create it using VectorAssembler as follows:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
//creating features column
val assembler = new VectorAssembler()
.setInputCols(Array(" insert your column names here "))
.setOutputCol("Feature")
Refer: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler.
Now you can proceed to fit the logistic regression model. pipeline is used to combine multiple transformations beforefitting the data.
val pipeline = new Pipeline().setStages(Array(assembler,lr))
//fitting the model
val lrModel = pipeline.fit(df)
I am using NaiveBayes multinomial classifier in Apache Spark ML (version 2.1.0) to predict some text categories.
Problem is how do I convert the prediction label(0.0, 1.0, 2.0) to string without trained DataFrame.
I know IndexToString can be used but its only helpful if training and prediction both are at the same time. But, In my case its independent job.
code looks like as
1) TrainingModel.scala : Train the model and save the model in file.
2) CategoryPrediction.scala : Load the trained model from file and do prediction on test data.
Please suggest the solution:
TrainingModel.scala
val trainData: Dataset[LabeledRecord] = spark.read.option("inferSchema", "false")
.schema(schema).csv("trainingdata1.csv").as[LabeledRecord]
val labelIndexer = new StringIndexer().setInputCol("category").setOutputCol("label").fit(trainData).setHandleInvalid("skip")
val tokenizer = new RegexTokenizer().setInputCol("text").setOutputCol("words")
val hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("features")
.setNumFeatures(1000)
val rf = new NaiveBayes().setLabelCol("label").setFeaturesCol("features").setModelType("multinomial")
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, labelIndexer, rf))
val model = pipeline.fit(trainData)
model.write.overwrite().save("naivebayesmodel");
CategoryPrediction.scala
val testData: Dataset[PredictLabeledRecord] = spark.read.option("inferSchema", "false")
.schema(predictSchema).csv("testingdata.csv").as[PredictLabeledRecord]
val model = PipelineModel.load("naivebayesmodel")
val predictions = model.transform(testData)
// val labelConverter = new IndexToString()
// .setInputCol("prediction")
// .setOutputCol("predictedLabelString")
// .setLabels(trainDataFrameIndexer.labels)
predictions.select("prediction", "text").show(false)
trainingdata1.csv
category,text
Drama,"a b c d e spark"
Action,"b d"
Horror,"spark f g h"
Thriller,"hadoop mapreduce"
testingdata.csv
text
"a b c d e spark"
"spark f g h"
Add a converter that will translate the prediction categories back to your labels in your pipeline, something like this:
val categoryConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("category")
.setLabels(labelIndexer.labels)
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, labelIndexer, rf, categoryConverter))
This will take the prediction and convert it back to a label using your labelIndexer.
I'm using Spark 2 + Scala to train LogisticRegression based binary classification model and I'm using import org.apache.spark.ml.classification.LogisticRegression, which is the new ml API in Spark 2. However, when I evaluated the model by AUROC, I did not find a way to use the probability (double in 0-1) instead of binary classification (0/1). This was previously achieved by removeThreshold(), but in ml.LogisticRegression I did not find a similar method. Thus, is there a way to do that?
The evaluator I'm using is
val evaluator = new BinaryClassificationEvaluator()
.setLabelCol("label")
.setRawPredictionCol("rawPrediction")
.setMetricName("areaUnderROC")
val auroc = evaluator.evaluate(predictions)`
if u want to get probability output other than 0/1 output, try this:
import org.apache.spark.ml.classification.{BinaryLogisticRegressionSummary, LogisticRegression}
val lr = new LogisticRegression()
.setMaxIter(100)
.setRegParam(0.3)
val lrModel = lr.fit(trainData)
val summary = lrModel.summary
summary.predictions.select("probability").show()
import org.apache.spark.ml.classification.{BinaryLogisticRegressionSummary,
LogisticRegression}
val lr = new LogisticRegression().setMaxIter(100).setRegParam(0.3)
val lrModel = lr.fit(trainData)
val trainingSummary = lrModel.summary
val predictions = lrModel.transform(test)
predictions.select("label", "probability").show()
Trying to add VectorAssembler to the GBT pipeline example and get an error the pipeline cannot find the features field. I'm bringing in a sample file instead of a libsvm so I needed to transform the feature set set.
Error:
Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("data/training_example.csv")
val sampleDF = df.sample(false,0.05,987897L)
val assembler = new VectorAssembler()
.setInputCols(Array("val1","val2","val3",...,"valN"))
.setOutputCol("features")
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(sampleDF)
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(sampleDF)
val Array(trainingData, testData) = sampleDF.randomSplit(Array(0.7, 0.3))
val gbt = new GBTClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
.setMaxIter(3)
.setMaxDepth(5)
val pipeline = new Pipeline()
.setStages(Array(assembler,labelIndexer,featureIndexer,gbt))
val model = pipeline.fit(trainingData)
val predictions = model.transform(testData)
predictions.show(10)
Basic problem:
Why you're calling fit() in featureIndexer?
If you call fit(sampleDF), VectorIndexer will search for features column in sampleDF, but this dataset doesn't have such column.
Pipeline's fit() will call all transformator and estimators, so call fit on assembler, then pass the result to fit of labelIndexer and pass previous step result to fit of featureIndexer.
DataFrame that will be used in featureIndexer.fit() called inside Pipeline will have all columns generated by previous transformers.
In your code sampleDF doesn't have features column, however, during Pipeline fit() this column will be added by assembler
Documentation sample has features column from the beginning.
val data = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
You must fit a DF having features column.So transform your original DF with VectorAssembler and give it as input.