How to set data for logistic regression in scala? - scala
I am new to scala and I want to implement a logistic regression model.So initially I load a csv file as below:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("D:/sample.txt")
The file is as below:
P,P,A,A,A,P,NB
N,N,A,A,A,N,NB
A,A,A,A,A,A,NB
P,P,P,P,P,P,NB
N,N,P,P,P,N,NB
A,A,P,P,P,A,NB
P,P,A,P,P,P,NB
P,P,P,A,A,P,NB
P,P,A,P,A,P,NB
P,P,A,A,P,P,NB
P,P,P,P,A,P,NB
P,P,P,A,P,P,NB
N,N,A,P,P,N,NB
N,N,P,A,A,N,NB
N,N,A,P,A,N,NB
N,N,A,P,A,N,NB
N,N,A,A,P,N,NB
N,N,P,P,A,N,NB
N,N,P,A,P,N,NB
A,A,A,P,P,A,NB
A,A,P,A,A,A,NB
A,A,A,P,A,A,NB
A,A,A,A,P,A,NB
A,A,P,P,A,A,NB
A,A,P,A,P,A,NB
P,N,A,A,A,P,NB
N,P,A,A,A,N,NB
P,N,A,A,A,N,NB
P,N,P,P,P,P,NB
N,P,P,P,P,N,NB
Then I want to train the model by below code:
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
.setFeaturesCol("Feature")
.setLabelCol("Label")
Then I fit the model by below:
val lrModel = lr.fit(df)
println(lrModel.coefficients +"are the coefficients")
println(lrModel.interceptVector+"are the intercerpt vactor")
println(lrModel.summary +"is summary")
But it is not printing the results.
Any help is appreciated.
from your code:
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
.setFeaturesCol("Feature") <- here
.setLabelCol("Label") <- here
you are setting features column and label column. As you didn't mention column names, i am assuming the column containing NB values is your label and you want to include all others are the columns for prediction.
All predictor variables that you want include in your model, needs to be in form of single vector column, generally called as features column. You need to create it using VectorAssembler as follows:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
//creating features column
val assembler = new VectorAssembler()
.setInputCols(Array(" insert your column names here "))
.setOutputCol("Feature")
Refer: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler.
Now you can proceed to fit the logistic regression model. pipeline is used to combine multiple transformations beforefitting the data.
val pipeline = new Pipeline().setStages(Array(assembler,lr))
//fitting the model
val lrModel = pipeline.fit(df)
Related
IllegalArgumentException when computing a PCA with Spark ML
I have a parquet file containing the id and features columns and I want to apply the pca algorithm. val dataset = spark.read.parquet("/usr/local/spark/dataset/data/user") val features = new VectorAssembler() .setInputCols(Array("id", "features" )) .setOutputCol("features") val pca = new PCA() .setInputCol("features") .setK(50) .fit(dataset) .setOutputCol("pcaFeatures") val result = pca.transform(dataset).select("pcaFeatures") pca.save("/usr/local/spark/dataset/out") but I have this exception java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually ArrayType(DoubleType,true).
Spark's PCA transformer needs a column created by a VectorAssembler. Here you create one but never use it. Also, the VectorAssembler only takes numbers as input. I don't know what the type of features is, but if it's an array, it won't work. Transform it into numeric columns first. Finally, it is a bad idea to name the assembled column the same way as an original column. Indeed, the VectorAssembler does not remove input columns and you will end up if two features columns. Here is a working example of PCA computation in Spark: import org.apache.spark.ml.feature._ val df = spark.range(10) .select('id, ('id * 'id) as "id2", ('id * 'id * 'id) as "id3") val assembler = new VectorAssembler() .setInputCols(Array("id", "id2", "id3")).setOutputCol("features") val assembled_df = assembler.transform(df) val pca = new PCA() .setInputCol("features").setOutputCol("pcaFeatures").setK(2) .fit(assembled_df) val result = pca.transform(assembled_df)
Field "features" does not exist. SparkML
I am trying to build a model in Spark ML with Zeppelin. I am new to this area and would like some help. I think i need to set the correct datatypes to the column and set the first column as the label. Any help would be greatly appreciated, thank you val training = sc.textFile("hdfs:///ford/fordTrain.csv") val header = training.first val inferSchema = true val df = training.toDF val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8) val lrModel = lr.fit(df) // Print the coefficients and intercept for multinomial logistic regression println(s"Coefficients: \n${lrModel.coefficientMatrix}") println(s"Intercepts: ${lrModel.interceptVector}") A snippet of the csv file i am using is: IsAlert,P1,P2,P3,P4,P5,P6,P7,P8,E1,E2 0,34.7406,9.84593,1400,42.8571,0.290601,572,104.895,0,0,0,
As you have mentioned, you are missing the features column. It is a vector containing all predictor variables. You have to create it using VectorAssembler. IsAlert is the label and all others variables (p1,p2,...) are predictor variables, you can create features column (actually you can name it anything you want instead of features) by: import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.ml.linalg.Vectors //creating features column val assembler = new VectorAssembler() .setInputCols(Array("P1","P2","P3","P4","P5","P6","P7","P8","E1","E2")) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8) .setFeaturesCol("features") // setting features column .setLabelCol("IsAlert") // setting label column //creating pipeline val pipeline = new Pipeline().setStages(Array(assembler,lr)) //fitting the model val lrModel = pipeline.fit(df) Refer: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler.
Spark 2 logisticregression remove threshold
I'm using Spark 2 + Scala to train LogisticRegression based binary classification model and I'm using import org.apache.spark.ml.classification.LogisticRegression, which is the new ml API in Spark 2. However, when I evaluated the model by AUROC, I did not find a way to use the probability (double in 0-1) instead of binary classification (0/1). This was previously achieved by removeThreshold(), but in ml.LogisticRegression I did not find a similar method. Thus, is there a way to do that? The evaluator I'm using is val evaluator = new BinaryClassificationEvaluator() .setLabelCol("label") .setRawPredictionCol("rawPrediction") .setMetricName("areaUnderROC") val auroc = evaluator.evaluate(predictions)`
if u want to get probability output other than 0/1 output, try this: import org.apache.spark.ml.classification.{BinaryLogisticRegressionSummary, LogisticRegression} val lr = new LogisticRegression() .setMaxIter(100) .setRegParam(0.3) val lrModel = lr.fit(trainData) val summary = lrModel.summary summary.predictions.select("probability").show()
import org.apache.spark.ml.classification.{BinaryLogisticRegressionSummary, LogisticRegression} val lr = new LogisticRegression().setMaxIter(100).setRegParam(0.3) val lrModel = lr.fit(trainData) val trainingSummary = lrModel.summary val predictions = lrModel.transform(test) predictions.select("label", "probability").show()
Error adding VectorAssembler to Spark ML Pipeline
Trying to add VectorAssembler to the GBT pipeline example and get an error the pipeline cannot find the features field. I'm bringing in a sample file instead of a libsvm so I needed to transform the feature set set. Error: Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist. val df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") .option("inferSchema", "true") .load("data/training_example.csv") val sampleDF = df.sample(false,0.05,987897L) val assembler = new VectorAssembler() .setInputCols(Array("val1","val2","val3",...,"valN")) .setOutputCol("features") val labelIndexer = new StringIndexer() .setInputCol("label") .setOutputCol("indexedLabel") .fit(sampleDF) val featureIndexer = new VectorIndexer() .setInputCol("features") .setOutputCol("indexedFeatures") .setMaxCategories(4) .fit(sampleDF) val Array(trainingData, testData) = sampleDF.randomSplit(Array(0.7, 0.3)) val gbt = new GBTClassifier() .setLabelCol("indexedLabel") .setFeaturesCol("indexedFeatures") .setMaxIter(3) .setMaxDepth(5) val pipeline = new Pipeline() .setStages(Array(assembler,labelIndexer,featureIndexer,gbt)) val model = pipeline.fit(trainingData) val predictions = model.transform(testData) predictions.show(10)
Basic problem: Why you're calling fit() in featureIndexer? If you call fit(sampleDF), VectorIndexer will search for features column in sampleDF, but this dataset doesn't have such column. Pipeline's fit() will call all transformator and estimators, so call fit on assembler, then pass the result to fit of labelIndexer and pass previous step result to fit of featureIndexer. DataFrame that will be used in featureIndexer.fit() called inside Pipeline will have all columns generated by previous transformers. In your code sampleDF doesn't have features column, however, during Pipeline fit() this column will be added by assembler
Documentation sample has features column from the beginning. val data = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") You must fit a DF having features column.So transform your original DF with VectorAssembler and give it as input.
Handling unseen categorical variables and MaxBins calculation in Spark Multiclass-classification
Below is the code I have for a RandomForest multiclass-classification model. I am reading from a CSV file and doing various transformations as seen in the code. I am calculating the max number of categories and then giving it as a parameter to RF. This takes a lot of time! Is there a parameter to set or an easier way to make the model automatically infer the max categories?Since it can go more than 1000 and I cannot omit them. How do I handle unseen labels on new data for prediction since StringIndexer will not work in that case. the code below is just a split of data but I will be introducing new data as well in future // Need to predict 2 classes val cols_to_predict=Array("Label1","Label2") // ID col val omit_cols=Array("Key") // reading the csv file val data = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .option("inferSchema", "true") // Automatically infer data types .load("abc.csv") .cache() // creating a features DF by droppping the labels so that I can run all // the cols through String Indexer val features=data.drop("Label1").drop("Label2").drop("Key") // Since I do not know my max categories possible, I find it out // and use it for maxBins parameter in RF val distinct_col_counts=features.columns.map(x => data.select(x).distinct().count ).max val transformers: Array[org.apache.spark.ml.PipelineStage] = features.columns.map( cname => new StringIndexer().setInputCol(cname).setOutputCol(s"${cname}_index").fit(features) ) val assembler = new VectorAssembler() .setInputCols(features.columns.map(cname => s"${cname}_index")) .setOutputCol("features") val labelIndexer2 = new StringIndexer() .setInputCol("prog_label2") .setOutputCol("Label2") .fit(data) val labelIndexer1 = new StringIndexer() .setInputCol("orig_label1") .setOutputCol("Label1") .fit(data) val rf = new RandomForestClassifier() .setLabelCol("Label1") .setFeaturesCol("features") .setNumTrees(100) .setMaxBins(distinct_col_counts.toInt) val labelConverter = new IndexToString() .setInputCol("prediction") .setOutputCol("predictedLabel") .setLabels(labelIndexer1.labels) // Split into train and test val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3)) trainingData.cache() testData.cache() // Running only for one label for now Label1 val stages: Array[org.apache.spark.ml.PipelineStage] =transformers :+ labelIndexer1 :+ assembler :+ rf :+ labelConverter //:+ labelIndexer2 val pipeline=new Pipeline().setStages(stages) val model=pipeline.fit(trainingData) val predictions = model.transform(testData)