I have a parquet file containing the id and features columns and I want to apply the pca algorithm.
val dataset = spark.read.parquet("/usr/local/spark/dataset/data/user")
val features = new VectorAssembler()
.setInputCols(Array("id", "features" ))
.setOutputCol("features")
val pca = new PCA()
.setInputCol("features")
.setK(50)
.fit(dataset)
.setOutputCol("pcaFeatures")
val result = pca.transform(dataset).select("pcaFeatures")
pca.save("/usr/local/spark/dataset/out")
but I have this exception
java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually ArrayType(DoubleType,true).
Spark's PCA transformer needs a column created by a VectorAssembler. Here you create one but never use it. Also, the VectorAssembler only takes numbers as input. I don't know what the type of features is, but if it's an array, it won't work. Transform it into numeric columns first. Finally, it is a bad idea to name the assembled column the same way as an original column. Indeed, the VectorAssembler does not remove input columns and you will end up if two features columns.
Here is a working example of PCA computation in Spark:
import org.apache.spark.ml.feature._
val df = spark.range(10)
.select('id, ('id * 'id) as "id2", ('id * 'id * 'id) as "id3")
val assembler = new VectorAssembler()
.setInputCols(Array("id", "id2", "id3")).setOutputCol("features")
val assembled_df = assembler.transform(df)
val pca = new PCA()
.setInputCol("features").setOutputCol("pcaFeatures").setK(2)
.fit(assembled_df)
val result = pca.transform(assembled_df)
I want to run pca with KNN in spark. I have a file that contains id, features.
> KNN.printSchema
root
|-- id: int (nullable = true)
|-- features: double (nullable = true)
code:
val dataset = spark.read.parquet("/usr/local/spark/dataset/data/user")
val features = new VectorAssembler()
.setInputCols(Array("id", "features" ))
.setOutputCol("features")
val Array(train, test) = dataset
.randomSplit(Array(0.7, 0.3), seed = 1234L)
.map(_.cache())
//create PCA matrix to reduce feature dimensions
val pca = new PCA()
.setInputCol("features")
.setK(5)
.setOutputCol("pcaFeatures")
val knn = new KNNClassifier()
.setTopTreeSize(dataset.count().toInt / 5)
.setFeaturesCol("pcaFeatures")
.setPredictionCol("predicted")
.setK(1)
val pipeline = new Pipeline()
.setStages(Array(pca, knn))
.fit(train)
Above code block is throwing this exception
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually ArrayType(DoubleType,true).
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.feature.PCAParams$class.validateAndTransformSchema(PCA.scala:54)
at org.apache.spark.ml.feature.PCAModel.validateAndTransformSchema(PCA.scala:125)
at org.apache.spark.ml.feature.PCAModel.transformSchema(PCA.scala:162)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132)
at KNN$.main(KNN.scala:63)
at KNN.main(KNN.scala)
Basically, you are trying to split the dataset into training and test, assemble features, run a PCA and then a classifier to predict something. The overall logic is correct but there are several problems with your code.
A PCA in spark needs assembled features. You created one but you do not use it in the code.
You gave the name features to the output of the assembler, and you already have a column named that way. Since you do not use it, you don't see an error but if you were you would get this exception:
java.lang.IllegalArgumentException: Output column features already exists.
When running a classification, you need to specify at the very least the input features with setFeaturesCol and the label you are trying to learn with setLabelCol. You did not specified the label and by default, the label is "label". You don't have any column named that way, hence the exception spark throws at you.
Here is a working example of what you are trying to do.
// a funky dataset with 3 features (`x1`, `x2`, `x`3) and a label `y`,
// the class we are trying to predict.
val dataset = spark.range(10)
.select('id as "x1", rand() as "x2", ('id * 'id) as "x3")
.withColumn("y", (('x2 * 3 + 'x1) cast "int").mod(2))
.cache()
// splitting the dataset, that part was ok ;-)
val Array(train, test) = dataset
.randomSplit(Array(0.7, 0.3), seed = 1234L)
.map(_.cache())
// An assembler, the output name cannot be one of the inputs.
val assembler = new VectorAssembler()
.setInputCols(Array("x1", "x2", "x3"))
.setOutputCol("features")
// A pca, that part was ok as well
val pca = new PCA()
.setInputCol("features")
.setK(2)
.setOutputCol("pcaFeatures")
// A LogisticRegression classifier. (KNN is not part of spark's standard API, but
// requires the same minimum information: features and label)
val classifier = new LogisticRegression()
.setFeaturesCol("pcaFeatures")
.setLabelCol("y")
// And the full pipeline
val pipeline = new Pipeline().setStages(Array(assembler, pca, classifier))
val model = pipeline.fit(train)
I am trying to build a model in Spark ML with Zeppelin.
I am new to this area and would like some help. I think i need to set the correct datatypes to the column and set the first column as the label. Any help would be greatly appreciated, thank you
val training = sc.textFile("hdfs:///ford/fordTrain.csv")
val header = training.first
val inferSchema = true
val df = training.toDF
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
val lrModel = lr.fit(df)
// Print the coefficients and intercept for multinomial logistic regression
println(s"Coefficients: \n${lrModel.coefficientMatrix}")
println(s"Intercepts: ${lrModel.interceptVector}")
A snippet of the csv file i am using is:
IsAlert,P1,P2,P3,P4,P5,P6,P7,P8,E1,E2
0,34.7406,9.84593,1400,42.8571,0.290601,572,104.895,0,0,0,
As you have mentioned, you are missing the features column. It is a vector containing all predictor variables. You have to create it using VectorAssembler.
IsAlert is the label and all others variables (p1,p2,...) are predictor variables, you can create features column (actually you can name it anything you want instead of features) by:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
//creating features column
val assembler = new VectorAssembler()
.setInputCols(Array("P1","P2","P3","P4","P5","P6","P7","P8","E1","E2"))
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
.setFeaturesCol("features") // setting features column
.setLabelCol("IsAlert") // setting label column
//creating pipeline
val pipeline = new Pipeline().setStages(Array(assembler,lr))
//fitting the model
val lrModel = pipeline.fit(df)
Refer: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler.
I am new to scala and I want to implement a logistic regression model.So initially I load a csv file as below:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("D:/sample.txt")
The file is as below:
P,P,A,A,A,P,NB
N,N,A,A,A,N,NB
A,A,A,A,A,A,NB
P,P,P,P,P,P,NB
N,N,P,P,P,N,NB
A,A,P,P,P,A,NB
P,P,A,P,P,P,NB
P,P,P,A,A,P,NB
P,P,A,P,A,P,NB
P,P,A,A,P,P,NB
P,P,P,P,A,P,NB
P,P,P,A,P,P,NB
N,N,A,P,P,N,NB
N,N,P,A,A,N,NB
N,N,A,P,A,N,NB
N,N,A,P,A,N,NB
N,N,A,A,P,N,NB
N,N,P,P,A,N,NB
N,N,P,A,P,N,NB
A,A,A,P,P,A,NB
A,A,P,A,A,A,NB
A,A,A,P,A,A,NB
A,A,A,A,P,A,NB
A,A,P,P,A,A,NB
A,A,P,A,P,A,NB
P,N,A,A,A,P,NB
N,P,A,A,A,N,NB
P,N,A,A,A,N,NB
P,N,P,P,P,P,NB
N,P,P,P,P,N,NB
Then I want to train the model by below code:
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
.setFeaturesCol("Feature")
.setLabelCol("Label")
Then I fit the model by below:
val lrModel = lr.fit(df)
println(lrModel.coefficients +"are the coefficients")
println(lrModel.interceptVector+"are the intercerpt vactor")
println(lrModel.summary +"is summary")
But it is not printing the results.
Any help is appreciated.
from your code:
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
.setFeaturesCol("Feature") <- here
.setLabelCol("Label") <- here
you are setting features column and label column. As you didn't mention column names, i am assuming the column containing NB values is your label and you want to include all others are the columns for prediction.
All predictor variables that you want include in your model, needs to be in form of single vector column, generally called as features column. You need to create it using VectorAssembler as follows:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
//creating features column
val assembler = new VectorAssembler()
.setInputCols(Array(" insert your column names here "))
.setOutputCol("Feature")
Refer: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler.
Now you can proceed to fit the logistic regression model. pipeline is used to combine multiple transformations beforefitting the data.
val pipeline = new Pipeline().setStages(Array(assembler,lr))
//fitting the model
val lrModel = pipeline.fit(df)