Spark Scala Kmeans - how to label results and visualize? - scala

Here's some code that uses Spark ML to find clusters:
val dfRaw = spark.read.option("header", "true")
.csv("src/main/resources/input.csv")
val K = 5
val assembler = new VectorAssembler().setInputCols(Array("id", "lat", "lon")).setOutputCol("features")
val df = assembler.transform(dfRaw).select("features")
df.show(false)
val kmeans = new KMeans().setK(K).setSeed(1L)
val model : KMeansModel = kmeans.fit(df)
println("cluster centers")
model.clusterCenters.foreach(println)
println("----- predictions")
val predictions = model.transform(df)
predictions.collect().foreach(println)
The input file is made up of the following 4 cols: id, name, lat, lon
I'm sure I'm doing some stupid things in this code, but it kind of works (I think). Any advice on improving it appreciated. I'm wondering if the id column is affecting how it clusters the data.
I'm also struggling to make sense of the results. What's a good way to join predictions (a DF of Vector) with the name column of dfRaw?
Also, I'd like to visualize the results on some kind of free grid or geomap, any recommendations?

Related

IllegalArgumentException when computing a PCA with Spark ML

I have a parquet file containing the id and features columns and I want to apply the pca algorithm.
val dataset = spark.read.parquet("/usr/local/spark/dataset/data/user")
val features = new VectorAssembler()
.setInputCols(Array("id", "features" ))
.setOutputCol("features")
val pca = new PCA()
.setInputCol("features")
.setK(50)
.fit(dataset)
.setOutputCol("pcaFeatures")
val result = pca.transform(dataset).select("pcaFeatures")
pca.save("/usr/local/spark/dataset/out")
but I have this exception
java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually ArrayType(DoubleType,true).
Spark's PCA transformer needs a column created by a VectorAssembler. Here you create one but never use it. Also, the VectorAssembler only takes numbers as input. I don't know what the type of features is, but if it's an array, it won't work. Transform it into numeric columns first. Finally, it is a bad idea to name the assembled column the same way as an original column. Indeed, the VectorAssembler does not remove input columns and you will end up if two features columns.
Here is a working example of PCA computation in Spark:
import org.apache.spark.ml.feature._
val df = spark.range(10)
.select('id, ('id * 'id) as "id2", ('id * 'id * 'id) as "id3")
val assembler = new VectorAssembler()
.setInputCols(Array("id", "id2", "id3")).setOutputCol("features")
val assembled_df = assembler.transform(df)
val pca = new PCA()
.setInputCol("features").setOutputCol("pcaFeatures").setK(2)
.fit(assembled_df)
val result = pca.transform(assembled_df)

Exception:features must be of type org.apache.spark.ml.linalg.VectorUDT

I want to run pca with KNN in spark. I have a file that contains id, features.
> KNN.printSchema
root
|-- id: int (nullable = true)
|-- features: double (nullable = true)
code:
val dataset = spark.read.parquet("/usr/local/spark/dataset/data/user")
val features = new VectorAssembler()
.setInputCols(Array("id", "features" ))
.setOutputCol("features")
val Array(train, test) = dataset
.randomSplit(Array(0.7, 0.3), seed = 1234L)
.map(_.cache())
//create PCA matrix to reduce feature dimensions
val pca = new PCA()
.setInputCol("features")
.setK(5)
.setOutputCol("pcaFeatures")
val knn = new KNNClassifier()
.setTopTreeSize(dataset.count().toInt / 5)
.setFeaturesCol("pcaFeatures")
.setPredictionCol("predicted")
.setK(1)
val pipeline = new Pipeline()
.setStages(Array(pca, knn))
.fit(train)
Above code block is throwing this exception
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually ArrayType(DoubleType,true).
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.feature.PCAParams$class.validateAndTransformSchema(PCA.scala:54)
at org.apache.spark.ml.feature.PCAModel.validateAndTransformSchema(PCA.scala:125)
at org.apache.spark.ml.feature.PCAModel.transformSchema(PCA.scala:162)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132)
at KNN$.main(KNN.scala:63)
at KNN.main(KNN.scala)
Basically, you are trying to split the dataset into training and test, assemble features, run a PCA and then a classifier to predict something. The overall logic is correct but there are several problems with your code.
A PCA in spark needs assembled features. You created one but you do not use it in the code.
You gave the name features to the output of the assembler, and you already have a column named that way. Since you do not use it, you don't see an error but if you were you would get this exception:
java.lang.IllegalArgumentException: Output column features already exists.
When running a classification, you need to specify at the very least the input features with setFeaturesCol and the label you are trying to learn with setLabelCol. You did not specified the label and by default, the label is "label". You don't have any column named that way, hence the exception spark throws at you.
Here is a working example of what you are trying to do.
// a funky dataset with 3 features (`x1`, `x2`, `x`3) and a label `y`,
// the class we are trying to predict.
val dataset = spark.range(10)
.select('id as "x1", rand() as "x2", ('id * 'id) as "x3")
.withColumn("y", (('x2 * 3 + 'x1) cast "int").mod(2))
.cache()
// splitting the dataset, that part was ok ;-)
val Array(train, test) = dataset
.randomSplit(Array(0.7, 0.3), seed = 1234L)
.map(_.cache())
// An assembler, the output name cannot be one of the inputs.
val assembler = new VectorAssembler()
.setInputCols(Array("x1", "x2", "x3"))
.setOutputCol("features")
// A pca, that part was ok as well
val pca = new PCA()
.setInputCol("features")
.setK(2)
.setOutputCol("pcaFeatures")
// A LogisticRegression classifier. (KNN is not part of spark's standard API, but
// requires the same minimum information: features and label)
val classifier = new LogisticRegression()
.setFeaturesCol("pcaFeatures")
.setLabelCol("y")
// And the full pipeline
val pipeline = new Pipeline().setStages(Array(assembler, pca, classifier))
val model = pipeline.fit(train)

infinite centroid for kmeans spark scala

(i think i am almost sure what the answer is)
here is my code:
val fileName = """file:///home/user/data/csv/sessions_sample.csv"""
val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(fileName)
// calculate input for kmeans
val input1 = df.select("id", "duration", "ip_dist", "txr1", "txr2", "txr3", "txr4").na.fill(3.0)
val input2 = input1.map(r => (r.getInt(0), Vectors.dense((1 until r.size - 1).map{ i => r.getDouble(i)}.toArray[Double])))
val input3 = input2.toDF("id", "features")
// initiate kmeans
val kmeans = new KMeans().setK(100).setSeed(1L).setFeaturesCol("features").setPredictionCol("prediction")
val model = kmeans.fit(input3)
val model = kmeans.fit(input3.select("features"))
// Make predictions
val predictions = model.transform(input3.select("features"))
val predictions = model.transform(input3)
val evaluator = new ClusteringEvaluator()
// i get an error when i run this line
val silhouette = evaluator.evaluate(predictions)
java.lang.AssertionError: assertion failed: Number of clusters must be
greater than one. at scala.Predef$.assert(Predef.scala:170) at
org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette$.computeSilhouetteScore(ClusteringEvaluator.scala:416)
at
org.apache.spark.ml.evaluation.ClusteringEvaluator.evaluate(ClusteringEvaluator.scala:96)
... 49 elided
But my centroids look like this:
model.clusterCenters.foreach(println)
[3217567.1300936914,145.06533614203505,Infinity,Infinity,Infinity]
i think that beceause some centers are infinite => kmeans is unstable => silhouette measure goes wrong.
But it still doesnt answer why, if i try to change k, any k > 1 so far, i have an error saying "Number of clusters must be greater than one".
please advice.
I once saw the same message. The root cause is that every data is the same (my data is generated by a program) so of course there is only one cluster. BTW, I did not check its centers so I am not sure whether my case is the same as yours.

Select 2000+ columns as Features for Classification ML

I'm looking how I can select a lot of columns(2000+) as a feature from a Dataframe. I don't want to write the name one by one.
I'm doing classification and i have around 2000 features.
data is a Dataframe with around 2000 columns.
First, I get all of the columns name of my DF and drop 9 columns because i don't need them.
My idea was to use all the columns names to feed the VectorAssembler. The result should be something like [Value Of the 1st Feature, Value 2nd Feature, Value 3rd Feature...] for the first row and this for all of my Dataframe.
But I have this error :
java.lang.IllegalArgumentException: Field "features" does not exist.
EDIT : If something is unclear, please let me know that I can fix it.
I deleted some Transformers because it's not the point of my question.(StringIndexer, VectorIndexer, IndexToString)
val array = data.columns drop(9)
val assembler = new VectorAssembler()
.setInputCols(array)
.setOutputCol("features")
val Array(trainingData, testData) = data.randomSplit(Array(0.8, 0.2))
val rf = new RandomForestClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("features")
.setNumTrees(50)
val pipeline = new Pipeline()
.setStages(Array(assembler, rf))
val model = pipeline.fit(trainingData)
EDIT 2 I fix my problem. I took off the Vector Indexer and used array in the VectorAssembler and it worked perfectly.
Well at least, I get a result.

Spark ML VectorAssembler() dealing with thousands of columns in dataframe

I was using spark ML pipeline to set up classification models on really wide table. This means that I have to automatically generate all the code that deals with columns instead of literately typing each of them. I am pretty much a beginner on scala and spark. I was stuck at the VectorAssembler() part when I was trying to do something like following:
val featureHeaders = featureHeader.collect.mkString(" ")
//convert the header RDD into a string
val featureArray = featureHeaders.split(",").toArray
val quote = "\""
val featureSIArray = featureArray.map(x => (s"$quote$x$quote"))
//count the element in headers
val featureHeader_cnt = featureHeaders.split(",").toList.length
// Fit on whole dataset to include all labels in index.
import org.apache.spark.ml.feature.StringIndexer
val labelIndexer = new StringIndexer().
setInputCol("target").
setOutputCol("indexedLabel")
val featureAssembler = new VectorAssembler().
setInputCols(featureSIArray).
setOutputCol("features")
val convpipeline = new Pipeline().
setStages(Array(labelIndexer, featureAssembler))
val myFeatureTransfer = convpipeline.fit(df)
Apparently it didn't work. I am not sure what should I do to make the whole thing more automatic or ML pipeline does not take that many columns at this moment(which I doubt)?
I finally figured out one way, which is not very pretty. It is to create vector.dense for the features, and then create data frame out of this.
import org.apache.spark.mllib.regression.LabeledPoint
val myDataRDDLP = inputData.map {line =>
val indexed = line.split('\t').zipWithIndex
val myValues = indexed.filter(x=> {x._2 >1770}).map(x=>x._1).map(_.toDouble)
val mykey = indexed.filter(x=> {x._2 == 3}).map(x=>(x._1.toDouble-1)).mkString.toDouble
LabeledPoint(mykey, Vectors.dense(myValues))
}
val training = sqlContext.createDataFrame(myDataRDDLP).toDF("label", "features")
You shouldn't use quotes (s"$quote$x$quote") unless column names contain quotes. Try
val featureAssembler = new VectorAssembler().
setInputCols(featureArray).
setOutputCol("features")
For pyspark, you can first create a list of the column names:
df_colnames = df.columns
Then you can use that in vectorAssembler:
assemble = VectorAssembler(inputCols = df_colnames, outputCol = 'features')
df_vectorized = assemble.transform(df)