Converting Dataframe to Vector.dense for k-mean - scala

Following the answer to this question
How to convert type Row into Vector to feed to the KMeans
I have created the feature table for my data.(assembler is a Vector Assembler)
val kmeanInput = assembler.transform(table1).select("features")
When I run k-means with kmeanInput
val clusters = KMeans.train(kmeanInput, numCluster, numIteration)
I get the error
:102: error: type mismatch; found : org.apache.spark.sql.DataFrame (which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] val clusters = KMeans.train(kmeanInput, numCluster, numIteration)
As #Jed mentioned in his answer, this happens because rows are not in Vectors.dense format.
To solve this I tried
val dat = kmeanInput.rdd.map(lambda row: Vectors.dense([x for x in
row["features"]]))
And I get this error
:3: error: ')' expected but '(' found. val dat = kmeanInput.rdd.map(lambda row: Vectors.dense([x for x in row["features"]]))
:3: error: ';' expected but ')' found. val dat = kmeanInput.rdd.map(lambda row: Vectors.dense([x for x in row["features"]]))

You imported the incorrect library, you should use KMeans from ml instead of mllib. The first one uses a DataFrame and the second one uses an RDD.

Related

K-means from dataframe wor2Vec on scala

I am trying to apply K-means method to the output of my wor2Vec model.
The word2vec method produced a dataframe vectors and I cannot apply K-means on it.
See Below the dataframe of vectors
After I've ran the following code to transform my dataframe to RDD
val parsedData = vectors.rdd.map(s => Vectors.dense(s.getDouble(0),s.getDouble(1))).cache()
When I try to apply K-means ont it it doesn't work
val clusters = KMeans.train(parsedData, 5, 20)
This is the error I have :
<console>:60: error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.ml.linalg.Vector]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
How to transforme the sql dataframe I have before to one that will match with spark.mllib.linalg.Vector

Cast all specific datatype columns into other datatypes programatically in Scala Spark

I am programmatically trying to convert datatypes of columns and running into some coding issues.
I modified the code used here for this.
Data >> any numbers being read as strings.
Code >>
import org.apache.spark.sql
raw_data.schema.fields
.collect({case x if x.dataType.typeName == "string" => x.name})
.foldLeft(raw_data)({case(dframe,field) => dframe(field).cast(sql.types.IntegerType)})
Error >>
<console>:75: error: type mismatch;
found : org.apache.spark.sql.Column
required: org.apache.spark.sql.DataFrame
(which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
.foldLeft(raw_data)({case(dframe,field) => dframe(field).cast(sql.types.IntegerType)})
The problem is that the result of dframe(field).cast(sql.types.IntegerType) in the foldLeft is a column, however, to continue the iteration a dataframe is expected. In the link where the code is originally from dframe.drop(field) is used which does return a dataframe and hence works.
To fix this, simply use withColumn which will adjust a specific column and then return the whole dataframe:
foldLeft(raw_data)({case(dframe, field) => dframe.withColumn(field, dframe(field).cast(sql.types.IntegerType))})

Transpose RDD[Vector] to change records to attributes for an csv of size 500.000 x 50

I would like to read a csv file and transpose it to measure correlation between attributes. But when I transpose it I get below error:
not enough arguments for method transpose: (implicit asTraversable:
org.apache.spark.mllib.linalg.Vector =>
scala.collection.GenTraversableOnce[B])Seq[Seq[B]]. Unspecified value
parameter asTraversable.
Error occurred in an application involving default arguments.
val file = "/data.csv"
val data = sc.textFile(file).map(line => Vectors.dense(line.split (",").map(_.toDouble).distinct))
val transposedData = sc.parallelize(data.collect.toSeq.transpose)
val correlMatrix: Matrix = Statistics.corr(transposedData, "pearson")
println(correlMatrix.toString)
not enough arguments for method transpose: (implicit asTraversable: org.apache.spark.mllib.linalg.Vector => scala.collection.GenTraversableOnce[B])Seq[Seq[B]]. Unspecified value parameter asTraversable.
data RDD is a collection of org.apache.spark.mllib.linalg.Vector i.e. collection of objects. But transpose would require collection of collection.
data.collect.toSeq simply gives you Seq[Vector] which cannot be transposed.
The following code should work for you
val data = sc.textFile(file).map(line => line.split (",").map(_.toDouble))
val untransposedData = data.map(Vectors.dense(_))
val transposedData = sc.parallelize(data.collect.toSeq.transpose).map(x => Vectors.dense(x.toArray))
val correlMatrix: Matrix = Statistics.corr(transposedData, "pearson")
println(correlMatrix.toString)
Note: distinct is removed as it would make the two dimentional matrix to be uneven which would lead to another issue.

How to convert RDD[Row] to RDD[Vector]

I'm trying to implement k-means method using scala.
I created a RDD something like that
val df = sc.parallelize(data).groupByKey().collect().map((chunk)=> {
sc.parallelize(chunk._2.toSeq).toDF()
})
val examples = df.map(dataframe =>{
dataframe.selectExpr(
"avg(time) as avg_time",
"variance(size) as var_size",
"variance(time) as var_time",
"count(size) as examples"
).rdd
})
val rdd_final=examples.reduce(_ union _)
val kmeans= new KMeans()
val model = kmeans.run(rdd_final)
With this code I obtain an error
type mismatch;
[error] found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
[error] required:org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
So I tried to cast doing:
val rdd_final_Vector = rdd_final.map{x:Row => x.getAs[org.apache.spark.mllib.linalg.Vector](0)}
val model = kmeans.run(rdd_final_Vector)
But then I obtain an error:
java.lang.ClassCastException: java.lang.Double cannot be cast to org.apache.spark.mllib.linalg.Vector
So I'm looking for a way to do that cast, but I can't find any method.
Any idea?
Best regards
At least a couple of issues here:
No you really can not cast a Row to a Vector: a Row is a collection of potentially disparate types understood by Spark SQL. A Vector is not a native spark sql type
There seems to be a mismatch between the content of your SQL statement and what you are attempting to achieve with KMeans: the SQL is performing aggregations. But KMeans expects a series of individual data points in the form a Vector (which encapsulates an Array[Double]) . So then - why are you supplying sum's and average's to a KMeans operation?
Addressing just #1 here: you will need to do something along the lines of:
val doubVals = <rows rdd>.map{ row => row.getDouble("colname") }
val vector = Vectors.toDense{ doubVals.collect}
Then you have a properly encapsulated Array[Double] (within a Vector) that can be supplied to Kmeans.

Apache Spark k-means clustering train data format in Scala

I want to apply k-mean clustering to my data which is in DataFrame format being result of a sqlContext.sql() query, in scala. I can convert it to RDD by using ".rdd".
As I understand from the docs and the single example on Spark's website, KMeans.train expects RDD vector.
My data consists of two fields, userid and avg. What I want is clustering the userids according to their associated avg values which is in Double type.
Currently I have:
val queryResult = sqlContext.sql(s"some-query") // | userid(String) | avg(Double) |
val trainData = queryResult.rdd
val clusters = KMeans.train(trainData, numClusters, numIterations)
leading to this errors:
<console>:46: error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
That example suggests such a practice:
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
But I cannot figure out how to change it to have an RDD Vector including avg data and also mapped somehow to userids.
How can I format my input data to make k-mean clustering run as I expect?