I am trying to apply K-means method to the output of my wor2Vec model.
The word2vec method produced a dataframe vectors and I cannot apply K-means on it.
See Below the dataframe of vectors
After I've ran the following code to transform my dataframe to RDD
val parsedData = vectors.rdd.map(s => Vectors.dense(s.getDouble(0),s.getDouble(1))).cache()
When I try to apply K-means ont it it doesn't work
val clusters = KMeans.train(parsedData, 5, 20)
This is the error I have :
<console>:60: error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.ml.linalg.Vector]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
How to transforme the sql dataframe I have before to one that will match with spark.mllib.linalg.Vector
Related
Following the answer to this question
How to convert type Row into Vector to feed to the KMeans
I have created the feature table for my data.(assembler is a Vector Assembler)
val kmeanInput = assembler.transform(table1).select("features")
When I run k-means with kmeanInput
val clusters = KMeans.train(kmeanInput, numCluster, numIteration)
I get the error
:102: error: type mismatch; found : org.apache.spark.sql.DataFrame (which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] val clusters = KMeans.train(kmeanInput, numCluster, numIteration)
As #Jed mentioned in his answer, this happens because rows are not in Vectors.dense format.
To solve this I tried
val dat = kmeanInput.rdd.map(lambda row: Vectors.dense([x for x in
row["features"]]))
And I get this error
:3: error: ')' expected but '(' found. val dat = kmeanInput.rdd.map(lambda row: Vectors.dense([x for x in row["features"]]))
:3: error: ';' expected but ')' found. val dat = kmeanInput.rdd.map(lambda row: Vectors.dense([x for x in row["features"]]))
You imported the incorrect library, you should use KMeans from ml instead of mllib. The first one uses a DataFrame and the second one uses an RDD.
Relatively new to scala and the Spark API kit but I have a question trying to make use of the vector assembler
http://spark.apache.org/docs/latest/ml-features.html#vectorassembler
to then make use of matrix correlations
https://spark.apache.org/docs/2.1.0/mllib-statistics.html#correlations
The dataframe column is of dtype linalg.Vector
val assembler = new VectorAssembler()
val trainwlabels3 = assembler.transform(trainwlabels2)
trainwlabels3.dtypes(0)
res90: (String, String) = (features,org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7)
and yet calling this to an RDD for the statistics tool throws a mismatch error.
val data: RDD[Vector] = sc.parallelize(
trainwlabels3("features")
)
<console>:80: error: type mismatch;
found : org.apache.spark.sql.Column
required: Seq[org.apache.spark.mllib.linalg.Vector]
Thanks in advance for any help.
You should just select:
val features = trainwlabels3.select($"features")
Convert to RDD
val featuresRDD = features.rdd
and map:
featuresRDD.map(_.getAs[Vector]("features"))
This should work for you:
val rddForStatistics = new VectorAssembler()
.transform(trainwlabels2)
.select($"features")
.as[Vector] //turns Dataset[Row] (a.k.a DataFrame) to DataSet[Vector]
.rdd
However, you should avoid RDDs and figure out how to do what you want with the DataFrame-based API (in the spark.ml package) because working with RDDs is all but deprecated in MLlib.
I am familiar with Python and I am learning Spark-Scala.
I want to build a DataFrame which has structure desribed by this syntax:
// Prepare training data from a list of (label, features) tuples.
val training = spark.createDataFrame(Seq(
(1.1, Vectors.dense(1.1, 0.1)),
(0.2, Vectors.dense(1.0, -1.0)),
(3.0, Vectors.dense(1.3, 1.0)),
(1.0, Vectors.dense(1.2, -0.5))
)).toDF("label", "features")
I got the above syntax from this URL:
http://spark.apache.org/docs/latest/ml-pipeline.html
Currently my data is in array which I had pulled out of a DF:
val my_a = gspc17_df.collect().map{row => Seq(row(2),Vectors.dense(row(3).asInstanceOf[Double],row(4).asInstanceOf[Double]))}
The structure of my array is very similar to the above DF:
my_a: Array[Seq[Any]] =
Array(
List(-1.4830674013266898, [-0.004192832940431825,-0.003170667657263393]),
List(-0.05876766500768526, [-0.008462913654529357,-0.006880595828929472]),
List(1.0109273250546658, [-3.1816797620416693E-4,-0.006502619326182358]))
How to copy data from my array into a DataFrame which has the above structure?
I tried this syntax:
val my_df = spark.createDataFrame(my_a).toDF("label","features")
Spark barked at me:
<console>:105: error: inferred type arguments [Seq[Any]] do not conform to method createDataFrame's type parameter bounds [A <: Product]
val my_df = spark.createDataFrame(my_a).toDF("label","features")
^
<console>:105: error: type mismatch;
found : scala.collection.mutable.WrappedArray[Seq[Any]]
required: Seq[A]
val my_df = spark.createDataFrame(my_a).toDF("label","features")
^
scala>
The first problem here is that you use List to store row data. List is a homogeneous data structure and since the only common type for Any (row(2)) and DenseVector is Any (Object) you end up with a Seq[Any].
The next issue is that you use row(2) at all. Since Row is effectively a collection of Any this operation doesn't return any useful type and result couldn't be stored in a DataFrame without providing an explicit Encoder.
From the more Sparkish perspective it is not the good approach neither. collect-int just to transform data shouldn't require any comment and. mapping over Rows just to create Vectors doesn't make much sense either.
Assuming that there is no type mismatch you can use VectorAssembler:
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(Array(df.columns(3), df.columns(4)))
.setOutputCol("features")
assembler.transform(df).select(df.columns(2), "features")
or if you really want to handle this manually an UDF.
val toVec = udf((x: Double, y: Double) => Vectors.dense(x, y))
df.select(col(df.columns(2)), toVec(col(df.columns(3)), col(df.columns(4))))
In general I would strongly recommend getting familiar with Scala before you start using it with Spark.
I'm trying to implement k-means method using scala.
I created a RDD something like that
val df = sc.parallelize(data).groupByKey().collect().map((chunk)=> {
sc.parallelize(chunk._2.toSeq).toDF()
})
val examples = df.map(dataframe =>{
dataframe.selectExpr(
"avg(time) as avg_time",
"variance(size) as var_size",
"variance(time) as var_time",
"count(size) as examples"
).rdd
})
val rdd_final=examples.reduce(_ union _)
val kmeans= new KMeans()
val model = kmeans.run(rdd_final)
With this code I obtain an error
type mismatch;
[error] found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
[error] required:org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
So I tried to cast doing:
val rdd_final_Vector = rdd_final.map{x:Row => x.getAs[org.apache.spark.mllib.linalg.Vector](0)}
val model = kmeans.run(rdd_final_Vector)
But then I obtain an error:
java.lang.ClassCastException: java.lang.Double cannot be cast to org.apache.spark.mllib.linalg.Vector
So I'm looking for a way to do that cast, but I can't find any method.
Any idea?
Best regards
At least a couple of issues here:
No you really can not cast a Row to a Vector: a Row is a collection of potentially disparate types understood by Spark SQL. A Vector is not a native spark sql type
There seems to be a mismatch between the content of your SQL statement and what you are attempting to achieve with KMeans: the SQL is performing aggregations. But KMeans expects a series of individual data points in the form a Vector (which encapsulates an Array[Double]) . So then - why are you supplying sum's and average's to a KMeans operation?
Addressing just #1 here: you will need to do something along the lines of:
val doubVals = <rows rdd>.map{ row => row.getDouble("colname") }
val vector = Vectors.toDense{ doubVals.collect}
Then you have a properly encapsulated Array[Double] (within a Vector) that can be supplied to Kmeans.
I want to apply k-mean clustering to my data which is in DataFrame format being result of a sqlContext.sql() query, in scala. I can convert it to RDD by using ".rdd".
As I understand from the docs and the single example on Spark's website, KMeans.train expects RDD vector.
My data consists of two fields, userid and avg. What I want is clustering the userids according to their associated avg values which is in Double type.
Currently I have:
val queryResult = sqlContext.sql(s"some-query") // | userid(String) | avg(Double) |
val trainData = queryResult.rdd
val clusters = KMeans.train(trainData, numClusters, numIterations)
leading to this errors:
<console>:46: error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
That example suggests such a practice:
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
But I cannot figure out how to change it to have an RDD Vector including avg data and also mapped somehow to userids.
How can I format my input data to make k-mean clustering run as I expect?