Apache Spark k-means clustering train data format in Scala - scala

I want to apply k-mean clustering to my data which is in DataFrame format being result of a sqlContext.sql() query, in scala. I can convert it to RDD by using ".rdd".
As I understand from the docs and the single example on Spark's website, KMeans.train expects RDD vector.
My data consists of two fields, userid and avg. What I want is clustering the userids according to their associated avg values which is in Double type.
Currently I have:
val queryResult = sqlContext.sql(s"some-query") // | userid(String) | avg(Double) |
val trainData = queryResult.rdd
val clusters = KMeans.train(trainData, numClusters, numIterations)
leading to this errors:
<console>:46: error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
That example suggests such a practice:
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
But I cannot figure out how to change it to have an RDD Vector including avg data and also mapped somehow to userids.
How can I format my input data to make k-mean clustering run as I expect?

Related

K-means from dataframe wor2Vec on scala

I am trying to apply K-means method to the output of my wor2Vec model.
The word2vec method produced a dataframe vectors and I cannot apply K-means on it.
See Below the dataframe of vectors
After I've ran the following code to transform my dataframe to RDD
val parsedData = vectors.rdd.map(s => Vectors.dense(s.getDouble(0),s.getDouble(1))).cache()
When I try to apply K-means ont it it doesn't work
val clusters = KMeans.train(parsedData, 5, 20)
This is the error I have :
<console>:60: error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.ml.linalg.Vector]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
How to transforme the sql dataframe I have before to one that will match with spark.mllib.linalg.Vector

Converting a Spark DataFrame for ML processing

I have written the following code to feed data to a machine learning algorithm in Spark 2.3. The code below runs fine. I need to enhance this code to be able to convert not just 3 columns but any number of columns, uploaded via the csv file. For instance, if I had loaded 5 columns, how can I put them automatically in the Vector.dense command below, or some other way to generate the same end result? Does anyone know how this can be done?
val data2 = spark.read.format("csv").option("header",
"true").load("/data/c7.csv")
val goodBadRecords = data2.map(
row =>{
val n0 = row(0).toString.toLowerCase().toDouble
val n1 = row(1).toString.toLowerCase().toDouble
val n2 = row(2).toString.toLowerCase().toDouble
val n3 = row(3).toString.toLowerCase().toDouble
(n0, Vectors.dense(n1,n2,n3))
}
).toDF("label", "features")
Thanks
Regards,
Adeel
A VectorAssembler can do the job:
VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features [...] into a single feature vector
Based on your code, the solution would look like:
val data2 = spark.read.format("csv")
.option("header","true")
.option("inferSchema", "true") //1
.load("/data/c7.csv")
val fields = data2.schema.fieldNames
val assembler = new VectorAssembler()
.setInputCols(fields.tail) //2
.setOutputCol("features") //3
val goodBadRecords = assembler.transform(data2)
.withColumn("label", col(fields(0))) //4
.drop(fields:_*) //5
Remarks:
A schema is necessary for the input data, as the VectorAssembler only accepts the following input column types: all numeric types, boolean type, and vector type (same link). You seem to have a csv with doubles, so infering the schema should work. But of course, any other method to transform the string data to doubles is also ok.
Use all but the first column as input for the VectorAssembler
Name the result column of the VectorAssembler features
Create a new column called label as copy of the first column
Drop all orginal columns. This last step is optional as the learning algorithm usually only looks at the label and feature column and ignores all other columns

BucketedRandomProjectionLSHModel approxNearestNeighbors function on entire dataframe

I'm trying to evaluate an entire DataFrame through the approxNearestNeighbors function of BucketedRandomProjectionLSHModel
What I expect:
A DataFrame containing the following information:
cookieId NN
id1 [id3, id5, id7]
id2 [id8, id9]
...
Input DataFrame (daily_content_transformed):
cookieID features(a sparse vector)
id1 sparse vector with features
id2 sparse vector with features
...
This works:
val key = Vectors.sparse(37599,
Array(1,4,6,7,16,57,81,104,166,225,290,692,763),
Array(1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0))
model.approxNearestNeighbors(daily_content_transformed, key, 20).show(20, false)
It returns a DataFrame with 21 rows. I could extract the cookieId column from this DataFrame and store it in the expected DataFrame.
Where I'm stuck:
instead of hard coding the key to retrieve NN from, run the method for every row in the input dataframe, and make a dataframe as expected above
Any help?
Edit in reply to first response:
After playing around with the suggestion to use approxSimilarityJoin instead of approxNearestNeighbors I came to the following conclusions:
the suggested solution works well for daily_content_transformed.limit(3000)
starting from daily_content_transformed.limit(5000), my spark job terminates with an java.lang.OutOfMemoryError.
my input table contains +- 800 000 unique cookieID's (rows).
Although the suggested solution works for small inputs, scalability is an issue.
BucketedRandomProjectionLSHModel doesn't provide required API. I think you approximate it using approxSimilarityJoin:
import org.apache.spark.sql.functions.{struct, udf, collect_list. sort_array}
val threshold: Double
val n: Int
def take(n: Int) = udf((xs: Seq[String]) => xs.take(n))
model
.approxNearestNeighbors(
daily_content_transformed.alias("left"),
daily_content_transformed.alias("right"))
.groupBy($"datasetA.id" as "cookieId")
// Collect pairs (dist, id)
.agg(collect_list(struct($"distCol", $"datasetB.id" as "id") as "NN"))
// Sort by dist, drop dist and take n
.withColumn("NN" take(n)(sort_array($"NN", false).getItem("id")))
This guarantees to preserve at most n neighbors.

How to convert RDD[Row] to RDD[Vector]

I'm trying to implement k-means method using scala.
I created a RDD something like that
val df = sc.parallelize(data).groupByKey().collect().map((chunk)=> {
sc.parallelize(chunk._2.toSeq).toDF()
})
val examples = df.map(dataframe =>{
dataframe.selectExpr(
"avg(time) as avg_time",
"variance(size) as var_size",
"variance(time) as var_time",
"count(size) as examples"
).rdd
})
val rdd_final=examples.reduce(_ union _)
val kmeans= new KMeans()
val model = kmeans.run(rdd_final)
With this code I obtain an error
type mismatch;
[error] found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
[error] required:org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
So I tried to cast doing:
val rdd_final_Vector = rdd_final.map{x:Row => x.getAs[org.apache.spark.mllib.linalg.Vector](0)}
val model = kmeans.run(rdd_final_Vector)
But then I obtain an error:
java.lang.ClassCastException: java.lang.Double cannot be cast to org.apache.spark.mllib.linalg.Vector
So I'm looking for a way to do that cast, but I can't find any method.
Any idea?
Best regards
At least a couple of issues here:
No you really can not cast a Row to a Vector: a Row is a collection of potentially disparate types understood by Spark SQL. A Vector is not a native spark sql type
There seems to be a mismatch between the content of your SQL statement and what you are attempting to achieve with KMeans: the SQL is performing aggregations. But KMeans expects a series of individual data points in the form a Vector (which encapsulates an Array[Double]) . So then - why are you supplying sum's and average's to a KMeans operation?
Addressing just #1 here: you will need to do something along the lines of:
val doubVals = <rows rdd>.map{ row => row.getDouble("colname") }
val vector = Vectors.toDense{ doubVals.collect}
Then you have a properly encapsulated Array[Double] (within a Vector) that can be supplied to Kmeans.

How do I run the Spark decision tree with a categorical feature set using Scala?

I have a feature set with a corresponding categoricalFeaturesInfo: Map[Int,Int]. However, for the life of me I cannot figure out how I am supposed to get the DecisionTree class to work. It will not accept anything, but a LabeledPoint as data. However, LabeledPoint requires (double, vector) where the vector requires doubles.
val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))
// Run training algorithm to build the model
val maxDepth: Int = 3
val isMulticlassWithCategoricalFeatures: Boolean = true
val numClassesForClassification: Int = countPossibilities(labelCol)
val model = DecisionTree.train(LP, Classification, Gini, isMulticlassWithCategoricalFeatures, maxDepth, numClassesForClassification,categoricalFeaturesInfo)
The error I get:
scala> val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))
<console>:32: error: overloaded method value dense with alternatives:
(values: Array[Double])org.apache.spark.mllib.linalg.Vector <and>
(firstValue: Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
cannot be applied to (Array[String])
val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))
My resources thus far:
tree config, decision tree, labeledpoint
You can first transform categories to numbers, then load data as if all features are numerical.
When you build a decision tree model in Spark, you just need to tell spark which features are categorical and also the feature's arity (the number of distinct categories of that feature) by specifying a map Map[Int, Int]() from feature indices to its arity.
For example if you have data as:
1,a,add
2,b,more
1,c,thinking
3,a,to
1,c,me
You can first transform data into numerical format as:
1,0,0
2,1,1
1,2,2
3,0,3
1,2,4
In that format you can load data to Spark. Then if you want to tell Spark the second and the third columns are categorical, you should create a map:
categoricalFeaturesInfo = Map[Int, Int]((1,3),(2,5))
The map tells us that feature with index 1 has arity 3, and feature with index 2 has artity 5. They will be considered as categorical when we build a decision tree model passing that map as a parameter of the training function:
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
Strings are not supported by LabeledPoint, one way to put it into a LabeledPoint is to split your data into multiple columns, considering that your strings are categorical.
So for example, if you have the following dataset:
id,String,Intvalue
1,"a",123
2,"b",456
3,"c",789
4,"a",887
Then you could split your string data, making each value of the strings into a new column
a -> 1,0,0
b -> 0,1,0
c -> 0,0,1
As you have 3 distinct values of Strings, you will convert your string column to 3 new columns, and each value will be represented by a value in this new columns.
Now your dataset will be
id,String,Intvalue
1,1,0,0,123
2,0,1,0,456
3,0,0,1,789
4,1,0,0,887
Which now you can convert into Double values and use it into your LabeledPoint.
Another way to convert your strings into a LabeledPoint is to create a distinctlist of values for each column, and convert the values of the strings into the index of that string in this list. Which is not recommended because if so, in this supposed dataset it will be
a = 0
b = 1
c = 2
But in this case the algorithms will consider a closer to b than to c, which cannot be determined.
You need to confirm the type of array x.
From the error log, it said that the item in array x is string which is not supported in spark.
Current spark Vectors can only be filled by Double.