How to convert RDD[Row] to RDD[Vector] - scala

I'm trying to implement k-means method using scala.
I created a RDD something like that
val df = sc.parallelize(data).groupByKey().collect().map((chunk)=> {
sc.parallelize(chunk._2.toSeq).toDF()
})
val examples = df.map(dataframe =>{
dataframe.selectExpr(
"avg(time) as avg_time",
"variance(size) as var_size",
"variance(time) as var_time",
"count(size) as examples"
).rdd
})
val rdd_final=examples.reduce(_ union _)
val kmeans= new KMeans()
val model = kmeans.run(rdd_final)
With this code I obtain an error
type mismatch;
[error] found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
[error] required:org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
So I tried to cast doing:
val rdd_final_Vector = rdd_final.map{x:Row => x.getAs[org.apache.spark.mllib.linalg.Vector](0)}
val model = kmeans.run(rdd_final_Vector)
But then I obtain an error:
java.lang.ClassCastException: java.lang.Double cannot be cast to org.apache.spark.mllib.linalg.Vector
So I'm looking for a way to do that cast, but I can't find any method.
Any idea?
Best regards

At least a couple of issues here:
No you really can not cast a Row to a Vector: a Row is a collection of potentially disparate types understood by Spark SQL. A Vector is not a native spark sql type
There seems to be a mismatch between the content of your SQL statement and what you are attempting to achieve with KMeans: the SQL is performing aggregations. But KMeans expects a series of individual data points in the form a Vector (which encapsulates an Array[Double]) . So then - why are you supplying sum's and average's to a KMeans operation?
Addressing just #1 here: you will need to do something along the lines of:
val doubVals = <rows rdd>.map{ row => row.getDouble("colname") }
val vector = Vectors.toDense{ doubVals.collect}
Then you have a properly encapsulated Array[Double] (within a Vector) that can be supplied to Kmeans.

Related

In Spark-Scala, how to copy Array of Lists into DataFrame?

I am familiar with Python and I am learning Spark-Scala.
I want to build a DataFrame which has structure desribed by this syntax:
// Prepare training data from a list of (label, features) tuples.
val training = spark.createDataFrame(Seq(
(1.1, Vectors.dense(1.1, 0.1)),
(0.2, Vectors.dense(1.0, -1.0)),
(3.0, Vectors.dense(1.3, 1.0)),
(1.0, Vectors.dense(1.2, -0.5))
)).toDF("label", "features")
I got the above syntax from this URL:
http://spark.apache.org/docs/latest/ml-pipeline.html
Currently my data is in array which I had pulled out of a DF:
val my_a = gspc17_df.collect().map{row => Seq(row(2),Vectors.dense(row(3).asInstanceOf[Double],row(4).asInstanceOf[Double]))}
The structure of my array is very similar to the above DF:
my_a: Array[Seq[Any]] =
Array(
List(-1.4830674013266898, [-0.004192832940431825,-0.003170667657263393]),
List(-0.05876766500768526, [-0.008462913654529357,-0.006880595828929472]),
List(1.0109273250546658, [-3.1816797620416693E-4,-0.006502619326182358]))
How to copy data from my array into a DataFrame which has the above structure?
I tried this syntax:
val my_df = spark.createDataFrame(my_a).toDF("label","features")
Spark barked at me:
<console>:105: error: inferred type arguments [Seq[Any]] do not conform to method createDataFrame's type parameter bounds [A <: Product]
val my_df = spark.createDataFrame(my_a).toDF("label","features")
^
<console>:105: error: type mismatch;
found : scala.collection.mutable.WrappedArray[Seq[Any]]
required: Seq[A]
val my_df = spark.createDataFrame(my_a).toDF("label","features")
^
scala>
The first problem here is that you use List to store row data. List is a homogeneous data structure and since the only common type for Any (row(2)) and DenseVector is Any (Object) you end up with a Seq[Any].
The next issue is that you use row(2) at all. Since Row is effectively a collection of Any this operation doesn't return any useful type and result couldn't be stored in a DataFrame without providing an explicit Encoder.
From the more Sparkish perspective it is not the good approach neither. collect-int just to transform data shouldn't require any comment and. mapping over Rows just to create Vectors doesn't make much sense either.
Assuming that there is no type mismatch you can use VectorAssembler:
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(Array(df.columns(3), df.columns(4)))
.setOutputCol("features")
assembler.transform(df).select(df.columns(2), "features")
or if you really want to handle this manually an UDF.
val toVec = udf((x: Double, y: Double) => Vectors.dense(x, y))
df.select(col(df.columns(2)), toVec(col(df.columns(3)), col(df.columns(4))))
In general I would strongly recommend getting familiar with Scala before you start using it with Spark.

How to create DataFrame from Scala's List of Iterables?

I have the following Scala value:
val values: List[Iterable[Any]] = Traces().evaluate(features).toList
and I want to convert it to a DataFrame.
When I try the following:
sqlContext.createDataFrame(values)
I got this error:
error: overloaded method value createDataFrame with alternatives:
[A <: Product](data: Seq[A])(implicit evidence$2: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame
[A <: Product](rdd: org.apache.spark.rdd.RDD[A])(implicit evidence$1: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame
cannot be applied to (List[Iterable[Any]])
sqlContext.createDataFrame(values)
Why?
Thats what spark implicits object is for. It allows you to convert your common scala collection types into DataFrame / DataSet / RDD.
Here is an example with Spark 2.0 but it exists in older versions too
import org.apache.spark.sql.SparkSession
val values = List(1,2,3,4,5)
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val df = values.toDF()
Edit: Just realised you were after 2d list. Here is something I tried on spark-shell. I converted a 2d List to List of Tuples and used implicit conversion to DataFrame:
val values = List(List("1", "One") ,List("2", "Two") ,List("3", "Three"),List("4","4")).map(x =>(x(0), x(1)))
import spark.implicits._
val df = values.toDF
Edit2: The original question by MTT was How to create spark dataframe from a scala list for a 2d list for which this is a correct answer. The original question is https://stackoverflow.com/revisions/38063195/1
The question was later changed to match an accepted answer. Adding this edit so that if someone else looking for something similar to the original question can find it.
As zero323 mentioned, we need to first convert List[Iterable[Any]] to List[Row] and then put rows in RDD and prepare schema for the spark data frame.
To convert List[Iterable[Any]] to List[Row], we can say
val rows = values.map{x => Row(x:_*)}
and then having schema like schema, we can make RDD
val rdd = sparkContext.makeRDD[RDD](rows)
and finally create a spark data frame
val df = sqlContext.createDataFrame(rdd, schema)
Simplest approach:
val newList = yourList.map(Tuple1(_))
val df = spark.createDataFrame(newList).toDF("stuff")
In Spark 2 we can use DataSet by just converting list to DS by toDS API
val ds = list.flatMap(_.split(",")).toDS() // Records split by comma
or
val ds = list.toDS()
This more convenient than rdd or df
The most concise way I've found:
val df = spark.createDataFrame(List("A", "B", "C").map(Tuple1(_)))

The argument types of an anonymous function must be fully known. (SLS 8.5) when word2vec applied for dataframe

I apply Spark's word2vec by using a dataframe. Here is my code:
val df2 = df.groupBy("LABEL").agg(collect_list("TERM").alias("TERM"))
val word2Vec = new Word2Vec()
.setInputCol("TERM")
.setOutputCol("result")
.setMinCount(0)
val model = word2Vec.fit(df2)
val result = model.transform(df2)
val synonyms = model.findSynonyms("4", 10)
//synonyms.foreach(println)
for((synonym, cosineSimilarity) <- synonyms) {
println(s"$synonym $cosineSimilarity")
}
When I use synonyms.foreach(println) the code works, however, the returned results are not ordered based on their similarity scores. Instead I have tried the for loop seen at bottom of the code. When applying it the following error has been thrown:
Error:(52, 40) missing parameter type for expanded function
The argument types of an anonymous function must be fully known. (SLS 8.5)
Expected type was: ?
for((synonym, cosineSimilarity) <- synonyms) {
^
From other similar stackoverflow questions and the error, it seems the exact types of arguments are needed. In the for loop synonyms is a dataframe and the returned values have types String and Double, respectively. So all my trials have failed. How can I remedy this?
The result of findSynonyms is a non-materialized Spark-internal DataFrame. You can not simply iterate on the result.
def findSynonyms(word: Vector, num: Int): DataFrame = {
..
sc.parallelize(wordVectors.findSynonyms(word, num)).toDF("word", "similarity")
}
Note the reason foreach worked is that is a materialization method clearly defined on DataFrame's

Spark DataFrame zipWithIndex

I am using a DataFrame to read in a .parquet files but than turning them into an rdd to do my normal processing I wanted to do on them.
So I have my file:
val dataSplit = sqlContext.parquetFile("input.parquet")
val convRDD = dataSplit.rdd
val columnIndex = convRDD.flatMap(r => r.zipWithIndex)
I get the following error even when I convert from a dataframe to RDD:
:26: error: value zipWithIndex is not a member of
org.apache.spark.sql.Row
Anyone know how to do what I am trying to do, essentially trying to get the value and the column index.
I was thinking something like:
val dataSplit = sqlContext.parquetFile(inputVal.toString)
val schema = dataSplit.schema
val columnIndex = dataSplit.flatMap(r => 0 until schema.length
but getting stuck on the last part as not sure how to do the same of zipWithIndex.
You can simply convert Row to Seq:
convRDD.flatMap(r => r.toSeq.zipWithIndex)
Important thing to note here is that extracting type information becomes tricky. Row.toSeq returns Seq[Any] and resulting RDD is RDD[(Any, Int)].

How to transform Array[(Double, Double)] into Array[Double] in Scala?

I'm using MLlib of Spark (v1.1.0) and Scala to do k-means clustering applied to a file with points (longitude and latitude).
My file contains 4 fields separated by comma (the last two are the longitude and latitude).
Here, it's an example of k-means clustering using Spark:
https://spark.apache.org/docs/1.1.0/mllib-clustering.html
What I want to do is to read the last two fields of my files that are in a specific directory in HDFS, transform them into an RDD<Vector> o use this method in KMeans class:
train(RDD<Vector> data, int k, int maxIterations)
This is my code:
val data = sc.textFile("/user/test/location/*")
val parsedData = data.map(s => Vectors.dense(s.split(',').map(fields => (fields(2).toDouble,fields(3).toDouble))))
But when I run it in spark-shell I get the following error:
error: overloaded method value dense with alternatives: (values:
Array[Double])org.apache.spark.mllib.linalg.Vector (firstValue:
Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
cannot be applied to (Array[(Double, Double)])
So, I don't know how to transform my Array[(Double, Double)] into Array[Double]. Maybe there is another way to read the two fields and convert them into RDD<Vector>, any suggestion?
Previous suggestion using flatMap was based on the assumption that you wanted to map over the elements of the array given by the .split(",") - and offered to satisfy the types, by using Array instead of Tuple2.
The argument received by the .map/.flatMap functions is an element of the original collection, so should be named 'field' (singluar) for clarity. Calling fields(2) selects the 3rd character of each of the elements of the split - hence the source of confusion.
If what you're after is the 3rd and 4th elements of the .split(",") array, converted to Double:
s.split(",").drop(2).take(2).map(_.toDouble)
or if you want all BUT the first to fields converted to Double (if there may be more than 2):
s.split(",").drop(2).map(_.toDouble)
There're two 'factory' methods for dense Vectors:
def dense(values: Array[Double]): Vector
def dense(firstValue: Double, otherValues: Double*): Vector
While the provided type above is Array[Tuple2[Double,Double]] and hence does not type-match:
(Extracting the logic above:)
val parseLineToTuple: String => Array[(Double,Double)] = s => s=> s.split(',').map(fields => (fields(2).toDouble,fields(3).toDouble))
What is needed here is to create a new Array out of the input String, like this: (again focusing only on the specific parsing logic)
val parseLineToArray: String => Array[Double] = s=> s.split(",").flatMap(fields => Array(fields(2).toDouble,fields(3).toDouble)))
Integrating that in the original code should solve the issue:
val data = sc.textFile("/user/test/location/*")
val vectors = data.map(s => Vectors.dense(parseLineToArray(s))
(You can of course inline that code, I separated it here to focus on the issue at hand)
val parsedData = data.map(s => Vectors.dense(s.split(',').flatMap(fields => Array(fields(2).toDouble,fields(3).toDouble))))