Spark Scala Vector Map ClassCastException - scala

Trying to make use of the Statistics tool in Spark for Scala and having difficulty preparing a vector that will take.
val featuresrdd = features.rdd.map{_.getAs[Vector]("features")}
featuresrdd: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[952] at map at <console>:82
This produces a vector of type 'mllib.linalg.vector', however upon using this in the tool, the vector has changed to type 'DenseVector'.
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.rdd.RDD
val correlMatrix: Matrix = Statistics.corr(featuresrdd, "pearson")
java.lang.ClassCastException: org.apache.spark.ml.linalg.DenseVector cannot be cast to org.apache.spark.mllib.linalg.Vector
Any help would be greatly appreciated.
Thanks

Use asML function to convert old Vector to new Vector in ML:
val newMLFeaturesRDD = featuresrdd.map(_.asML)
val correlMatrix: Matrix = Statistics.corr(newMLFeaturesRDD , "pearson")

Related

How to calculate euclidean distance of each row in a dataframe to a constant reference array

I have a dataframe which is created from parquet files that has 512 columns(all float values).
I am trying to calculate euclidean distance of each row in my dataframe to a constant reference array.
My development environment is Zeppelin 0.7.3 with spark 2.1 and Scala. Here is the zeppelin paragraphs I run:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
//Create dataframe from parquet file
val filePath = "/tmp/vector.parquet/*.parquet"
val df = spark.read.parquet(filePath)
//Create assembler and vectorize df
val assembler = new VectorAssembler()
.setInputCols(df.columns)
.setOutputCol("features")
val training = assembler.transform(df)
//Create udf
val eucDisUdf = udf((features: Vector,
myvec:Vector)=>Vectors.sqdist(features, myvec))
//Cretae ref vector
val myScalaVec = Vectors.dense( Array.fill(512)(25.44859))
val distDF =
training2.withColumn("euc",eucDisUdf($"features",myScalaVec))
This code gives the following error for eucDisUdf call:
error: type mismatch; found : org.apache.spark.ml.linalg.Vector
required: org.apache.spark.sql.Column
I appreciate any idea how to eliminate this error and compute distances properly in scala.
I think you can use currying to achieve that:
def eucDisUdf(myvec:Vector) = udf((features: Vector) => Vectors.sqdist(features, myvec))
val myScalaVec = Vectors.dense(Array.fill(512)(25.44859))
val distDF = training2.withColumn( "euc", eucDisUdf(myScalaVec)($"features") )

print CoordinateMatrix after using RowMatrix.columnSimilarities in Apache Spark

I am using spark mllib for one of my projects in which I need to calculate document similarities.
I first converted the documents to vectors using tf-idf transform of the mllib, then converted it into RowMatrix and used the columnSimilarities() method.
I referred to tf-idf documentation and used the DIMSUM implementation for cosine similarities.
in spark-shell this is the scala code is executed:
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.IDF
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val documents = sc.textFile("test1").map(_.split(" ").toSeq)
val hashingTF = new HashingTF()
val tf = hashingTF.transform(documents)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)
// now use the RowMatrix to compute cosineSimilarities
// which implements DIMSUM algorithm
val mat = new RowMatrix(tfidf)
val sim = mat.columnSimilarities() // returns a CoordinateMatrix
Now let's say my input file, test1 in this code block is a simple file with 5 short documents (less than 10 terms each), one on each row.
Since I am just testing this code, I would like to see the output of mat.columnSimilarities() which is in object sim.
I would like to see the similarity of 1st document vector with 2nd, 3rd and so on.
I referred to spark documentation for CoordinateMatrix which is the type of object returned by columnSimilarities method of RowMatrix class and referred by sim.
By going through more documentation, I figured I could convert the CoordinateMatrix to RowMatrix, then convert the rows of RowMatrix to arrays and then print like this println(sim.toRowMatrix().rows.toArray().mkString("\n")) .
But that gives some output which I couldn't understand.
Can anyone help? Any kind of resource links etc would help a lot!
Thanks!
You can try the following, no need to convert to row matrix format
val transformedRDD = sim.entries.map{case MatrixEntry(row: Long, col:Long, sim:Double) => Array(row,col,sim).mkString(",")}
To retrieve the elements you can invoke the following action
transformedRDD.collect()

Spark: Input a vector

I'm get into spark and I have problems with Vectors
import org.apache.spark.mllib.linalg.{Vectors, Vector}
The input of my program is a text file with contains the output of a RDD(Vector):
dataset.txt:
[-0.5069793074881704,-2.368342680619545,-3.401324690974588]
[-0.7346396928543871,-2.3407983487917448,-2.793949129209909]
[-0.9174226561793709,-0.8027635530022152,-1.701699021443242]
[0.510736518683609,-2.7304268743276174,-2.418865539558031]
So, what a try to do is:
val rdd = sc.textFile("/workingdirectory/dataset")
val data = rdd.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
I have the error because it read [0.510736518683609 as a number.
Exist any form to load directly the vector stored in the text-file without doing the second line? How I can delete "[" in the map stage ?
I'm really new in spark, sorry if it's a very obvious question.
Given the input the simplest thing you can do is to use Vectors.parse:
scala> import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.Vectors
scala> Vectors.parse("[-0.50,-2.36,-3.40]")
res14: org.apache.spark.mllib.linalg.Vector = [-0.5,-2.36,-3.4]
It also works with sparse representation:
scala> Vectors.parse("(10,[1,5],[0.5,-1.0])")
res15: org.apache.spark.mllib.linalg.Vector = (10,[1,5],[0.5,-1.0])
Combining it with your data all you need is:
rdd.map(Vectors.parse)
If you expect malformed / empty lines you can wrap it using Try:
import scala.util.Try
rdd.map(line => Try(Vectors.parse(line))).filter(_.isSuccess).map(_.get)
Here is one way to do it :
val rdd = sc.textFile("/workingdirectory/dataset")
val data = rdd.map {
s =>
val vect = s.replaceAll("\\[", "").replaceAll("\\]","").split(',').map(_.toDouble)
Vectors.dense(vect)
}
I've just broke the map into line for readability purpose.
Note: Remember, it's simple a string processing on each line.

Convert Rdd[Vector] to Rdd[Double]

How do I convert csv to Rdd[Double]? I have the error: cannot be applied to (org.apache.spark.rdd.RDD[Unit]) at this line:
val kd = new KernelDensity().setSample(rows)
My full code is here:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
class KdeAnalysis {
val conf = new SparkConf().setAppName("sample").setMaster("local")
val sc = new SparkContext(conf)
val DATAFILE: String = "C:\\Users\\ajohn\\Desktop\\spark_R\\data\\mass_cytometry\\mass.csv"
val rows = sc.textFile(DATAFILE).map {
line => val values = line.split(',').map(_.toDouble)
Vectors.dense(values)
}.cache()
// Construct the density estimator with the sample data and a standard deviation for the Gaussian
// kernels
val rdd : RDD[Double] = sc.parallelize(rows)
val kd = new KernelDensity().setSample(rdd)
.setBandwidth(3.0)
// Find density estimates for the given values
val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
}
Since rows is a RDD[org.apache.spark.mllib.linalg.Vector] following line cannot work:
val rdd : RDD[Double] = sc.parallelize(rows)
parallelize expects Seq[T] and RDD is not a Seq.
Even if this part worked as you expect your input is simply wrong. A correct argument for KernelDensity.setSample is either RDD[Double] or JavaRDD[java.lang.Double]. It looks like it doesn't support a multivariate data at this moment.
Regarding a question from the tile you can flatMap
rows.flatMap(_.toArray)
or even better when you create rows
val rows = sc.textFile(DATAFILE).flatMap(_.split(',').map(_.toDouble)).cache()
but I doubt it is really what you need.
Have prepared this code, please evaluate if it can help you out ->
val doubleRDD = rows.map(_.toArray).flatMap(x => x)

Standardize an RDD

Suppose I have an RDD of doubles and I want to “standardize” it as follows:
Compute the mean and sd for each col
For each col, subtract the column mean from each entry and divide the result by the column sd
Can this be done efficiently and easily (without converting the RDD into a double array at any stage)?
Thanks and regards,
You can use StandardScaler from Spark itself
/**
* Standardizes features by removing the mean and scaling to unit variance
* using column summary
*/
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.rdd.RDD
val data: RDD[Vector] = ???
val scaler = new StandardScaler(true, true).fit(data)
data.foreach { vector =>
val scaled = scaler.transform(vector)
}