I have a dataframe which is created from parquet files that has 512 columns(all float values).
I am trying to calculate euclidean distance of each row in my dataframe to a constant reference array.
My development environment is Zeppelin 0.7.3 with spark 2.1 and Scala. Here is the zeppelin paragraphs I run:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
//Create dataframe from parquet file
val filePath = "/tmp/vector.parquet/*.parquet"
val df = spark.read.parquet(filePath)
//Create assembler and vectorize df
val assembler = new VectorAssembler()
.setInputCols(df.columns)
.setOutputCol("features")
val training = assembler.transform(df)
//Create udf
val eucDisUdf = udf((features: Vector,
myvec:Vector)=>Vectors.sqdist(features, myvec))
//Cretae ref vector
val myScalaVec = Vectors.dense( Array.fill(512)(25.44859))
val distDF =
training2.withColumn("euc",eucDisUdf($"features",myScalaVec))
This code gives the following error for eucDisUdf call:
error: type mismatch; found : org.apache.spark.ml.linalg.Vector
required: org.apache.spark.sql.Column
I appreciate any idea how to eliminate this error and compute distances properly in scala.
I think you can use currying to achieve that:
def eucDisUdf(myvec:Vector) = udf((features: Vector) => Vectors.sqdist(features, myvec))
val myScalaVec = Vectors.dense(Array.fill(512)(25.44859))
val distDF = training2.withColumn( "euc", eucDisUdf(myScalaVec)($"features") )
I am trying to create a block matrix from a input data file. I have managed to get the data read from the data file and stored in IndexedRowMatrix and CoordinateMatrix format correct.
When I use .toBlockMatrix on the CoordinateMatrix the result is a block matrix containing only 0.0 with the same dimensions as the CoordinateMatrix.
I am using version 1.5.0-cdh5.5.0
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix
import org.apache.spark.mllib.linalg.distributed.IndexedRow
import org.apache.spark.mllib.linalg.distributed.BlockMatrix
val conf = new SparkConf().setMaster("local").setAppName("Transpose");
val sc = new SparkContext(conf)
val dataRDD = sc.textFile("/user/cloudera/data/data.txt").map(line => Vectors.dense(line.split(" ").map(_.toDouble))).zipWithIndex.map(_.swap)
//Format of dataRDD is RDD[(Long, Vector)]
val rows = dataRDD.map{case(k,v) => IndexedRow(k,v)}
//Format of rows is RDD[IndexedRow]
val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)
val coordMat: CoordinateMatrix = mat.toCoordinateMatrix()
val blockMat: BlockMatrix = coordMat.toBlockMatrix().cache()
The data file is just simply two columns by sixty rows of integers.
140 123
141 310
310 381
480 321
... ...
Update:
I've done some investigating and have discovered that the groupByKey function is not working correctly, which is what is preventing the BlockMatrix from being formed correctly. I still however do not know why groupByKey, join, and groupBy are not working and always returning an empty result.
I have solved the problem by removing the lines of code:
val conf = new SparkConf().setMaster("local").setAppName("Transpose")
val sc = new SparkContext(conf)
I found the answer in the below linked page in a comment by Farzad Nozarian,
Unable to count words using reduceByKey((v1,v2) => v1 + v2) scala function in spark
As a side-note this might help people who are getting empty results for .groupByKey, .reduceByKey, .join, etc.
I have a very simple code to try Cosine Similarity:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix, RowMatrix}
val rows= Array(((1,2,3,4,5),(1,2,3,4,5),(1,2,4,5,8),(3,4,1,2,7),(7,7,7,7,7)))
val mat = new RowMatrix(rows)
val simsPerfect = mat.columnSimilarities()
val simsEstimate = mat.columnSimilarities(0.8)
I run this code on Amazon AWS which has Spark 1.5 however I got the following message for the last two lines:
"Erroe: value columnSimilarities is not a memeber of org.apache.spark.rdd.RDD[(int,int)]"
Could you please help to resolve this issue?
I found the answer. I need to convert the matrix to RDD. Here is the right code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix, RowMatrix}
import org.apache.spark.rdd._
import org.apache.spark.mllib.linalg._
def matrixToRDD(m: Matrix): RDD[Vector] = {
val columns = m.toArray.grouped(m.numRows)
val rows = columns.toSeq.transpose // Skip this if you want a column-major RDD.
val vectors = rows.map(row => new DenseVector(row.toArray))
sc.parallelize(vectors)
}
val dm: Matrix = Matrices.dense(5, 5,Array(1,2,3,4,5,1,2,3,4,5,1,2,4,5,8,3,4,1,2,7,7,7,7,7,7))
val rows = matrixToRDD(dm)
val mat = new RowMatrix(rows)
val simsPerfect = mat.columnSimilarities()
val simsEstimate = mat.columnSimilarities(0.8)
println("Pairwise similarities are: " + simsPerfect.entries.collect.mkString(", "))
println("Estimated pairwise similarities are: " + simsEstimate.entries.collect.mkString(", "))
Cheers
How do I convert csv to Rdd[Double]? I have the error: cannot be applied to (org.apache.spark.rdd.RDD[Unit]) at this line:
val kd = new KernelDensity().setSample(rows)
My full code is here:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
class KdeAnalysis {
val conf = new SparkConf().setAppName("sample").setMaster("local")
val sc = new SparkContext(conf)
val DATAFILE: String = "C:\\Users\\ajohn\\Desktop\\spark_R\\data\\mass_cytometry\\mass.csv"
val rows = sc.textFile(DATAFILE).map {
line => val values = line.split(',').map(_.toDouble)
Vectors.dense(values)
}.cache()
// Construct the density estimator with the sample data and a standard deviation for the Gaussian
// kernels
val rdd : RDD[Double] = sc.parallelize(rows)
val kd = new KernelDensity().setSample(rdd)
.setBandwidth(3.0)
// Find density estimates for the given values
val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
}
Since rows is a RDD[org.apache.spark.mllib.linalg.Vector] following line cannot work:
val rdd : RDD[Double] = sc.parallelize(rows)
parallelize expects Seq[T] and RDD is not a Seq.
Even if this part worked as you expect your input is simply wrong. A correct argument for KernelDensity.setSample is either RDD[Double] or JavaRDD[java.lang.Double]. It looks like it doesn't support a multivariate data at this moment.
Regarding a question from the tile you can flatMap
rows.flatMap(_.toArray)
or even better when you create rows
val rows = sc.textFile(DATAFILE).flatMap(_.split(',').map(_.toDouble)).cache()
but I doubt it is really what you need.
Have prepared this code, please evaluate if it can help you out ->
val doubleRDD = rows.map(_.toArray).flatMap(x => x)
Suppose I have an RDD of doubles and I want to “standardize” it as follows:
Compute the mean and sd for each col
For each col, subtract the column mean from each entry and divide the result by the column sd
Can this be done efficiently and easily (without converting the RDD into a double array at any stage)?
Thanks and regards,
You can use StandardScaler from Spark itself
/**
* Standardizes features by removing the mean and scaling to unit variance
* using column summary
*/
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.rdd.RDD
val data: RDD[Vector] = ???
val scaler = new StandardScaler(true, true).fit(data)
data.foreach { vector =>
val scaled = scaler.transform(vector)
}