How to calculate euclidean distance of each row in a dataframe to a constant reference array - scala

I have a dataframe which is created from parquet files that has 512 columns(all float values).
I am trying to calculate euclidean distance of each row in my dataframe to a constant reference array.
My development environment is Zeppelin 0.7.3 with spark 2.1 and Scala. Here is the zeppelin paragraphs I run:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
//Create dataframe from parquet file
val filePath = "/tmp/vector.parquet/*.parquet"
val df = spark.read.parquet(filePath)
//Create assembler and vectorize df
val assembler = new VectorAssembler()
.setInputCols(df.columns)
.setOutputCol("features")
val training = assembler.transform(df)
//Create udf
val eucDisUdf = udf((features: Vector,
myvec:Vector)=>Vectors.sqdist(features, myvec))
//Cretae ref vector
val myScalaVec = Vectors.dense( Array.fill(512)(25.44859))
val distDF =
training2.withColumn("euc",eucDisUdf($"features",myScalaVec))
This code gives the following error for eucDisUdf call:
error: type mismatch; found : org.apache.spark.ml.linalg.Vector
required: org.apache.spark.sql.Column
I appreciate any idea how to eliminate this error and compute distances properly in scala.

I think you can use currying to achieve that:
def eucDisUdf(myvec:Vector) = udf((features: Vector) => Vectors.sqdist(features, myvec))
val myScalaVec = Vectors.dense(Array.fill(512)(25.44859))
val distDF = training2.withColumn( "euc", eucDisUdf(myScalaVec)($"features") )

Related

How to apply kmeans for parquet file?

I want to apply k-means for my parquet file.but error appear .
edited
java.lang.ArrayIndexOutOfBoundsException: 2
code
val Data = sqlContext.read.parquet("/usr/local/spark/dataset/norm")
val parsedData = Data.rdd.map(s => Vectors.dense(s.getDouble(1),s.getDouble(2))).cache()
import org.apache.spark.mllib.clustering.KMeans
val numClusters = 30
val numIteration = 1
val userClusterModel = KMeans.train(parsedData, numClusters, numIteration)
val userfeature1 = parsedData.first
val userCost = userClusterModel.computeCost(parsedData)
println("WSSSE for users: " + userCost)
How to solve this error?
I believe you are using https://spark.apache.org/docs/latest/mllib-clustering.html#k-means as a reference to build your K-Means model.
In the example
val data = sc.textFile("data/mllib/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
data is of type org.apache.spark.rdd.RDD In your case sqlContext.read.parquet is of type DataFrame. Therefore you would have to convert the dataframe to RDD to perform the split operation
To convert from Dataframe to RDD you can use the below sample as reference
val rows: RDD[Row] = df.rdd
val parsedData = Data.rdd.map(s => Vectors.dense(s.getInt(0),s.getDouble(1))).cache()

Input Spark Scala Dataframe Column as Vector

Relatively new to scala and the Spark API kit but I have a question trying to make use of the vector assembler
http://spark.apache.org/docs/latest/ml-features.html#vectorassembler
to then make use of matrix correlations
https://spark.apache.org/docs/2.1.0/mllib-statistics.html#correlations
The dataframe column is of dtype linalg.Vector
val assembler = new VectorAssembler()
val trainwlabels3 = assembler.transform(trainwlabels2)
trainwlabels3.dtypes(0)
res90: (String, String) = (features,org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7)
and yet calling this to an RDD for the statistics tool throws a mismatch error.
val data: RDD[Vector] = sc.parallelize(
trainwlabels3("features")
)
<console>:80: error: type mismatch;
found : org.apache.spark.sql.Column
required: Seq[org.apache.spark.mllib.linalg.Vector]
Thanks in advance for any help.
You should just select:
val features = trainwlabels3.select($"features")
Convert to RDD
val featuresRDD = features.rdd
and map:
featuresRDD.map(_.getAs[Vector]("features"))
This should work for you:
val rddForStatistics = new VectorAssembler()
.transform(trainwlabels2)
.select($"features")
.as[Vector] //turns Dataset[Row] (a.k.a DataFrame) to DataSet[Vector]
.rdd
However, you should avoid RDDs and figure out how to do what you want with the DataFrame-based API (in the spark.ml package) because working with RDDs is all but deprecated in MLlib.

Spark Scala Vector Map ClassCastException

Trying to make use of the Statistics tool in Spark for Scala and having difficulty preparing a vector that will take.
val featuresrdd = features.rdd.map{_.getAs[Vector]("features")}
featuresrdd: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[952] at map at <console>:82
This produces a vector of type 'mllib.linalg.vector', however upon using this in the tool, the vector has changed to type 'DenseVector'.
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.rdd.RDD
val correlMatrix: Matrix = Statistics.corr(featuresrdd, "pearson")
java.lang.ClassCastException: org.apache.spark.ml.linalg.DenseVector cannot be cast to org.apache.spark.mllib.linalg.Vector
Any help would be greatly appreciated.
Thanks
Use asML function to convert old Vector to new Vector in ML:
val newMLFeaturesRDD = featuresrdd.map(_.asML)
val correlMatrix: Matrix = Statistics.corr(newMLFeaturesRDD , "pearson")

How to convert spark DataFrame to RDD mllib LabeledPoints?

I tried to apply PCA to my data and then apply RandomForest to the transformed data. However, PCA.transform(data) gave me a DataFrame but I need a mllib LabeledPoints to feed my RandomForest. How can I do that?
My code:
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.ml.feature.PCA
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
val dataset = MLUtils.loadLibSVMFile(sc, "data/mnist/mnist.bz2")
val splits = dataset.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
val trainingDf = trainingData.toDF()
val pca = new PCA()
.setInputCol("features")
.setOutputCol("pcaFeatures")
.setK(100)
.fit(trainingDf)
val pcaTrainingData = pca.transform(trainingDf)
val numClasses = 10
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 10 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 20
val maxBins = 32
val model = RandomForest.trainClassifier(pcaTrainingData, numClasses, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
error: type mismatch;
found : org.apache.spark.sql.DataFrame
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint]
I tried the following two possible solutions but they didn't work:
scala> val pcaTrainingData = trainingData.map(p => p.copy(features = pca.transform(p.features)))
<console>:39: error: overloaded method value transform with alternatives:
(dataset: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame <and>
(dataset: org.apache.spark.sql.DataFrame,paramMap: org.apache.spark.ml.param.ParamMap)org.apache.spark.sql.DataFrame <and>
(dataset: org.apache.spark.sql.DataFrame,firstParamPair: org.apache.spark.ml.param.ParamPair[_],otherParamPairs: org.apache.spark.ml.param.ParamPair[_]*)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.mllib.linalg.Vector)
And:
val labeled = pca
.transform(trainingDf)
.map(row => LabeledPoint(row.getDouble(0), row(4).asInstanceOf[Vector[Int]]))
error: type mismatch;
found : scala.collection.immutable.Vector[Int]
required: org.apache.spark.mllib.linalg.Vector
(I have imported org.apache.spark.mllib.linalg.Vectors in the above case)
Any help?
The correct approach here is the second one you tried - mapping each Row into a LabeledPoint to get an RDD[LabeledPoint]. However, it has two mistakes:
The correct Vector class (org.apache.spark.mllib.linalg.Vector) does NOT take type arguments (e.g. Vector[Int]) - so even though you had the right import, the compiler concluded that you meant scala.collection.immutable.Vector which DOES.
The DataFrame returned from PCA.fit() has 3 columns, and you tried to extract column number 4. For example, showing first 4 lines:
+-----+--------------------+--------------------+
|label| features| pcaFeatures|
+-----+--------------------+--------------------+
| 5.0|(780,[152,153,154...|[880.071111851977...|
| 1.0|(780,[158,159,160...|[-41.473039034112...|
| 2.0|(780,[155,156,157...|[931.444898405036...|
| 1.0|(780,[124,125,126...|[25.5114585648411...|
+-----+--------------------+--------------------+
To make this easier - I prefer using the column names instead of their indices.
So here's the transformation you need:
val labeled = pca.transform(trainingDf).rdd.map(row => LabeledPoint(
row.getAs[Double]("label"),
row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")
))

Convert Rdd[Vector] to Rdd[Double]

How do I convert csv to Rdd[Double]? I have the error: cannot be applied to (org.apache.spark.rdd.RDD[Unit]) at this line:
val kd = new KernelDensity().setSample(rows)
My full code is here:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
class KdeAnalysis {
val conf = new SparkConf().setAppName("sample").setMaster("local")
val sc = new SparkContext(conf)
val DATAFILE: String = "C:\\Users\\ajohn\\Desktop\\spark_R\\data\\mass_cytometry\\mass.csv"
val rows = sc.textFile(DATAFILE).map {
line => val values = line.split(',').map(_.toDouble)
Vectors.dense(values)
}.cache()
// Construct the density estimator with the sample data and a standard deviation for the Gaussian
// kernels
val rdd : RDD[Double] = sc.parallelize(rows)
val kd = new KernelDensity().setSample(rdd)
.setBandwidth(3.0)
// Find density estimates for the given values
val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
}
Since rows is a RDD[org.apache.spark.mllib.linalg.Vector] following line cannot work:
val rdd : RDD[Double] = sc.parallelize(rows)
parallelize expects Seq[T] and RDD is not a Seq.
Even if this part worked as you expect your input is simply wrong. A correct argument for KernelDensity.setSample is either RDD[Double] or JavaRDD[java.lang.Double]. It looks like it doesn't support a multivariate data at this moment.
Regarding a question from the tile you can flatMap
rows.flatMap(_.toArray)
or even better when you create rows
val rows = sc.textFile(DATAFILE).flatMap(_.split(',').map(_.toDouble)).cache()
but I doubt it is really what you need.
Have prepared this code, please evaluate if it can help you out ->
val doubleRDD = rows.map(_.toArray).flatMap(x => x)