Maximum value of an mllib Vector? - scala

I've created an ML pipeline with Apache Spark using mllib.
The evaluator result is a DataFrame with a column "probability", which is a mllib vector of probabilities (similar to predict_proba in scikit-learn).
val rfPredictions = rfModels.bestModel.transform(testing)
val precision = evaluator.evaluate(rfPredictions)
I tried something like this with no success:
rfPredictions.select("probability").map{c => c.getAs[Vector](1).max}
<console>:166: error: value max is not a member of
org.apache.spark.mllib.linalg.Vector
I want a new column with the max of this probabilities. Any ideas?

Vector doesn't have a max method. Try toArray.max:
rfPredictions.select("probability").map{ c => c.getAs[Vector](1).toArray.max }
or argmax:
rfPredictions.select("probability").map{ c => {
val v = c.getAs[Vector](1)
v(v.argmax)
}}
To add the max as a new column, define a udf and use it with withColumn function:
val max_proba_udf = udf((v: Vector) => v.toArray.max)
rfPredictions.withColumn("max_prob", max_proba_udf($"probability"))

Spark > 2.0
With ml, not mllib this will work next way:
import org.apache.spark.ml.linalg.DenseVector
just_another_df.select("probability").map{ c => c.getAs[DenseVector](0).toArray.max }
Using udf
import org.apache.spark.ml.linalg.DenseVector
val max_proba_udf = udf((v: DenseVector) => v.toArray.max)
val rfPredictions = just_another_df.withColumn("MAX_PROB", max_proba_udf($"probability"))

Related

subtract vector column from scalar in spark using scala

I used Movielens 20 million dataset which contain file called rating .csv(UserId,MovieId,Rating) .I applied Alternating Least Square(ALS) which give output userId,FeatureVector in 10 parquet files . Dimensinality Reduction
I want to make normalize for featureVector using z-score method.
I want to subtract vector(featureVector) from constant scalar 2.484 ,divide value into 1.8305 and save value in parquet files.features column
val df = sqlContext.read.parquet("file:///usr/local/spark/dataset/model/data/user/part-r-00000-7d55ba81-5761-4e36-b488-7e6214df2a68.snappy.parquet")
sqlContext.sql("select features from df")
df.withColumn("output", "features" -2.484).show(20)
How to subtract vector from each value of scalar?
if you want to subtract 2.484 from each vector value you can try it
import spark.implicits._
import org.apache.spark.ml.linalg.{DenseVector, Vectors}
import org.apache.spark.rdd.RDD
val df = Seq(Vector(1.0,2.0,2.5,3.0,3.5)).toDF("features")
val rdd: RDD[DenseVector] = df.select('features)
.rdd
.map(s => s.getAs[Seq[Double]]("features").toArray)
.map(s => Vectors.dense(s.map(s => s - 2.3333)).toDense)
rdd.take(10).foreach(println(_))
output:
[-1.3333,-0.33329999999999993,0.16670000000000007,0.6667000000000001,1.1667]

How to calculate euclidean distance of each row in a dataframe to a constant reference array

I have a dataframe which is created from parquet files that has 512 columns(all float values).
I am trying to calculate euclidean distance of each row in my dataframe to a constant reference array.
My development environment is Zeppelin 0.7.3 with spark 2.1 and Scala. Here is the zeppelin paragraphs I run:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
//Create dataframe from parquet file
val filePath = "/tmp/vector.parquet/*.parquet"
val df = spark.read.parquet(filePath)
//Create assembler and vectorize df
val assembler = new VectorAssembler()
.setInputCols(df.columns)
.setOutputCol("features")
val training = assembler.transform(df)
//Create udf
val eucDisUdf = udf((features: Vector,
myvec:Vector)=>Vectors.sqdist(features, myvec))
//Cretae ref vector
val myScalaVec = Vectors.dense( Array.fill(512)(25.44859))
val distDF =
training2.withColumn("euc",eucDisUdf($"features",myScalaVec))
This code gives the following error for eucDisUdf call:
error: type mismatch; found : org.apache.spark.ml.linalg.Vector
required: org.apache.spark.sql.Column
I appreciate any idea how to eliminate this error and compute distances properly in scala.
I think you can use currying to achieve that:
def eucDisUdf(myvec:Vector) = udf((features: Vector) => Vectors.sqdist(features, myvec))
val myScalaVec = Vectors.dense(Array.fill(512)(25.44859))
val distDF = training2.withColumn( "euc", eucDisUdf(myScalaVec)($"features") )

Spark Scala: Vector Dataframe to RDD of values

I have a spark dataframe that has a vector in it:
org.apache.spark.sql.DataFrame = [sF: vector]
and I'm trying to convert it to a RDD of values:
org.apache.spark.rdd.RDD[(Double, Double)]
However, I haven't been able to convert it properly. I've tried:
val m2 = m1.select($"sF").rdd.map{case Row(v1, v2) => (v1.toString.toDouble, v2.toString.toDouble)}
and it compiles, but I get a runtime error:
scala.MatchError: [[-0.1111111111111111,-0.2222222222222222]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
when i do:
m2.take(10).foreach(println).
Is there something I'm doing wrong?
Assuming you want the first two values of the vectors present in the sF column, maybe this will work:
import org.apache.spark.mllib.linalg.Vector
val m2 = m1
.select($"sF")
.map { case Row(v: Vector) => (v(0), v(1)) }
You are getting an error because when you do case Row(v1, v2), it will not match the contents of the rows in your DataFrame, because you are expecting two values on each row (v1 and v2), but there is only one: a Vector.
Note: you don't need to call .rdd if you are going to do a .map operation.

Convert Rdd[Vector] to Rdd[Double]

How do I convert csv to Rdd[Double]? I have the error: cannot be applied to (org.apache.spark.rdd.RDD[Unit]) at this line:
val kd = new KernelDensity().setSample(rows)
My full code is here:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
class KdeAnalysis {
val conf = new SparkConf().setAppName("sample").setMaster("local")
val sc = new SparkContext(conf)
val DATAFILE: String = "C:\\Users\\ajohn\\Desktop\\spark_R\\data\\mass_cytometry\\mass.csv"
val rows = sc.textFile(DATAFILE).map {
line => val values = line.split(',').map(_.toDouble)
Vectors.dense(values)
}.cache()
// Construct the density estimator with the sample data and a standard deviation for the Gaussian
// kernels
val rdd : RDD[Double] = sc.parallelize(rows)
val kd = new KernelDensity().setSample(rdd)
.setBandwidth(3.0)
// Find density estimates for the given values
val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
}
Since rows is a RDD[org.apache.spark.mllib.linalg.Vector] following line cannot work:
val rdd : RDD[Double] = sc.parallelize(rows)
parallelize expects Seq[T] and RDD is not a Seq.
Even if this part worked as you expect your input is simply wrong. A correct argument for KernelDensity.setSample is either RDD[Double] or JavaRDD[java.lang.Double]. It looks like it doesn't support a multivariate data at this moment.
Regarding a question from the tile you can flatMap
rows.flatMap(_.toArray)
or even better when you create rows
val rows = sc.textFile(DATAFILE).flatMap(_.split(',').map(_.toDouble)).cache()
but I doubt it is really what you need.
Have prepared this code, please evaluate if it can help you out ->
val doubleRDD = rows.map(_.toArray).flatMap(x => x)

Convert RDD of Vector in LabeledPoint using Scala - MLLib in Apache Spark

I'm using MLlib of Apache-Spark and Scala. I need to convert a group of Vector
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
in a LabeledPoint in order to apply the algorithms of MLLib
Each vector is composed of Double value of 0.0 (false) or 1.0 (true).
All the vectors are saved in a RDD, so the final RDD is of the type
val data_tmp: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
So, in the RDD there are vectors create with
def createArray(values: List[String]) : Vector =
{
var arr : Array[Double] = new Array[Double](tags_table.size)
tags_table.foreach(x => arr(x._2) = if (values.contains(x._1)) 1.0 else 0.0 )
val dv: Vector = Vectors.dense(arr)
return dv
}
/*each element of result is a List[String]*/
val data_tmp=result.map(x=> createArray(x._2))
val data: RowMatrix = new RowMatrix(data_tmp)
How I can create from this RDD (data_tmp) or from the RowMatrix (data) a LabeledPoint set for using the MLLib algorithms?
For example i need to apply the SVMs linear alghoritms show here
I found the solution:
def createArray(values: List[String]) : Vector =
{
var arr : Array[Double] = new Array[Double](tags_table.size)
tags_table.foreach(x => arr(x._2) = if (values.contains(x._1)) 1.0 else 0.0 )
val dv: Vector = Vectors.dense(arr)
return dv
}
val data_tmp=result.map(x=> createArray(x._2))
val parsedData = data_tmp.map { line => LabeledPoint(1.0,line) }