Sum vector columns in spark - scala

I have a dataframe where I have multiple columns that contain vectors (number of vector columns is dynamic). I need to create a new column taking the sum of all the vector columns. I'm having a hard time getting this done. here is a code to generate a sample dataset that I'm testing on.
import org.apache.spark.ml.feature.VectorAssembler
val temp1 = spark.createDataFrame(Seq(
(1,1.0,0.0,4.7,6,0.0),
(2,1.0,0.0,6.8,6,0.0),
(3,1.0,1.0,7.8,5,0.0),
(4,0.0,1.0,4.1,7,0.0),
(5,1.0,0.0,2.8,6,1.0),
(6,1.0,1.0,6.1,5,0.0),
(7,0.0,1.0,4.9,7,1.0),
(8,1.0,0.0,7.3,6,0.0)))
.toDF("id", "f1","f2","f3","f4","label")
val assembler1 = new VectorAssembler()
.setInputCols(Array("f1","f2","f3"))
.setOutputCol("vec1")
val temp2 = assembler1.setHandleInvalid("skip").transform(temp1)
val assembler2 = new VectorAssembler()
.setInputCols(Array("f2","f3", "f4"))
.setOutputCol("vec2")
val df = assembler2.setHandleInvalid("skip").transform(temp2)
This gives me the following dataset
+---+---+---+---+---+-----+-------------+-------------+
| id| f1| f2| f3| f4|label| vec1| vec2|
+---+---+---+---+---+-----+-------------+-------------+
| 1|1.0|0.0|4.7| 6| 0.0|[1.0,0.0,4.7]|[0.0,4.7,6.0]|
| 2|1.0|0.0|6.8| 6| 0.0|[1.0,0.0,6.8]|[0.0,6.8,6.0]|
| 3|1.0|1.0|7.8| 5| 0.0|[1.0,1.0,7.8]|[1.0,7.8,5.0]|
| 4|0.0|1.0|4.1| 7| 0.0|[0.0,1.0,4.1]|[1.0,4.1,7.0]|
| 5|1.0|0.0|2.8| 6| 1.0|[1.0,0.0,2.8]|[0.0,2.8,6.0]|
| 6|1.0|1.0|6.1| 5| 0.0|[1.0,1.0,6.1]|[1.0,6.1,5.0]|
| 7|0.0|1.0|4.9| 7| 1.0|[0.0,1.0,4.9]|[1.0,4.9,7.0]|
| 8|1.0|0.0|7.3| 6| 0.0|[1.0,0.0,7.3]|[0.0,7.3,6.0]|
+---+---+---+---+---+-----+-------------+-------------+
If I needed to taek sum of regular columns, I can do it using something like,
import org.apache.spark.sql.functions.col
df.withColumn("sum", namesOfColumnsToSum.map(col).reduce((c1, c2)=>c1+c2))
I know I can use breeze to sum DenseVectors just using "+" operator
import breeze.linalg._
val v1 = DenseVector(1,2,3)
val v2 = DenseVector(5,6,7)
v1+v2
So, the above code gives me the expected vector. But I'm not sure how to take the sum of the vector columns and sum vec1 and vec2 columns.
I did try the suggestions mentioned here, but had no luck

Here's my take but coded in PySpark. Someone can probably help in translating this to Scala:
from pyspark.ml.linalg import Vectors, VectorUDT
import numpy as np
from pyspark.sql.functions import udf, array
def vector_sum (arr):
return Vectors.dense(np.sum(arr,axis=0))
vector_sum_udf = udf(vector_sum, VectorUDT())
df = df.withColumn('sum',vector_sum_udf(array(['vec1','vec2'])))

Related

How Does Everyone Deal with Probabilities from XGBoost Scored Data? Scala [duplicate]

I just used Standard Scaler to normalize my features for a ML application. After selecting the scaled features, I want to convert this back to a dataframe of Doubles, though the length of my vectors are arbitrary. I know how to do it for a specific 3 features by using
myDF.map{case Row(v: Vector) => (v(0), v(1), v(2))}.toDF("f1", "f2", "f3")
but not for an arbitrary amount of features. Is there an easy way to do this?
Example:
val testDF = sc.parallelize(List(Vectors.dense(5D, 6D, 7D), Vectors.dense(8D, 9D, 10D), Vectors.dense(11D, 12D, 13D))).map(Tuple1(_)).toDF("scaledFeatures")
val myColumnNames = List("f1", "f2", "f3")
// val finalDF = DataFrame[f1: Double, f2: Double, f3: Double]
EDIT
I found out how to unpack to column names when creating the dataframe, but still am having trouble converting a vector to a sequence needed to create the dataframe:
finalDF = testDF.map{case Row(v: Vector) => v.toArray.toSeq /* <= this errors */}.toDF(List("f1", "f2", "f3"): _*)
Spark >= 3.0.0
Since Spark 3.0 you can use vector_to_array
import org.apache.spark.ml.functions.vector_to_array
testDF.select(vector_to_array($"scaledFeatures").alias("_tmp")).select(exprs:_*)
Spark < 3.0.0
One possible approach is something similar to this
import org.apache.spark.sql.functions.udf
// In Spark 1.x you'll will have to replace ML Vector with MLLib one
// import org.apache.spark.mllib.linalg.Vector
// In 2.x the below is usually the right choice
import org.apache.spark.ml.linalg.Vector
// Get size of the vector
val n = testDF.first.getAs[Vector](0).size
// Simple helper to convert vector to array<double>
// asNondeterministic is available in Spark 2.3 or befor
// It can be removed, but at the cost of decreased performance
val vecToSeq = udf((v: Vector) => v.toArray).asNondeterministic
// Prepare a list of columns to create
val exprs = (0 until n).map(i => $"_tmp".getItem(i).alias(s"f$i"))
testDF.select(vecToSeq($"scaledFeatures").alias("_tmp")).select(exprs:_*)
If you know a list of columns upfront you can simplify this a little:
val cols: Seq[String] = ???
val exprs = cols.zipWithIndex.map{ case (c, i) => $"_tmp".getItem(i).alias(c) }
For Python equivalent see How to split Vector into columns - using PySpark.
Please try VectorSlicer :
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame(
Seq((1, 0.2, 0.8), (2, 0.1, 0.9), (3, 0.3, 0.7))
).toDF("id", "negative_logit", "positive_logit")
val assembler = new VectorAssembler()
.setInputCols(Array("negative_logit", "positive_logit"))
.setOutputCol("prediction")
val output = assembler.transform(dataset)
output.show()
/*
+---+--------------+--------------+----------+
| id|negative_logit|positive_logit|prediction|
+---+--------------+--------------+----------+
| 1| 0.2| 0.8| [0.2,0.8]|
| 2| 0.1| 0.9| [0.1,0.9]|
| 3| 0.3| 0.7| [0.3,0.7]|
+---+--------------+--------------+----------+
*/
val slicer = new VectorSlicer()
.setInputCol("prediction")
.setIndices(Array(1))
.setOutputCol("positive_prediction")
val posi_output = slicer.transform(output)
posi_output.show()
/*
+---+--------------+--------------+----------+-------------------+
| id|negative_logit|positive_logit|prediction|positive_prediction|
+---+--------------+--------------+----------+-------------------+
| 1| 0.2| 0.8| [0.2,0.8]| [0.8]|
| 2| 0.1| 0.9| [0.1,0.9]| [0.9]|
| 3| 0.3| 0.7| [0.3,0.7]| [0.7]|
+---+--------------+--------------+----------+-------------------+
*/
Alternate solution that evovled couple of days ago: Import the VectorDisassembler into your project (as long as it's not merged into Spark), now:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame(
Seq((0, 1.2, 1.3), (1, 2.2, 2.3), (2, 3.2, 3.3))
).toDF("id", "val1", "val2")
val assembler = new VectorAssembler()
.setInputCols(Array("val1", "val2"))
.setOutputCol("vectorCol")
val output = assembler.transform(dataset)
output.show()
/*
+---+----+----+---------+
| id|val1|val2|vectorCol|
+---+----+----+---------+
| 0| 1.2| 1.3|[1.2,1.3]|
| 1| 2.2| 2.3|[2.2,2.3]|
| 2| 3.2| 3.3|[3.2,3.3]|
+---+----+----+---------+*/
val disassembler = new org.apache.spark.ml.feature.VectorDisassembler()
.setInputCol("vectorCol")
disassembler.transform(output).show()
/*
+---+----+----+---------+----+----+
| id|val1|val2|vectorCol|val1|val2|
+---+----+----+---------+----+----+
| 0| 1.2| 1.3|[1.2,1.3]| 1.2| 1.3|
| 1| 2.2| 2.3|[2.2,2.3]| 2.2| 2.3|
| 2| 3.2| 3.3|[3.2,3.3]| 3.2| 3.3|
+---+----+----+---------+----+----+*/
I use Spark 2.3.2, and built a xgboost4j binary-classification model, the result looks like this:
results_train.select("classIndex","probability","prediction").show(3,0)
+----------+----------------------------------------+----------+
|classIndex|probability |prediction|
+----------+----------------------------------------+----------+
|1 |[0.5998525619506836,0.400147408246994] |0.0 |
|1 |[0.5487841367721558,0.45121586322784424]|0.0 |
|0 |[0.5555324554443359,0.44446757435798645]|0.0 |
I define the following udf to get the elements out of vector column probability
import org.apache.spark.sql.functions._
def getProb = udf((probV: org.apache.spark.ml.linalg.Vector, clsInx: Int) => probV.apply(clsInx) )
results_train.select("classIndex","probability","prediction").
withColumn("p_0",getProb($"probability",lit(0))).
withColumn("p_1",getProb($"probability", lit(1))).show(3,0)
+----------+----------------------------------------+----------+------------------+-------------------+
|classIndex|probability |prediction|p_0 |p_1 |
+----------+----------------------------------------+----------+------------------+-------------------+
|1 |[0.5998525619506836,0.400147408246994] |0.0 |0.5998525619506836|0.400147408246994 |
|1 |[0.5487841367721558,0.45121586322784424]|0.0 |0.5487841367721558|0.45121586322784424|
|0 |[0.5555324554443359,0.44446757435798645]|0.0 |0.5555324554443359|0.44446757435798645|
Hope this would help for those who handle with Vector type input.
Since the above answers need additional libraries or still not supported, I have used pandas dataframe to easity extract the vector values and then convert it back to spark dataframe.
# convert to pandas dataframe
pandasDf = dataframe.toPandas()
# add a new column
pandasDf['newColumnName'] = 0 # filled the new column with 0s
# now iterate through the rows and update the column
for index, row in pandasDf.iterrows():
value = row['vectorCol'][0] # get the 0th value of the vector
pandasDf.loc[index, 'newColumnName'] = value # put the value in the new column

Convert vector UDT to double in ML [duplicate]

I just used Standard Scaler to normalize my features for a ML application. After selecting the scaled features, I want to convert this back to a dataframe of Doubles, though the length of my vectors are arbitrary. I know how to do it for a specific 3 features by using
myDF.map{case Row(v: Vector) => (v(0), v(1), v(2))}.toDF("f1", "f2", "f3")
but not for an arbitrary amount of features. Is there an easy way to do this?
Example:
val testDF = sc.parallelize(List(Vectors.dense(5D, 6D, 7D), Vectors.dense(8D, 9D, 10D), Vectors.dense(11D, 12D, 13D))).map(Tuple1(_)).toDF("scaledFeatures")
val myColumnNames = List("f1", "f2", "f3")
// val finalDF = DataFrame[f1: Double, f2: Double, f3: Double]
EDIT
I found out how to unpack to column names when creating the dataframe, but still am having trouble converting a vector to a sequence needed to create the dataframe:
finalDF = testDF.map{case Row(v: Vector) => v.toArray.toSeq /* <= this errors */}.toDF(List("f1", "f2", "f3"): _*)
Spark >= 3.0.0
Since Spark 3.0 you can use vector_to_array
import org.apache.spark.ml.functions.vector_to_array
testDF.select(vector_to_array($"scaledFeatures").alias("_tmp")).select(exprs:_*)
Spark < 3.0.0
One possible approach is something similar to this
import org.apache.spark.sql.functions.udf
// In Spark 1.x you'll will have to replace ML Vector with MLLib one
// import org.apache.spark.mllib.linalg.Vector
// In 2.x the below is usually the right choice
import org.apache.spark.ml.linalg.Vector
// Get size of the vector
val n = testDF.first.getAs[Vector](0).size
// Simple helper to convert vector to array<double>
// asNondeterministic is available in Spark 2.3 or befor
// It can be removed, but at the cost of decreased performance
val vecToSeq = udf((v: Vector) => v.toArray).asNondeterministic
// Prepare a list of columns to create
val exprs = (0 until n).map(i => $"_tmp".getItem(i).alias(s"f$i"))
testDF.select(vecToSeq($"scaledFeatures").alias("_tmp")).select(exprs:_*)
If you know a list of columns upfront you can simplify this a little:
val cols: Seq[String] = ???
val exprs = cols.zipWithIndex.map{ case (c, i) => $"_tmp".getItem(i).alias(c) }
For Python equivalent see How to split Vector into columns - using PySpark.
Please try VectorSlicer :
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame(
Seq((1, 0.2, 0.8), (2, 0.1, 0.9), (3, 0.3, 0.7))
).toDF("id", "negative_logit", "positive_logit")
val assembler = new VectorAssembler()
.setInputCols(Array("negative_logit", "positive_logit"))
.setOutputCol("prediction")
val output = assembler.transform(dataset)
output.show()
/*
+---+--------------+--------------+----------+
| id|negative_logit|positive_logit|prediction|
+---+--------------+--------------+----------+
| 1| 0.2| 0.8| [0.2,0.8]|
| 2| 0.1| 0.9| [0.1,0.9]|
| 3| 0.3| 0.7| [0.3,0.7]|
+---+--------------+--------------+----------+
*/
val slicer = new VectorSlicer()
.setInputCol("prediction")
.setIndices(Array(1))
.setOutputCol("positive_prediction")
val posi_output = slicer.transform(output)
posi_output.show()
/*
+---+--------------+--------------+----------+-------------------+
| id|negative_logit|positive_logit|prediction|positive_prediction|
+---+--------------+--------------+----------+-------------------+
| 1| 0.2| 0.8| [0.2,0.8]| [0.8]|
| 2| 0.1| 0.9| [0.1,0.9]| [0.9]|
| 3| 0.3| 0.7| [0.3,0.7]| [0.7]|
+---+--------------+--------------+----------+-------------------+
*/
Alternate solution that evovled couple of days ago: Import the VectorDisassembler into your project (as long as it's not merged into Spark), now:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame(
Seq((0, 1.2, 1.3), (1, 2.2, 2.3), (2, 3.2, 3.3))
).toDF("id", "val1", "val2")
val assembler = new VectorAssembler()
.setInputCols(Array("val1", "val2"))
.setOutputCol("vectorCol")
val output = assembler.transform(dataset)
output.show()
/*
+---+----+----+---------+
| id|val1|val2|vectorCol|
+---+----+----+---------+
| 0| 1.2| 1.3|[1.2,1.3]|
| 1| 2.2| 2.3|[2.2,2.3]|
| 2| 3.2| 3.3|[3.2,3.3]|
+---+----+----+---------+*/
val disassembler = new org.apache.spark.ml.feature.VectorDisassembler()
.setInputCol("vectorCol")
disassembler.transform(output).show()
/*
+---+----+----+---------+----+----+
| id|val1|val2|vectorCol|val1|val2|
+---+----+----+---------+----+----+
| 0| 1.2| 1.3|[1.2,1.3]| 1.2| 1.3|
| 1| 2.2| 2.3|[2.2,2.3]| 2.2| 2.3|
| 2| 3.2| 3.3|[3.2,3.3]| 3.2| 3.3|
+---+----+----+---------+----+----+*/
I use Spark 2.3.2, and built a xgboost4j binary-classification model, the result looks like this:
results_train.select("classIndex","probability","prediction").show(3,0)
+----------+----------------------------------------+----------+
|classIndex|probability |prediction|
+----------+----------------------------------------+----------+
|1 |[0.5998525619506836,0.400147408246994] |0.0 |
|1 |[0.5487841367721558,0.45121586322784424]|0.0 |
|0 |[0.5555324554443359,0.44446757435798645]|0.0 |
I define the following udf to get the elements out of vector column probability
import org.apache.spark.sql.functions._
def getProb = udf((probV: org.apache.spark.ml.linalg.Vector, clsInx: Int) => probV.apply(clsInx) )
results_train.select("classIndex","probability","prediction").
withColumn("p_0",getProb($"probability",lit(0))).
withColumn("p_1",getProb($"probability", lit(1))).show(3,0)
+----------+----------------------------------------+----------+------------------+-------------------+
|classIndex|probability |prediction|p_0 |p_1 |
+----------+----------------------------------------+----------+------------------+-------------------+
|1 |[0.5998525619506836,0.400147408246994] |0.0 |0.5998525619506836|0.400147408246994 |
|1 |[0.5487841367721558,0.45121586322784424]|0.0 |0.5487841367721558|0.45121586322784424|
|0 |[0.5555324554443359,0.44446757435798645]|0.0 |0.5555324554443359|0.44446757435798645|
Hope this would help for those who handle with Vector type input.
Since the above answers need additional libraries or still not supported, I have used pandas dataframe to easity extract the vector values and then convert it back to spark dataframe.
# convert to pandas dataframe
pandasDf = dataframe.toPandas()
# add a new column
pandasDf['newColumnName'] = 0 # filled the new column with 0s
# now iterate through the rows and update the column
for index, row in pandasDf.iterrows():
value = row['vectorCol'][0] # get the 0th value of the vector
pandasDf.loc[index, 'newColumnName'] = value # put the value in the new column

Column manipulations in Spark Scala

I am learning to work with Apache Spark (Scala) and still figuring out how things work out here
I am trying to achieve a simple task of
Finding max of column
Subtract each value of the column from this max and create a new column
The code I am using is
import org.apache.spark.sql.functions._
val training = sqlContext.createDataFrame(Seq(
(10),
(13),
(14),
(21)
)).toDF("Values")
val training_max = training.withColumn("Val_Max",training.groupBy().agg(max("Values"))
val training_max_sub = training_max.withColumn("Subs",training_max.groupBy().agg(col("Val_Max")-col("Values) ))
However I am getting a lot of errors. I am more or less fluent in R and had I been doing the same task my code would have been:
library(dplyr)
new_data <- training %>%
mutate(Subs= max(Values) - Values)
Here is a solution using window functions. You'll need a HiveContext to use them
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val training = sc.parallelize(Seq(10,13,14,21)).toDF("values")
training.withColumn("subs",
max($"values").over(Window.partitionBy()) - $"values").show
Which produces the expected output :
+------+----+
|values|subs|
+------+----+
| 10| 11|
| 13| 8|
| 14| 7|
| 21| 0|
+------+----+

Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, ..., fn: Double)]

I just used Standard Scaler to normalize my features for a ML application. After selecting the scaled features, I want to convert this back to a dataframe of Doubles, though the length of my vectors are arbitrary. I know how to do it for a specific 3 features by using
myDF.map{case Row(v: Vector) => (v(0), v(1), v(2))}.toDF("f1", "f2", "f3")
but not for an arbitrary amount of features. Is there an easy way to do this?
Example:
val testDF = sc.parallelize(List(Vectors.dense(5D, 6D, 7D), Vectors.dense(8D, 9D, 10D), Vectors.dense(11D, 12D, 13D))).map(Tuple1(_)).toDF("scaledFeatures")
val myColumnNames = List("f1", "f2", "f3")
// val finalDF = DataFrame[f1: Double, f2: Double, f3: Double]
EDIT
I found out how to unpack to column names when creating the dataframe, but still am having trouble converting a vector to a sequence needed to create the dataframe:
finalDF = testDF.map{case Row(v: Vector) => v.toArray.toSeq /* <= this errors */}.toDF(List("f1", "f2", "f3"): _*)
Spark >= 3.0.0
Since Spark 3.0 you can use vector_to_array
import org.apache.spark.ml.functions.vector_to_array
testDF.select(vector_to_array($"scaledFeatures").alias("_tmp")).select(exprs:_*)
Spark < 3.0.0
One possible approach is something similar to this
import org.apache.spark.sql.functions.udf
// In Spark 1.x you'll will have to replace ML Vector with MLLib one
// import org.apache.spark.mllib.linalg.Vector
// In 2.x the below is usually the right choice
import org.apache.spark.ml.linalg.Vector
// Get size of the vector
val n = testDF.first.getAs[Vector](0).size
// Simple helper to convert vector to array<double>
// asNondeterministic is available in Spark 2.3 or befor
// It can be removed, but at the cost of decreased performance
val vecToSeq = udf((v: Vector) => v.toArray).asNondeterministic
// Prepare a list of columns to create
val exprs = (0 until n).map(i => $"_tmp".getItem(i).alias(s"f$i"))
testDF.select(vecToSeq($"scaledFeatures").alias("_tmp")).select(exprs:_*)
If you know a list of columns upfront you can simplify this a little:
val cols: Seq[String] = ???
val exprs = cols.zipWithIndex.map{ case (c, i) => $"_tmp".getItem(i).alias(c) }
For Python equivalent see How to split Vector into columns - using PySpark.
Please try VectorSlicer :
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame(
Seq((1, 0.2, 0.8), (2, 0.1, 0.9), (3, 0.3, 0.7))
).toDF("id", "negative_logit", "positive_logit")
val assembler = new VectorAssembler()
.setInputCols(Array("negative_logit", "positive_logit"))
.setOutputCol("prediction")
val output = assembler.transform(dataset)
output.show()
/*
+---+--------------+--------------+----------+
| id|negative_logit|positive_logit|prediction|
+---+--------------+--------------+----------+
| 1| 0.2| 0.8| [0.2,0.8]|
| 2| 0.1| 0.9| [0.1,0.9]|
| 3| 0.3| 0.7| [0.3,0.7]|
+---+--------------+--------------+----------+
*/
val slicer = new VectorSlicer()
.setInputCol("prediction")
.setIndices(Array(1))
.setOutputCol("positive_prediction")
val posi_output = slicer.transform(output)
posi_output.show()
/*
+---+--------------+--------------+----------+-------------------+
| id|negative_logit|positive_logit|prediction|positive_prediction|
+---+--------------+--------------+----------+-------------------+
| 1| 0.2| 0.8| [0.2,0.8]| [0.8]|
| 2| 0.1| 0.9| [0.1,0.9]| [0.9]|
| 3| 0.3| 0.7| [0.3,0.7]| [0.7]|
+---+--------------+--------------+----------+-------------------+
*/
Alternate solution that evovled couple of days ago: Import the VectorDisassembler into your project (as long as it's not merged into Spark), now:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame(
Seq((0, 1.2, 1.3), (1, 2.2, 2.3), (2, 3.2, 3.3))
).toDF("id", "val1", "val2")
val assembler = new VectorAssembler()
.setInputCols(Array("val1", "val2"))
.setOutputCol("vectorCol")
val output = assembler.transform(dataset)
output.show()
/*
+---+----+----+---------+
| id|val1|val2|vectorCol|
+---+----+----+---------+
| 0| 1.2| 1.3|[1.2,1.3]|
| 1| 2.2| 2.3|[2.2,2.3]|
| 2| 3.2| 3.3|[3.2,3.3]|
+---+----+----+---------+*/
val disassembler = new org.apache.spark.ml.feature.VectorDisassembler()
.setInputCol("vectorCol")
disassembler.transform(output).show()
/*
+---+----+----+---------+----+----+
| id|val1|val2|vectorCol|val1|val2|
+---+----+----+---------+----+----+
| 0| 1.2| 1.3|[1.2,1.3]| 1.2| 1.3|
| 1| 2.2| 2.3|[2.2,2.3]| 2.2| 2.3|
| 2| 3.2| 3.3|[3.2,3.3]| 3.2| 3.3|
+---+----+----+---------+----+----+*/
I use Spark 2.3.2, and built a xgboost4j binary-classification model, the result looks like this:
results_train.select("classIndex","probability","prediction").show(3,0)
+----------+----------------------------------------+----------+
|classIndex|probability |prediction|
+----------+----------------------------------------+----------+
|1 |[0.5998525619506836,0.400147408246994] |0.0 |
|1 |[0.5487841367721558,0.45121586322784424]|0.0 |
|0 |[0.5555324554443359,0.44446757435798645]|0.0 |
I define the following udf to get the elements out of vector column probability
import org.apache.spark.sql.functions._
def getProb = udf((probV: org.apache.spark.ml.linalg.Vector, clsInx: Int) => probV.apply(clsInx) )
results_train.select("classIndex","probability","prediction").
withColumn("p_0",getProb($"probability",lit(0))).
withColumn("p_1",getProb($"probability", lit(1))).show(3,0)
+----------+----------------------------------------+----------+------------------+-------------------+
|classIndex|probability |prediction|p_0 |p_1 |
+----------+----------------------------------------+----------+------------------+-------------------+
|1 |[0.5998525619506836,0.400147408246994] |0.0 |0.5998525619506836|0.400147408246994 |
|1 |[0.5487841367721558,0.45121586322784424]|0.0 |0.5487841367721558|0.45121586322784424|
|0 |[0.5555324554443359,0.44446757435798645]|0.0 |0.5555324554443359|0.44446757435798645|
Hope this would help for those who handle with Vector type input.
Since the above answers need additional libraries or still not supported, I have used pandas dataframe to easity extract the vector values and then convert it back to spark dataframe.
# convert to pandas dataframe
pandasDf = dataframe.toPandas()
# add a new column
pandasDf['newColumnName'] = 0 # filled the new column with 0s
# now iterate through the rows and update the column
for index, row in pandasDf.iterrows():
value = row['vectorCol'][0] # get the 0th value of the vector
pandasDf.loc[index, 'newColumnName'] = value # put the value in the new column

Size of the sparse vector in the column of a data-frame in Apache scala spark

I am using a vector assembler to transform a dataframe.
var stringAssembler = new VectorAssembler().setInputCols(encodedstringColumns).setOutputCol("stringFeatures")
df = stringAssembler.transform(df)
**var stringVectorSize = df.select("stringFeatures").head.size**
var stringPca = new PCA().setInputCol("stringFeatures").setOutputCol("pcaStringFeatures").setK(stringVectorSize).fit(output)
Now stringVectorSize will tell PCA how many columns to keep while performing pca.
I am trying to get the size of the output sparse vector from the vector assembler but my code gives size = 1 which is wrong. What is the right code to get the size of a sparse vector which is the part of a dataframe column.
To put it plainly
+-------------+------------+-------------+------------+---+-----------+---------------+-----------------+--------------------+
|PetalLengthCm|PetalWidthCm|SepalLengthCm|SepalWidthCm| Id| Species|Species_Encoded| Id_Encoded| stringFeatures|
+-------------+------------+-------------+------------+---+-----------+---------------+-----------------+--------------------+
| 1.4| 0.2| 5.1| 3.5| 1|Iris-setosa| (2,[0],[1.0])| (149,[91],[1.0])|(151,[91,149],[1....|
| 1.4| 0.2| 4.9| 3.0| 2|Iris-setosa| (2,[0],[1.0])|(149,[119],[1.0])|(151,[119,149],[1...|
| 1.3| 0.2| 4.7| 3.2| 3|Iris-setosa| (2,[0],[1.0])|(149,[140],[1.0])|(151,[140,149],[1...|
For the above dataframe . I want to extract the size of stringFeatures sparse vector ( which is 151)
If you read DataFrame's documentation you will notice that the head method returns a Row. Therefore, rather than obtaining your SparseVector's size, you are obtaining Row's size. Thus, to solve this you have to extract the element stored in the Row.
val row = df.select("stringFeatures").head
val vector = vector(0).asInstanceOf[SparseVector]
val size = vector.size
For instance:
import sqlContext.implicits._
import org.apache.spark.mllib.linalg.SparseVector
val df = sc.parallelize(Array(10,2,3,4)).toDF("n")
val pepe = udf((i: Int) => new SparseVector(i, Array(i-1), Array(i)))
val x = df.select(pepe(df("n")).as("n"))
x.show()
+---------------+
| n|
+---------------+
|(10,[9],[10.0])|
| (2,[1],[2.0])|
| (3,[2],[3.0])|
| (4,[3],[4.0])|
+---------------+
val y = x.select("n").head
y(0).asInstanceOf[SparseVector].size
res12: Int = 10