Constructing Spark ML features column with nested arrays

Constructing Spark ML features column with nested arrays - scala

My dataframe, df, has columns comprising 2-dimensional (x,y) data. Combining these columns with VectorAssembler into the 'features' column results in all these pairs being flattened. Is there a way to have these columns represented in their original form i.e. as [[x1,y1],[x2,y2],[x3,y3]] instead of what I am getting: [x1,y1,x2,y2,x3,y3]
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.VectorAssembler
val df = Seq((Seq(1.0,2.0), Seq(3.0,4.0), Seq(5.0,6.0)),
(Seq(7.0,8.0), Seq(9.0,10.0), Seq(11.0,12.0))).toDF("f1", "f2", "f3")
//Ref https://stackoverflow.com/a/41091839/4106464
val seqAsVector = udf((xs: Seq[Double]) => Vectors.dense(xs.toArray))
val df_final = df.select(seqAsVector(col("f1")).as("f1"), seqAsVector(col("f2")).as("f2"), seqAsVector(col("f3")).as("f3"))
val assembler = new VectorAssembler().setInputCols(Array("f1","f2","f3")).setOutputCol("features")
val df_out = assembler.transform(df_final)
df.show
df_out.show(false)
// df
//+----------+-----------+------------+
//| f1| f2| f3|
//+----------+-----------+------------+
//|[1.0, 2.0]| [3.0, 4.0]| [5.0, 6.0]|
//|[7.0, 8.0]|[9.0, 10.0]|[11.0, 12.0]|
//+----------+-----------+------------+
// df_out with VectorAssembler
//+---------+----------+-----------+----------------------------+
//|f1 |f2 |f3 |features |
//+---------+----------+-----------+----------------------------+
//|[1.0,2.0]|[3.0,4.0] |[5.0,6.0] |[1.0,2.0,3.0,4.0,5.0,6.0] |
//|[7.0,8.0]|[9.0,10.0]|[11.0,12.0]|[7.0,8.0,9.0,10.0,11.0,12.0]|
//+---------+----------+-----------+----------------------------+
//Desired features column:
//+---------+----------+-----------+----------------------------------+
//|f1 |f2 |f3 |features |
//+---------+----------+-----------+----------------------------------+
//|[1.0,2.0]|[3.0,4.0] |[5.0,6.0] |[[1.0,2.0],[3.0,4.0],[5.0,6.0]] |
//|[7.0,8.0]|[9.0,10.0]|[11.0,12.0]|[[7.0,8.0],[9.0,10.0],[11.0,12.0]]|
//+---------+----------+-----------+----------------------------------+

Related

Scala: Find the maximum value across each row of a dataframe

For each row of a DataFrame, I would like to extract the maximum value and put it in a new column.
The example code below gives me a DataFrame ('dfmax') of each maximum value:
val donuts = Seq((2.0, 1.50, 3.5), (4.2, 22.3, 10.8), (33.6, 2.50, 7.3))
val df = sparkSession
.createDataFrame(donuts)
.toDF("col1", "col2", "col3")
df.show()
import sparkSession.implicits._
val dfmax = df.map(r => r.getValuesMap[Double](df.schema.fieldNames).map(r => r._2).max)
dfmax.show
This gives me df:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 2.0| 1.5| 3.5|
| 4.2|22.3|10.8|
|33.6| 2.5| 7.3|
+----+----+----+
and dfmax:
+-----+
|value|
+-----+
| 3.5|
| 22.3|
| 33.6|
+-----+
I would like to have these two frames combined in one table preferably using .withColumn or similar in a style like this (which I cannot get to work):
def maxValue(data: DataFrame): DataFrame = {
val dfmax = df.map(r => r.getValuesMap[Double](df.schema.fieldNames).map(r => r._2).max)
dfmax
}
val udfMaxValue = udf(maxValue _)
df.withColumn("max", udfMaxValue(df))

How Does Everyone Deal with Probabilities from XGBoost Scored Data? Scala [duplicate]

I just used Standard Scaler to normalize my features for a ML application. After selecting the scaled features, I want to convert this back to a dataframe of Doubles, though the length of my vectors are arbitrary. I know how to do it for a specific 3 features by using
myDF.map{case Row(v: Vector) => (v(0), v(1), v(2))}.toDF("f1", "f2", "f3")
but not for an arbitrary amount of features. Is there an easy way to do this?
Example:
val testDF = sc.parallelize(List(Vectors.dense(5D, 6D, 7D), Vectors.dense(8D, 9D, 10D), Vectors.dense(11D, 12D, 13D))).map(Tuple1(_)).toDF("scaledFeatures")
val myColumnNames = List("f1", "f2", "f3")
// val finalDF = DataFrame[f1: Double, f2: Double, f3: Double]
EDIT
I found out how to unpack to column names when creating the dataframe, but still am having trouble converting a vector to a sequence needed to create the dataframe:
finalDF = testDF.map{case Row(v: Vector) => v.toArray.toSeq /* <= this errors */}.toDF(List("f1", "f2", "f3"): _*)

Spark >= 3.0.0
Since Spark 3.0 you can use vector_to_array
import org.apache.spark.ml.functions.vector_to_array
testDF.select(vector_to_array($"scaledFeatures").alias("_tmp")).select(exprs:_*)
Spark < 3.0.0
One possible approach is something similar to this
import org.apache.spark.sql.functions.udf
// In Spark 1.x you'll will have to replace ML Vector with MLLib one
// import org.apache.spark.mllib.linalg.Vector
// In 2.x the below is usually the right choice
import org.apache.spark.ml.linalg.Vector
// Get size of the vector
val n = testDF.first.getAs[Vector](0).size
// Simple helper to convert vector to array<double>
// asNondeterministic is available in Spark 2.3 or befor
// It can be removed, but at the cost of decreased performance
val vecToSeq = udf((v: Vector) => v.toArray).asNondeterministic
// Prepare a list of columns to create
val exprs = (0 until n).map(i => $"_tmp".getItem(i).alias(s"f$i"))
testDF.select(vecToSeq($"scaledFeatures").alias("_tmp")).select(exprs:_*)
If you know a list of columns upfront you can simplify this a little:
val cols: Seq[String] = ???
val exprs = cols.zipWithIndex.map{ case (c, i) => $"_tmp".getItem(i).alias(c) }
For Python equivalent see How to split Vector into columns - using PySpark.

Please try VectorSlicer :
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame(
Seq((1, 0.2, 0.8), (2, 0.1, 0.9), (3, 0.3, 0.7))
).toDF("id", "negative_logit", "positive_logit")
val assembler = new VectorAssembler()
.setInputCols(Array("negative_logit", "positive_logit"))
.setOutputCol("prediction")
val output = assembler.transform(dataset)
output.show()
/*
+---+--------------+--------------+----------+
| id|negative_logit|positive_logit|prediction|
+---+--------------+--------------+----------+
| 1| 0.2| 0.8| [0.2,0.8]|
| 2| 0.1| 0.9| [0.1,0.9]|
| 3| 0.3| 0.7| [0.3,0.7]|
+---+--------------+--------------+----------+
*/
val slicer = new VectorSlicer()
.setInputCol("prediction")
.setIndices(Array(1))
.setOutputCol("positive_prediction")
val posi_output = slicer.transform(output)
posi_output.show()
/*
+---+--------------+--------------+----------+-------------------+
| id|negative_logit|positive_logit|prediction|positive_prediction|
+---+--------------+--------------+----------+-------------------+
| 1| 0.2| 0.8| [0.2,0.8]| [0.8]|
| 2| 0.1| 0.9| [0.1,0.9]| [0.9]|
| 3| 0.3| 0.7| [0.3,0.7]| [0.7]|
+---+--------------+--------------+----------+-------------------+
*/

Alternate solution that evovled couple of days ago: Import the VectorDisassembler into your project (as long as it's not merged into Spark), now:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame(
Seq((0, 1.2, 1.3), (1, 2.2, 2.3), (2, 3.2, 3.3))
).toDF("id", "val1", "val2")
val assembler = new VectorAssembler()
.setInputCols(Array("val1", "val2"))
.setOutputCol("vectorCol")
val output = assembler.transform(dataset)
output.show()
/*
+---+----+----+---------+
| id|val1|val2|vectorCol|
+---+----+----+---------+
| 0| 1.2| 1.3|[1.2,1.3]|
| 1| 2.2| 2.3|[2.2,2.3]|
| 2| 3.2| 3.3|[3.2,3.3]|
+---+----+----+---------+*/
val disassembler = new org.apache.spark.ml.feature.VectorDisassembler()
.setInputCol("vectorCol")
disassembler.transform(output).show()
/*
+---+----+----+---------+----+----+
| id|val1|val2|vectorCol|val1|val2|
+---+----+----+---------+----+----+
| 0| 1.2| 1.3|[1.2,1.3]| 1.2| 1.3|
| 1| 2.2| 2.3|[2.2,2.3]| 2.2| 2.3|
| 2| 3.2| 3.3|[3.2,3.3]| 3.2| 3.3|
+---+----+----+---------+----+----+*/

I use Spark 2.3.2, and built a xgboost4j binary-classification model, the result looks like this：
results_train.select("classIndex","probability","prediction").show(3,0)
+----------+----------------------------------------+----------+
|classIndex|probability |prediction|
+----------+----------------------------------------+----------+
|1 |[0.5998525619506836,0.400147408246994] |0.0 |
|1 |[0.5487841367721558,0.45121586322784424]|0.0 |
|0 |[0.5555324554443359,0.44446757435798645]|0.0 |
I define the following udf to get the elements out of vector column probability
import org.apache.spark.sql.functions._
def getProb = udf((probV: org.apache.spark.ml.linalg.Vector, clsInx: Int) => probV.apply(clsInx) )
results_train.select("classIndex","probability","prediction").
withColumn("p_0",getProb($"probability",lit(0))).
withColumn("p_1",getProb($"probability", lit(1))).show(3,0)
+----------+----------------------------------------+----------+------------------+-------------------+
|classIndex|probability |prediction|p_0 |p_1 |
+----------+----------------------------------------+----------+------------------+-------------------+
|1 |[0.5998525619506836,0.400147408246994] |0.0 |0.5998525619506836|0.400147408246994 |
|1 |[0.5487841367721558,0.45121586322784424]|0.0 |0.5487841367721558|0.45121586322784424|
|0 |[0.5555324554443359,0.44446757435798645]|0.0 |0.5555324554443359|0.44446757435798645|
Hope this would help for those who handle with Vector type input.

Since the above answers need additional libraries or still not supported, I have used pandas dataframe to easity extract the vector values and then convert it back to spark dataframe.
# convert to pandas dataframe
pandasDf = dataframe.toPandas()
# add a new column
pandasDf['newColumnName'] = 0 # filled the new column with 0s
# now iterate through the rows and update the column
for index, row in pandasDf.iterrows():
value = row['vectorCol'][0] # get the 0th value of the vector
pandasDf.loc[index, 'newColumnName'] = value # put the value in the new column

Convert vector UDT to double in ML [duplicate]

I just used Standard Scaler to normalize my features for a ML application. After selecting the scaled features, I want to convert this back to a dataframe of Doubles, though the length of my vectors are arbitrary. I know how to do it for a specific 3 features by using
myDF.map{case Row(v: Vector) => (v(0), v(1), v(2))}.toDF("f1", "f2", "f3")
but not for an arbitrary amount of features. Is there an easy way to do this?
Example:
val testDF = sc.parallelize(List(Vectors.dense(5D, 6D, 7D), Vectors.dense(8D, 9D, 10D), Vectors.dense(11D, 12D, 13D))).map(Tuple1(_)).toDF("scaledFeatures")
val myColumnNames = List("f1", "f2", "f3")
// val finalDF = DataFrame[f1: Double, f2: Double, f3: Double]
EDIT
I found out how to unpack to column names when creating the dataframe, but still am having trouble converting a vector to a sequence needed to create the dataframe:
finalDF = testDF.map{case Row(v: Vector) => v.toArray.toSeq /* <= this errors */}.toDF(List("f1", "f2", "f3"): _*)

Spark >= 3.0.0
Since Spark 3.0 you can use vector_to_array
import org.apache.spark.ml.functions.vector_to_array
testDF.select(vector_to_array($"scaledFeatures").alias("_tmp")).select(exprs:_*)
Spark < 3.0.0
One possible approach is something similar to this
import org.apache.spark.sql.functions.udf
// In Spark 1.x you'll will have to replace ML Vector with MLLib one
// import org.apache.spark.mllib.linalg.Vector
// In 2.x the below is usually the right choice
import org.apache.spark.ml.linalg.Vector
// Get size of the vector
val n = testDF.first.getAs[Vector](0).size
// Simple helper to convert vector to array<double>
// asNondeterministic is available in Spark 2.3 or befor
// It can be removed, but at the cost of decreased performance
val vecToSeq = udf((v: Vector) => v.toArray).asNondeterministic
// Prepare a list of columns to create
val exprs = (0 until n).map(i => $"_tmp".getItem(i).alias(s"f$i"))
testDF.select(vecToSeq($"scaledFeatures").alias("_tmp")).select(exprs:_*)
If you know a list of columns upfront you can simplify this a little:
val cols: Seq[String] = ???
val exprs = cols.zipWithIndex.map{ case (c, i) => $"_tmp".getItem(i).alias(c) }
For Python equivalent see How to split Vector into columns - using PySpark.

Please try VectorSlicer :
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame(
Seq((1, 0.2, 0.8), (2, 0.1, 0.9), (3, 0.3, 0.7))
).toDF("id", "negative_logit", "positive_logit")
val assembler = new VectorAssembler()
.setInputCols(Array("negative_logit", "positive_logit"))
.setOutputCol("prediction")
val output = assembler.transform(dataset)
output.show()
/*
+---+--------------+--------------+----------+
| id|negative_logit|positive_logit|prediction|
+---+--------------+--------------+----------+
| 1| 0.2| 0.8| [0.2,0.8]|
| 2| 0.1| 0.9| [0.1,0.9]|
| 3| 0.3| 0.7| [0.3,0.7]|
+---+--------------+--------------+----------+
*/
val slicer = new VectorSlicer()
.setInputCol("prediction")
.setIndices(Array(1))
.setOutputCol("positive_prediction")
val posi_output = slicer.transform(output)
posi_output.show()
/*
+---+--------------+--------------+----------+-------------------+
| id|negative_logit|positive_logit|prediction|positive_prediction|
+---+--------------+--------------+----------+-------------------+
| 1| 0.2| 0.8| [0.2,0.8]| [0.8]|
| 2| 0.1| 0.9| [0.1,0.9]| [0.9]|
| 3| 0.3| 0.7| [0.3,0.7]| [0.7]|
+---+--------------+--------------+----------+-------------------+
*/

Alternate solution that evovled couple of days ago: Import the VectorDisassembler into your project (as long as it's not merged into Spark), now:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame(
Seq((0, 1.2, 1.3), (1, 2.2, 2.3), (2, 3.2, 3.3))
).toDF("id", "val1", "val2")
val assembler = new VectorAssembler()
.setInputCols(Array("val1", "val2"))
.setOutputCol("vectorCol")
val output = assembler.transform(dataset)
output.show()
/*
+---+----+----+---------+
| id|val1|val2|vectorCol|
+---+----+----+---------+
| 0| 1.2| 1.3|[1.2,1.3]|
| 1| 2.2| 2.3|[2.2,2.3]|
| 2| 3.2| 3.3|[3.2,3.3]|
+---+----+----+---------+*/
val disassembler = new org.apache.spark.ml.feature.VectorDisassembler()
.setInputCol("vectorCol")
disassembler.transform(output).show()
/*
+---+----+----+---------+----+----+
| id|val1|val2|vectorCol|val1|val2|
+---+----+----+---------+----+----+
| 0| 1.2| 1.3|[1.2,1.3]| 1.2| 1.3|
| 1| 2.2| 2.3|[2.2,2.3]| 2.2| 2.3|
| 2| 3.2| 3.3|[3.2,3.3]| 3.2| 3.3|
+---+----+----+---------+----+----+*/

I use Spark 2.3.2, and built a xgboost4j binary-classification model, the result looks like this：
results_train.select("classIndex","probability","prediction").show(3,0)
+----------+----------------------------------------+----------+
|classIndex|probability |prediction|
+----------+----------------------------------------+----------+
|1 |[0.5998525619506836,0.400147408246994] |0.0 |
|1 |[0.5487841367721558,0.45121586322784424]|0.0 |
|0 |[0.5555324554443359,0.44446757435798645]|0.0 |
I define the following udf to get the elements out of vector column probability
import org.apache.spark.sql.functions._
def getProb = udf((probV: org.apache.spark.ml.linalg.Vector, clsInx: Int) => probV.apply(clsInx) )
results_train.select("classIndex","probability","prediction").
withColumn("p_0",getProb($"probability",lit(0))).
withColumn("p_1",getProb($"probability", lit(1))).show(3,0)
+----------+----------------------------------------+----------+------------------+-------------------+
|classIndex|probability |prediction|p_0 |p_1 |
+----------+----------------------------------------+----------+------------------+-------------------+
|1 |[0.5998525619506836,0.400147408246994] |0.0 |0.5998525619506836|0.400147408246994 |
|1 |[0.5487841367721558,0.45121586322784424]|0.0 |0.5487841367721558|0.45121586322784424|
|0 |[0.5555324554443359,0.44446757435798645]|0.0 |0.5555324554443359|0.44446757435798645|
Hope this would help for those who handle with Vector type input.

Since the above answers need additional libraries or still not supported, I have used pandas dataframe to easity extract the vector values and then convert it back to spark dataframe.
# convert to pandas dataframe
pandasDf = dataframe.toPandas()
# add a new column
pandasDf['newColumnName'] = 0 # filled the new column with 0s
# now iterate through the rows and update the column
for index, row in pandasDf.iterrows():
value = row['vectorCol'][0] # get the 0th value of the vector
pandasDf.loc[index, 'newColumnName'] = value # put the value in the new column

add dataframe to another one

I'd like to make a summary of a dataframe. I got some outputs. I would like to combine the three dataframe into a dataframe that will be exactly like the first one.
Here is what I did.
// Compute column summary statistics.
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val dataframe = spark.read.option("header", true).option("inferSchema", true).format("com.databricks.spark.csv").load("C:/Users/mhattabi/Desktop/donnee/cassandraTest_1.csv")
val colNames=dataframe.columns
val data=dataframe.describe().show()
+-------+-------------------+-------------------+-------------------+-------------------+-------------------+
|summary| Col0| Col1| Col2| Col3| Col4|
+-------+-------------------+-------------------+-------------------+-------------------+-------------------+
| count| 9999| 9999| 9999| 9999| 9999|
| mean| 0.4976937166129511| 0.5032998128645433| 0.5002933978916888| 0.5008783202471074|0.49977372871783293|
| stddev| 0.2893201326892155|0.28767789122296994|0.29041197844235034|0.28989958496291496| 0.2881033430504947|
| min|4.92436811557243E-6|3.20277176946531E-5|1.41602940923349E-5|6.53252937203857E-5| 5.4864212896146E-5|
| max| 0.999442967120299| 0.9999608020298| 0.999968873336897| 0.999836584087385| 0.999822016805327|
+-------+-------------------+-------------------+-------------------+-------------------+-------------------+
println("Skewness")
val Skewness = dataframe.columns.map(c => skewness(c).as(c))
val Skewness_ = dataframe.agg(Skewness.head, Skewness.tail: _*).show()
Skewness
+--------------------+--------------------+--------------------+--------------------+--------------------+
| Col0| Col1| Col2| Col3| Col4|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|0.015599787007160271|-0.00740111491496...|0.006096695102089171|0.003614431405637598|0.007869663345343194|
+--------------------+--------------------+--------------------+--------------------+--------------------+
println("Kurtosis")
val Kurtosis = dataframe.columns.map(c => kurtosis(c).as(c))
val Kurtosis_ = dataframe.agg(Kurtosis.head, Kurtosis.tail: _*).show//kurtosis
Kurtosis
+-------------------+-------------------+-------------------+-------------------+------------------+
| Col0| Col1| Col2| Col3| Col4|
+-------------------+-------------------+-------------------+-------------------+------------------+
|-1.2187774053075133|-1.1861812968784207|-1.2107252263053805|-1.2108988817869097|-1.199054929668751|
+-------------------+-------------------+-------------------+-------------------+------------------+
I would like to add to skewness and the kurtosis dataframe to the first one and add their names into the first colummns.
Thanks in advance

you need to add summary column to both skewness and kurtosis tables using withColumn
val Skewness_ = dataframe.agg(Skewness.head, Skewness.tail: _*).withColumn("summary", lit("Skewness"))
Do the same for kurtosis
val Kurtosis_ = dataframe.agg(Kurtosis.head, Kurtosis.tail: _*).withColumn("summary", lit("Kurtosis"))
Use Select in all dataframes to have the column names in order
val orderColumn = Vector("summary", "col0", "col1", "col2", "col3", "col4")
val Skewness_ordered = Skewness_.select(orderColumn.map(col):_*)
val Kurtosis_ordered = Kurtosis_.select(orderColumn.map(col):_*)
and union them.
val combined = dataframe.union(Skewness_ordered).union(Kurtosis_ordered)

In elegant way you can combine your dataframes Skewness and Kurtosis with initial one to new dataframe as:
import org.apache.spark.sql.functions._
val result = dataframe.union(Skewness.select(lit("Skewness"),Skewness.col("*")))
.union(Kurtosis.select(lit("Kurtosis"),Kurtosis.col("*")))
result.show()

Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, ..., fn: Double)]

I just used Standard Scaler to normalize my features for a ML application. After selecting the scaled features, I want to convert this back to a dataframe of Doubles, though the length of my vectors are arbitrary. I know how to do it for a specific 3 features by using
myDF.map{case Row(v: Vector) => (v(0), v(1), v(2))}.toDF("f1", "f2", "f3")
but not for an arbitrary amount of features. Is there an easy way to do this?
Example:
val testDF = sc.parallelize(List(Vectors.dense(5D, 6D, 7D), Vectors.dense(8D, 9D, 10D), Vectors.dense(11D, 12D, 13D))).map(Tuple1(_)).toDF("scaledFeatures")
val myColumnNames = List("f1", "f2", "f3")
// val finalDF = DataFrame[f1: Double, f2: Double, f3: Double]
EDIT
I found out how to unpack to column names when creating the dataframe, but still am having trouble converting a vector to a sequence needed to create the dataframe:
finalDF = testDF.map{case Row(v: Vector) => v.toArray.toSeq /* <= this errors */}.toDF(List("f1", "f2", "f3"): _*)

Spark >= 3.0.0
Since Spark 3.0 you can use vector_to_array
import org.apache.spark.ml.functions.vector_to_array
testDF.select(vector_to_array($"scaledFeatures").alias("_tmp")).select(exprs:_*)
Spark < 3.0.0
One possible approach is something similar to this
import org.apache.spark.sql.functions.udf
// In Spark 1.x you'll will have to replace ML Vector with MLLib one
// import org.apache.spark.mllib.linalg.Vector
// In 2.x the below is usually the right choice
import org.apache.spark.ml.linalg.Vector
// Get size of the vector
val n = testDF.first.getAs[Vector](0).size
// Simple helper to convert vector to array<double>
// asNondeterministic is available in Spark 2.3 or befor
// It can be removed, but at the cost of decreased performance
val vecToSeq = udf((v: Vector) => v.toArray).asNondeterministic
// Prepare a list of columns to create
val exprs = (0 until n).map(i => $"_tmp".getItem(i).alias(s"f$i"))
testDF.select(vecToSeq($"scaledFeatures").alias("_tmp")).select(exprs:_*)
If you know a list of columns upfront you can simplify this a little:
val cols: Seq[String] = ???
val exprs = cols.zipWithIndex.map{ case (c, i) => $"_tmp".getItem(i).alias(c) }
For Python equivalent see How to split Vector into columns - using PySpark.

Please try VectorSlicer :
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame(
Seq((1, 0.2, 0.8), (2, 0.1, 0.9), (3, 0.3, 0.7))
).toDF("id", "negative_logit", "positive_logit")
val assembler = new VectorAssembler()
.setInputCols(Array("negative_logit", "positive_logit"))
.setOutputCol("prediction")
val output = assembler.transform(dataset)
output.show()
/*
+---+--------------+--------------+----------+
| id|negative_logit|positive_logit|prediction|
+---+--------------+--------------+----------+
| 1| 0.2| 0.8| [0.2,0.8]|
| 2| 0.1| 0.9| [0.1,0.9]|
| 3| 0.3| 0.7| [0.3,0.7]|
+---+--------------+--------------+----------+
*/
val slicer = new VectorSlicer()
.setInputCol("prediction")
.setIndices(Array(1))
.setOutputCol("positive_prediction")
val posi_output = slicer.transform(output)
posi_output.show()
/*
+---+--------------+--------------+----------+-------------------+
| id|negative_logit|positive_logit|prediction|positive_prediction|
+---+--------------+--------------+----------+-------------------+
| 1| 0.2| 0.8| [0.2,0.8]| [0.8]|
| 2| 0.1| 0.9| [0.1,0.9]| [0.9]|
| 3| 0.3| 0.7| [0.3,0.7]| [0.7]|
+---+--------------+--------------+----------+-------------------+
*/

Alternate solution that evovled couple of days ago: Import the VectorDisassembler into your project (as long as it's not merged into Spark), now:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame(
Seq((0, 1.2, 1.3), (1, 2.2, 2.3), (2, 3.2, 3.3))
).toDF("id", "val1", "val2")
val assembler = new VectorAssembler()
.setInputCols(Array("val1", "val2"))
.setOutputCol("vectorCol")
val output = assembler.transform(dataset)
output.show()
/*
+---+----+----+---------+
| id|val1|val2|vectorCol|
+---+----+----+---------+
| 0| 1.2| 1.3|[1.2,1.3]|
| 1| 2.2| 2.3|[2.2,2.3]|
| 2| 3.2| 3.3|[3.2,3.3]|
+---+----+----+---------+*/
val disassembler = new org.apache.spark.ml.feature.VectorDisassembler()
.setInputCol("vectorCol")
disassembler.transform(output).show()
/*
+---+----+----+---------+----+----+
| id|val1|val2|vectorCol|val1|val2|
+---+----+----+---------+----+----+
| 0| 1.2| 1.3|[1.2,1.3]| 1.2| 1.3|
| 1| 2.2| 2.3|[2.2,2.3]| 2.2| 2.3|
| 2| 3.2| 3.3|[3.2,3.3]| 3.2| 3.3|
+---+----+----+---------+----+----+*/

I use Spark 2.3.2, and built a xgboost4j binary-classification model, the result looks like this：
results_train.select("classIndex","probability","prediction").show(3,0)
+----------+----------------------------------------+----------+
|classIndex|probability |prediction|
+----------+----------------------------------------+----------+
|1 |[0.5998525619506836,0.400147408246994] |0.0 |
|1 |[0.5487841367721558,0.45121586322784424]|0.0 |
|0 |[0.5555324554443359,0.44446757435798645]|0.0 |
I define the following udf to get the elements out of vector column probability
import org.apache.spark.sql.functions._
def getProb = udf((probV: org.apache.spark.ml.linalg.Vector, clsInx: Int) => probV.apply(clsInx) )
results_train.select("classIndex","probability","prediction").
withColumn("p_0",getProb($"probability",lit(0))).
withColumn("p_1",getProb($"probability", lit(1))).show(3,0)
+----------+----------------------------------------+----------+------------------+-------------------+
|classIndex|probability |prediction|p_0 |p_1 |
+----------+----------------------------------------+----------+------------------+-------------------+
|1 |[0.5998525619506836,0.400147408246994] |0.0 |0.5998525619506836|0.400147408246994 |
|1 |[0.5487841367721558,0.45121586322784424]|0.0 |0.5487841367721558|0.45121586322784424|
|0 |[0.5555324554443359,0.44446757435798645]|0.0 |0.5555324554443359|0.44446757435798645|
Hope this would help for those who handle with Vector type input.

Since the above answers need additional libraries or still not supported, I have used pandas dataframe to easity extract the vector values and then convert it back to spark dataframe.
# convert to pandas dataframe
pandasDf = dataframe.toPandas()
# add a new column
pandasDf['newColumnName'] = 0 # filled the new column with 0s
# now iterate through the rows and update the column
for index, row in pandasDf.iterrows():
value = row['vectorCol'][0] # get the 0th value of the vector
pandasDf.loc[index, 'newColumnName'] = value # put the value in the new column

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Constructing Spark ML features column with nested arrays - scala

Related

Scala: Find the maximum value across each row of a dataframe

How Does Everyone Deal with Probabilities from XGBoost Scored Data? Scala [duplicate]

Convert vector UDT to double in ML [duplicate]

add dataframe to another one

Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, ..., fn: Double)]

Categories

Resources