Spark: Joining with array - scala

I need to join a dataframe with a string column to one with array of string so that if one of the values in the array is matched, the rows will join.
I tried this but I guess it's not support.
Any other way to do this?
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("test")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
import spark.implicits._
val left = spark.sparkContext.parallelize(Seq(1, 2, 3)).toDF("col1")
val right = spark.sparkContext.parallelize(Seq((Array(1, 2), "Yes"),(Array(3),"No"))).toDF("col1", "col2")
left.join(right,"col1")
Throws:
org.apache.spark.sql.AnalysisException: cannot resolve '(col1
=col1)' due to data
type mismatch: differing types in '(col1 =
col1)' (int and array).;;

One option is to create an UDF for building your join condition:
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
val left = spark.sparkContext.parallelize(Seq(1, 2, 3)).toDF("col1")
val right = spark.sparkContext.parallelize(Seq((Array(1, 2), "Yes"),(Array(3),"No"))).toDF("col1", "col2")
val checkValue = udf {
(array: WrappedArray[Int], value: Int) => array.contains(value)
}
val result = left.join(right, checkValue(right("col1"), left("col1")), "inner")
result.show
+----+------+----+
|col1| col1|col2|
+----+------+----+
| 1|[1, 2]| Yes|
| 2|[1, 2]| Yes|
| 3| [3]| No|
+----+------+----+

The most succinct way to do this is to use the array_contains spark sql expression as shown below, that said I've compared the performance of this with the performance of doing an explode and join as shown in a previous answer and the explode seems more performant.
import org.apache.spark.sql.functions.expr
import spark.implicits._
val left = Seq(1, 2, 3).toDF("col1")
val right = Seq((Array(1, 2), "Yes"),(Array(3),"No")).toDF("col1", "col2").withColumnRenamed("col1", "col1_array")
val joined = left.join(right, expr("array_contains(col1_array, col1)")).show
+----+----------+----+
|col1|col1_array|col2|
+----+----------+----+
| 1| [1, 2]| Yes|
| 2| [1, 2]| Yes|
| 3| [3]| No|
+----+----------+----+
Note you can't use the org.apache.spark.sql.functions.array_contains function directly as it requires the second argument to be a literal as opposed to a column expression.

You could use explode on you Array column before the join. Explode creates a new line for each element in the array :
right = right.withColumn("exploded_col",explode(right("col1")))
right.show()
+------+----+--------------+
| col1|col2|exploded_col_1|
+------+----+--------------+
|[1, 2]| Yes| 1|
|[1, 2]| Yes| 2|
| [3]| No| 3|
+------+----+--------------+
Then you can easily join with your first dataset.

Related

Comparing two dataframes in Spark

I'm comparing two dataframes in spark using except().
For exmaple: df.except(df2)
I will get all the records that are not available in df2 from df. However, I would like to list field details also which are not matching.
For example:
df:
------------------
id,name,age,city
101,kp,28,CHN
------------------
df2:
-----------------
id,name,age,city
101,kp,28,HYD
----------------
Expected output:
df3
--------------------------
id,name,age,city,diff
101,kp,28,CHN,City is not matching
--------------------------------
How can I acheive this?
Use intersect to get the values common to both DataFrames,then build your not matching logic
intersect -returns a new Dataset containing rows only in both this Dataset and another Dataset.
df.intersect(df2)
return a new RDD that contains the intersection of elements in the source dataset and the argument.
intersection(anotherrdd) returns the elements which are present in both the DF.
intersection(anotherrdd) remove all the duplicate including duplicated in single DF
Newer again attempt on the above but not possible elegantly, but with JOIN as opposed to except. Best I can do.
I believe it does what you need and takes into the fact there are things in one data set or not.
Run under Databricks.
case class Person(personid: Int, personname: String, cityid: Int)
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.functions._
val df1 = Seq(
Person(0, "AgataZ", 0),
Person(1, "Iweta", 0),
Person(2, "Patryk", 2),
Person(9999, "Maria", 2),
Person(5, "John", 2),
Person(6, "Patsy", 2),
Person(7, "Gloria", 222),
Person(3333, "Maksym", 0)).toDF
val df2 = Seq(
Person(0, "Agata", 0),
Person(1, "Iweta", 0),
Person(2, "Patryk", 2),
Person(5, "John", 2),
Person(6, "Patsy", 333),
Person(7, "Gloria", 2),
Person(4444, "Hans", 3)).toDF
val joined = df1.join(df2, df1("personid") === df2("personid"), "outer")
val newNames = Seq("personId1", "personName1", "personCity1", "personId2", "personName2", "personCity2")
val df_Renamed = joined.toDF(newNames: _*)
// Some deliberate variation shown in approach for learning
val df_temp = df_Renamed.filter($"personCity1" =!= $"personCity2" || $"personName1" =!= $"personName2" || $"personName1".isNull || $"personName2".isNull || $"personCity1".isNull || $"personCity2".isNull).select($"personId1", $"personName1".alias("Name"), $"personCity1", $"personId2", $"personName2".alias("Name2"), $"personCity2"). withColumn("PersonID", when($"personId1".isNotNull, $"personId1").otherwise($"personId2"))
val df_final = df_temp.withColumn("nameChange ?", when($"Name".isNull or $"Name2".isNull or $"Name" =!= $"Name2", "Yes").otherwise("No")).withColumn("cityChange ?", when($"personCity1".isNull or $"personCity2".isNull or $"personCity1" =!= $"personCity2", "Yes").otherwise("No")).drop("PersonId1").drop("PersonId2")
df_final.show()
gives:
+------+-----------+------+-----------+--------+------------+------------+
| Name|personCity1| Name2|personCity2|PersonID|nameChange ?|cityChange ?|
+------+-----------+------+-----------+--------+------------+------------+
| Patsy| 2| Patsy| 333| 6| No| Yes|
|Maksym| 0| null| null| 3333| Yes| Yes|
| null| null| Hans| 3| 4444| Yes| Yes|
|Gloria| 222|Gloria| 2| 7| No| Yes|
| Maria| 2| null| null| 9999| Yes| Yes|
|AgataZ| 0| Agata| 0| 0| Yes| No|
+------+-----------+------+-----------+--------+------------+------------+

How Does Everyone Deal with Probabilities from XGBoost Scored Data? Scala [duplicate]

I just used Standard Scaler to normalize my features for a ML application. After selecting the scaled features, I want to convert this back to a dataframe of Doubles, though the length of my vectors are arbitrary. I know how to do it for a specific 3 features by using
myDF.map{case Row(v: Vector) => (v(0), v(1), v(2))}.toDF("f1", "f2", "f3")
but not for an arbitrary amount of features. Is there an easy way to do this?
Example:
val testDF = sc.parallelize(List(Vectors.dense(5D, 6D, 7D), Vectors.dense(8D, 9D, 10D), Vectors.dense(11D, 12D, 13D))).map(Tuple1(_)).toDF("scaledFeatures")
val myColumnNames = List("f1", "f2", "f3")
// val finalDF = DataFrame[f1: Double, f2: Double, f3: Double]
EDIT
I found out how to unpack to column names when creating the dataframe, but still am having trouble converting a vector to a sequence needed to create the dataframe:
finalDF = testDF.map{case Row(v: Vector) => v.toArray.toSeq /* <= this errors */}.toDF(List("f1", "f2", "f3"): _*)
Spark >= 3.0.0
Since Spark 3.0 you can use vector_to_array
import org.apache.spark.ml.functions.vector_to_array
testDF.select(vector_to_array($"scaledFeatures").alias("_tmp")).select(exprs:_*)
Spark < 3.0.0
One possible approach is something similar to this
import org.apache.spark.sql.functions.udf
// In Spark 1.x you'll will have to replace ML Vector with MLLib one
// import org.apache.spark.mllib.linalg.Vector
// In 2.x the below is usually the right choice
import org.apache.spark.ml.linalg.Vector
// Get size of the vector
val n = testDF.first.getAs[Vector](0).size
// Simple helper to convert vector to array<double>
// asNondeterministic is available in Spark 2.3 or befor
// It can be removed, but at the cost of decreased performance
val vecToSeq = udf((v: Vector) => v.toArray).asNondeterministic
// Prepare a list of columns to create
val exprs = (0 until n).map(i => $"_tmp".getItem(i).alias(s"f$i"))
testDF.select(vecToSeq($"scaledFeatures").alias("_tmp")).select(exprs:_*)
If you know a list of columns upfront you can simplify this a little:
val cols: Seq[String] = ???
val exprs = cols.zipWithIndex.map{ case (c, i) => $"_tmp".getItem(i).alias(c) }
For Python equivalent see How to split Vector into columns - using PySpark.
Please try VectorSlicer :
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame(
Seq((1, 0.2, 0.8), (2, 0.1, 0.9), (3, 0.3, 0.7))
).toDF("id", "negative_logit", "positive_logit")
val assembler = new VectorAssembler()
.setInputCols(Array("negative_logit", "positive_logit"))
.setOutputCol("prediction")
val output = assembler.transform(dataset)
output.show()
/*
+---+--------------+--------------+----------+
| id|negative_logit|positive_logit|prediction|
+---+--------------+--------------+----------+
| 1| 0.2| 0.8| [0.2,0.8]|
| 2| 0.1| 0.9| [0.1,0.9]|
| 3| 0.3| 0.7| [0.3,0.7]|
+---+--------------+--------------+----------+
*/
val slicer = new VectorSlicer()
.setInputCol("prediction")
.setIndices(Array(1))
.setOutputCol("positive_prediction")
val posi_output = slicer.transform(output)
posi_output.show()
/*
+---+--------------+--------------+----------+-------------------+
| id|negative_logit|positive_logit|prediction|positive_prediction|
+---+--------------+--------------+----------+-------------------+
| 1| 0.2| 0.8| [0.2,0.8]| [0.8]|
| 2| 0.1| 0.9| [0.1,0.9]| [0.9]|
| 3| 0.3| 0.7| [0.3,0.7]| [0.7]|
+---+--------------+--------------+----------+-------------------+
*/
Alternate solution that evovled couple of days ago: Import the VectorDisassembler into your project (as long as it's not merged into Spark), now:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame(
Seq((0, 1.2, 1.3), (1, 2.2, 2.3), (2, 3.2, 3.3))
).toDF("id", "val1", "val2")
val assembler = new VectorAssembler()
.setInputCols(Array("val1", "val2"))
.setOutputCol("vectorCol")
val output = assembler.transform(dataset)
output.show()
/*
+---+----+----+---------+
| id|val1|val2|vectorCol|
+---+----+----+---------+
| 0| 1.2| 1.3|[1.2,1.3]|
| 1| 2.2| 2.3|[2.2,2.3]|
| 2| 3.2| 3.3|[3.2,3.3]|
+---+----+----+---------+*/
val disassembler = new org.apache.spark.ml.feature.VectorDisassembler()
.setInputCol("vectorCol")
disassembler.transform(output).show()
/*
+---+----+----+---------+----+----+
| id|val1|val2|vectorCol|val1|val2|
+---+----+----+---------+----+----+
| 0| 1.2| 1.3|[1.2,1.3]| 1.2| 1.3|
| 1| 2.2| 2.3|[2.2,2.3]| 2.2| 2.3|
| 2| 3.2| 3.3|[3.2,3.3]| 3.2| 3.3|
+---+----+----+---------+----+----+*/
I use Spark 2.3.2, and built a xgboost4j binary-classification model, the result looks like this:
results_train.select("classIndex","probability","prediction").show(3,0)
+----------+----------------------------------------+----------+
|classIndex|probability |prediction|
+----------+----------------------------------------+----------+
|1 |[0.5998525619506836,0.400147408246994] |0.0 |
|1 |[0.5487841367721558,0.45121586322784424]|0.0 |
|0 |[0.5555324554443359,0.44446757435798645]|0.0 |
I define the following udf to get the elements out of vector column probability
import org.apache.spark.sql.functions._
def getProb = udf((probV: org.apache.spark.ml.linalg.Vector, clsInx: Int) => probV.apply(clsInx) )
results_train.select("classIndex","probability","prediction").
withColumn("p_0",getProb($"probability",lit(0))).
withColumn("p_1",getProb($"probability", lit(1))).show(3,0)
+----------+----------------------------------------+----------+------------------+-------------------+
|classIndex|probability |prediction|p_0 |p_1 |
+----------+----------------------------------------+----------+------------------+-------------------+
|1 |[0.5998525619506836,0.400147408246994] |0.0 |0.5998525619506836|0.400147408246994 |
|1 |[0.5487841367721558,0.45121586322784424]|0.0 |0.5487841367721558|0.45121586322784424|
|0 |[0.5555324554443359,0.44446757435798645]|0.0 |0.5555324554443359|0.44446757435798645|
Hope this would help for those who handle with Vector type input.
Since the above answers need additional libraries or still not supported, I have used pandas dataframe to easity extract the vector values and then convert it back to spark dataframe.
# convert to pandas dataframe
pandasDf = dataframe.toPandas()
# add a new column
pandasDf['newColumnName'] = 0 # filled the new column with 0s
# now iterate through the rows and update the column
for index, row in pandasDf.iterrows():
value = row['vectorCol'][0] # get the 0th value of the vector
pandasDf.loc[index, 'newColumnName'] = value # put the value in the new column

Scala add new column to dataframe by expression

I am going to add new column to a dataframe with expression.
for example, I have a dataframe of
+-----+----------+----------+-----+
| C1 | C2 | C3 |C4 |
+-----+----------+----------+-----+
|steak|1 |1 | 150|
|steak|2 |2 | 180|
| fish|3 |3 | 100|
+-----+----------+----------+-----+
and I want to create a new column C5 with expression "C2/C3+C4", assuming there are several new columns need to add, and the expressions may be different and come from database.
Is there a good way to do this?
I know that if I have an expression like "2+3*4" I can use scala.tools.reflect.ToolBox to eval it.
And normally I am using df.withColumn to add new column.
Seems I need to create an UDF, but how can I pass the columns value as parameters to UDF? especially there maybe multiple expression need different columns calculate.
This can be done using expr to create a Column from an expression:
val df = Seq((1,2)).toDF("x","y")
val myExpression = "x+y"
import org.apache.spark.sql.functions.expr
df.withColumn("z",expr(myExpression)).show()
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 3|
+---+---+---+
Two approaches:
import spark.implicits._ //so that you could use .toDF
val df = Seq(
("steak", 1, 1, 150),
("steak", 2, 2, 180),
("fish", 3, 3, 100)
).toDF("C1", "C2", "C3", "C4")
import org.apache.spark.sql.functions._
// 1st approach using expr
df.withColumn("C5", expr("C2/(C3 + C4)")).show()
// 2nd approach using selectExpr
df.selectExpr("*", "(C2/(C3 + C4)) as C5").show()
+-----+---+---+---+--------------------+
| C1| C2| C3| C4| C5|
+-----+---+---+---+--------------------+
|steak| 1| 1|150|0.006622516556291391|
|steak| 2| 2|180| 0.01098901098901099|
| fish| 3| 3|100| 0.02912621359223301|
+-----+---+---+---+--------------------+
In Spark 2.x, you can create a new column C5 with expression "C2/C3+C4" using withColumn() and org.apache.spark.sql.functions._,
val currentDf = Seq(
("steak", 1, 1, 150),
("steak", 2, 2, 180),
("fish", 3, 3, 100)
).toDF("C1", "C2", "C3", "C4")
val requiredDf = currentDf
.withColumn("C5", (col("C2")/col("C3")+col("C4")))
Also, you can do the same using org.apache.spark.sql.Column as well.
(But the space complexity is bit higher in this approach than using org.apache.spark.sql.functions._ due to the Column object creation)
val requiredDf = currentDf
.withColumn("C5", (new Column("C2")/new Column("C3")+new Column("C4")))
This worked perfectly for me. I am using Spark 2.0.2.

Split 1 column into 3 columns in spark scala

I have a dataframe in Spark using scala that has a column that I need split.
scala> test.show
+-------------+
|columnToSplit|
+-------------+
| a.b.c|
| d.e.f|
+-------------+
I need this column split out to look like this:
+--------------+
|col1|col2|col3|
| a| b| c|
| d| e| f|
+--------------+
I'm using Spark 2.0.0
Thanks
Try:
import sparkObject.spark.implicits._
import org.apache.spark.sql.functions.split
df.withColumn("_tmp", split($"columnToSplit", "\\.")).select(
$"_tmp".getItem(0).as("col1"),
$"_tmp".getItem(1).as("col2"),
$"_tmp".getItem(2).as("col3")
)
The important point to note here is that the sparkObject is the SparkSession object you might have already initialized. So, the (1) import statement has to be compulsorily put inline within the code, not before the class definition.
To do this programmatically, you can create a sequence of expressions with (0 until 3).map(i => col("temp").getItem(i).as(s"col$i")) (assume you need 3 columns as result) and then apply it to select with : _* syntax:
df.withColumn("temp", split(col("columnToSplit"), "\\.")).select(
(0 until 3).map(i => col("temp").getItem(i).as(s"col$i")): _*
).show
+----+----+----+
|col0|col1|col2|
+----+----+----+
| a| b| c|
| d| e| f|
+----+----+----+
To keep all columns:
df.withColumn("temp", split(col("columnToSplit"), "\\.")).select(
col("*") +: (0 until 3).map(i => col("temp").getItem(i).as(s"col$i")): _*
).show
+-------------+---------+----+----+----+
|columnToSplit| temp|col0|col1|col2|
+-------------+---------+----+----+----+
| a.b.c|[a, b, c]| a| b| c|
| d.e.f|[d, e, f]| d| e| f|
+-------------+---------+----+----+----+
If you are using pyspark, use a list comprehension to replace the map in scala:
df = spark.createDataFrame([['a.b.c'], ['d.e.f']], ['columnToSplit'])
from pyspark.sql.functions import col, split
(df.withColumn('temp', split('columnToSplit', '\\.'))
.select(*(col('temp').getItem(i).alias(f'col{i}') for i in range(3))
).show()
+----+----+----+
|col0|col1|col2|
+----+----+----+
| a| b| c|
| d| e| f|
+----+----+----+
A solution which avoids the select part. This is helpful when you just want to append the new columns:
case class Message(others: String, text: String)
val r1 = Message("foo1", "a.b.c")
val r2 = Message("foo2", "d.e.f")
val records = Seq(r1, r2)
val df = spark.createDataFrame(records)
df.withColumn("col1", split(col("text"), "\\.").getItem(0))
.withColumn("col2", split(col("text"), "\\.").getItem(1))
.withColumn("col3", split(col("text"), "\\.").getItem(2))
.show(false)
+------+-----+----+----+----+
|others|text |col1|col2|col3|
+------+-----+----+----+----+
|foo1 |a.b.c|a |b |c |
|foo2 |d.e.f|d |e |f |
+------+-----+----+----+----+
Update: I highly recommend to use Psidom's implementation to avoid splitting three times.
This appends columns to the original DataFrame and doesn't use select, and only splits once using a temporary column:
import spark.implicits._
df.withColumn("_tmp", split($"columnToSplit", "\\."))
.withColumn("col1", $"_tmp".getItem(0))
.withColumn("col2", $"_tmp".getItem(1))
.withColumn("col3", $"_tmp".getItem(2))
.drop("_tmp")
This expands on Psidom's answer and shows how to do the split dynamically, without hardcoding the number of columns. This answer runs a query to calculate the number of columns.
val df = Seq(
"a.b.c",
"d.e.f"
).toDF("my_str")
.withColumn("letters", split(col("my_str"), "\\."))
val numCols = df
.withColumn("letters_size", size($"letters"))
.agg(max($"letters_size"))
.head()
.getInt(0)
df
.select(
(0 until numCols).map(i => $"letters".getItem(i).as(s"col$i")): _*
)
.show()
We can write using for with yield in Scala :-
If your number of columns exceeds just add it to desired column and play with it. :)
val aDF = Seq("Deepak.Singh.Delhi").toDF("name")
val desiredColumn = Seq("name","Lname","City")
val colsize = desiredColumn.size
val columList = for (i <- 0 until colsize) yield split(col("name"),".").getItem(i).alias(desiredColumn(i))
aDF.select(columList: _ *).show(false)
Output:-
+------+------+-----+--+
|name |Lname |city |
+-----+------+-----+---+
|Deepak|Singh |Delhi|
+---+------+-----+-----+
If you don't need name column then, drop the column and just use withColumn.
Example:
Without using the select statement.
Lets assume we have a dataframe having a set of columns and we want to split a column having column name as name
import spark.implicits._
val columns = Seq("name","age","address")
val data = Seq(("Amit.Mehta", 25, "1 Main st, Newark, NJ, 92537"),
("Rituraj.Mehta", 28,"3456 Walnut st, Newark, NJ, 94732"))
var dfFromData = spark.createDataFrame(data).toDF(columns:_*)
dfFromData.printSchema()
val newDF = dfFromData.map(f=>{
val nameSplit = f.getAs[String](0).split("\\.").map(_.trim)
(nameSplit(0),nameSplit(1),f.getAs[Int](1),f.getAs[String](2))
})
val finalDF = newDF.toDF("First Name","Last Name", "Age","Address")
finalDF.printSchema()
finalDF.show(false)
output:

Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, ..., fn: Double)]

I just used Standard Scaler to normalize my features for a ML application. After selecting the scaled features, I want to convert this back to a dataframe of Doubles, though the length of my vectors are arbitrary. I know how to do it for a specific 3 features by using
myDF.map{case Row(v: Vector) => (v(0), v(1), v(2))}.toDF("f1", "f2", "f3")
but not for an arbitrary amount of features. Is there an easy way to do this?
Example:
val testDF = sc.parallelize(List(Vectors.dense(5D, 6D, 7D), Vectors.dense(8D, 9D, 10D), Vectors.dense(11D, 12D, 13D))).map(Tuple1(_)).toDF("scaledFeatures")
val myColumnNames = List("f1", "f2", "f3")
// val finalDF = DataFrame[f1: Double, f2: Double, f3: Double]
EDIT
I found out how to unpack to column names when creating the dataframe, but still am having trouble converting a vector to a sequence needed to create the dataframe:
finalDF = testDF.map{case Row(v: Vector) => v.toArray.toSeq /* <= this errors */}.toDF(List("f1", "f2", "f3"): _*)
Spark >= 3.0.0
Since Spark 3.0 you can use vector_to_array
import org.apache.spark.ml.functions.vector_to_array
testDF.select(vector_to_array($"scaledFeatures").alias("_tmp")).select(exprs:_*)
Spark < 3.0.0
One possible approach is something similar to this
import org.apache.spark.sql.functions.udf
// In Spark 1.x you'll will have to replace ML Vector with MLLib one
// import org.apache.spark.mllib.linalg.Vector
// In 2.x the below is usually the right choice
import org.apache.spark.ml.linalg.Vector
// Get size of the vector
val n = testDF.first.getAs[Vector](0).size
// Simple helper to convert vector to array<double>
// asNondeterministic is available in Spark 2.3 or befor
// It can be removed, but at the cost of decreased performance
val vecToSeq = udf((v: Vector) => v.toArray).asNondeterministic
// Prepare a list of columns to create
val exprs = (0 until n).map(i => $"_tmp".getItem(i).alias(s"f$i"))
testDF.select(vecToSeq($"scaledFeatures").alias("_tmp")).select(exprs:_*)
If you know a list of columns upfront you can simplify this a little:
val cols: Seq[String] = ???
val exprs = cols.zipWithIndex.map{ case (c, i) => $"_tmp".getItem(i).alias(c) }
For Python equivalent see How to split Vector into columns - using PySpark.
Please try VectorSlicer :
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame(
Seq((1, 0.2, 0.8), (2, 0.1, 0.9), (3, 0.3, 0.7))
).toDF("id", "negative_logit", "positive_logit")
val assembler = new VectorAssembler()
.setInputCols(Array("negative_logit", "positive_logit"))
.setOutputCol("prediction")
val output = assembler.transform(dataset)
output.show()
/*
+---+--------------+--------------+----------+
| id|negative_logit|positive_logit|prediction|
+---+--------------+--------------+----------+
| 1| 0.2| 0.8| [0.2,0.8]|
| 2| 0.1| 0.9| [0.1,0.9]|
| 3| 0.3| 0.7| [0.3,0.7]|
+---+--------------+--------------+----------+
*/
val slicer = new VectorSlicer()
.setInputCol("prediction")
.setIndices(Array(1))
.setOutputCol("positive_prediction")
val posi_output = slicer.transform(output)
posi_output.show()
/*
+---+--------------+--------------+----------+-------------------+
| id|negative_logit|positive_logit|prediction|positive_prediction|
+---+--------------+--------------+----------+-------------------+
| 1| 0.2| 0.8| [0.2,0.8]| [0.8]|
| 2| 0.1| 0.9| [0.1,0.9]| [0.9]|
| 3| 0.3| 0.7| [0.3,0.7]| [0.7]|
+---+--------------+--------------+----------+-------------------+
*/
Alternate solution that evovled couple of days ago: Import the VectorDisassembler into your project (as long as it's not merged into Spark), now:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame(
Seq((0, 1.2, 1.3), (1, 2.2, 2.3), (2, 3.2, 3.3))
).toDF("id", "val1", "val2")
val assembler = new VectorAssembler()
.setInputCols(Array("val1", "val2"))
.setOutputCol("vectorCol")
val output = assembler.transform(dataset)
output.show()
/*
+---+----+----+---------+
| id|val1|val2|vectorCol|
+---+----+----+---------+
| 0| 1.2| 1.3|[1.2,1.3]|
| 1| 2.2| 2.3|[2.2,2.3]|
| 2| 3.2| 3.3|[3.2,3.3]|
+---+----+----+---------+*/
val disassembler = new org.apache.spark.ml.feature.VectorDisassembler()
.setInputCol("vectorCol")
disassembler.transform(output).show()
/*
+---+----+----+---------+----+----+
| id|val1|val2|vectorCol|val1|val2|
+---+----+----+---------+----+----+
| 0| 1.2| 1.3|[1.2,1.3]| 1.2| 1.3|
| 1| 2.2| 2.3|[2.2,2.3]| 2.2| 2.3|
| 2| 3.2| 3.3|[3.2,3.3]| 3.2| 3.3|
+---+----+----+---------+----+----+*/
I use Spark 2.3.2, and built a xgboost4j binary-classification model, the result looks like this:
results_train.select("classIndex","probability","prediction").show(3,0)
+----------+----------------------------------------+----------+
|classIndex|probability |prediction|
+----------+----------------------------------------+----------+
|1 |[0.5998525619506836,0.400147408246994] |0.0 |
|1 |[0.5487841367721558,0.45121586322784424]|0.0 |
|0 |[0.5555324554443359,0.44446757435798645]|0.0 |
I define the following udf to get the elements out of vector column probability
import org.apache.spark.sql.functions._
def getProb = udf((probV: org.apache.spark.ml.linalg.Vector, clsInx: Int) => probV.apply(clsInx) )
results_train.select("classIndex","probability","prediction").
withColumn("p_0",getProb($"probability",lit(0))).
withColumn("p_1",getProb($"probability", lit(1))).show(3,0)
+----------+----------------------------------------+----------+------------------+-------------------+
|classIndex|probability |prediction|p_0 |p_1 |
+----------+----------------------------------------+----------+------------------+-------------------+
|1 |[0.5998525619506836,0.400147408246994] |0.0 |0.5998525619506836|0.400147408246994 |
|1 |[0.5487841367721558,0.45121586322784424]|0.0 |0.5487841367721558|0.45121586322784424|
|0 |[0.5555324554443359,0.44446757435798645]|0.0 |0.5555324554443359|0.44446757435798645|
Hope this would help for those who handle with Vector type input.
Since the above answers need additional libraries or still not supported, I have used pandas dataframe to easity extract the vector values and then convert it back to spark dataframe.
# convert to pandas dataframe
pandasDf = dataframe.toPandas()
# add a new column
pandasDf['newColumnName'] = 0 # filled the new column with 0s
# now iterate through the rows and update the column
for index, row in pandasDf.iterrows():
value = row['vectorCol'][0] # get the 0th value of the vector
pandasDf.loc[index, 'newColumnName'] = value # put the value in the new column