Spark select item in array by max score - scala

Given the following DataFrame containing an id and a Seq of Stuff (with an id and score), how do I select the "best" Stuff in the array by score?
I'd like NOT to use UDFs and possibly work with Spark DataFrame functions only.
case class Stuff(id: Int, score: Double)
val df = spark.createDataFrame(Seq(
(1, Seq(Stuff(11, 0.4), Stuff(12, 0.5))),
(2, Seq(Stuff(22, 0.9), Stuff(23, 0.8)))
)).toDF("id", "data")
df.show(false)
+---+----------------------+
|id |data |
+---+----------------------+
|1 |[[11, 0.4], [12, 0.5]]|
|2 |[[22, 0.9], [23, 0.8]]|
+---+----------------------+
df.printSchema
root
|-- id: integer (nullable = false)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: integer (nullable = false)
| | |-- score: double (nullable = false)
I tried going down the route of window functions but the code gets a bit too convoluted. Expected output:
+---+---------+
|id |topStuff |
+---+---------
|1 |[12, 0.5]|
|2 |[22, 0.9]|
+---+---------+

You can use Spark 2.4 higher-order functions:
df
.selectExpr("id","(filter(data, x -> x.score == array_max(data.score)))[0] as topstuff")
.show()
gives
+---+---------+
| id| topstuff|
+---+---------+
| 1|[12, 0.5]|
| 2|[22, 0.9]|
+---+---------+
As an alternative, use window-functions (requires shuffling!):
df
.select($"id",explode($"data").as("topstuff"))
.withColumn("selector",max($"topstuff.score") .over(Window.partitionBy($"id")))
.where($"topstuff.score"===$"selector")
.drop($"selector")
.show()
also gives:
+---+---------+
| id| topstuff|
+---+---------+
| 1|[12, 0.5]|
| 2|[22, 0.9]|
+---+---------+

Related

Pyspark RDD column value selection

I have a rdd like this:
|item_id| recommendations|
+-------+------------------+
| 1|[{810, 5.2324243},{134, 4.58323},{810, 4.89248}]
| 23|[[{1643, 5.1180077}, {1463, 4.8429747}, {1368, 4.4758873}]
if I want to only extract the first value in each {} from col "recommendations".
Expected result looks like this:
|item_id| recommendations|
+-------+------------------+
| 1|[{810, 134, 810}]
| 23|[{1643, 1463, 1368}]
What should I do? Thanks!
Not sure if your data is an rdd or a dataframe, so I provide both here. Overall, from your sample data, I assume your recommendations is an array of struct type. You will know the exact columns by running df.printSchema() (if it was a dataframe) or rdd.first() (if it was an rdd). I created a dummy schema with two columns a and b.
This is my "dummy" class
class X():
def __init__(self, a, b):
self.a = a
self.b = b
This is my "dummy" data
schema = T.StructType([
T.StructField('id', T.IntegerType()),
T.StructField('rec', T.ArrayType(T.StructType([
T.StructField('a', T.IntegerType()),
T.StructField('b', T.FloatType()),
])))
])
df = spark.createDataFrame([
(1, [X(810, 5.2324243), X(134, 4.58323), X(810, 4.89248)]),
(23, [X(1643, 5.1180077), X(1463, 4.8429747), X(1368, 4.4758873)])
], schema)
If your data is a dataframe
df.show(10, False)
df.printSchema()
+---+---------------------------------------------------------+
|id |rec |
+---+---------------------------------------------------------+
|1 |[{810, 5.2324243}, {134, 4.58323}, {810, 4.89248}] |
|23 |[{1643, 5.1180077}, {1463, 4.8429747}, {1368, 4.4758873}]|
+---+---------------------------------------------------------+
root
|-- id: integer (nullable = true)
|-- rec: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: integer (nullable = true)
| | |-- b: float (nullable = true)
(df
.select('id', F.explode('rec').alias('rec'))
.groupBy('id')
.agg(F.collect_list('rec.a').alias('rec'))
.show()
)
+---+------------------+
| id| rec|
+---+------------------+
| 1| [810, 134, 810]|
| 23|[1643, 1463, 1368]|
+---+------------------+
If your data is an rdd
dfrdd = df.rdd
dfrdd.first()
# Row(id=1, rec=[Row(a=810, b=5.232424259185791), Row(a=134, b=4.583230018615723), Row(a=810, b=4.89247989654541)])
(dfrdd
.map(lambda x: (x.id, [r.a for r in x.rec]))
.toDF()
.show()
)
+---+------------------+
| _1| _2|
+---+------------------+
| 1| [810, 134, 810]|
| 23|[1643, 1463, 1368]|
+---+------------------+

Add new column of Map Datatype to Spark Dataframe in scala

I'm able to create a new Dataframe with one column having Map datatype.
val inputDF2 = Seq(
(1, "Visa", 1, Map[String, Int]()),
(2, "MC", 2, Map[String, Int]())).toDF("id", "card_type", "number_of_cards", "card_type_details")
scala> inputDF2.show(false)
+---+---------+---------------+-----------------+
|id |card_type|number_of_cards|card_type_details|
+---+---------+---------------+-----------------+
|1 |Visa |1 |[] |
|2 |MC |2 |[] |
+---+---------+---------------+-----------------+
Now I want to create a new column of the same type as card_type_details. I'm trying to use the spark withColumn method to add this new column.
inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>").show(false)
+---------+---------+---------------+---------------------+-----+
|person_id|card_type|number_of_cards|card_type_details |tmp |
+---------+---------+---------------+---------------------+-----+
|1 |Visa |1 |[] |null |
|2 |MC |2 |[] |null |
+---------+---------+---------------+---------------------+-----+
When I checked the schema of both the columns, it is same but values are coming different.
scala> inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>").printSchema
root
|-- id: integer (nullable = false)
|-- card_type: string (nullable = true)
|-- number_of_cards: integer (nullable = false)
|-- card_type_details: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = false)
|-- tmp: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = true)
I'm not sure if I'm doing correctly while adding the new column. Issue is coming when I'm applying the .isEmpty method on the tmp column. I'm getting null pointer exception.
scala> def checkValue = udf((card_type_details: Map[String, Int]) => {
| var output_map = Map[String, Int]()
| if (card_type_details.isEmpty) { output_map += 0.toString -> 1 }
| else {output_map = card_type_details }
| output_map
| })
checkValue: org.apache.spark.sql.expressions.UserDefinedFunction
scala> inputDF2.withColumn("value", checkValue(col("card_type_details"))).show(false)
+---+---------+---------------+-----------------+--------+
|id |card_type|number_of_cards|card_type_details|value |
+---+---------+---------------+-----------------+--------+
|1 |Visa |1 |[] |[0 -> 1]|
|2 |MC |2 |[] |[0 -> 1]|
+---+---------+---------------+-----------------+--------+
scala> inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>")
.withColumn("value", checkValue(col("tmp"))).show(false)
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$checkValue$1: (map<string,int>) => map<string,int>)
Caused by: java.lang.NullPointerException
at $anonfun$checkValue$1.apply(<console>:28)
at $anonfun$checkValue$1.apply(<console>:26)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:108)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:107)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1063)
How to add a new column that should have the same values as card_type_details column.
To add the tmp column with the same value as card_type_details, you just do:
inputDF2.withColumn("tmp", col("cart_type_details"))
If you aim to add a column with an empty map and avoid the NullPointerException, the solution is:
inputDF2.withColumn("tmp", typedLit(Map.empty[Int, String]))

How to return ListBuffer as a column from UDF using Spark Scala?

I am trying to use UDF's and return ListBuffer as a column from UDF, i am getting error.
I have created Df by executing below code:
val df = Seq((1,"dept3##rama##kumar","dept3##rama##kumar"), (2,"dept31##rama1##kumar1","dept33##rama3##kumar3")).toDF("id","str1","str2")
df.show()
it show like below:
+---+--------------------+--------------------+
| id| str1| str2|
+---+--------------------+--------------------+
| 1| dept3##rama##kumar| dept3##rama##kumar|
| 2|dept31##rama1##ku...|dept33##rama3##ku...|
+---+--------------------+--------------------+
as per my requirement i have to use i have to split the above columns based some inputs so i have tried UDF like below :
def appendDelimiterError=udf((id: Int, str1: String, str2: String)=> {
var lit = new ListBuffer[Any]()
if(str1.contains("##"){val a=str1.split("##")}
else if(str1.contains("##"){val a=str1.split("##")}
else if(str1.contains("#&"){val a=str1.split("#&")}
if(str2.contains("##"){ val b=str2.split("##")}
else if(str2.contains("##"){ val b=str2.split("##") }
else if(str1.contains("##"){val b=str2.split("##")}
var tmp_row = List(a,"test1",b)
lit +=tmp_row
return lit
})
val
i try to cal by executing below code:
val df1=df.appendDelimiterError("newcol",appendDelimiterError(df("id"),df("str1"),df("str2"))
i getting error "this was a bad call" .i want use ListBuffer/list to store and return to calling place.
my expected output will be:
+---+--------------------+------------------------+----------------------------------------------------------------------+
| id| str1| str2 | newcol |
+---+--------------------+------------------------+----------------------------------------------------------------------+
| 1| dept3##rama##kumar| dept3##rama##kumar |ListBuffer(List("dept","rama","kumar"),List("dept3","rama","kumar")) |
| 2|dept31##rama1##kumar1|dept33##rama3##kumar3 | ListBuffer(List("dept31","rama1","kumar1"),List("dept33","rama3","kumar3")) |
+---+--------------------+------------------------+----------------------------------------------------------------------+
How to achieve this?
An alternative with my own fictional data to which you can tailor and no UDF:
import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val df = Seq(
(1, "111##cat##666", "222##fritz##777"),
(2, "AAA##cat##555", "BBB##felix##888"),
(3, "HHH##mouse##yyy", "123##mickey##ZZZ")
).toDF("c0", "c1", "c2")
val df2 = df.withColumn( "c_split", split(col("c1"), ("(##)|(##)|(##)|(##)") ))
.union(df.withColumn("c_split", split(col("c2"), ("(##)|(##)|(##)|(##)") )) )
df2.show(false)
df2.printSchema()
val df3 = df2.groupBy(col("c0")).agg(collect_list(col("c_split")).as("List_of_Data") )
df3.show(false)
df3.printSchema()
Gives answer but no ListBuffer - really necessary?, as follows:
+---+---------------+----------------+------------------+
|c0 |c1 |c2 |c_split |
+---+---------------+----------------+------------------+
|1 |111##cat##666 |222##fritz##777 |[111, cat, 666] |
|2 |AAA##cat##555 |BBB##felix##888 |[AAA, cat, 555] |
|3 |HHH##mouse##yyy|123##mickey##ZZZ|[HHH, mouse, yyy] |
|1 |111##cat##666 |222##fritz##777 |[222, fritz, 777] |
|2 |AAA##cat##555 |BBB##felix##888 |[BBB, felix, 888] |
|3 |HHH##mouse##yyy|123##mickey##ZZZ|[123, mickey, ZZZ]|
+---+---------------+----------------+------------------+
root
|-- c0: integer (nullable = false)
|-- c1: string (nullable = true)
|-- c2: string (nullable = true)
|-- c_split: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---------------------------------------+
|c0 |List_of_Data |
+---+---------------------------------------+
|1 |[[111, cat, 666], [222, fritz, 777]] |
|3 |[[HHH, mouse, yyy], [123, mickey, ZZZ]]|
|2 |[[AAA, cat, 555], [BBB, felix, 888]] |
+---+---------------------------------------+
root
|-- c0: integer (nullable = false)
|-- List_of_Data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)

Sample values from a list colum in spark dataframe

I have a spark-scala dataframe as shown in df1 below: I would like to sample with replacement from scores column(a List), based on counts in another column of df1.
val df1 = sc.parallelize(Seq(("a1",2,List(20,10)),("a2",1,List(30,10)),
("a3",3,List(10)),("a4",2,List(10,20,40)))).toDF("colA","counts","scores")
df1.show()
+----+------+------------+
|colA|counts| scores|
+----+------+------------+
| a1| 2| [20, 10]|
| a2| 1| [30, 10]|
| a3| 3| [10]|
| a4| 2|[10, 20, 40]|
+----+------+------------+
Expected output is shown in df2: from row 1, sample 2 values from list [20,10]; from row 2 sample 1 value from list [30,10]; from row 3 sample 3 values from list[10] with repetition.. etc.
df2.show() //expected output
+----+------+------------+-------------+
|colA|counts| scores|sampledScores|
+----+------+------------+-------------+
| a1| 2| [20, 10]| [20, 10]|
| a2| 1| [30, 10]| [30]|
| a3| 3| [10]| [10, 10, 10]|
| a4| 2|[10, 20, 40]| [10, 40]|
+----+------+------------+-------------+
I wrote an udf 'takeSample' and applied to df1 but did not work as intended.
val takeSample = udf((a:Array[Int], count1:Int) => {Array.fill(count1)(
a(new Random(System.currentTimeMillis).nextInt(a.size)))}
)
val df2 = df1.withColumn("SampledScores", takeSample(df1("Scores"),df1("counts")))
I got the following run-time error; when executing
df2.printSchema()
root
|-- colA: string (nullable = true)
|-- counts: integer (nullable = true)
|-- scores: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- SampledScores: array (nullable = true)
| |-- element: integer (containsNull = false)
df2.show()
org.apache.spark.SparkException: Failed to execute user defined
function($anonfun$1: (array<int>, int) => array<int>)
Caused by: java.lang.ClassCastException:
scala.collection.mutable.WrappedArray$ofRef cannot be cast to [I
at $anonfun$1.apply(<console>:47)
Any solution is greatly appreciated.
Changing the data type from Array[Int] to Seq[Int] in the UDF will resolve the issue:
val takeSample = udf((a:Seq[Int], count1:Int) => {Array.fill(count1)(
a(new Random(System.currentTimeMillis).nextInt(a.size)))}
)
val df2 = df1.withColumn("SampledScores", takeSample(df1("Scores"),df1("counts")))
It will give us the expected output:
df2.printSchema()
root
|-- colA: string (nullable = true)
|-- counts: integer (nullable = true)
|-- scores: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- SampledScores: array (nullable = true)
| |-- element: integer (containsNull = false)
df2.show
+----+------+------------+-------------+
|colA|counts| scores|SampledScores|
+----+------+------------+-------------+
| a1| 2| [20, 10]| [20, 20]|
| a2| 1| [30, 10]| [30]|
| a3| 3| [10]| [10, 10, 10]|
| a4| 2|[10, 20, 40]| [20, 20]|
+----+------+------------+-------------+

Sort by date an Array of a Spark DataFrame Column

I have a DataFrame formated as below:
+---+------------------------------------------------------+
|Id |DateInfos |
+---+------------------------------------------------------+
|B |[[3, 19/06/2012-02.42.01], [4, 17/06/2012-18.22.21]] |
|A |[[1, 15/06/2012-18.22.16], [2, 15/06/2012-09.22.35]] |
|C |[[5, 14/06/2012-05.20.01]] |
+---+------------------------------------------------------+
I would like to sort each element of DateInfos column by date with the timestamp in the second element of my Array
+---+------------------------------------------------------+
|Id |DateInfos |
+---+------------------------------------------------------+
|B |[[4, 17/06/2012-18.22.21], [3, 19/06/2012-02.42.01]] |
|A |[[2, 15/06/2012-09.22.35], [1, 15/06/2012-18.22.16]] |
|C |[[5, 14/06/2012-05.20.01]] |
+---+------------------------------------------------------+
the schema of my DataFrame is printed as below:
root
|-- C1: string (nullable = true)
|-- C2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: string (nullable = false)
I assume I have to create an udf which use a function with the following signature:
def sort_by_date(mouvements : Array[Any]) : Array[Any]
Do you have any idea?
That's indeed a bit tricky - because although the UDF's input and output types seem identical, we can't really define it that way - because the input is actually a mutable.WrappedArray[Row] and the output can't use Row or else Spark will fail to decode it into a Row...
So we define a UDF that takes a mutable.WrappedArray[Row] and returns an Array[(Int, String)]:
val sortDates = udf { arr: mutable.WrappedArray[Row] =>
arr.map { case Row(i: Int, s: String) => (i, s) }.sortBy(_._2)
}
val result = input.select($"Id", sortDates($"DateInfos") as "DateInfos")
result.show(truncate = false)
// +---+--------------------------------------------------+
// |Id |DateInfos |
// +---+--------------------------------------------------+
// |B |[[4,17/06/2012-18.22.21], [3,19/06/2012-02.42.01]]|
// |A |[[2,15/06/2012-09.22.35], [1,15/06/2012-18.22.16]]|
// |C |[[5,14/06/2012-05.20.01]] |
// +---+--------------------------------------------------+