Get minimum value from an Array in a Spark DataFrame column - scala

I have a DataFrame with Arrays.
val DF = Seq(
("123", "|1|2","3|3|4" ),
("124", "|3|2","|3|4" )
).toDF("id", "complete1", "complete2")
.select($"id", split($"complete1", "\\|").as("complete1"), split($"complete2", "\\|").as("complete2"))
|id |complete1|complete2|
+-------------+---------+---------+
| 123| [, 1, 2]|[3, 3, 4]|
| 124| [, 3, 2]| [, 3, 4]|
+-------------+---------+---------+
How do I extract the minimum of each arrays?
|id |complete1|complete2|
+-------------+---------+---------+
| 123| 1 | 3 |
| 124| 2 | 3 |
+-------------+---------+---------+
I have tried defining a UDF to do this but I am getting an error.
def minArray(a:Array[String]) :String = a.filter(_.nonEmpty).min.mkString
val minArrayUDF = udf(minArray _)
def getMinArray(df: DataFrame, i: Int): DataFrame = df.withColumn("complete" + i, minArrayUDF(df("complete" + i)))
val minDf = (1 to 2).foldLeft(DF){ case (df, i) => getMinArray(df, i)}
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;

Since Spark 2.4, you can use array_min to find the minimum value in an array. To use this function you will first have to cast your arrays of strings to arrays of integers. Casting will also take care of the empty strings by converting them into null values.
DF.select($"id",
array_min(expr("cast(complete1 as array<int>)")).as("complete1"),
array_min(expr("cast(complete2 as array<int>)")).as("complete2"))

You can define your udf function as below
def minUdf = udf((arr: Seq[String])=> arr.filterNot(_ == "").map(_.toInt).min)
and call it as
DF.select(col("id"), minUdf(col("complete1")).as("complete1"), minUdf(col("complete2")).as("complete2")).show(false)
which should give you
+---+---------+---------+
|id |complete1|complete2|
+---+---------+---------+
|123|1 |3 |
|124|2 |3 |
+---+---------+---------+
Updated
In case if the array passed to udf functions are empty or array of empty strings then you will encounter
java.lang.UnsupportedOperationException: empty.min
You should handle that with if else condition in udf function as
def minUdf = udf((arr: Seq[String])=> {
val filtered = arr.filterNot(_ == "")
if(filtered.isEmpty) 0
else filtered.map(_.toInt).min
})
I hope the answer is helpful

Here is how you can do it without using udf
First explode the array you got with split() and then group by the same id and find min
val DF = Seq(
("123", "|1|2","3|3|4" ),
("124", "|3|2","|3|4" )
).toDF("id", "complete1", "complete2")
.select($"id", split($"complete1", "\\|").as("complete1"), split($"complete2", "\\|").as("complete2"))
.withColumn("complete1", explode($"complete1"))
.withColumn("complete2", explode($"complete2"))
.groupBy($"id").agg(min($"complete1".cast(IntegerType)).as("complete1"), min($"complete2".cast(IntegerType)).as("complete2"))
Output:
+---+---------+---------+
|id |complete1|complete2|
+---+---------+---------+
|124|2 |3 |
|123|1 |3 |
+---+---------+---------+

You don't need an UDF for this, you can use sort_array:
val DF = Seq(
("123", "|1|2","3|3|4" ),
("124", "|3|2","|3|4" )
).toDF("id", "complete1", "complete2")
.select(
$"id",
split(regexp_replace($"complete1","^\\|",""), "\\|").as("complete1"),
split(regexp_replace($"complete2","^\\|",""), "\\|").as("complete2")
)
// now select minimum
DF.
.select(
$"id",
sort_array($"complete1")(0).as("complete1"),
sort_array($"complete2")(0).as("complete2")
).show()
+---+---------+---------+
| id|complete1|complete2|
+---+---------+---------+
|123| 1| 3|
|124| 2| 3|
+---+---------+---------+
Note that I removed the leading | before splitting to avoid empty strings in the array

Related

Maximum of some specific columns in a spark scala dataframe

I have a dataframe like this.
+---+---+---+---+
| M| c2| c3| d1|
+---+---+---+---+
| 1|2_1|4_3|1_2|
| 2|3_4|4_5|1_2|
+---+---+---+---+
I have to transform this df should look like below. Here, c_max = max(c2,c3) after splitting with _.ie, all the columns (c2 and c3) have to be splitted with _ and then getting the max.
In the actual scenario, I have 50 columns ie, c2,c3....c50 and need to take the max from this.
+---+---+---+---+------+
| M| c2| c3| d1|c_Max |
+---+---+---+---+------+
| 1|2_1|4_3|1_2| 4 |
| 2|3_4|4_5|1_2| 5 |
+---+---+---+---+------+
Here is one way using expr and build-in array functions for Spark >= 2.4.0:
import org.apache.spark.sql.functions.{expr, array_max, array}
val df = Seq(
(1, "2_1", "3_4", "1_2"),
(2, "3_4", "4_5", "1_2")
).toDF("M", "c2", "c3", "d1")
// get max c for each c column
val c_cols = df.columns.filter(_.startsWith("c")).map{ c =>
expr(s"array_max(cast(split(${c}, '_') as array<int>))")
}
df.withColumn("max_c", array_max(array(c_cols:_*))).show
Output:
+---+---+---+---+-----+
| M| c2| c3| d1|max_c|
+---+---+---+---+-----+
| 1|2_1|3_4|1_2| 4|
| 2|3_4|4_5|1_2| 5|
+---+---+---+---+-----+
For older versions use the next code:
val c_cols = df.columns.filter(_.startsWith("c")).map{ c =>
val c_ar = split(col(c), "_").cast("array<int>")
when(c_ar.getItem(0) > c_ar.getItem(1), c_ar.getItem(0)).otherwise(c_ar.getItem(1))
}
df.withColumn("max_c", greatest(c_cols:_*)).show
Use greatest function:
val df = Seq((1, "2_1", "3_4", "1_2"),(2, "3_4", "4_5", "1_2"),
).toDF("M", "c2", "c3", "d1")
// get all `c` columns and split by `_` to get the values after the underscore
val c_cols = df.columns.filter(_.startsWith("c"))
.flatMap{
c => Seq(split(col(c), "_").getItem(0).cast("int"),
split(col(c), "_").getItem(1).cast("int")
)
}
// apply greatest func
val c_max = greatest(c_cols: _*)
// add new column
df.withColumn("c_Max", c_max).show()
Gives:
+---+---+---+---+-----+
| M| c2| c3| d1|c_Max|
+---+---+---+---+-----+
| 1|2_1|3_4|1_2| 4|
| 2|3_4|4_5|1_2| 5|
+---+---+---+---+-----+
In spark >= 2.4.0, you can use the array_max function and get some code that would work even with columns containing more than 2 values. The idea is to start by concatenating all the columns (concat column). For that, I use concat_ws on an array of all the columns I want to concat, that I obtain with array(cols.map(col) :_*). Then I split the resulting string to get a big array of strings containing all the values of all the columns. I cast it to an array of ints and I call array_max on it.
val cols = (2 to 50).map("c"+_)
val result = df
.withColumn("concat", concat_ws("_", array(cols.map(col) :_*)))
.withColumn("array_of_ints", split('concat, "_").cast(ArrayType(IntegerType)))
.withColumn("c_max", array_max('array_of_ints))
.drop("concat", "array_of_ints")
In spark < 2.4, you can define array_max yourself like this:
val array_max = udf((s : Seq[Int]) => s.max)
The previous code does not need to be modified. Note however that UDFs can be slower than predefined spark SQL functions.

Get the number of null per row in PySpark dataframe

This is probably a duplicate, but somehow I have been searching for a long time already:
I want to get the number of nulls per Row in a Spark dataframe. I.e.
col1 col2 col3
null 1 a
1 2 b
2 3 null
Should in the end be:
col1 col2 col3 number_of_null
null 1 a 1
1 2 b 0
2 3 null 1
In a general fashion, I want to get the number of times a certain string or number appears in a spark dataframe row.
I.e.
col1 col2 col3 number_of_ABC
ABC 1 a 1
1 2 b 0
2 ABC ABC 2
I am using Pyspark 2.3.0 and prefer a solution that does not involve SQL syntax. For some reason, I seem not to be able to google this. :/
EDIT: Assume that I have so many columns that I can't list them all.
EDIT2: I explicitely dont want to have a pandas solution.
EDIT3: The solution explained with sums or means does not work as it throws errors:
(data type mismatch: differing types in '((`log_time` IS NULL) + 0)' (boolean and int))
...
isnull(log_time#10) + 0) + isnull(log#11))
In Scala:
val df = List(
("ABC", "1", "a"),
("1", "2", "b"),
("2", "ABC", "ABC")
).toDF("col1", "col2", "col3")
val expected = "ABC"
val complexColumn: Column = df.schema.fieldNames.map(c => when(col(c) === lit(expected), 1).otherwise(0)).reduce((a, b) => a + b)
df.withColumn("countABC", complexColumn).show(false)
Output:
+----+----+----+--------+
|col1|col2|col3|countABC|
+----+----+----+--------+
|ABC |1 |a |1 |
|1 |2 |b |0 |
|2 |ABC |ABC |2 |
+----+----+----+--------+
As stated in pasha701's answer, I resort to map and reduce. Note that I am working on Spark 1.6.x and Python 2.7
Taking your DataFrame as df (and as is)
dfvals = [
(None, "1", "a"),
("1", "2", "b"),
("2", None, None)
]
df = sqlc.createDataFrame(dfvals, ['col1', 'col2', 'col3'])
new_df = df.withColumn('null_cnt', reduce(lambda x, y: x + y,
map(lambda x: func.when(func.isnull(func.col(x)) == 'true', 1).otherwise(0),
df.schema.names)))
Check if the value is Null and assign 1 or 0. Add the result to get the count.
new_df.show()
+----+----+----+--------+
|col1|col2|col3|null_cnt|
+----+----+----+--------+
|null| 1| a| 1|
| 1| 2| b| 0|
| 2|null|null| 2|
+----+----+----+--------+

use length function in substring in spark

I am trying to use the length function inside a substring function in a DataFrame
but it gives error
val substrDF = testDF.withColumn("newcol", substring($"col", 1, length($"col")-1))
below is the error
error: type mismatch;
found : org.apache.spark.sql.Column
required: Int
I am using 2.1.
Function "expr" can be used:
val data = List("first", "second", "third")
val df = sparkContext.parallelize(data).toDF("value")
val result = df.withColumn("cutted", expr("substring(value, 1, length(value)-1)"))
result.show(false)
output:
+------+------+
|value |cutted|
+------+------+
|first |firs |
|second|secon |
|third |thir |
+------+------+
You could also use $"COLUMN".substr
val substrDF = testDF.withColumn("newcol", $"col".substr(lit(1), length($"col")-1))
Output:
val testDF = sc.parallelize(List("first", "second", "third")).toDF("col")
val result = testDF.withColumn("newcol", $"col".substr(org.apache.spark.sql.functions.lit(1), length($"col")-1))
result.show(false)
+------+------+
|col |newcol|
+------+------+
|first |firs |
|second|secon |
|third |thir |
+------+------+
You get that error because you the signature of substring is
def substring(str: Column, pos: Int, len: Int): Column
The len argument that you are passing is a Column, and should be an Int.
You may probably want to implement a simple UDF to solve that problem.
val strTail = udf((str: String) => str.substring(1))
testDF.withColumn("newCol", strTail($"col"))
If all you want is to remove the last character of the string, you can do that without UDF as well. By using regexp_replace :
testDF.show
+---+----+
| id|name|
+---+----+
| 1|abcd|
| 2|qazx|
+---+----+
testDF.withColumn("newcol", regexp_replace($"name", ".$" , "") ).show
+---+----+------+
| id|name|newcol|
+---+----+------+
| 1|abcd| abc|
| 2|qazx| qaz|
+---+----+------+
You have to use the SUBSTR function to achieve this.
val substrDF = testDF.withColumn("newcol", 'col.substr(lit(1), length('col)-1))
The first parameter is the position from which you want the data to be trimmed, the second parameter is the length of the trimmed field.
(startPos: Int,len: Int)

Pass Array[seq[String]] to UDF in spark scala

I am new to UDF in spark. I have also read the answer here
Problem statement: I'm trying to find pattern matching from a dataframe col.
Ex: Dataframe
val df = Seq((1, Some("z")), (2, Some("abs,abc,dfg")),
(3,Some("a,b,c,d,e,f,abs,abc,dfg"))).toDF("id", "text")
df.show()
+---+--------------------+
| id| text|
+---+--------------------+
| 1| z|
| 2| abs,abc,dfg|
| 3|a,b,c,d,e,f,abs,a...|
+---+--------------------+
df.filter($"text".contains("abs,abc,dfg")).count()
//returns 2 as abs exits in 2nd row and 3rd row
Now I want to do this pattern matching for every row in column $text and add new column called count.
Result:
+---+--------------------+-----+
| id| text|count|
+---+--------------------+-----+
| 1| z| 1|
| 2| abs,abc,dfg| 2|
| 3|a,b,c,d,e,f,abs,a...| 1|
+---+--------------------+-----+
I tried to define a udf passing $text column as Array[Seq[String]. But I am not able to get what I intended.
What I tried so far:
val txt = df.select("text").collect.map(_.toSeq.map(_.toString)) //convert column to Array[Seq[String]
val valsum = udf((txt:Array[Seq[String],pattern:String)=> {txt.count(_ == pattern) } )
df.withColumn("newCol", valsum( lit(txt) ,df(text)) )).show()
Any help would be appreciated
You will have to know all the elements of text column which can be done using collect_list by grouping all the rows of your dataframe as one. Then just check if element in text column in the collected array and count them as in the following code.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = Seq((1, Some("z")), (2, Some("abs,abc,dfg")),(3,Some("a,b,c,d,e,f,abs,abc,dfg"))).toDF("id", "text")
val valsum = udf((txt: String, array : mutable.WrappedArray[String])=> array.filter(element => element.contains(txt)).size)
df.withColumn("grouping", lit("g"))
.withColumn("array", collect_list("text").over(Window.partitionBy("grouping")))
.withColumn("count", valsum($"text", $"array"))
.drop("grouping", "array")
.show(false)
You should have following output
+---+-----------------------+-----+
|id |text |count|
+---+-----------------------+-----+
|1 |z |1 |
|2 |abs,abc,dfg |2 |
|3 |a,b,c,d,e,f,abs,abc,dfg|1 |
+---+-----------------------+-----+
I hope this is helpful.

Comparing two array columns in Scala Spark

I have a dataframe of format given below.
movieId1 | genreList1 | genreList2
--------------------------------------------------
1 |[Adventure,Comedy] |[Adventure]
2 |[Animation,Drama,War] |[War,Drama]
3 |[Adventure,Drama] |[Drama,War]
and trying to create another flag column which shows whether genreList2 is a subset of genreList1.
movieId1 | genreList1 | genreList2 | Flag
---------------------------------------------------------------
1 |[Adventure,Comedy] | [Adventure] |1
2 |[Animation,Drama,War] | [War,Drama] |1
3 |[Adventure,Drama] | [Drama,War] |0
I have tried this:
def intersect_check(a: Array[String], b: Array[String]): Int = {
if (b.sameElements(a.intersect(b))) { return 1 }
else { return 2 }
}
def intersect_check_udf =
udf((colvalue1: Array[String], colvalue2: Array[String]) => intersect_check(colvalue1, colvalue2))
data = data.withColumn("Flag", intersect_check_udf(col("genreList1"), col("genreList2")))
But this throws error
org.apache.spark.SparkException: Failed to execute user defined function.
P.S.: The above function (intersect_check) works for Arrays.
We can define an udf that calculates the length of the intersection between the two Array columns and checks whether it is equal to the length of the second column. If so, the second array is a subset of the first one.
Also, the inputs of your udf need to be class WrappedArray[String], not Array[String] :
import scala.collection.mutable.WrappedArray
import org.apache.spark.sql.functions.col
val same_elements = udf { (a: WrappedArray[String],
b: WrappedArray[String]) =>
if (a.intersect(b).length == b.length){ 1 }else{ 0 }
}
df.withColumn("test",same_elements(col("genreList1"),col("genreList2")))
.show(truncate = false)
+--------+-----------------------+------------+----+
|movieId1|genreList1 |genreList2 |test|
+--------+-----------------------+------------+----+
|1 |[Adventure, Comedy] |[Adventure] |1 |
|2 |[Animation, Drama, War]|[War, Drama]|1 |
|3 |[Adventure, Drama] |[Drama, War]|0 |
+--------+-----------------------+------------+----+
Data
val df = List((1,Array("Adventure","Comedy"), Array("Adventure")),
(2,Array("Animation","Drama","War"), Array("War","Drama")),
(3,Array("Adventure","Drama"),Array("Drama","War"))).toDF("movieId1","genreList1","genreList2")
Here is the solution converting using subsetOf
val spark =
SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val data = spark.sparkContext.parallelize(
Seq(
(1,Array("Adventure","Comedy"),Array("Adventure")),
(2,Array("Animation","Drama","War"),Array("War","Drama")),
(3,Array("Adventure","Drama"),Array("Drama","War"))
)).toDF("movieId1", "genreList1", "genreList2")
val subsetOf = udf((col1: Seq[String], col2: Seq[String]) => {
if (col2.toSet.subsetOf(col1.toSet)) 1 else 0
})
data.withColumn("flag", subsetOf(data("genreList1"), data("genreList2"))).show()
Hope this helps!
One solution may be to exploit spark array builtin functions: genreList2 is subset of genreList1 if the intersection between the two is equal to genreList2. In the code below a sort_array operation has been added to avoid a mismatch between two arrays with different ordering but same elements.
val spark = {
SparkSession
.builder()
.master("local")
.appName("test")
.getOrCreate()
}
import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val df = Seq(
(1, Array("Adventure","Comedy"), Array("Adventure")),
(2, Array("Animation","Drama","War"), Array("War","Drama")),
(3, Array("Adventure","Drama"), Array("Drama","War"))
).toDF("movieId1", "genreList1", "genreList2")
df
.withColumn("flag",
sort_array(array_intersect($"genreList1",$"genreList2"))
.equalTo(
sort_array($"genreList2")
)
.cast("integer")
)
.show()
The output is
+--------+--------------------+------------+----+
|movieId1| genreList1| genreList2|flag|
+--------+--------------------+------------+----+
| 1| [Adventure, Comedy]| [Adventure]| 1|
| 2|[Animation, Drama...|[War, Drama]| 1|
| 3| [Adventure, Drama]|[Drama, War]| 0|
+--------+--------------------+------------+----+
This can also work here and it does not use udf
import spark.implicits._
val data = Seq(
(1,Array("Adventure","Comedy"),Array("Adventure")),
(2,Array("Animation","Drama","War"),Array("War","Drama")),
(3,Array("Adventure","Drama"),Array("Drama","War"))
).toDF("movieId1", "genreList1", "genreList2")
data
.withColumn("size",size(array_except($"genreList2",$"genreList1")))
.withColumn("flag",when($"size" === lit(0), 1) otherwise(0))
.show(false)
Spark 3.0+ (forall)
forall($"genreList2", x => array_contains($"genreList1", x)).cast("int")
Full example:
val df = Seq(
(1, Seq("Adventure", "Comedy"), Seq("Adventure")),
(2, Seq("Animation", "Drama","War"), Seq("War", "Drama")),
(3, Seq("Adventure", "Drama"), Seq("Drama", "War"))
).toDF("movieId1", "genreList1", "genreList2")
val df2 = df.withColumn("Flag", forall($"genreList2", x => array_contains($"genreList1", x)).cast("int"))
df2.show()
// +--------+--------------------+------------+----+
// |movieId1| genreList1| genreList2|Flag|
// +--------+--------------------+------------+----+
// | 1| [Adventure, Comedy]| [Adventure]| 1|
// | 2|[Animation, Drama...|[War, Drama]| 1|
// | 3| [Adventure, Drama]|[Drama, War]| 0|
// +--------+--------------------+------------+----+