How to write UDF with values as references to other columns? - scala

I'd like to create a UDF that does the following:
A DataFramehas 5 columns and with want to create the 6th column with the sum that the value that contain the name the first and the second column.
Let me print the DataFrame and explain with that:
case class salary(c1: String, c2: String, c3: Int, c4: Int, c5: Int)
val df = Seq(
salary("c3", "c4", 7, 5, 6),
salary("c5", "c4", 8, 10, 20),
salary("c5", "c3", 1, 4, 9))
.toDF()
DataFrame result
+---+---+---+---+---+
| c1| c2| c3| c4| c5|
+---+---+---+---+---+
| c3| c4| 7| 5| 6|
| c5| c4| 8| 10| 20|
| c5| c3| 1| 4| 9|
+---+---+---+---+---+
df.withColumn("c6",UDFName(c1,c2))
And the result for this column should be:
1º Row(C3,C4) Then 7+5= 12
2º Row(C5,C4) Then 20+10= 30
3º Row(C5,C3) Then 9+1= 10

There is really no need for UDF here. Just use virtual MapType column:
import org.apache.spark.sql.functions.{col, lit, map}
// We use an interleaved list of column name and column value
val values = map(Seq("c3", "c4", "c5").flatMap(c => Seq(lit(c), col(c))): _*)
// Check the first row
df.select(values).limit(1).show(false)
+------------------------------+
|map(c3, c3, c4, c4, c5, c5) |
+------------------------------+
|Map(c3 -> 7, c4 -> 5, c5 -> 6)|
+------------------------------+
and use it in expression:
df.withColumn("c6", values($"c1") + values($"c2"))
+---+---+---+---+---+---+
| c1| c2| c3| c4| c5| c6|
+---+---+---+---+---+---+
| c3| c4| 7| 5| 6| 12|
| c5| c4| 8| 10| 20| 30|
| c5| c3| 1| 4| 9| 10|
+---+---+---+---+---+---+
It is much cleaner, faster, and safer than dealing with UDFs and Rows:
import org.apache.spark.sql.functions.{struct, udf}
import org.apache.spark.sql.Row
val f = udf((row: Row) => for {
// Use Options to avoid problems with null columns
// Explicit null checks should be faster, but much more verbose
c1 <- Option(row.getAs[String]("c1"))
c2 <- Option(row.getAs[String]("c2"))
// In this case we could (probably) skip Options below
// but Ints in Spark SQL can get null
x <- Option(row.getAs[Int](c1))
y <- Option(row.getAs[Int](c2))
} yield x + y)
df.withColumn("c6", f(struct(df.columns map col: _*)))
+---+---+---+---+---+---+
| c1| c2| c3| c4| c5| c6|
+---+---+---+---+---+---+
| c3| c4| 7| 5| 6| 12|
| c5| c4| 8| 10| 20| 30|
| c5| c3| 1| 4| 9| 10|
+---+---+---+---+---+---+

A user-defined function (UDF) has access to the values that are passed directly as input parameters.
If you want to access the other columns, a UDF will only have access to them iff you pass them as input parameters. With that, you should easily achieve what you're after.
I highly recommend using struct function to combine all the other columns.
struct(cols: Column*): Column Creates a new struct column.
You could also use Dataset.columns method to access the columns to struct.
columns: Array[String] Returns all column names as an array.

Related

A sum of typedLit columns evaluates to NULL

I am trying to create a sum column by taking the sum of the row values of a set of columns in a dataframe. So I followed the following method to do it.
val temp_data = spark.createDataFrame(Seq(
(1, 5),
(2, 4),
(3, 7),
(4, 6)
)).toDF("A", "B")
val cols = List(col("A"), col("B"))
temp_data.withColumn("sum", cols.reduce(_ + _)).show
+---+---+---+
| A| B|sum|
+---+---+---+
| 1| 5| 6|
| 2| 4| 6|
| 3| 7| 10|
| 4| 6| 10|
+---+---+---+
So this methods works fine and produce the expected output. However, I want to create the cols variable without specifying the column names explicitly. Therefore I've used typedLit as follows.
val cols2 = temp_data.columns.map(x=>typedLit(x)).toList
when I look at cols and cols2 they look identical.
cols: List[org.apache.spark.sql.Column] = List(A, B)
cols2: List[org.apache.spark.sql.Column] = List(A, B)
However, when I use cols2 to create my sum column, it doesn't work the way I expect it to work.
temp_data.withColumn("sum", cols2.reduce(_ + _)).show
+---+---+----+
| A| B| sum|
+---+---+----+
| 1| 5|null|
| 2| 4|null|
| 3| 7|null|
| 4| 6|null|
+---+---+----+
Does anyone have any idea what I'm doing wrong here? Why doesn't the second method work like the first method?
lit or typedLit is not a replacement for Column. What your code does it creates a list of string literals - "A" and "B"
temp_data.select(cols2: _*).show
+---+---+
| A| B|
+---+---+
| A| B|
| A| B|
| A| B|
| A| B|
+---+---+
and asks for their sums - hence the result is undefined.
You might use TypedColumn here:
import org.apache.spark.sql.TypedColumn
val typedSum: TypedColumn[Any, Int] = cols.map(_.as[Int]).reduce{
(x, y) => (x + y).as[Int]
}
temp_data.withColumn("sum", typedSum).show
but it doesn't provide any practical advantage over standard Column here.
You are trying with typedLit which is not right and like other answer mentioned you don't have to use a function with TypedColumn. You can simply use map transformation on columns of dataframe to convert it to List(Col)
Change your cols2 statement to below and try.
val cols = temp_data.columns.map(f=> col(f))
temp_data.withColumn("sum", cols.reduce(_ + _)).show
You will get below output.
+---+---+---+
| A| B|sum|
+---+---+---+
| 1| 5| 6|
| 2| 4| 6|
| 3| 7| 10|
| 4| 6| 10|
+---+---+---+
Thanks

How to show only relevant columns from Spark's DataFrame?

I have a large JSON file with 432 key-value pairs and many rows of such data. That data is loaded pretty well, however when I want to use df.show() to display 20 items I see a bunch of nulls. The file is quite sparse. It's very hard to make something out of it. What would be nice is to drop columns that have only nulls for 20 rows, however, given that I have a lot of key-value pairs it's hard to do manually. Is there a way to detect in Spark's dataframe what columns contain only nulls and drop them?
You can try like below, for more info, referred_question
scala> val df = Seq((1,2,null),(3,4,null),(5,6,null),(7,8,"9")).toDF("a","b","c")
scala> df.show
+---+---+----+
| a| b| c|
+---+---+----+
| 1| 2|null|
| 3| 4|null|
| 5| 6|null|
| 7| 8| 9|
+---+---+----+
scala> val dfl = df.limit(3) //limiting the number of rows you need, in your case it is 20
scala> val col_names = dfl.select(dfl.columns.map(x => count(col(x)).alias(x)):_*).first.toSeq.zipWithIndex.filter(x => x._1.toString.toInt > 0).map(_._2).map(x => dfl.columns(x)).map(x => col(x)) // this will give you column names which is having not null values
col_names: Seq[org.apache.spark.sql.Column] = ArrayBuffer(a, b)
scala> dfl.select(col_names : _*).show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
| 5| 6|
+---+---+
Let me know if it works for you.
Similar to Sathiyan's idea, but using the columnname in the count() itself.
scala> val df = Seq((1,2,null),(3,4,null),(5,6,null)).toDF("a","b","c")
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
scala> df.show
+---+---+----+
| a| b| c|
+---+---+----+
| 1| 2|null|
| 3| 4|null|
| 5| 6|null|
+---+---+----+
scala> val notnull_cols = df.select(df.columns.map(x=>concat_ws("=",first(lit(x)),count(col(x)))):_*).first.toSeq.map(_.toString).filter(!_.contains("=0")).map( x=>col(x.split("=")(0)) )
notnull_cols: Seq[org.apache.spark.sql.Column] = ArrayBuffer(a, b)
scala> df.select(notnull_cols:_*).show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
| 5| 6|
+---+---+
The intermediate results shows the count along with column names
scala> df.select(df.columns.map(x=>concat_ws("=",first(lit(x)),count(col(x))).as(x+"_nullcount")):_*).show
+-----------+-----------+-----------+
|a_nullcount|b_nullcount|c_nullcount|
+-----------+-----------+-----------+
| a=3| b=3| c=0|
+-----------+---------- -+-----------+
scala>

Scala/Spark: How to select columns to read ONLY when list of columns > 0

I'm passing in a parameter fieldsToLoad: List[String] and I want to load ALL columns if this list is empty and load only the columns specified in the list if the list has more one or more columns. I have this now which reads the columns passed in the list:
val parquetDf = sparkSession.read.parquet(inputPath:_*).select(fieldsToLoad.head, fieldsToLoadList.tail:_*)
But how do I add a condition to load * (all columns) when the list is empty?
#Andy Hayden answer is correct but I want to introduce how to use selectExpr function to simplify the selection
scala> val df = Range(1, 4).toList.map(x => (x, x + 1, x + 2)).toDF("c1", "c2", "c3")
df: org.apache.spark.sql.DataFrame = [c1: int, c2: int ... 1 more field]
scala> df.show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
| 3| 4| 5|
+---+---+---+
scala> val fieldsToLoad = List("c2", "c3")
fieldsToLoad: List[String] = List(c2, c3) ^
scala> df.selectExpr((if (fieldsToLoad.nonEmpty) fieldsToLoad else List("*")):_*).show()
+---+---+
| c2| c3|
+---+---+
| 2| 3|
| 3| 4|
| 4| 5|
+---+---+
scala> val fieldsToLoad = List()
fieldsToLoad: List[Nothing] = List()
scala> df.selectExpr((if (fieldsToLoad.nonEmpty) fieldsToLoad else List("*")):_*).show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
| 3| 4| 5|
+---+---+---+
You could use an if statement first to replace the empty with just *:
val cols = if (fieldsToLoadList.nonEmpty) fieldsToLoadList else Array("*")
sparkSession.read.parquet(inputPath:_*).select(cols.head, cols.tail:_*).

Scala/Spark dataframes: find the column name corresponding to the max

In Scala/Spark, having a dataframe:
val dfIn = sqlContext.createDataFrame(Seq(
("r0", 0, 2, 3),
("r1", 1, 0, 0),
("r2", 0, 2, 2))).toDF("id", "c0", "c1", "c2")
I would like to compute a new column maxCol holding the name of the column corresponding to the max value (for each row). With this example, the output should be:
+---+---+---+---+------+
| id| c0| c1| c2|maxCol|
+---+---+---+---+------+
| r0| 0| 2| 3| c2|
| r1| 1| 0| 0| c0|
| r2| 0| 2| 2| c1|
+---+---+---+---+------+
Actually the dataframe have more than 60 columns. Thus a generic solution is required.
The equivalent in Python Pandas (yes, I know, I should compare with pyspark...) could be:
dfOut = pd.concat([dfIn, dfIn.idxmax(axis=1).rename('maxCol')], axis=1)
With a small trick you can use greatest function. Required imports:
import org.apache.spark.sql.functions.{col, greatest, lit, struct}
First let's create a list of structs, where the first element is value, and the second one column name:
val structs = dfIn.columns.tail.map(
c => struct(col(c).as("v"), lit(c).as("k"))
)
Structure like this can be passed to greatest as follows:
dfIn.withColumn("maxCol", greatest(structs: _*).getItem("k"))
+---+---+---+---+------+
| id| c0| c1| c2|maxCol|
+---+---+---+---+------+
| r0| 0| 2| 3| c2|
| r1| 1| 0| 0| c0|
| r2| 0| 2| 2| c2|
+---+---+---+---+------+
Please note that in case of ties it will take the element which occurs later in the sequence (lexicographically (x, "c2") > (x, "c1")). If for some reason this is not acceptable you can explicitly reduce with when:
import org.apache.spark.sql.functions.when
val max_col = structs.reduce(
(c1, c2) => when(c1.getItem("v") >= c2.getItem("v"), c1).otherwise(c2)
).getItem("k")
dfIn.withColumn("maxCol", max_col)
+---+---+---+---+------+
| id| c0| c1| c2|maxCol|
+---+---+---+---+------+
| r0| 0| 2| 3| c2|
| r1| 1| 0| 0| c0|
| r2| 0| 2| 2| c1|
+---+---+---+---+------+
In case of nullable columns you have to adjust this, for example by coalescing to values to -Inf.

Split String (or List of Strings) to individual columns in spark dataframe

Given a dataframe "df" and a list of columns "colStr", is there a way in Spark Dataframe to extract or reference those columns from the data frame.
Here's an example -
val in = sc.parallelize(List(0, 1, 2, 3, 4, 5))
val df = in.map(x => (x, x+1, x+2)).toDF("c1", "c2", "c3")
val keyColumn = "c2" // this is either a single column name or a string of column names delimited by ','
val keyGroup = keyColumn.split(",").toSeq.map(x => col(x))
import org.apache.spark.sql.expressions.Window
import sqlContext.implicits._
val ranker = Window.partitionBy(keyGroup).orderBy($"c2")
val new_df= df.withColumn("rank", rank.over(ranker))
new_df.show()
The above errors out with
error: overloaded method value partitionBy with alternatives
(cols:org.apache.spark.sql.Column*)org.apache.spark.sql.expressions.WindowSpec <and>
(colName: String,colNames: String*)org.apache.spark.sql.expressions.WindowSpec
cannot be applied to (Seq[org.apache.spark.sql.Column])
Appreciate the help. Thanks!
If you are trying to group data frame by the columns in the keyGroup list, you can pass keyGroup: _* as parameter to partitionBy function:
val ranker = Window.partitionBy(keyGroup: _*).orderBy($"c2")
val new_df= df.withColumn("rank", rank.over(ranker))
new_df.show
+---+---+---+----+
| c1| c2| c3|rank|
+---+---+---+----+
| 0| 1| 2| 1|
| 5| 6| 7| 1|
| 2| 3| 4| 1|
| 4| 5| 6| 1|
| 3| 4| 5| 1|
| 1| 2| 3| 1|
+---+---+---+----+