Determine value of column by iterating over all other columns of dataframe - scala

I have a dataframe with columns A-Z and I want to assign the value of Z based on if any other column value is null.
I can do this by:
val df2 = df1.withColumn("Z",
when(col("A") === lit(null), lit("Y"))
.when(col("B") === lit(null), lit("Y"))
.when(col("C") === lit(null), lit("Y"))
...
...
.when(col("Y") === lit(null), lit("Y"))
.otherwise(lit("N")));
Is there a more succinct way to iterate over all other columns inside the withColumn method?

Yes, you can iterate over the columns within withColumns and use foldLeft for the logical expression:
val df2 = df1.withColumn("Z",
when(
df.columns
.filter(name => name.matches("[A-Z]")) // only take these column names
.map(name => col(name)) // maps String to Column
.foldLeft(lit(false))((acc, current) => when(acc or current.isNull, lit(true)).otherwise(lit(false)))
, lit("Y"))
.otherwise(lit("N"))
)
Test:
input:
+---+----+----+
| A| B| C|
+---+----+----+
| 1| 2| 3|
| 1|null| 3|
| 1|null|null|
+---+----+----+
output:
+---+----+----+---+
| A| B| C| Z|
+---+----+----+---+
| 1| 2| 3| N|
| 1|null| 3| Y|
| 1|null|null| Y|
+---+----+----+---+

I achieved this by exploring the spark.sql.functions package
val df2 = df1
.withColumn("Z",when(array_contains(array(df1.columns.map(c=>lower(col(c))):_*),"null"),lit("Y")).otherwise(lit("N")))

Related

A sum of typedLit columns evaluates to NULL

I am trying to create a sum column by taking the sum of the row values of a set of columns in a dataframe. So I followed the following method to do it.
val temp_data = spark.createDataFrame(Seq(
(1, 5),
(2, 4),
(3, 7),
(4, 6)
)).toDF("A", "B")
val cols = List(col("A"), col("B"))
temp_data.withColumn("sum", cols.reduce(_ + _)).show
+---+---+---+
| A| B|sum|
+---+---+---+
| 1| 5| 6|
| 2| 4| 6|
| 3| 7| 10|
| 4| 6| 10|
+---+---+---+
So this methods works fine and produce the expected output. However, I want to create the cols variable without specifying the column names explicitly. Therefore I've used typedLit as follows.
val cols2 = temp_data.columns.map(x=>typedLit(x)).toList
when I look at cols and cols2 they look identical.
cols: List[org.apache.spark.sql.Column] = List(A, B)
cols2: List[org.apache.spark.sql.Column] = List(A, B)
However, when I use cols2 to create my sum column, it doesn't work the way I expect it to work.
temp_data.withColumn("sum", cols2.reduce(_ + _)).show
+---+---+----+
| A| B| sum|
+---+---+----+
| 1| 5|null|
| 2| 4|null|
| 3| 7|null|
| 4| 6|null|
+---+---+----+
Does anyone have any idea what I'm doing wrong here? Why doesn't the second method work like the first method?
lit or typedLit is not a replacement for Column. What your code does it creates a list of string literals - "A" and "B"
temp_data.select(cols2: _*).show
+---+---+
| A| B|
+---+---+
| A| B|
| A| B|
| A| B|
| A| B|
+---+---+
and asks for their sums - hence the result is undefined.
You might use TypedColumn here:
import org.apache.spark.sql.TypedColumn
val typedSum: TypedColumn[Any, Int] = cols.map(_.as[Int]).reduce{
(x, y) => (x + y).as[Int]
}
temp_data.withColumn("sum", typedSum).show
but it doesn't provide any practical advantage over standard Column here.
You are trying with typedLit which is not right and like other answer mentioned you don't have to use a function with TypedColumn. You can simply use map transformation on columns of dataframe to convert it to List(Col)
Change your cols2 statement to below and try.
val cols = temp_data.columns.map(f=> col(f))
temp_data.withColumn("sum", cols.reduce(_ + _)).show
You will get below output.
+---+---+---+
| A| B|sum|
+---+---+---+
| 1| 5| 6|
| 2| 4| 6|
| 3| 7| 10|
| 4| 6| 10|
+---+---+---+
Thanks

How to show only relevant columns from Spark's DataFrame?

I have a large JSON file with 432 key-value pairs and many rows of such data. That data is loaded pretty well, however when I want to use df.show() to display 20 items I see a bunch of nulls. The file is quite sparse. It's very hard to make something out of it. What would be nice is to drop columns that have only nulls for 20 rows, however, given that I have a lot of key-value pairs it's hard to do manually. Is there a way to detect in Spark's dataframe what columns contain only nulls and drop them?
You can try like below, for more info, referred_question
scala> val df = Seq((1,2,null),(3,4,null),(5,6,null),(7,8,"9")).toDF("a","b","c")
scala> df.show
+---+---+----+
| a| b| c|
+---+---+----+
| 1| 2|null|
| 3| 4|null|
| 5| 6|null|
| 7| 8| 9|
+---+---+----+
scala> val dfl = df.limit(3) //limiting the number of rows you need, in your case it is 20
scala> val col_names = dfl.select(dfl.columns.map(x => count(col(x)).alias(x)):_*).first.toSeq.zipWithIndex.filter(x => x._1.toString.toInt > 0).map(_._2).map(x => dfl.columns(x)).map(x => col(x)) // this will give you column names which is having not null values
col_names: Seq[org.apache.spark.sql.Column] = ArrayBuffer(a, b)
scala> dfl.select(col_names : _*).show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
| 5| 6|
+---+---+
Let me know if it works for you.
Similar to Sathiyan's idea, but using the columnname in the count() itself.
scala> val df = Seq((1,2,null),(3,4,null),(5,6,null)).toDF("a","b","c")
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
scala> df.show
+---+---+----+
| a| b| c|
+---+---+----+
| 1| 2|null|
| 3| 4|null|
| 5| 6|null|
+---+---+----+
scala> val notnull_cols = df.select(df.columns.map(x=>concat_ws("=",first(lit(x)),count(col(x)))):_*).first.toSeq.map(_.toString).filter(!_.contains("=0")).map( x=>col(x.split("=")(0)) )
notnull_cols: Seq[org.apache.spark.sql.Column] = ArrayBuffer(a, b)
scala> df.select(notnull_cols:_*).show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
| 5| 6|
+---+---+
The intermediate results shows the count along with column names
scala> df.select(df.columns.map(x=>concat_ws("=",first(lit(x)),count(col(x))).as(x+"_nullcount")):_*).show
+-----------+-----------+-----------+
|a_nullcount|b_nullcount|c_nullcount|
+-----------+-----------+-----------+
| a=3| b=3| c=0|
+-----------+---------- -+-----------+
scala>

Scala/Spark: How to select columns to read ONLY when list of columns > 0

I'm passing in a parameter fieldsToLoad: List[String] and I want to load ALL columns if this list is empty and load only the columns specified in the list if the list has more one or more columns. I have this now which reads the columns passed in the list:
val parquetDf = sparkSession.read.parquet(inputPath:_*).select(fieldsToLoad.head, fieldsToLoadList.tail:_*)
But how do I add a condition to load * (all columns) when the list is empty?
#Andy Hayden answer is correct but I want to introduce how to use selectExpr function to simplify the selection
scala> val df = Range(1, 4).toList.map(x => (x, x + 1, x + 2)).toDF("c1", "c2", "c3")
df: org.apache.spark.sql.DataFrame = [c1: int, c2: int ... 1 more field]
scala> df.show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
| 3| 4| 5|
+---+---+---+
scala> val fieldsToLoad = List("c2", "c3")
fieldsToLoad: List[String] = List(c2, c3) ^
scala> df.selectExpr((if (fieldsToLoad.nonEmpty) fieldsToLoad else List("*")):_*).show()
+---+---+
| c2| c3|
+---+---+
| 2| 3|
| 3| 4|
| 4| 5|
+---+---+
scala> val fieldsToLoad = List()
fieldsToLoad: List[Nothing] = List()
scala> df.selectExpr((if (fieldsToLoad.nonEmpty) fieldsToLoad else List("*")):_*).show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
| 3| 4| 5|
+---+---+---+
You could use an if statement first to replace the empty with just *:
val cols = if (fieldsToLoadList.nonEmpty) fieldsToLoadList else Array("*")
sparkSession.read.parquet(inputPath:_*).select(cols.head, cols.tail:_*).

Spark dataframe select rows with at least one null or blank in any column of that row

from one dataframe i want to create a new dataframe where at least one value in any of the columns is null or blank in spark 1.5 / scala.
i am trying to write a generalize function to create this new dataframe. where i pass the dataframe and the list of columns and creates the record.
Thanks
Sample Data:
val df = Seq((null, Some(2)), (Some("a"), Some(4)), (Some(""), Some(5)), (Some("b"), null)).toDF("A", "B")
df.show
+----+----+
| A| B|
+----+----+
|null| 2|
| a| 4|
| | 5|
| b|null|
+----+----+
You can construct the condition as, assume blank means empty string here:
import org.apache.spark.sql.functions.col
val cond = df.columns.map(x => col(x).isNull || col(x) === "").reduce(_ || _)
df.filter(cond).show
+----+----+
| A| B|
+----+----+
|null| 2|
| | 5|
| b|null|
+----+----+

Split String (or List of Strings) to individual columns in spark dataframe

Given a dataframe "df" and a list of columns "colStr", is there a way in Spark Dataframe to extract or reference those columns from the data frame.
Here's an example -
val in = sc.parallelize(List(0, 1, 2, 3, 4, 5))
val df = in.map(x => (x, x+1, x+2)).toDF("c1", "c2", "c3")
val keyColumn = "c2" // this is either a single column name or a string of column names delimited by ','
val keyGroup = keyColumn.split(",").toSeq.map(x => col(x))
import org.apache.spark.sql.expressions.Window
import sqlContext.implicits._
val ranker = Window.partitionBy(keyGroup).orderBy($"c2")
val new_df= df.withColumn("rank", rank.over(ranker))
new_df.show()
The above errors out with
error: overloaded method value partitionBy with alternatives
(cols:org.apache.spark.sql.Column*)org.apache.spark.sql.expressions.WindowSpec <and>
(colName: String,colNames: String*)org.apache.spark.sql.expressions.WindowSpec
cannot be applied to (Seq[org.apache.spark.sql.Column])
Appreciate the help. Thanks!
If you are trying to group data frame by the columns in the keyGroup list, you can pass keyGroup: _* as parameter to partitionBy function:
val ranker = Window.partitionBy(keyGroup: _*).orderBy($"c2")
val new_df= df.withColumn("rank", rank.over(ranker))
new_df.show
+---+---+---+----+
| c1| c2| c3|rank|
+---+---+---+----+
| 0| 1| 2| 1|
| 5| 6| 7| 1|
| 2| 3| 4| 1|
| 4| 5| 6| 1|
| 3| 4| 5| 1|
| 1| 2| 3| 1|
+---+---+---+----+