How to get column names with all values null? - scala

I don't have any ideas to get column names when it has null value
For example,
case class A(name: String, id: String, email: String, company: String)
val e1 = A("n1", null, "n1#c1.com", null)
val e2 = A("n2", null, "n2#c1.com", null)
val e3 = A("n3", null, "n3#c1.com", null)
val e4 = A("n4", null, "n4#c2.com", null)
val e5 = A("n5", null, "n5#c2.com", null)
val e6 = A("n6", null, "n6#c2.com", null)
val e7 = A("n7", null, "n7#c3.com", null)
val e8 = A("n8", null, "n8#c3.com", null)
val As = Seq(e1, e2, e3, e4, e5, e6, e7, e8)
val df = sc.parallelize(As).toDF
This code makes dataframe like this :
+----+----+---------+-------+
|name| id| email|company|
+----+----+---------+-------+
| n1|null|n1#c1.com| null|
| n2|null|n2#c1.com| null|
| n3|null|n3#c1.com| null|
| n4|null|n4#c2.com| null|
| n5|null|n5#c2.com| null|
| n6|null|n6#c2.com| null|
| n7|null|n7#c3.com| null|
| n8|null|n8#c3.com| null|
+----+----+---------+-------+
and I want to get column names all of their rows are null : id, company
I don't care the type of output. Array, String, RDD whatever

You can do a simple count on all your columns, then using the indices of the columns that return a count of 0, you subset df.columns:
import org.apache.spark.sql.functions.{count,col}
// Get column indices
val col_inds = df.select(df.columns.map(c => count(col(c)).alias(c)): _*)
.collect()(0)
.toSeq.zipWithIndex
.filter(_._1 == 0).map(_._2)
// Subset column names using the indices
col_inds.map(i => df.columns.apply(i))
//Seq[String] = ArrayBuffer(id, company)

An alternative solution could be as follows (but am afraid the performance might not be satisfactory).
val ids = Seq(
("1", null: String),
("1", null: String),
("10", null: String)
).toDF("id", "all_nulls")
scala> ids.show
+---+---------+
| id|all_nulls|
+---+---------+
| 1| null|
| 1| null|
| 10| null|
+---+---------+
val s = ids.columns.
map { c =>
(c, ids.select(c).dropDuplicates(c).na.drop.count) }. // <-- performance here!
collect { case (c, cnt) if cnt == 0 => c }
scala> s.foreach(println)
all_nulls

Related

rename multiple column of a dataframe in scala

I want to rename some columns in a dataframe that I'm providing in a Seq.
I'm using below method:
def prefixColumns(dataframe: DataFrame, columnPrefix: String, cols: Seq[String]) : DataFrame = {
for (column <- dataframe.columns){
if(cols.contains(column)){
dataframe.withColumnRenamed(column, columnPrefix + "_" + column)
}
}
dataframe
}
and calling
prefix(products, "products", Seq(col1,col2,col3,col4))
It is only renaming col4 as products_col4 and other columns are left as is.
Can someone suggest me a way to do this in scala?
I want to rename only the columns provided in the Seq and other columns of dataframe as is.
Your function does not rename anything because withColumnRenamed does not transform the object it is called on. It returns a new object with the column renamed. Let's check that:
Seq("id", "id2")
val cols = Seq("id", "id2")
val df = spark.range(1).select('id, 'id as "x", 'id as "id2", 'id as "id3")
df.show
+---+---+---+---+
| id| x|id2|id3|
+---+---+---+---+
| 0| 0| 0| 0|
+---+---+---+---+
prefixColumns(df, "X", col).show()
+---+---+---+---+
| id| x|id2|id3|
+---+---+---+---+
| 0| 0| 0| 0|
+---+---+---+---+
But you can adjust your function a little bit to make it work:
def prefixColumns(dataframe: DataFrame, columnPrefix: String, cols: Seq[String]) : DataFrame = {
var result = dataframe
for (column <- dataframe.columns){
if(cols.contains(column)){
// we assign the renamed df to the result variable
result = result.withColumnRenamed(column, columnPrefix + "_" + column)
}
}
result
}
prefixColumns(df, "X", col).show()
+----+---+-----+---+
|X_id| x|X_id2|id3|
+----+---+-----+---+
| 0| 0| 0| 0|
+----+---+-----+---+
NB: another way at it is to use select like this. No for loops:
dataframe.select( dataframe.columns.map(c =>
if(cols contains c) col(c).alias(columnPrefix + "_" + c) else col(c)
) : _*)

Dynamic evaluation of Boolean expressions in a Spark DataFrame

Suppose I have a Spark DataFrame (in Scala) like
+---+---+---------------+
| a| b| expr|
+---+---+---------------+
| 0| 0|a = 1 AND b = 0|
| 0| 1| a = 0|
| 1| 0|a = 1 AND b = 1|
| 1| 1|a = 1 AND b = 1|
| 1| 1| null|
| 1| 1| a = 0 OR b = 1|
+---+---+---------------+
in which the string column expr contains nullable Boolean expressions that refer to the other numeric columns in the same DataFrame (a and b).
I would like to derive a column eval(expr) that evaluates the Boolean expression expr row-wise, i.e.,
+---+---+---------------+----------+
| a| b| expr|eval(expr)|
+---+---+---------------+----------+
| 0| 0|a = 1 AND b = 0| false|
| 0| 1| a = 0| true|
| 1| 0|a = 1 AND b = 1| false|
| 1| 1|a = 1 AND b = 1| true|
| 1| 1| null| true|
| 1| 1| a = 0 OR b = 1| true|
+---+---+---------------+----------+
(in particular, although this is an optional specification, null evaluates to true).
Question
What's the best way to create eval(expr)?
That is, how can I create a column in a Spark DataFrame that evaluates a column of Boolean expressions that refer to other columns in the DataFrame?
I have some not-quite-satisfactory solutions below. They assume the following DataFrame in scope:
val df: DataFrame = Seq(
(0, 0, "a = 1 AND b = 0"),
(0, 1, "a = 0"),
(1, 0, "a = 1 AND b = 1"),
(1, 1, "a = 1 AND b = 1"),
(1, 1, null),
(1, 1, "a = 0 OR b = 1")
).toDF("a", "b", "expr")
Solution 1
Create a large global expression out of the individual expressions:
val exprs: Column = concat(
df.columns
.filter(_ != "expr")
.zipWithIndex
.flatMap {
case (name, i) =>
if (i == 0)
Seq(lit(s"($name = "), col(name))
else
Seq(lit(s" AND $name = "), col(name))
} :+ lit(" AND (") :+ col("expr") :+ lit("))"): _*
)
// exprs: org.apache.spark.sql.Column = concat((a = , a, AND b = , b, AND (, expr, )))
val bigExprString = df.select(exprs).na.drop.as[String].collect.mkString(" OR ")
// bigExprString: String = (a = 0 AND b = 0 AND (a = 1 AND b = 0)) OR (a = 0 AND b = 1 AND (a = 0)) OR (a = 1 AND b = 0 AND (a = 1 AND b = 1)) OR (a = 1 AND b = 1 AND (a = 1 AND b = 1)) OR (a = 1 AND b = 1 AND (a = 0 OR b = 1))
val result: DataFrame = df.withColumn("eval(expr)", expr(bigExprString))
The downside here is the resulting string is very large. In my actual use case, it would be many tens of thousands of characters long, if not longer. I'm not too sure whether this would cause problems.
Solution 2
Separate the DataFrame into multiple based on the value of the expression column, operate on each individually, and recombine into one DataFrame.
val exprs: Seq[String] = df.select("expr").distinct.as[String].collect
// exprs: Seq[String] = WrappedArray(a = 1 AND b = 1, a = 1 AND b = 0, null, a = 0, a = 0 OR b = 1)
val result: DataFrame = exprs.map(e =>
df.filter(col("expr") === e)
.withColumn("eval(expr)", if (e == null) lit(true) else when(expr(e), true).otherwise(false))
).reduce(_.union(_))
.show()
I think the downside of this approach is that it creates many intermediate tables (one for each distinct expression). In my actual use case, this count is potentially hundreds or thousands.
Using this answer the scala.tools.reflect.ToolBox can be used to evaluate the expression after transforming it into a valid Scala expression:
case class Result(a: Integer, b: Integer, expr: String, result: Boolean)
df.mapPartitions(it => {
import scala.reflect.runtime.universe
import scala.tools.reflect.ToolBox
val tb = universe.runtimeMirror(this.getClass.getClassLoader).mkToolBox()
val res = it.map(r => {
val a = r.getInt(0)
val b = r.getInt(1)
val expr = r.getString(2)
val exprResult =
if ( expr == null) {
true
}
else {
val scalaExpr = expr.replace("=", "==").replace("AND", "&").replace("OR", "|")
val scalaExpr2 = s"var a=${a}; var b=${b}; ${scalaExpr}"
tb.eval(tb.parse(scalaExpr2)).asInstanceOf[Boolean]
}
Result(a, b, expr, exprResult)
})
res
}).show()
Output:
+---+---+---------------+------+
| a| b| expr|result|
+---+---+---------------+------+
| 0| 0|a = 1 AND b = 0| false|
| 0| 1| a = 0| true|
| 1| 0|a = 1 AND b = 1| false|
| 1| 1|a = 1 AND b = 1| true|
| 1| 1| null| true|
| 1| 1| a = 0 OR b = 1| true|
+---+---+---------------+------+
I am using mapPartitions here instead of a simple udf as the initialization of the the toolbox takes some time. Instead of initializing it once per row it is now initialized only once per partition.

Avoiding duplicate coulmns after nullSafeJoin scala spark

I have use case wherein I need to join nullable columns. I am doing the same like this :
def nullSafeJoin(leftDF: DataFrame, rightDF: DataFrame, joinOnColumns: Seq[String]) = {
val dataset1 = leftDF.alias("dataset1")
val dataset2 = rightDF.alias("dataset2")
val firstColumn = joinOnColumns.head
val colExpression: Column = (col(s"dataset1.$firstColumn").eqNullSafe(col(s"dataset2.$firstColumn")))
val fullExpr = joinOnColumns.tail.foldLeft(colExpression) {
(colExpression, p) => colExpression && (col(s"dataset1.$p").eqNullSafe(col(s"dataset2.$p")))
}
dataset1.join(dataset2, fullExpr)
}
The final joined dataset has duplicate columns. I have tried dropping the columns using the alias like this :
dataset1.join(dataset2, fullExpr).drop(s"dataset2.$firstColumn")
but it doesn't work.
I understand that instead of dropping we can do a select columns.
I am trying to have a generic code base so don't want to pass the list of columns to be selected to the function (In case of drop I would be having to just drop the list of joinOnColumns we have passed to the function)
Any pointers on how to solve this would be really helpful.
Thanks!
Edit : (Sample data )
leftDF :
+------------------+-----------+---------+---------+-------+
| A| B| C| D| status|
+------------------+-----------+---------+---------+-------+
| 14567| 37| 1| game|Enabled|
| 14567| BASE| 1| toy| Paused|
| 13478| null| 5| game|Enabled|
| 2001| BASE| 1| null| Paused|
| null| 37| 1| home|Enabled|
+------------------+-----------+---------+---------+-------+
rightDF :
+------------------+-----------+---------+
| A| B| C|
+------------------+-----------+---------+
| 140| 37| 1|
| 569| BASE| 1|
| 13478| null| 5|
| 2001| BASE| 1|
| null| 37| 1|
+------------------+-----------+---------+
Final Join (Required):
+------------------+-----------+---------+---------+-------+
| A| B| C| D| status|
+------------------+-----------+---------+---------+-------+
| 13478| null| 5| game|Enabled|
| 2001| BASE| 1| null| Paused|
| null| 37| 1| home|Enabled|
+------------------+-----------+---------+---------+-------+
Your final DataFrame has duplicate columns from both leftDF & rightDF, don't have identifier to check if that column is from leftDF or rightDF.
So I have renamed leftDF & rightDF columns. leftDF columns starts with left_[column_name] & rightDF columns starts with right_[column_name]
Hope below code will help you.
scala> :paste
// Entering paste mode (ctrl-D to finish)
val left = Seq(("14567", "37", "1", "game", "Enabled"), ("14567", "BASE", "1", "toy", "Paused"), ("13478", "null", "5", "game", "Enabled"), ("2001", "BASE", "1", "null", "Paused"), ("null", "37", "1", "home", "Enabled")).toDF("a", "b", "c", "d", "status")
val right = Seq(("140", "37", 1), ("569", "BASE", 1), ("13478", "null", 5), ("2001", "BASE", 1), ("null", "37", 1)).toDF("a", "b", "c")
import org.apache.spark.sql.DataFrame
def nullSafeJoin(leftDF: DataFrame, rightDF: DataFrame, joinOnColumns: Seq[String]):DataFrame = {
val leftRenamedDF = leftDF
.columns
.map(c => (c, s"left_${c}"))
.foldLeft(leftDF){ (df, c) =>
df.withColumnRenamed(c._1, c._2)
}
val rightRenamedDF = rightDF
.columns
.map(c => (c, s"right_${c}"))
.foldLeft(rightDF){(df, c) =>
df.withColumnRenamed(c._1, c._2)
}
val fullExpr = joinOnColumns
.tail
.foldLeft($"left_${joinOnColumns.head}".eqNullSafe($"right_${joinOnColumns.head}")){(cee, p) =>
cee && ($"left_${p}".eqNullSafe($"right_${p}"))
}
val finalColumns = joinOnColumns
.map(c => col(s"left_${c}").as(c)) ++ // Taking All columns from Join columns
leftDF.columns.diff(joinOnColumns).map(c => col(s"left_${c}").as(c)) ++ // Taking missing columns from leftDF
rightDF.columns.diff(joinOnColumns).map(c => col(s"right_${c}").as(c)) // Taking missing columns from rightDF
leftRenamedDF.join(rightRenamedDF, fullExpr).select(finalColumns: _*)
}
scala>
Final DataFrame result is :
scala> nullSafeJoin(left, right, Seq("a", "b", "c")).show(false)
// Exiting paste mode, now interpreting.
+-----+----+---+----+-------+
|a |b |c |d |status |
+-----+----+---+----+-------+
|13478|null|5 |game|Enabled|
|2001 |BASE|1 |null|Paused |
|null |37 |1 |home|Enabled|
+-----+----+---+----+-------+

Scala spark, input dataframe, return columns where all values equal to 1

Given a dataframe, say that it contains 4 columns and 3 rows. I want to write a function to return the columns where all the values in that column are equal to 1.
This is a Scala code. I want to use some spark transformations to transform or filter the dataframe input. This filter should be implemented in a function.
case class Grade(c1: Integral, c2: Integral, c3: Integral, c4: Integral)
val example = Seq(
Grade(1,3,1,1),
Grade(1,1,null,1),
Grade(1,10,2,1)
)
val dfInput = spark.createDataFrame(example)
After I call the function filterColumns()
val dfOutput = dfInput.filterColumns()
it should return 3 row 2 columns dataframe with value all 1.
A bit more readable approach using Dataset[Grade]
import org.apache.spark.sql.functions.col
import scala.collection.mutable
import org.apache.spark.sql.Column
val tmp = dfInput.map(grade => grade.dropWhenNotEqualsTo(1))
val rowsCount = dfInput.count()
val colsToRetain = mutable.Set[Column]()
for (column <- tmp.columns) {
val withoutNullsCount = tmp.select(column).na.drop().count()
if (rowsCount == withoutNullsCount) colsToRetain += col(column)
}
dfInput.select(colsToRetain.toArray:_*).show()
+---+---+
| c4| c1|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
And the case object
case class Grade(c1: Integer, c2: Integer, c3: Integer, c4: Integer) {
def dropWhenNotEqualsTo(n: Integer): Grade = {
Grade(nullOrValue(c1, n), nullOrValue(c2, n), nullOrValue(c3, n), nullOrValue(c4, n))
}
def nullOrValue(c: Integer, n: Integer) = if (c == n) c else null
}
grade.dropWhenNotEqualsTo(1) -> returns a new Grade with values that not satisfies the condition replaced to nulls
+---+----+----+---+
| c1| c2| c3| c4|
+---+----+----+---+
| 1|null| 1| 1|
| 1| 1|null| 1|
| 1|null|null| 1|
+---+----+----+---+
(column <- tmp.columns) -> iterate over the columns
tmp.select(column).na.drop() -> drop rows with nulls
e.g for c2 this will return
+---+
| c2|
+---+
| 1|
+---+
if (rowsCount == withoutNullsCount) colsToRetain += col(column) -> if column contains nulls just drop it
one of the options is reduce on rdd:
import spark.implicits._
val df= Seq(("1","A","3","4"),("1","2","?","4"),("1","2","3","4")).toDF()
df.show()
val first = df.first()
val size = first.length
val diffStr = "#"
val targetStr = "1"
def rowToArray(row: Row): Array[String] = {
val arr = new Array[String](row.length)
for (i <- 0 to row.length-1){
arr(i) = row.getString(i)
}
arr
}
def compareArrays(a1: Array[String], a2: Array[String]): Array[String] = {
val arr = new Array[String](a1.length)
for (i <- 0 to a1.length-1){
arr(i) = if (a1(i).equals(a2(i)) && a1(i).equals(targetStr)) a1(i) else diffStr
}
arr
}
val diff = df.rdd
.map(rowToArray)
.reduce(compareArrays)
val cols = (df.columns zip diff).filter(!_._2.equals(diffStr)).map(s=>df(s._1))
df.select(cols:_*).show()
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| A| 3| 4|
| 1| 2| ?| 4|
| 1| 2| 3| 4|
+---+---+---+---+
+---+
| _1|
+---+
| 1|
| 1|
| 1|
+---+
I would try to prepare dataset for processing without nulls. In case of few columns this simple iterative approach might work fine (don't forget to import spark implicits before import spark.implicits._):
val example = spark.sparkContext.parallelize(Seq(
Grade(1,3,1,1),
Grade(1,1,0,1),
Grade(1,10,2,1)
)).toDS().cache()
def allOnes(colName: String, ds: Dataset[Grade]): Boolean = {
val row = ds.select(colName).distinct().collect()
if (row.length == 1 && row.head.getInt(0) == 1) true
else false
}
val resultColumns = example.columns.filter(col => allOnes(col, example))
example.selectExpr(resultColumns: _*).show()
result is:
+---+---+
| c1| c4|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
If nulls are inevitable, use untyped dataset (aka dataframe):
val schema = StructType(Seq(
StructField("c1", IntegerType, nullable = true),
StructField("c2", IntegerType, nullable = true),
StructField("c3", IntegerType, nullable = true),
StructField("c4", IntegerType, nullable = true)
))
val example = spark.sparkContext.parallelize(Seq(
Row(1,3,1,1),
Row(1,1,null,1),
Row(1,10,2,1)
))
val dfInput = spark.createDataFrame(example, schema).cache()
def allOnes(colName: String, df: DataFrame): Boolean = {
val row = df.select(colName).distinct().collect()
if (row.length == 1 && row.head.getInt(0) == 1) true
else false
}
val resultColumns= dfInput.columns.filter(col => allOnes(col, dfInput))
dfInput.selectExpr(resultColumns: _*).show()

how to concat multiple columns in spark while getting the column names to be concatenated from another table (different for each row)

I am trying to concat multiple columns in spark using concat function.
For example below is the table for which I have to add new concatenated column
table - **t**
+---+----+
| id|name|
+---+----+
| 1| a|
| 2| b|
+---+----+
and below is the table which has the information about which columns are to be concatenated for given id (for id 1 column id and name needs to be concatenated and for id 2 only id)
table - **r**
+---+-------+
| id| att |
+---+-------+
| 1|id,name|
| 2| id |
+---+-------+
if I join the two tables and do something like below, I am able to concat but not based on the table r (as the new column is having 1,a for first row but for second row it should be 2 only)
t.withColumn("new",concat_ws(",",t.select("att").first.mkString.split(",").map(c => col(c)): _*)).show
+---+----+-------+---+
| id|name| att |new|
+---+----+-------+---+
| 1| a|id,name|1,a|
| 2| b| id |2,b|
+---+----+-------+---+
I have to apply filter before the select in the above query, but I am not sure how to do that in withColumn for each row.
Something like below, if that is possible.
t.withColumn("new",concat_ws(",",t.**filter**("id="+this.id).select("att").first.mkString.split(",").map(c => col(c)): _*)).show
As it will require to filter each row based on the id.
scala> t.filter("id=1").select("att").first.mkString.split(",").map(c => col(c))
res90: Array[org.apache.spark.sql.Column] = Array(id, name)
scala> t.filter("id=2").select("att").first.mkString.split(",").map(c => col(c))
res89: Array[org.apache.spark.sql.Column] = Array(id)
Below is the final required result.
+---+----+-------+---+
| id|name| att |new|
+---+----+-------+---+
| 1| a|id,name|1,a|
| 2| b| id |2 |
+---+----+-------+---+
We can use UDF
Requirements for this logic to work.
The column name of your table t should be in same order as it comes in col att of table r
scala> input_df_1.show
+---+----+
| id|name|
+---+----+
| 1| a|
| 2| b|
+---+----+
scala> input_df_2.show
+---+-------+
| id| att|
+---+-------+
| 1|id,name|
| 2| id|
+---+-------+
scala> val join_df = input_df_1.join(input_df_2,Seq("id"),"inner")
join_df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
scala> val req_cols = input_df_1.columns
req_cols: Array[String] = Array(id, name)
scala> def new_col_udf = udf((cols : Seq[String],row : String,attr : String) => {
| val row_values = row.split(",")
| val attrs = attr.split(",")
| val req_val = attrs.map{at =>
| val index = cols.indexOf(at)
| row_values(index)
| }
| req_val.mkString(",")
| })
new_col_udf: org.apache.spark.sql.expressions.UserDefinedFunction
scala> val intermediate_df = join_df.withColumn("concat_column",concat_ws(",",'id,'name)).withColumn("new_col",new_col_udf(lit(req_cols),'concat_column,'att))
intermediate_df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 3 more fields]
scala> val result_df = intermediate_df.select('id,'name,'att,'new_col)
result_df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 2 more fields]
scala> result_df.show
+---+----+-------+-------+
| id|name| att|new_col|
+---+----+-------+-------+
| 1| a|id,name| 1,a|
| 2| b| id| 2|
+---+----+-------+-------+
Hope it answers your question.
This may be done in a UDF:
val cols: Seq[Column] = dataFrame.columns.map(x => col(x)).toSeq
val indices: Seq[String] = dataFrame.columns.map(x => x).toSeq
val generateNew = udf((values: Seq[Any]) => {
val att = values(indices.indexOf("att")).toString.split(",")
val associatedIndices = indices.filter(x => att.contains(x))
val builder: StringBuilder = StringBuilder.newBuilder
values.filter(x => associatedIndices.contains(values.indexOf(x)))
values.foreach{ v => builder.append(v).append(";") }
builder.toString()
})
val dfColumns = array(cols:_*)
val dNew = dataFrame.withColumn("new", generateNew(dfColumns))
This is just a sketch, but the idea is that you can pass a sequence of items to the user defined function, and select the ones that are needed dynamically.
Note that there are additional types of collection/maps that you can pass - for example How to pass array to UDF