Spark DataFrame - drop null values from column - scala

Given a dataframe :
val df = sc.parallelize(Seq(("foo", ArrayBuffer(null,"bar",null)), ("bar", ArrayBuffer("one","two",null)))).toDF("key", "value")
df.show
+---+--------------------------+
|key| value|
+---+--------------------------+
|foo|ArrayBuffer(null,bar,null)|
|bar|ArrayBuffer(one, two,null)|
+---+--------------------------+
I'd like to drop null from column value. After removal the dataframe should look like this :
+---+--------------------------+
|key| value|
+---+--------------------------+
|foo|ArrayBuffer(bar) |
|bar|ArrayBuffer(one, two) |
+---+--------------------------+
Any suggestion welcome . 10x

You'll need an UDF here. For example with a flatMap:
val filterOutNull = udf((xs: Seq[String]) =>
Option(xs).map(_.flatMap(Option(_))))
df.withColumn("value", filterOutNull($"value"))
where external Option with map handles NULL columns:
Option(null: Seq[String]).map(identity)
Option[Seq[String]] = None
Option(Seq("foo", null, "bar")).map(identity)
Option[Seq[String]] = Some(List(foo, null, bar))
and ensures we don't fail with NPE when input is NULL / null by mapping
NULL -> null -> None -> None -> NULL
where null is a Scala null and NULL is a SQL NULL.
The internal flatMap flattens a sequence of Options effectively filtering nulls:
Seq("foo", null, "bar").flatMap(Option(_))
Seq[String] = List(foo, bar)
A more imperative equivalent could be something like this:
val imperativeFilterOutNull = udf((xs: Seq[String]) =>
if (xs == null) xs
else for {
x <- xs
if x != null
} yield x)

Option 1: using UDF:
val filterNull = udf((arr : Seq[String]) => arr.filter((x: String) => x != null))
df.withColumn("value", filterNull($"value")).show()
Option 2: no UDF
df.withColumn("value", explode($"value")).filter($"value".isNotNull).groupBy("key").agg(collect_list($"value")).show()
Note that this is less efficient...

Also you can use spark-daria it has: com.github.mrpowers.spark.daria.sql.functions.arrayExNull
from the documentation:
Like array but doesn't include null element

Related

Combining When clause with tail: _* on Spark Scala Dataframe

Given this statement 1:
val aggDF3 = aggDF2.select(cols.map { col => ( when(size(aggDF2(col)) === 0,lit(null))
.otherwise(aggDF2(col))).as(s"$col") }: _*)
Given this statement 2:
aggDF.select(colsToSelect.head, colsToSelect.tail: _*).show()
Can I combine the when logic... on statement 1 with the colsToSelect.tail: _* in a single statement, so that the first field is just selected, and the logic only applies to tail scope of dataframe colums? Tried various aspects, but on thin ice here.
This should work:
val aggDF : DataFrame = ???
val colsToSelect : Seq[String] = ???
aggDF
.select((col(colsToSelect.head) +: colsToSelect.tail.map
(col => when(size(aggDF(col)) === 0,lit(null))
.otherwise(aggDF(col)).as(s"$col"))):_*)
.show()
remember that select is overloaded and works differently with String and Column: With cols : Seq[String], you need select(cols.head,cols.tail:_*), with cols : Seq[Column] you need select(cols:_*). The solution above uses the second variant.

Reduce and sum tuples by key

In my Spark Scala application I have an RDD with the following format:
(05/05/2020, (name, 1))
(05/05/2020, (name, 1))
(05/05/2020, (name2, 1))
...
(06/05/2020, (name, 1))
What I want to do is group these elements by date and sum the tuples that have the same "name" as key.
Expected Output:
(05/05/2020, List[(name, 2), (name2, 1)]),
(06/05/2020, List[(name, 1)])
...
In order to do that, I am currently using a groupByKey operation and some extra transformations in order to group the tuples by key and calculate the sum for those that share the same one.
For performance reasons, I would like to replace this groupByKey operation with a reduceByKey or an aggregateByKey in order to reduce the amount of data transferred over the network.
However, I can't get my head around on how to do this. Both of these transformations take as parameter a function between the values (tuples in my case) so I can't see how I can group the tuples by key in order to calculate their sum.
Is it doable?
Here's how you can merge your Tuples using reduceByKey:
/**
File /path/to/file1:
15/04/2010 name
15/04/2010 name
15/04/2010 name2
15/04/2010 name2
15/04/2010 name3
16/04/2010 name
16/04/2010 name
File /path/to/file2:
15/04/2010 name
15/04/2010 name3
**/
import org.apache.spark.rdd.RDD
val filePaths = Array("/path/to/file1", "/path/to/file2").mkString(",")
val rdd: RDD[(String, (String, Int))] = sc.textFile(filePaths).
map{ line =>
val pair = line.split("\\t", -1)
(pair(0), (pair(1), 1))
}
rdd.
map{ case (k, (n, v)) => (k, Map(n -> v)) }.
reduceByKey{ (acc, m) =>
acc ++ m.map{ case (n, v) => (n -> (acc.getOrElse(n, 0) + v)) }
}.
map(x => (x._1, x._2.toList)).
collect
// res1: Array[(String, List[(String, Int)])] = Array(
// (15/04/2010, List((name,3), (name2,2), (name3,2))), (16/04/2010, List((name,2)))
// )
Note that the initial mapping is needed because we want to merge the Tuples as elements in a Map, and reduceByKey for RDD[K, V] requires the same data type V before and after the transformation:
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
Yes .aggeregateBykey() can be used as follows:
import scala.collection.mutable.HashMap
def merge(map: HashMap[String, Int], element: (String, Int)) = {
if(map.contains(element._1)) map(element._1) += element._2 else map(element._1) = element._2
map
}
val input = sc.parallelize(List(("05/05/2020",("name",1)),("05/05/2020", ("name", 1)),("05/05/2020", ("name2", 1)),("06/05/2020", ("name", 1))))
val output = input.aggregateByKey(HashMap[String, Int]())({
//combining map & tuple
case (map, element) => merge(map, element)
}, {
// combining two maps
case (map1, map2) => {
val combined = (map1.keySet ++ map2.keySet).map { i=> (i,map1.getOrElse(i,0) + map2.getOrElse(i,0)) }.toMap
collection.mutable.HashMap(combined.toSeq: _*)
}
}).mapValues(_.toList)
credits: Best way to merge two maps and sum the values of same key?
You could convert the RDD to a DataFrame and just use a groupBy with sum, here is one way to do it
import org.apache.spark.sql.types._
val schema = StructType(StructField("date", StringType, false) :: StructField("name", StringType, false) :: StructField("value", IntegerType, false) :: Nil)
val rd = sc.parallelize(Seq(("05/05/2020", ("name", 1)),
("05/05/2020", ("name", 1)),
("05/05/2020", ("name2", 1)),
("06/05/2020", ("name", 1))))
val df = spark.createDataFrame(rd.map{ case (a, (b,c)) => Row(a,b,c)},schema)
df.show
+----------+-----+-----+
| date| name|value|
+----------+-----+-----+
|05/05/2020| name| 1|
|05/05/2020| name| 1|
|05/05/2020|name2| 1|
|06/05/2020| name| 1|
+----------+-----+-----+
val sumdf = df.groupBy("date","name").sum("value")
sumdf.show
+----------+-----+----------+
| date| name|sum(value)|
+----------+-----+----------+
|06/05/2020| name| 1|
|05/05/2020| name| 2|
|05/05/2020|name2| 1|
+----------+-----+----------+

Dataframe to RDD[Row] replacing space with nulls

I am converting a Spark dataframe to RDD[Row] so I can map it to final schema to write into Hive Orc table. I want to convert any space in the input to actual null so the hive table can store actual null instead of a empty string.
Input DataFrame (a single column with pipe delimited values):
col1
1|2|3||5|6|7|||...|
My code:
inputDF.rdd.
map { x: Row => x.get(0).asInstanceOf[String].split("\\|", -1)}.
map { x => Row (nullConverter(x(0)),nullConverter(x(1)),nullConverter(x(2)).... nullConverter(x(200)))}
def nullConverter(input: String): String = {
if (input.trim.length > 0) input.trim
else null
}
Is there any clean way of doing it rather than calling the nullConverter function 200 times.
Update based on single column:
Going with your approach, I will do something like:
inputDf.rdd.map((row: Row) => {
val values = row.get(0).asInstanceOf[String].split("\\|").map(nullConverter)
Row(values)
})
Make your nullConverter or any other logic a udf:
import org.apache.spark.sql.functions._
val nullConverter = udf((input: String) => {
if (input.trim.length > 0) input.trim
else null
})
Now, use the udf on your df and apply to all columns:
val convertedDf = inputDf.select(inputDf.columns.map(c => nullConverter(col(c)).alias(c)):_*)
Now, you can do your RDD logic.
This would be easier to do using the DataFrame API before converting to an RDD. First, split the data:
val df = Seq(("1|2|3||5|6|7|8||")).toDF("col0") // Example dataframe
val df2 = df.withColumn("col0", split($"col0", "\\|")) // Split on "|"
Then find out the length of the array:
val numCols = df2.first.getAs[Seq[String]](0).length
Now, for each element in the array, use the nullConverter UDF and then assign it to it's own column.
val nullConverter = udf((input: String) => {
if (input.trim.length > 0) input.trim
else null
})
val df3 = df2.select((0 until numCols).map(i => nullConverter($"col0".getItem(i)).as("col" + i)): _*)
The result using the example dataframe:
+----+----+----+----+----+----+----+----+----+----+
|col0|col1|col2|col3|col4|col5|col6|col7|col8|col9|
+----+----+----+----+----+----+----+----+----+----+
| 1| 2| 3|null| 5| 6| 7| 8|null|null|
+----+----+----+----+----+----+----+----+----+----+
Now convert it to an RDD or continue using the data as a DataFrame depending on your needs.
There is no point in converting dataframe to rdd
import org.apache.spark.sql.functions._
df = sc.parallelize([
(1, "foo bar"), (2, "foobar "), (3, " ")
]).toDF(["k", "v"])
df.select(regexp_replace(col("*"), " ", "NULL"))

Replace all occurrences of a String in all columns in a dataframe in scala

I have a dataframe with 20 Columns and in these columns there is a value XX which i want to replace with Empty String. How do i achieve that in scala. The withColumn function is for a single column, But i want to pass all 20 columns and replace values that have XX in the entire frame with Empty String , Can some one suggest a way.
Thanks
You can gather all the stringType columns in a list and use foldLeft to apply your removeXX UDF to each of the columns as follows:
val df = Seq(
(1, "aaXX", "bb"),
(2, "ccXX", "XXdd"),
(3, "ee", "fXXf")
).toDF("id", "desc1", "desc2")
import org.apache.spark.sql.types._
val stringColumns = df.schema.fields.collect{
case StructField(name, StringType, _, _) => name
}
val removeXX = udf( (s: String) =>
if (s == null) null else s.replaceAll("XX", "")
)
val dfResult = stringColumns.foldLeft( df )( (acc, c) =>
acc.withColumn( c, removeXX(df(c)) )
)
dfResult.show
+---+-----+-----+
| id|desc1|desc2|
+---+-----+-----+
| 1| aa| bb|
| 2| cc| dd|
| 3| ee| ff|
+---+-----+-----+
def clearValueContains(dataFrame: DataFrame,token :String,columnsToBeUpdated : List[String])={
columnsToBeUpdated.foldLeft(dataFrame){
(dataset ,columnName) =>
dataset.withColumn(columnName, when(col(columnName).contains(token), "").otherwise(col(columnName)))
}
}
You can use this function .. where you can put token as "XX" . Also the columnsToBeUpdated is the list of columns in which you need to search for the particular column.
dataset.withColumn(columnName, when(col(columnName) === token, "").otherwise(col(columnName)))
you can use the above code to replace on exact match.
We can do like this as well in scala.
//Getting all columns
val columns: Seq[String] = df.columns
//Using DataFrameNaFunctions to achieve this.
val changedDF = df.na.replace(columns, Map("XX"-> ""))
Hope this helps.

How to impute NULL values to zero in Spark/Scala

I have a Dataframe in which some columns are of type String and contain NULL as a String value (not as actual NULL). I want to impute them with zero. apparently df.na.fill(0) doesn't work. How can I impute them with zero?
You can use replace() from DataFrameNaFunctions, these can be accessed by the prefix .na:
val df1 = df.na.replace("*", Map("NULL" -> "0"))
You could also create your own udf that replicates this behaviour:
import org.apache.spark.sql.functions.col
val nullReplacer = udf((x: String) => {
if (x == "NULL") "0"
else x
})
val df1 = df.select(df.columns.map(c => nullReplacer(col(c)).alias(c)): _*)
However this would be superfluous given it does the same as the above, at the cost of more lines of code than necessary.