Compare timestamp columns - Spark scala - scala

I am trying to compare 2 timestamp columns and return the value which is minimum value among the two and wondering is there a better way than what I have. Note both columns might have values / one of them has value and other could be null. I know about When.otherwise.
import org.apache.spark.sql.functions._
import java.sql.Timestamp
val compareTime = udf((t1: Timestamp, t2: Timestamp) => {
if(t1 != null && t2 != null && t1.before(t2) ){
Some(t1)
}
else if(t1 != null && t2 != null && t2.before(t1)){
Some(t2)
}
else if(t1 != null){
Some(t1)
}
else if(t2 != null){
Some(t2)
}
else {
None
}
})
var df = Seq((1L, "2021-01-04 16:10:00","2021-01-04 15:20:00")).toDF("id","t1","t2")
df = (df.withColumn("t1",to_timestamp($"t1","yyyy-MM-dd HH:mm:ss"))
.withColumn("t2",to_timestamp($"t2","yyyy-MM-dd HH:mm:ss")))
df = df.withColumn("t3",compareTime($"t1",$"t2"))
df.show()

UDF is probably unnecessary here - you can use the Spark SQL function least:
var df = Seq((1L, "2021-01-04 16:10:00","2021-01-04 15:20:00")).toDF("id","t1","t2")
df = (df.withColumn("t1",to_timestamp($"t1","yyyy-MM-dd HH:mm:ss"))
.withColumn("t2",to_timestamp($"t2","yyyy-MM-dd HH:mm:ss")))
df = df.withColumn("t3",least($"t1",$"t2"))
df.show()
+---+-------------------+-------------------+-------------------+
| id| t1| t2| t3|
+---+-------------------+-------------------+-------------------+
| 1|2021-01-04 16:10:00|2021-01-04 15:20:00|2021-01-04 15:20:00|
+---+-------------------+-------------------+-------------------+
The opposite of least is greatest, if you want to get the larger one of the two columns.
Note that both least and greatest will ignore null values, but they will return null if all input columns are null.

Try this:
(Option(t1) ++ Option(t2)).minOption
It should do the same job as your if..else if..else stack.
Oops. My bad. Spark doesn't do Scala 2.13.x. Try this instead:
util.Try((Option(t1) ++ Option(t2)).minBy(_.getTime())).toOption

Related

Spark DataFrame Get Null Count For All Columns

I have a DataFrame in which I would like to get the total null values count and I have the following that does this generically on all the columns:
First my DataFrame that just contains one column (for simplicity):
val recVacDate = dfRaw.select("STATE")
When I print using a simple filter, I get to see the following:
val filtered = recVacDate.filter("STATE is null")
println(filtered.count()) // Prints 94051
But when I use this code below, I get just 1 as a result and I do not understand why?
val nullCount = recVacDate.select(recVacDate.columns.map(c => count(col(c).isNull || col(c) === "" || col(c).isNaN).alias(c)): _*)
println(nullCount.count()) // Prints 1
Any ideas as to what is wrong with the nullCount? The DataType of the column is a String.
This kind of fixed it:
df.select(df.columns.map(c => count(when(col(c).isNull || col(c) === "" || col(c).isNaN, c)).alias(c)): _*)
Notice the use of when clause after the count.

Using Spark 2.3.1 with Scala, Reduce Arbitrary List of Date Ranges into distinct non-overlapping ranges of dates

Given a list of date ranges, some of which overlap:
val df = Seq(
("Mike","2018-09-01","2018-09-10"), // range 1
("Mike","2018-09-05","2018-09-05"), // range 1
("Mike","2018-09-12","2018-09-12"), // range 1
("Mike","2018-09-11","2018-09-11"), // range 1
("Mike","2018-09-25","2018-09-29"), // range 4
("Mike","2018-09-21","2018-09-23"), // range 4
("Mike","2018-09-24","2018-09-24"), // range 4
("Mike","2018-09-14","2018-09-16"), // range 2
("Mike","2018-09-15","2018-09-17"), // range 2
("Mike","2018-09-05","2018-09-05"), // range 1
("Mike","2018-09-19","2018-09-19"), // range 3
("Mike","2018-09-19","2018-09-19"), // range 3
("Mike","2018-08-19","2018-08-20"), // range 5
("Mike","2018-10-01","2018-10-20"), // range 6
("Mike","2018-10-10","2018-10-30") // range 6
).toDF("name", "start", "end")
I'd like to reduce the data down to the minimum set of date ranges that completely encapsulate the above dates with no extra dates added:
+----+----------+----------+
|name|start |end |
+----+----------+----------+
|Mike|2018-09-01|2018-09-12|
|Mike|2018-09-14|2018-09-17|
|Mike|2018-09-19|2018-09-19|
|Mike|2018-09-21|2018-09-29|
|Mike|2018-08-19|2018-08-20|
|Mike|2018-10-01|2018-10-30|
+----+----------+----------+
EDIT: Added three new entries to the test data to account for new edge cases.
I cannot rely on the dates being in any particular order.
My best attempt at this so far:
Explode each date range into its set of individual days
Union the sets together into one big set of all the days
Sort the set into a list so the days are in order
Aggregate the individual days back into a list of lists of days.
Take the first and last day of each list as the new ranges.
The code, such as it is:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
import scala.collection.immutable.NumericRange
import java.time.LocalDate
case class MyRange(start:String, end:String)
val combineRanges = udf((ranges: Seq[Row]) => {
ranges.map(r => LocalDate.parse(r(0).toString).toEpochDay to LocalDate.parse(r(1).toString).toEpochDay)
.map(_.toIndexedSeq).reduce(_ ++ _).distinct.toList.sorted
.aggregate(List.empty[Vector[Long]])((ranges:List[Vector[Long]], d:Long) => {
ranges.lastOption.find(_.last + 1 == d) match {
case Some(r:Vector[Long]) => ranges.dropRight(1) :+ (r :+ d)
case None => ranges :+ Vector(d)
}
}, _ ++ _).map(v => MyRange(LocalDate.ofEpochDay(v.head).toString, LocalDate.ofEpochDay(v.last).toString))
})
df.groupBy("name")
.agg(combineRanges(collect_list(struct($"start", $"end"))) as "ranges")
.withColumn("ranges", explode($"ranges"))
.select($"name", $"ranges.start", $"ranges.end")
.show(false)
It seems to work, but it is very ugly and probably wasteful of time and memory.
I was kind of hoping to use the scala Range class to only notionally explode the date ranges into their individual days, but I've got a feeling the sort operation forces scala's hand and makes it actually create a list of all the dates in memory.
Does anyone have a better way of doing this?
I think the easiest (and most readable) way is to explode the ranges into individual days and then aggregate back into intervals. As the number of days cannot grow too large, I think exploding is not a bottleneck here. I show a "pure-Scala" solution which is then used inside an UDF which gets all the intervals from a collect_list aggregation :
import java.time.LocalDate
import java.time.temporal.ChronoUnit
def enumerateDays(start: LocalDate, end: LocalDate) = {
Iterator.iterate(start)(d => d.plusDays(1L))
.takeWhile(d => !d.isAfter(end))
.toList
}
implicit val localDateOrdering: Ordering[LocalDate] = Ordering.by(_.toEpochDay)
val combineRanges = udf((data: Seq[Row]) => {
val dateEnumerated =
data
.toSet[Row] // use Set to save memory if many spans overlap
// "explode" date spans into individual days
.flatMap { case Row(start: String, end: String) => enumerateDays(LocalDate.parse(start), LocalDate.parse(end)) }
.toVector
.sorted
// combine subsequent dates into Vectors
dateEnumerated.tail
// combine subsequent dates into Vectors
.foldLeft(Vector(Vector(dateEnumerated.head)))((agg, curr) => {
if (ChronoUnit.DAYS.between(agg.last.last, curr) == 1L) {
agg.init :+ (agg.last :+ curr)
} else {
agg :+ Vector(curr)
}
})
// now get min/max of dates per span
.map(r => (r.min.toString, r.max.toString))
})
df.groupBy("name")
.agg(combineRanges(collect_list(struct($"start", $"end"))) as "ranges")
.withColumn("ranges", explode($"ranges"))
.select($"name", $"ranges._1".as("start"), $"ranges._2".as("end"))
.show(false)
gives
+----+----------+----------+
|name|start |end |
+----+----------+----------+
|Mike|2018-08-19|2018-08-20|
|Mike|2018-09-01|2018-09-12|
|Mike|2018-09-14|2018-09-17|
|Mike|2018-09-19|2018-09-19|
|Mike|2018-09-21|2018-09-29|
|Mike|2018-10-01|2018-10-30|
+----+----------+----------+
I think it's also doable with more logic DataFrame API. I would still explode using UDFs, but then use Window-Functions and groupBy to build the new block based on number of days between 2 dates. But I think the above solution is also ok
Here is an alternative with DFs and SPARK SQL both non-procedural and procedural by definition. You need to read well and persist.
// Aspects such as caching and re-partitioning for performance not considered. On the other hand it all happens under the bonnet wth DF's - so they say.
// Functional only.
import org.apache.spark.sql.functions._
import spark.implicits._
import java.time._
import org.apache.spark.sql.functions.{lead, lag}
import org.apache.spark.sql.expressions.Window
def toEpochDay(s: String) = LocalDate.parse(s).toEpochDay
val toEpochDayUdf = udf(toEpochDay(_: String))
val df = Seq(
("Betty","2018-09-05","2018-09-05"), ("Betty","2018-09-05","2018-09-05"),
("Betty","2018-09-05","2018-09-08"), ("Betty","2018-09-07","2018-09-10"),
("Betty","2018-09-07","2018-09-08"), ("Betty","2018-09-06","2018-09-07"),
("Betty","2018-09-10","2018-09-15"), ("Betty","2017-09-10","2017-09-15"),
("XXX","2017-09-04","2017-09-10"), ("XXX","2017-09-10","2017-09-15"),
("YYY","2017-09-04","2017-09-10"), ("YYY","2017-09-11","2017-09-15"),
("Bob","2018-09-01","2018-09-02"), ("Bob","2018-09-04","2018-09-05"),
("Bob","2018-09-06","2018-09-07"), ("Bob","2019-09-04","2019-09-05"),
("Bob","2019-09-06","2019-09-07"), ("Bob","2018-09-08","2018-09-22")
).toDF("name", "start", "end")
// Remove any duplicates - pointless to n-process these!
val df2 = df.withColumn("s", toEpochDayUdf($"start")).withColumn("e", toEpochDayUdf($"end")).distinct
df2.show(false) // The original input
df2.createOrReplaceTempView("ranges")
// Find those records encompassed by a broader time frame and hence not required for processing.
val q = spark.sql(""" SELECT *
FROM ranges r1
WHERE EXISTS (SELECT r2.name
FROM ranges r2
WHERE r2.name = r1.name
AND ((r1.s >= r2.s AND r1.e < r2.e) OR
(r1.e <= r2.e AND r1.s > 2.s))
)
""")
//q.show(false)
val df3 = df2.except(q) // Overlapping or on their own / single range records left.
//df3.show(false)
df3.createOrReplaceTempView("ranges2")
// Find those ranges that have a gap between them and the next adjacent records, before or after, i.e. records that exist on their own and are in fact per de facto the first part of the answer.
val q2 = spark.sql(""" SELECT *
FROM ranges2 r1
WHERE NOT EXISTS (SELECT r2.name
FROM ranges2 r2
WHERE r2.name = r1.name
AND (r2.e >= r1.s - 1 AND r2.s <= r1.s - 1 ) OR
(r2.s <= r1.e + 1 AND r2.e >= r1.e + 1 ))
)
""")
// Store the first set of records that exist on their own with some form of gap, first part of result overall result set.
val result1 = q2.select("name", "start", "end")
result1.show(false)
// Get the records / ranges that have overlaps to process - the second remaining set of such records to process.
val df4 = df3.except(q2)
//df4.show(false)
//Avoid Serialization errors with lag!
#transient val w = org.apache.spark.sql.expressions.Window.partitionBy("name").orderBy("e")
#transient val lag_y = lag("e", 1, -99999999).over(w)
//df.select(lag_y).map(f _).first
val df5 = df4.withColumn("new_col", lag_y)
//df5.show(false)
// Massage data to get results via easier queries, e.g. avoid issues with correlated sub-queries.
val myExpression = "s - new_col"
val df6 = df5.withColumn("result", when($"new_col" === 0, 0).otherwise(expr(myExpression)))
//df6.show(false)
df6.createOrReplaceTempView("ranges3")
val q3 = spark.sql(""" SELECT *, dense_rank() over (PARTITION BY name ORDER BY start ASC) as RANK
FROM ranges3
WHERE new_col = -99999999 OR result > 1
""")
q3.createOrReplaceTempView("rangesSTARTS")
val q4 = spark.sql(""" SELECT *
FROM ranges3
WHERE result <= 1 AND new_col <> -99999999
""")
q4.createOrReplaceTempView("rangesFOLLOWERS")
val q5 = spark.sql(""" SELECT r1.*, r2.start as next_start
FROM rangesSTARTS r1 LEFT JOIN rangesSTARTS r2
ON r2.name = r1.name
AND r2.RANK = r1.RANK + 1
""")
//q5.show(false)
val q6 = q5.withColumn("end_period", when($"next_start".isNull, "2525-01-01").otherwise($"next_start"))
//q6.show(false)
q6.createOrReplaceTempView("rangesSTARTS2")
// Second and final set of results - the head and tail of such set of range records.
val result2 = spark.sql(""" SELECT r1.name, r1.start, MAX(r2.end) as end
FROM rangesFOLLOWERS r2, rangesSTARTS2 r1
WHERE r2.name = r1.name
AND r2.end >= r1.start
AND r2.end < r1.end_period
GROUP BY r1.name, r1.start """)
result2.show(false)
val finalresult = result1.union(result2)
finalresult.show
returns:
+-----+----------+----------+
| name| start| end|
+-----+----------+----------+
| Bob|2018-09-01|2018-09-02|
|Betty|2017-09-10|2017-09-15|
| YYY|2017-09-04|2017-09-15|
| Bob|2018-09-04|2018-09-22|
| Bob|2019-09-04|2019-09-07|
| XXX|2017-09-04|2017-09-15|
|Betty|2018-09-05|2018-09-15|
+-----+----------+----------+
An interesting contrast - what is better for performance and style? My last such effort for a while. Interested in your comments. You know the programming aspects better than I, so this question provides some good comparison and some good education. the other solutions do explode, not how I saw it.
I think my second approach is better, but still far from perfect. It at least avoids iterating through every day in every date range, although it now processes every range multiple times. I think I'm mostly going to be processing a few large ranges instead of a bunch of little ranges, so maybe that's ok.
Given:
val ranges = Seq(
("Mike","2018-09-01","2018-09-10"),
("Mike","2018-09-05","2018-09-05"),
("Mike","2018-09-12","2018-09-12"),
("Mike","2018-09-11","2018-09-11"),
("Mike","2018-09-25","2018-09-30"),
("Mike","2018-09-21","2018-09-23"),
("Mike","2018-09-24","2018-09-24"),
("Mike","2018-09-14","2018-09-16"),
("Mike","2018-09-15","2018-09-17"),
("Mike","2018-09-05","2018-09-05"),
("Mike","2018-09-19","2018-09-19"),
("Mike","2018-09-19","2018-09-19"),
("Mike","2018-08-19","2018-08-20"),
("Mike","2018-10-01","2018-10-20"),
("Mike","2018-10-10","2018-10-30")
)
val df = ranges.toDF("name", "start", "end")
I want:
+----+----------+----------+
|name|start |end |
+----+----------+----------+
|Mike|2018-09-01|2018-09-12|
|Mike|2018-09-21|2018-09-30|
|Mike|2018-09-14|2018-09-17|
|Mike|2018-09-19|2018-09-19|
|Mike|2018-08-19|2018-08-20|
|Mike|2018-10-01|2018-10-30|
+----+----------+----------+
(They are not in order this time. I'm ok with that since it was never a requirement. It just happened to be an artifact of my previous approach)
// very specific helper functions to convert a date string to and from a range
implicit class MyString(s:String) {
def toFilteredInt: Int = s.filter(_.isDigit).toInt
def to(s2:String): Range = s.toFilteredInt to s2.toFilteredInt
// this only works for YYYYMMDD strings. very dangerous.
def toDateStr = s"${s.slice(0,4)}-${s.slice(4,6)}-${s.slice(6,8)}"
}
// helper functions to combine two ranges
implicit class MyRange(r:Range) {
def expand(i: Int): Range = r.head - i * r.step to r.last + i * r.step
def containsPart(r2:Range): Boolean = r.contains(r2.head) || r.contains(r2.last)
def intersects(r2:Range): Boolean = r.containsPart(r2) || r2.containsPart(r)
def combine(r2:Range): Option[Range] = {
if (r.step == r2.step && r.intersects(r2 expand 1)) {
if (r.step > 0) Some(Math.min(r.head, r2.head) to Math.max(r.last, r2.last))
else Some(Math.max(r.head, r2.head) to Math.min(r.last, r2.last))
}
else None
}
def toDateStrTuple: (String,String) = (r.start.toString.toDateStr, r.end.toString.toDateStr)
}
// combines a range to one of the ranges in a sequence if possible;
// adds it to the sequence if it can't be combined.
def addToRanges(rs:Seq[Range], r:Range): Seq[Range] = {
if (rs.isEmpty) Seq(r)
else r.combine(rs.last) match {
case Some(r:Range) => rs.init :+ r
case None => addToRanges(rs.init, r) :+ rs.last
}
}
// tries to combine every range in the sequence with every other range
// does not handle the case where combining two ranges together allows
// them to be combined with a previous range in the sequence.
// however, if we call this and nothing has been combined, we know
// we are done
def collapseOnce(rs:Seq[Range]):Seq[Range] = {
if (rs.size <= 1) rs
else addToRanges(collapseOnce(rs.init), rs.last)
}
// keep collapsing the sequence of ranges until they can't collapse
// any further
def collapseAll(rs:Seq[Range]):Seq[Range] = {
val collapsed = collapseOnce(rs)
if (rs.size == collapsed.size) rs
else collapseAll(collapsed)
}
// now our udf is much simpler
val combineRanges = udf((rows: Seq[Row]) => {
val ranges = rows.map(r => r(0).toString to r(1).toString)
collapseAll(ranges).map(_.toDateStrTuple)
})
df.groupBy("name").agg(combineRanges(collect_list(struct($"start", $"end"))) as "ranges"
).withColumn("ranges", explode($"ranges")
).select($"name", $"ranges._1" as "start", $"ranges._2" as "end").show(false)
Room for improvement:
I'm pretty sure I'd get better performance most of the time if I bailed out of collapseOnce as soon as I found a range to combine. The typical use case will be adding a single day range to the last range in the sequence.
collapseOnce and addToRanges are not yet tail recursive.
Some of the date to string and string to date methods in my implicit classes should probably be stand alone methods. They are super specific to my problem and don't deserve to be included in general String and Range classes.
What about
exploding date range into days by a udf
using analytical functions to do the magic
Code:
import java.time.LocalDate
import java.time.format.DateTimeFormatter
def enumerateDays(start: LocalDate, end: LocalDate) = {
Iterator.iterate(start)(d => d.plusDays(1L))
.takeWhile(d => !d.isAfter(end))
.toSeq
}
val udf_enumerateDays = udf( (start:String, end:String) => enumerateDays(LocalDate.parse(start), LocalDate.parse(end)).map(_.toString))
df.select($"name", explode(udf_enumerateDays($"start",$"end")).as("day"))
.distinct
.withColumn("day_prev", lag($"day",1).over(Window.partitionBy($"name").orderBy($"day")))
.withColumn("is_consecutive", coalesce(datediff($"day",$"day_prev"),lit(0))<=1)
.withColumn("group_nb", sum(when($"is_consecutive",lit(0)).otherwise(lit(1))).over(Window.partitionBy($"name").orderBy($"day")))
.groupBy($"name",$"group_nb").agg(min($"day").as("start"), max($"day").as("end"))
.drop($"group_nb")
.orderBy($"name",$"start")
.show
Result:
+----+----------+----------+
|name| start| end|
+----+----------+----------+
|Mike|2018-08-19|2018-08-20|
|Mike|2018-09-01|2018-09-12|
|Mike|2018-09-14|2018-09-17|
|Mike|2018-09-19|2018-09-19|
|Mike|2018-09-21|2018-09-29|
|Mike|2018-10-01|2018-10-30|
+----+----------+----------+
Here is one more solution using DF without the UDFs
val df = Seq(
("Mike","2018-09-01","2018-09-10"), // range 1
("Mike","2018-09-05","2018-09-05"), // range 1
("Mike","2018-09-12","2018-09-12"), // range 1
("Mike","2018-09-11","2018-09-11"), // range 1
("Mike","2018-09-25","2018-09-30"), // range 4
("Mike","2018-09-21","2018-09-23"), // range 4
("Mike","2018-09-24","2018-09-24"), // range 4
("Mike","2018-09-14","2018-09-16"), // range 2
("Mike","2018-09-15","2018-09-17"), // range 2
("Mike","2018-09-05","2018-09-05"), // range 1
("Mike","2018-09-19","2018-09-19"), // range 3
("Mike","2018-09-19","2018-09-19") // range 3
).toDF("name", "start", "end").withColumn("start",'start.cast("date")).withColumn("end",'end.cast("date"))
df.printSchema()
val df2 = df.as("t1").join(df.as("t2"), $"t1.start" =!= $"t2.start" and $"t1.end" =!= $"t2.end")
.withColumn("date_diff_start",datediff($"t1.start",$"t2.start"))
.withColumn("date_diff_end",datediff($"t1.end",$"t2.end"))
.withColumn("n1_start",when('date_diff_start===1,$"t2.start"))
.withColumn("n1_end",when('date_diff_end === -1,$"t2.end"))
.filter( 'n1_start.isNotNull or 'n1_end.isNotNull)
.withColumn( "new_start", when('n1_start.isNotNull, $"n1_start").otherwise($"t1.start"))
.withColumn( "new_end", when('n1_end.isNotNull, $"n1_end").otherwise($"t1.end"))
.select("t1.name","new_start","new_end")
.distinct
val df3= df2.alias("t3").join(df2.alias("t4"),$"t3.name" === $"t4.name")
.withColumn("x1",when($"t3.new_end"=== $"t4.new_start",1)
.when($"t3.new_start" === $"t4.new_end",1)
.otherwise(0))
.groupBy("t3.name","t3.new_start","t3.new_end")
.agg( min( when('x1===1,$"t4.new_start" ).otherwise($"t3.new_start") ).as("ns"), max(when('x1===1,$"t4.new_end").otherwise($"t3.new_end")).as("ne"))
.select("t3.name","ns","ne")
.distinct
df3.show(false)
val num_combinations = df3.count
val df4 = df.filter('start==='end).distinct.select("name","start").alias("dup")
.join(df3.alias("d4"), $"d4.name"===$"dup.name" , "leftOuter")
.withColumn("cond", ! $"dup.start".between($"ns" , $"ne"))
.filter('cond)
.groupBy("d4.name","start" ).agg(count($"start").as("count"),collect_set('start).as("dup_s1"))
.filter('count===num_combinations)
.withColumn("start",explode('dup_s1))
.withColumn("end",'start)
.select("name","start","end")
df3.union(df4).show(false)
Results:
+----+----------+----------+
|name|ns |ne |
+----+----------+----------+
|Mike|2018-09-21|2018-09-30|
|Mike|2018-09-01|2018-09-12|
|Mike|2018-09-14|2018-09-17|
|Mike|2018-09-19|2018-09-19|
+----+----------+----------+

Dataframe to RDD[Row] replacing space with nulls

I am converting a Spark dataframe to RDD[Row] so I can map it to final schema to write into Hive Orc table. I want to convert any space in the input to actual null so the hive table can store actual null instead of a empty string.
Input DataFrame (a single column with pipe delimited values):
col1
1|2|3||5|6|7|||...|
My code:
inputDF.rdd.
map { x: Row => x.get(0).asInstanceOf[String].split("\\|", -1)}.
map { x => Row (nullConverter(x(0)),nullConverter(x(1)),nullConverter(x(2)).... nullConverter(x(200)))}
def nullConverter(input: String): String = {
if (input.trim.length > 0) input.trim
else null
}
Is there any clean way of doing it rather than calling the nullConverter function 200 times.
Update based on single column:
Going with your approach, I will do something like:
inputDf.rdd.map((row: Row) => {
val values = row.get(0).asInstanceOf[String].split("\\|").map(nullConverter)
Row(values)
})
Make your nullConverter or any other logic a udf:
import org.apache.spark.sql.functions._
val nullConverter = udf((input: String) => {
if (input.trim.length > 0) input.trim
else null
})
Now, use the udf on your df and apply to all columns:
val convertedDf = inputDf.select(inputDf.columns.map(c => nullConverter(col(c)).alias(c)):_*)
Now, you can do your RDD logic.
This would be easier to do using the DataFrame API before converting to an RDD. First, split the data:
val df = Seq(("1|2|3||5|6|7|8||")).toDF("col0") // Example dataframe
val df2 = df.withColumn("col0", split($"col0", "\\|")) // Split on "|"
Then find out the length of the array:
val numCols = df2.first.getAs[Seq[String]](0).length
Now, for each element in the array, use the nullConverter UDF and then assign it to it's own column.
val nullConverter = udf((input: String) => {
if (input.trim.length > 0) input.trim
else null
})
val df3 = df2.select((0 until numCols).map(i => nullConverter($"col0".getItem(i)).as("col" + i)): _*)
The result using the example dataframe:
+----+----+----+----+----+----+----+----+----+----+
|col0|col1|col2|col3|col4|col5|col6|col7|col8|col9|
+----+----+----+----+----+----+----+----+----+----+
| 1| 2| 3|null| 5| 6| 7| 8|null|null|
+----+----+----+----+----+----+----+----+----+----+
Now convert it to an RDD or continue using the data as a DataFrame depending on your needs.
There is no point in converting dataframe to rdd
import org.apache.spark.sql.functions._
df = sc.parallelize([
(1, "foo bar"), (2, "foobar "), (3, " ")
]).toDF(["k", "v"])
df.select(regexp_replace(col("*"), " ", "NULL"))

Count empty values in dataframe column in Spark (Scala)

I'm trying to count empty values in column in DataFrame like this:
df.filter((df(colname) === null) || (df(colname) === "")).count()
In colname there is a name of the column. This works fine if column type is string but if column type is integer and there are some nulls this code always returns 0. Why is this so? How to change it to make it work?
As mentioned on the question that df.filter((df(colname) === null) || (df(colname) === "")).count() works for String data types but the testing shows that null are not handled.
#Psidom's answer handles both null and empty but does not handle for NaN.
checking for .isNaN should handle all three cases
df.filter(df(colName).isNull || df(colName) === "" || df(colName).isNaN).count()
You can use isNull to test the null condition:
val df = Seq((Some("a"), Some(1)), (null, null), (Some(""), Some(2))).toDF("A", "B")
// df: org.apache.spark.sql.DataFrame = [A: string, B: int]
df.filter(df("A").isNull || df("A") === "").count
// res7: Long = 2
df.filter(df("B").isNull || df("B") === "").count
// res8: Long = 1

How to impute NULL values to zero in Spark/Scala

I have a Dataframe in which some columns are of type String and contain NULL as a String value (not as actual NULL). I want to impute them with zero. apparently df.na.fill(0) doesn't work. How can I impute them with zero?
You can use replace() from DataFrameNaFunctions, these can be accessed by the prefix .na:
val df1 = df.na.replace("*", Map("NULL" -> "0"))
You could also create your own udf that replicates this behaviour:
import org.apache.spark.sql.functions.col
val nullReplacer = udf((x: String) => {
if (x == "NULL") "0"
else x
})
val df1 = df.select(df.columns.map(c => nullReplacer(col(c)).alias(c)): _*)
However this would be superfluous given it does the same as the above, at the cost of more lines of code than necessary.