Spark/Scala transform fails in FindBugs - scala

FindBugs is reporting a "Null pointer dereference of ?" in the following Spark/Scala transform:
val df = Seq((10,20)).toDF
df
.transform(addCol1) // <-- Null pointer dereference of ?
def addCol1(df: DataFrame): DataFrame =
df.withColumn("col1", lit("a"))
Report:
Null pointer dereference of ? in new MyClass$...
Bug kind and pattern: NP - NP_ALWAYS_NULL
I don't understand what might be null in this case. Also, the ? may be a lead.
Does anyone have an hint?

Hope below code snippet will help.
scala> val df = List(1,2,3).toDF("id")
df: org.apache.spark.sql.DataFrame = [id: int]
scala> df.show
+---+
| id|
+---+
| 1|
| 2|
| 3|
+---+
scala> :paste
// Entering paste mode (ctrl-D to finish)
def addCol1(df: DataFrame): DataFrame =
df.withColumn("col1", lit("a"))
// Exiting paste mode, now interpreting.
addCol1: (df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
scala> df.transform(addCol1).show
+---+----+
| id|col1|
+---+----+
| 1| a|
| 2| a|
| 3| a|
+---+----+

Related

Convert Date Column to Age with Scala and Spark

I am trying to convert a Column of a Dataset to true Age.
I am using Scala with Spark and my project is on IntelliJ.
This is the sample dataset
TotalCost|BirthDate|Gender|TotalChildren|ProductCategoryName
1000||Male|2|Technology
2000|1957-03-06||3|Beauty
3000|1959-03-06|Male||Car
4000|1953-03-06|Male|2|
5000|1957-03-06|Female|3|Beauty
6000|1959-03-06|Male|4|Car
7000|1957-03-06|Female|3|Beauty
8000|1959-03-06|Male|4|Car
And this is the code of Scala
import org.apache.spark.sql.SparkSession
object DataFrameFromCSVFile2 {
def main(args:Array[String]):Unit= {
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val filePath="src/main/resources/demodata.txt"
val df = spark.read.options(Map("inferSchema"->"true","delimiter"->"|","header"->"true")).csv(filePath).select("Gender", "BirthDate", "TotalCost", "TotalChildren", "ProductCategoryName")
val df2 = df
.filter("Gender is not null")
.filter("BirthDate is not null")
.filter("TotalChildren is not null")
.filter("ProductCategoryName is not null")
df2.show()
So I am trying to convert the 1957-03-06 to an age like 61 in the Column
Any idea will help a lot
Thank you very much
You can use the built-in functions - months_between() or datediff(). Check this out
scala> val df = Seq("1957-03-06","1959-03-06").toDF("date")
df: org.apache.spark.sql.DataFrame = [date: string]
scala> df.show(false)
+----------+
|date |
+----------+
|1957-03-06|
|1959-03-06|
+----------+
scala> df.withColumn("age",months_between(current_date,'date)/12).show
+----------+------------------+
| date| age|
+----------+------------------+
|1957-03-06|61.806451612500005|
|1959-03-06|59.806451612500005|
+----------+------------------+
scala> df.withColumn("age",datediff(current_date,'date)/365).show
+----------+-----------------+
| date| age|
+----------+-----------------+
|1957-03-06|61.85205479452055|
|1959-03-06|59.85205479452055|
+----------+-----------------+
scala>
Here's one way that uses the java.time API in an UDF along with Spark's built-in when/otherwise for null check:
val currentAge = udf{ (dob: java.sql.Date) =>
import java.time.{LocalDate, Period}
Period.between(dob.toLocalDate, LocalDate.now).getYears
}
df.withColumn("CurrentAge", when($"BirthDate".isNotNull, currentAge($"BirthDate"))).
show(5)
// +------+-------------------+---------+-------------+-------------------+----------+
// |Gender| BirthDate|TotalCost|TotalChildren|ProductCategoryName|CurrentAge|
// +------+-------------------+---------+-------------+-------------------+----------+
// | Male| null| 1000| 2| Technology| null|
// | null|1957-03-06 00:00:00| 2000| 3| Beauty| 61|
// | Male|1959-03-06 00:00:00| 3000| null| Car| 59|
// | Male|1953-03-06 00:00:00| 4000| 2| null| 65|
// |Female|1957-03-06 00:00:00| 5000| 3| Beauty| 61|
// +------+-------------------+---------+-------------+-------------------+----------+
You can use the Java Calendar library to get the current date in your timezone to calculate the age. you can use udf to do that.
for example
import java.time.ZoneId
import java.util.Calendar
val data = Seq("1957-03-06","1959-03-06").toDF("date")
val ageudf = udf((inputDate:String)=>{
val format = new java.text.SimpleDateFormat("yyyy-MM-dd")
val birthDate = format.parse(inputDate).toInstant.atZone(ZoneId.systemDefault()).toLocalDate
val currentDate = Calendar.getInstance().getTime..toInstant.atZone(ZoneId.systemDefault()).toLocalDate
import java.time.Period
if((birthDate != null) && (currentDate != null)) Period.between(birthDate,currentDate).getYears
else 0
})
data.withColumn("age",ageUdf($"date")).show()
The output will be:
date|age
1957-03-06|61
1959-03-06|59

How to transform a string column of a dataframe into a column of Array[String] with Apache Spark and Scala

I have a DataFrame with a column 'title_from' as below.
.
This colume contains a sentence and I want to transform this column into a Array[String]. I have tried something like this but it does not works.
val newDF = df.select("title_from").map(x => x.split("\\\s+")
How can I achieve this? How can I transform a datafram of strings into a dataframe of Array[string]? I want evry line of newDF to be an array of words from df.
Thanks for any help!
You can use the withColumn function.
import org.apache.spark.sql.functions._
val newDF = df.withColumn("split_title_from", split(col("title_from"), "\\s+"))
.select("split_title_from")
Can you try following to get the list of all authors
scala> val df = Seq((1,"a1,a2,a3"), (2,"a1,a4,a10")).toDF("id","author")
df: org.apache.spark.sql.DataFrame = [id: int, author: string]
scala> df.show()
+---+---------+
| id| author|
+---+---------+
| 1| a1,a2,a3|
| 2|a1,a4,a10|
+---+---------+
scala> df.select("author").show
+---------+
| author|
+---------+
| a1,a2,a3|
|a1,a4,a10|
+---------+
scala> df.select("author").flatMap( row => { row.get(0).toString().split(",")}).show()
+-----+
|value|
+-----+
| a1|
| a2|
| a3|
| a1|
| a4|
| a10|
+-----+

Sequential Dynamic filters on the same Spark Dataframe Column in Scala Spark

I have a column named root and need to filter dataframe based on the different values of a root column.
Suppose I have a values in root are parent,child or sub-child and I want to apply these filters dynamically through a variable.
val x = ("parent,child,sub-child").split(",")
x.map(eachvalue <- {
var df1 = df.filter(col("root").contains(eachvalue))
}
But when I am doing it, it always overwriting the DF1 instead, I want to apply all the 3 filters and get the result.
May be in future I may extend the list to any number of filter values and the code should work.
Thanks,
Bab
You should apply the subsequent filters to the result of the previous filter, not on df:
val x = ("parent,child,sub-child").split(",")
var df1 = df
x.map(eachvalue <- {
df1 = df1.filter(col("root").contains(eachvalue))
}
df1 after the map operation will have all filters applied to it.
Let's see an example with spark shell. Hope it helps you.
scala> import spark.implicits._
import spark.implicits._
scala> val df0 =
spark.sparkContext.parallelize(List(1,2,1,3,3,2,1)).toDF("number")
df0: org.apache.spark.sql.DataFrame = [number: int]
scala> val list = List(1,2,3)
list: List[Int] = List(1, 2, 3)
scala> val dfFiltered = for (number <- list) yield { df0.filter($"number" === number)}
dfFiltered: List[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] = List([number: int], [number: int], [number: int])
scala> dfFiltered(0).show
+------+
|number|
+------+
| 1|
| 1|
| 1|
+------+
scala> dfFiltered(1).show
+------+
|number|
+------+
| 2|
| 2|
+------+
scala> dfFiltered(2).show
+------+
|number|
+------+
| 3|
| 3|
+------+
AFAIK isin can be used in this case below is the example.
import spark.implicits._
val colorStringArr = "red,yellow,blue".split(",")
val colorDF =
List(
"red",
"yellow",
"purple"
).toDF("color")
// to derive a column using a list
colorDF.withColumn(
"is_primary_color",
col("color").isin(colorStringArr: _*)
).show()
println( "if you don't want derived column and directly want to filter using a list with isin then .. ")
colorDF.filter(col("color").isin(colorStringArr: _*)).show
Result :
+------+----------------+
| color|is_primary_color|
+------+----------------+
| red| true|
|yellow| true|
|purple| false|
+------+----------------+
if you don't want derived column and directly want to filter using a list with isin then ....
+------+
| color|
+------+
| red|
|yellow|
+------+
One more way using array_contains and swapping the arguments.
scala> val x = ("parent,child,sub-child").split(",")
x: Array[String] = Array(parent, child, sub-child)
scala> val df = Seq(("parent"),("grand-parent"),("child"),("sub-child"),("cousin")).toDF("root")
df: org.apache.spark.sql.DataFrame = [root: string]
scala> df.show
+------------+
| root|
+------------+
| parent|
|grand-parent|
| child|
| sub-child|
| cousin|
+------------+
scala> df.withColumn("check", array_contains(lit(x),'root)).show
+------------+-----+
| root|check|
+------------+-----+
| parent| true|
|grand-parent|false|
| child| true|
| sub-child| true|
| cousin|false|
+------------+-----+
scala>
Here are my two cents
val filters = List(1,2,3)
val data = List(5,1,2,1,3,3,2,1,4)
val colName = "number"
val df = spark.
sparkContext.
parallelize(data).
toDF(colName).
filter(
r => filters.contains(r.getAs[Int](colName))
)
df.show()
which results in
+------+
|number|
+------+
| 1|
| 2|
| 1|
| 3|
| 3|
| 2|
| 1|
+------+

How to use DataFrame Window expressions and withColumn and not to change partition?

For some reason I have to convert RDD to DataFrame, then do something with DataFrame.
My interface is RDD,so I have to convert DataFrame to RDD, and when I use df.withcolumn, the partition change to 1, so I have to repartition and sortBy RDD.
Is there any cleaner solution ?
This is my code :
val rdd = sc.parallelize(List(1,3,2,4,5,6,7,8),4)
val partition = rdd.getNumPartitions
println(partition + "rdd")
val df=rdd.toDF()
val rdd2=df.rdd
val result = rdd.toDF("col1")
.withColumn("csum", sum($"col1").over(Window.orderBy($"col1")))
.withColumn("rownum", row_number().over(Window.orderBy($"col1")))
.withColumn("avg", $"csum"/$"rownum").rdd
println(result.getNumPartitions + "rdd2")
Let's make this as simple as possible, we will generate the same data into 4 partitions
scala> val df = spark.range(1,9,1,4).toDF
df: org.apache.spark.sql.DataFrame = [id: bigint]
scala> df.show
+---+
| id|
+---+
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
+---+
scala> df.rdd.getNumPartitions
res13: Int = 4
We don't need 3 window functions to prove this, so let's do it with one :
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val df2 = df.withColumn("csum", sum($"id").over(Window.orderBy($"id")))
df2: org.apache.spark.sql.DataFrame = [id: bigint, csum: bigint]
So what's happening here is that we didn't just add a column but we computed a window of cumulative sum over the data and since you haven't provided an partition column, the window function will move all the data to a single partition and you even get a warning from spark :
scala> df2.rdd.getNumPartitions
17/06/06 10:05:53 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
res14: Int = 1
scala> df2.show
17/06/06 10:05:56 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+---+----+
| id|csum|
+---+----+
| 1| 1|
| 2| 3|
| 3| 6|
| 4| 10|
| 5| 15|
| 6| 21|
| 7| 28|
| 8| 36|
+---+----+
So let's add now a column to partition on. We will create a new DataFrame just for the sake of demonstration :
scala> val df3 = df.withColumn("x", when($"id"<5,lit("a")).otherwise("b"))
df3: org.apache.spark.sql.DataFrame = [id: bigint, x: string]
It has indeed the same number of partitions that we defined explicitly on df :
scala> df3.rdd.getNumPartitions
res18: Int = 4
Let's perform our window operation using the column x to partition :
scala> val df4 = df3.withColumn("csum", sum($"id").over(Window.orderBy($"id").partitionBy($"x")))
df4: org.apache.spark.sql.DataFrame = [id: bigint, x: string ... 1 more field]
scala> df4.show
+---+---+----+
| id| x|csum|
+---+---+----+
| 5| b| 5|
| 6| b| 11|
| 7| b| 18|
| 8| b| 26|
| 1| a| 1|
| 2| a| 3|
| 3| a| 6|
| 4| a| 10|
+---+---+----+
The window function will repartition our data using the default number of partitions set in spark configuration.
scala> df4.rdd.getNumPartitions
res20: Int = 200
I was just reading about controlling the number of partitions when using groupBy aggregation, from https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-performance-tuning-groupBy-aggregation.html, it seems the same trick works with Window, in my code I'm defining a window like
windowSpec = Window \
.partitionBy('colA', 'colB') \
.orderBy('timeCol') \
.rowsBetween(1, 1)
and then doing
next_event = F.lead('timeCol', 1).over(windowSpec)
and creating a dataframe via
df2 = df.withColumn('next_event', next_event)
and indeed, it has 200 partitions. But, if I do
df2 = df.repartition(10, 'colA', 'colB').withColumn('next_event', next_event)
it has 10!

Select column by name with multiple aggregate columns after pivot with Spark Scala

I am trying to aggregate multitple columns after a pivot in Scala Spark 2.0.1:
scala> val df = List((1, 2, 3, None), (1, 3, 4, Some(1))).toDF("a", "b", "c", "d")
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 2 more fields]
scala> df.show
+---+---+---+----+
| a| b| c| d|
+---+---+---+----+
| 1| 2| 3|null|
| 1| 3| 4| 1|
+---+---+---+----+
scala> val pivoted = df.groupBy("a").pivot("b").agg(max("c"), max("d"))
pivoted: org.apache.spark.sql.DataFrame = [a: int, 2_max(`c`): int ... 3 more fields]
scala> pivoted.show
+---+----------+----------+----------+----------+
| a|2_max(`c`)|2_max(`d`)|3_max(`c`)|3_max(`d`)|
+---+----------+----------+----------+----------+
| 1| 3| null| 4| 1|
+---+----------+----------+----------+----------+
I am unable to select or rename those columns so far:
scala> pivoted.select("3_max(`d`)")
org.apache.spark.sql.AnalysisException: syntax error in attribute name: 3_max(`d`);
scala> pivoted.select("`3_max(`d`)`")
org.apache.spark.sql.AnalysisException: syntax error in attribute name: `3_max(`d`)`;
scala> pivoted.select("`3_max(d)`")
org.apache.spark.sql.AnalysisException: cannot resolve '`3_max(d)`' given input columns: [2_max(`c`), 3_max(`d`), a, 2_max(`d`), 3_max(`c`)];
There must be a simple trick here, any ideas? Thanks.
Seems like a bug, the back ticks caused the problem. One fix here would be to remove the back ticks from the column names:
val pivotedNewName = pivoted.columns.foldLeft(pivoted)((df, col) =>
df.withColumnRenamed(col, col.replace("`", "")))
Now you can select by column names as normal:
pivotedNewName.select("2_max(c)").show
+--------+
|2_max(c)|
+--------+
| 3|
+--------+