How to add days (as values of a column) to date? - scala

I have a problem with adding days (numbers) to date format columns in Spark. I know that there is a function date_add that takes two arguments - date column and integer:
date_add(date startdate, tinyint/smallint/int days)
I'd like to use a column value that is of type integer instead (not an integer itself).
Say I have the following dataframe:
val data = Seq(
(0, "2016-01-1"),
(1, "2016-02-2"),
(2, "2016-03-22"),
(3, "2016-04-25"),
(4, "2016-05-21"),
(5, "2016-06-1"),
(6, "2016-03-21"))
).toDF("id", "date")
I can simply add integers to dates:
val date_add_fun =
data.select(
$"id",
$"date",
date_add($"date", 1)
)
But I cannot use a column expression that contains the values:
val date_add_fun =
data.select(
$"id",
$"date",
date_add($"date", $"id")
)
It gives error:
<console>:60: error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: Int
date_add($"date", $"id")
Does anyone know if it is possible to use column is date_add function? Or what is the workaround?

You can use expr:
import org.apache.spark.sql.functions.expr
data.withColumn("future", expr("date_add(date, id)")).show
// +---+----------+----------+
// | id| date| future|
// +---+----------+----------+
// | 0| 2016-01-1|2016-01-01|
// | 1| 2016-02-2|2016-02-03|
// | 2|2016-03-22|2016-03-24|
// | 3|2016-04-25|2016-04-28|
// | 4|2016-05-21|2016-05-25|
// | 5| 2016-06-1|2016-06-06|
// | 6|2016-03-21|2016-03-27|
// +---+----------+----------+
selectExpr could be use in a similar way:
data.selectExpr("*", "date_add(date, id) as future").show

The other answers work but aren't a drop in replacement for the existing date_add function.
I had a case where expr wouldn't work for me, so here is a drop in replacement:
def date_add(date: Column, days: Column) = {
new Column(DateAdd(date.expr, days.expr))
}
Basically, all the machinery is there in Spark to do this already, the function signature for date_add just forces it to be a literal.

You can use a sql expression as
data.createOrReplaceTempView("table")
sqlContext.sql("select id, date, date_add(`date`, `id`) as added_date from table").show(false)
which would give you
+---+----------+----------+
|id |date |added_date|
+---+----------+----------+
|0 |2016-01-1 |2016-01-01|
|1 |2016-02-2 |2016-02-03|
|2 |2016-03-22|2016-03-24|
|3 |2016-04-25|2016-04-28|
|4 |2016-05-21|2016-05-25|
|5 |2016-06-1 |2016-06-06|
|6 |2016-03-21|2016-03-27|
+---+----------+----------+

For the Python developers who are here, you can simply add a date column to another column together using +:
import pyspark.sql.functions as F
new_df = df.withColumn("new_date", F.col("date") + F.col("offset"))
Juste make sure that the offset column is int/smallint/tinyint.

Related

Saving double dataype in spark dataframe

In below spark-scala code, double data type value is storing differently. Though in table, all columns are of string type. Same result in impala as well.
Does someone knows how to make sure exact value get saved and retrieved ?
Thanks
val df = Seq(("one", 1324235345435.4546)).toDF("a", "b")
df.write.mode("append").insertInto("test")
spark.sql("select * from test").show(false)
+---+---------------------+
|a |b |
+---+---------------------+
|one|1.3242353454354546E12|
+---+---------------------+
Try with casting to Decimal type and then insert into Hive table.
val df = Seq(("one", 1324235345435.4546))
.toDF("a", "b")
.select('a,'b.cast("Decimal(36,4)"))
df.show(false)
+---+------------------+
|a |b |
+---+------------------+
|one|1324235345435.4546|
+---+------------------+
scala> df.select(format_number(col("b"),4)).show(false)
+----------------------+
|format_number(b, 4) |
+----------------------+
|1,324,235,345,435.4546|
+----------------------+
you could use the number_formater function on top of column so that you it can automatically convert to string with your requirement with the precision.
hope this helps on generalizing.

How to remove the fractional part from a dataframe column?

Input dataframe:
val ds = Seq((1,34.44),
(2,76.788),
(3,54.822)).toDF("id","mark")
Expected output:
val ds = Seq((1,34),
(2,76),
(3,54)).toDF("id","mark")
I want to remove the fractional part from the column mark as above. I have searched for any builtin functions, but did not find anything. How should an udf look like to achieve the above result?
You can just use cast to integer as
import org.apache.spark.sql.functions._
ds.withColumn("mark", $"mark".cast("integer")).show(false)
which should give you
+---+----+
|id |mark|
+---+----+
|1 |34 |
|2 |76 |
|3 |54 |
+---+----+
I hope the answer is helpful
Update
You commented as
But if any string values are there in the column , it is getting null since we are casting into integer . i don't want that kind of behaviour
So I guess your mark column must be a StringType() and you can use regexp_replace
import org.apache.spark.sql.functions._
ds.withColumn("mark", regexp_replace($"mark", "(\\.\\d+)", "")).show(false)

Get minimum value from an Array in a Spark DataFrame column

I have a DataFrame with Arrays.
val DF = Seq(
("123", "|1|2","3|3|4" ),
("124", "|3|2","|3|4" )
).toDF("id", "complete1", "complete2")
.select($"id", split($"complete1", "\\|").as("complete1"), split($"complete2", "\\|").as("complete2"))
|id |complete1|complete2|
+-------------+---------+---------+
| 123| [, 1, 2]|[3, 3, 4]|
| 124| [, 3, 2]| [, 3, 4]|
+-------------+---------+---------+
How do I extract the minimum of each arrays?
|id |complete1|complete2|
+-------------+---------+---------+
| 123| 1 | 3 |
| 124| 2 | 3 |
+-------------+---------+---------+
I have tried defining a UDF to do this but I am getting an error.
def minArray(a:Array[String]) :String = a.filter(_.nonEmpty).min.mkString
val minArrayUDF = udf(minArray _)
def getMinArray(df: DataFrame, i: Int): DataFrame = df.withColumn("complete" + i, minArrayUDF(df("complete" + i)))
val minDf = (1 to 2).foldLeft(DF){ case (df, i) => getMinArray(df, i)}
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;
Since Spark 2.4, you can use array_min to find the minimum value in an array. To use this function you will first have to cast your arrays of strings to arrays of integers. Casting will also take care of the empty strings by converting them into null values.
DF.select($"id",
array_min(expr("cast(complete1 as array<int>)")).as("complete1"),
array_min(expr("cast(complete2 as array<int>)")).as("complete2"))
You can define your udf function as below
def minUdf = udf((arr: Seq[String])=> arr.filterNot(_ == "").map(_.toInt).min)
and call it as
DF.select(col("id"), minUdf(col("complete1")).as("complete1"), minUdf(col("complete2")).as("complete2")).show(false)
which should give you
+---+---------+---------+
|id |complete1|complete2|
+---+---------+---------+
|123|1 |3 |
|124|2 |3 |
+---+---------+---------+
Updated
In case if the array passed to udf functions are empty or array of empty strings then you will encounter
java.lang.UnsupportedOperationException: empty.min
You should handle that with if else condition in udf function as
def minUdf = udf((arr: Seq[String])=> {
val filtered = arr.filterNot(_ == "")
if(filtered.isEmpty) 0
else filtered.map(_.toInt).min
})
I hope the answer is helpful
Here is how you can do it without using udf
First explode the array you got with split() and then group by the same id and find min
val DF = Seq(
("123", "|1|2","3|3|4" ),
("124", "|3|2","|3|4" )
).toDF("id", "complete1", "complete2")
.select($"id", split($"complete1", "\\|").as("complete1"), split($"complete2", "\\|").as("complete2"))
.withColumn("complete1", explode($"complete1"))
.withColumn("complete2", explode($"complete2"))
.groupBy($"id").agg(min($"complete1".cast(IntegerType)).as("complete1"), min($"complete2".cast(IntegerType)).as("complete2"))
Output:
+---+---------+---------+
|id |complete1|complete2|
+---+---------+---------+
|124|2 |3 |
|123|1 |3 |
+---+---------+---------+
You don't need an UDF for this, you can use sort_array:
val DF = Seq(
("123", "|1|2","3|3|4" ),
("124", "|3|2","|3|4" )
).toDF("id", "complete1", "complete2")
.select(
$"id",
split(regexp_replace($"complete1","^\\|",""), "\\|").as("complete1"),
split(regexp_replace($"complete2","^\\|",""), "\\|").as("complete2")
)
// now select minimum
DF.
.select(
$"id",
sort_array($"complete1")(0).as("complete1"),
sort_array($"complete2")(0).as("complete2")
).show()
+---+---------+---------+
| id|complete1|complete2|
+---+---------+---------+
|123| 1| 3|
|124| 2| 3|
+---+---------+---------+
Note that I removed the leading | before splitting to avoid empty strings in the array

Pass Array[seq[String]] to UDF in spark scala

I am new to UDF in spark. I have also read the answer here
Problem statement: I'm trying to find pattern matching from a dataframe col.
Ex: Dataframe
val df = Seq((1, Some("z")), (2, Some("abs,abc,dfg")),
(3,Some("a,b,c,d,e,f,abs,abc,dfg"))).toDF("id", "text")
df.show()
+---+--------------------+
| id| text|
+---+--------------------+
| 1| z|
| 2| abs,abc,dfg|
| 3|a,b,c,d,e,f,abs,a...|
+---+--------------------+
df.filter($"text".contains("abs,abc,dfg")).count()
//returns 2 as abs exits in 2nd row and 3rd row
Now I want to do this pattern matching for every row in column $text and add new column called count.
Result:
+---+--------------------+-----+
| id| text|count|
+---+--------------------+-----+
| 1| z| 1|
| 2| abs,abc,dfg| 2|
| 3|a,b,c,d,e,f,abs,a...| 1|
+---+--------------------+-----+
I tried to define a udf passing $text column as Array[Seq[String]. But I am not able to get what I intended.
What I tried so far:
val txt = df.select("text").collect.map(_.toSeq.map(_.toString)) //convert column to Array[Seq[String]
val valsum = udf((txt:Array[Seq[String],pattern:String)=> {txt.count(_ == pattern) } )
df.withColumn("newCol", valsum( lit(txt) ,df(text)) )).show()
Any help would be appreciated
You will have to know all the elements of text column which can be done using collect_list by grouping all the rows of your dataframe as one. Then just check if element in text column in the collected array and count them as in the following code.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = Seq((1, Some("z")), (2, Some("abs,abc,dfg")),(3,Some("a,b,c,d,e,f,abs,abc,dfg"))).toDF("id", "text")
val valsum = udf((txt: String, array : mutable.WrappedArray[String])=> array.filter(element => element.contains(txt)).size)
df.withColumn("grouping", lit("g"))
.withColumn("array", collect_list("text").over(Window.partitionBy("grouping")))
.withColumn("count", valsum($"text", $"array"))
.drop("grouping", "array")
.show(false)
You should have following output
+---+-----------------------+-----+
|id |text |count|
+---+-----------------------+-----+
|1 |z |1 |
|2 |abs,abc,dfg |2 |
|3 |a,b,c,d,e,f,abs,abc,dfg|1 |
+---+-----------------------+-----+
I hope this is helpful.

How to calculate difference between date column and current date?

I am trying to calculate the Date Diff between a column field and current date of the system.
Here is my sample code where I have hard coded the my column field with 20170126.
val currentDate = java.time.LocalDate.now
var datediff = spark.sqlContext.sql("""Select datediff(to_date('$currentDate'),to_date(DATE_FORMAT(CAST(unix_timestamp( cast('20170126' as String), 'yyyyMMdd') AS TIMESTAMP), 'yyyy-MM-dd'))) AS GAP
""")
datediff.show()
Output is like:
+----+
| GAP|
+----+
|null|
+----+
I need to calculate actual Gap Between the two dates but getting NULL.
You have not defined the type and format of "column field" so I assume it's a string in the (not-very-pleasant) format YYYYMMdd.
val records = Seq((0, "20170126")).toDF("id", "date")
scala> records.show
+---+--------+
| id| date|
+---+--------+
| 0|20170126|
+---+--------+
scala> records
.withColumn("year", substring($"date", 0, 4))
.withColumn("month", substring($"date", 5, 2))
.withColumn("day", substring($"date", 7, 2))
.withColumn("d", concat_ws("-", $"year", $"month", $"day"))
.select($"id", $"d" cast "date")
.withColumn("datediff", datediff(current_date(), $"d"))
.show
+---+----------+--------+
| id| d|datediff|
+---+----------+--------+
| 0|2017-01-26| 83|
+---+----------+--------+
PROTIP: Read up on functions object.
Caveats
cast
Please note that I could not convince Spark SQL to cast the column "date" to DateType given the rules in DateTimeUtils.stringToDate:
yyyy,
yyyy-[m]m
yyyy-[m]m-[d]d
yyyy-[m]m-[d]d
yyyy-[m]m-[d]d *
yyyy-[m]m-[d]dT*
date_format
I could not convince date_format to work either so I parsed "date" column myself using substring and concat_ws functions.