This question already has answers here:
Spark Dataframe Random UUID changes after every transformation/action
(4 answers)
Closed 5 years ago.
in a dataframe, I'm generating a column based on column A in DateType format "yyyy-MM-dd". Column A is generated from a UDF (udf generates a random date from the last 24 months).
from that generated date I try to calculate column B. Column B is column A minus 6 months. ex. 2017-06-01 in A is 2017-01-01 in B.
To achieve this I use function add_months(columname, -6)
when I do this using another column (not generated by udf) I get the right result. But when I do it on that generated column I get random values, totally wrong.
I checked the schema, column is from DateType
this is my code :
val test = df.withColumn("A", to_date(callUDF("randomUDF")))
val test2 = test.select(col("*"), add_months(col("A"), -6).as("B"))
code of my UDF :
sqlContext.udf.register("randomUDF", () => {
//prepare dateformat
val formatter = new SimpleDateFormat("yyyy-MM-dd")
//get today's date as reference
val today = Calendar.getInstance()
val now = today.getTime()
//set "from" 2 years from now
val from = Calendar.getInstance()
from.setTime(now)
from.add(Calendar.MONTH, -24)
// set dates into Long
val valuefrom = from.getTimeInMillis()
val valueto = today.getTimeInMillis()
//generate random Long between from and to
val value3 = (valuefrom + Math.random()*(valueto - valuefrom))
// set generated value to Calendar and format date
val calendar3 = Calendar.getInstance()
calendar3.setTimeInMillis(value3.toLong)
formatter.format(calendar3.getTime()
}
UDF works as expected, but I think there is something going wrong here.
I tried the add_months function on another column (not generated) and it worked fine.
example of results I get with this code :
A | B
2017-10-20 | 2016-02-27
2016-05-06 | 2015-05-25
2016-01-09 | 2016-03-14
2016-01-04 | 2017-04-26
using spark version 1.5.1
using scala 2.10.4
The creation of test2 dataframe in your code
val test2 = test.select(col("*"), add_months(col("A"), -6).as("B"))
is treated by spark as
val test2 = df.withColumn("A", to_date(callUDF("randomUDF"))).select(col("*"), add_months(to_date(callUDF("randomUDF")), -6).as("B"))
So you can see that udf function is called twice. df.withColumn("A", to_date(callUDF("randomUDF"))) is generating the date that comes in column A. And add_months(to_date(callUDF("randomUDF")), -6).as("B") is calling udf function again and generating a new date and subtracting 6 months from it and showing that date in column B.
Thats the reason you are getting random dates.
The solution to this would be to use persist or cache in test dataframe as
val test = df.withColumn("A", callUDF("randomUDF")).cache()
val test2 = test.as("table").withColumn("B", add_months($"table.A", -6))
Related
Lets say I have the following dataset called customers
lastVisit
id
2018-08-08 12:23:43.234
11
2021-12-08 14:13:45.4
12
And the lastVisit field is of type Option[Timestamp]
I want to be able to perform the following...
val filteredCustomers = customers.filter($"lastVisit" > current date - x days)
so that I return all the customers that have a lastVisit date within the last x days.
This is what I have tried so far.
val timeFilter: Timestamp => Long = input => {
val sdf = new SimpleDateFormat("yyyy-mm-dd")
val visitDate = sdf.parse(input.toString).toInstant.atZone(ZoneId.systemDefault()).toLocalDate
val dateNow = LocalDate.now()
ChronoUnit.DAYS.between(visitDate, dateNow)
}
val timeFilterUDF = udf(timeFilter)
val filteredCustomers = customers.withColumn("days", timeFilteredUDF(col("lastVisit")))
val filteredCustomers2 = filteredCustomers.filter($"days" < n)
This runs locally but when I submit it as a spark job to run on the full table I got a null pointer exception in the following line
val visitDate = sdf.parse(input.toString).toInstant.atZone(ZoneId.systemDefault()).toLocalDate
val dateNow = LocalDate.now()
The data looks good so I am unsure what the problem could be, I also imagine there is a much better way to implement the logic I am trying to do, any advice would be greatly appreciated!
Thank you
#Xaleate, Based on your query, seems like you want to achieve a logic of
current_date - lastVisits < x days
Did you try using the datediff UDF already available in spark? here is a two line solution using datediff
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object LastDateIssue {
val spark: SparkSession = SparkSession.builder().appName("Last Date issue").master("local[*]").getOrCreate()
def main(args: Array[String]): Unit = {
import spark.implicits._
//prepare customer data for test
var customers = Map(
"2018-08-08 12:23:43.234"-> 11,
"2021-12-08 14:13:45.4"-> 12,
"2022-02-01 14:13:45.4"-> 13)
.toSeq
.toDF("lastVisit", "id")
// number of days
val x: Int = 10
customers = customers.filter(datediff(lit(current_date()), col("lastVisit")) < x)
customers.show(20, truncate = false)
}
}
This returns id = 13 as that is within the last 10 days (you could chose x accordingly)
+---------------------+---+
|lastVisit |id |
+---------------------+---+
|2022-02-01 14:13:45.4|13 |
+---------------------+---+
Use date_sub function.
df.filter($"lastVisit" > date_sub(current_date(),n)).show(false)
Hi how's it going? I'm a Python developer trying to learn Spark Scala. My task is to create date range bins, and count the frequency of occurrences in each bin (histogram).
My input dataframe looks something like this
My bin edges are this (in Python):
bins = ["01-01-1990 - 12-31-1999","01-01-2000 - 12-31-2009"]
and the output dataframe I'm looking for is (counts of how many values in original dataframe per bin):
Is there anyone who can guide me on how to do this is spark scala? I'm a bit lost. Thank you.
Are You Looking for A result Like Following:
+------------------------+------------------------+
|01-01-1990 -- 12-31-1999|01-01-2000 -- 12-31-2009|
+------------------------+------------------------+
| 3| null|
| null| 2|
+------------------------+------------------------+
It can be achieved with little bit of spark Sql and pivot function as shown below
check out the left join condition
val binRangeData = Seq(("01-01-1990","12-31-1999"),
("01-01-2000","12-31-2009"))
val binRangeDf = binRangeData.toDF("start_date","end_date")
// binRangeDf.show
val inputDf = Seq((0,"10-12-1992"),
(1,"11-11-1994"),
(2,"07-15-1999"),
(3,"01-20-2001"),
(4,"02-01-2005")).toDF("id","input_date")
// inputDf.show
binRangeDf.createOrReplaceTempView("bin_range")
inputDf.createOrReplaceTempView("input_table")
val countSql = """
SELECT concat(date_format(c.st_dt,'MM-dd-yyyy'),' -- ',date_format(c.end_dt,'MM-dd-yyyy')) as header, c.bin_count
FROM (
(SELECT
b.st_dt, b.end_dt, count(1) as bin_count
FROM
(select to_date(input_date,'MM-dd-yyyy') as date_input , * from input_table) a
left join
(select to_date(start_date,'MM-dd-yyyy') as st_dt, to_date(end_date,'MM-dd-yyyy') as end_dt from bin_range ) b
on
a.date_input >= b.st_dt and a.date_input < b.end_dt
group by 1,2) ) c"""
val countDf = spark.sql(countSql)
countDf.groupBy("bin_count").pivot("header").sum("bin_count").drop("bin_count").show
Although, since you have 2 bin ranges there will be 2 rows generated.
We can achieve this by looking at the date column and determining within which range each record falls.
// First we set up the problem
// Create a format that looks like yours
val dateFormat = java.time.format.DateTimeFormatter.ofPattern("MM-dd-yyyy")
// Get the current local date
val now = java.time.LocalDate.now
// Create a range of 1-10000 and map each to minusDays
// so we can have range of dates going 10000 days back
val dates = (1 to 10000).map(now.minusDays(_).format(dateFormat))
// Create a DataFrame we can work with.
val df = dates.toDF("date")
So far so good. We have date entries to work with, and they are like your format (MM-dd-yyyy).
Next up, we need a function which returns 1 if the date falls within range, and 0 if not. We create a UserDefinedFunction (UDF) from this function so we can apply it to all rows simultaneously across Spark executors.
// We will process each range one at a time, so we'll take it as a string
// and split it accordingly. Then we perform our tests. Using Dates is
// necessary to cater to your format.
import java.text.SimpleDateFormat
def isWithinRange(date: String, binRange: String): Int = {
val format = new SimpleDateFormat("MM-dd-yyyy")
val startDate = format.parse(binRange.split(" - ").head)
val endDate = format.parse(binRange.split(" - ").last)
val testDate = format.parse(date)
if (!(testDate.before(startDate) || testDate.after(endDate))) 1
else 0
}
// We create a udf which uses an anonymous function taking two args and
// simply pass the values to our prepared function
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
def isWithinRangeUdf: UserDefinedFunction =
udf((date: String, binRange: String) => isWithinRange(date, binRange))
Now that we have our UDF setup, we create new columns in our DataFrame and group by the given bins and sum the values over (hence why we made our functions evaluate to an Int)
// We define our bins List
val bins = List("01-01-1990 - 12-31-1999",
"01-01-2000 - 12-31-2009",
"01-01-2010 - 12-31-2020")
// We fold through the bins list, creating a column from each bin at a time,
// enriching the DataFrame with more columns as we go
import org.apache.spark.sql.functions.{col, lit}
val withBinsDf = bins.foldLeft(df){(changingDf, bin) =>
changingDf.withColumn(bin, isWithinRangeUdf(col("date"), lit(bin)))
}
withBinsDf.show(1)
//+----------+-----------------------+-----------------------+-----------------------+
//| date|01-01-1990 - 12-31-1999|01-01-2000 - 12-31-2009|01-01-2010 - 12-31-2020|
//+----------+-----------------------+-----------------------+-----------------------+
//|09-01-2020| 0| 0| 1|
//+----------+-----------------------+-----------------------+-----------------------+
//only showing top 1 row
Finally we select our bin columns and groupBy them and sum.
val binsDf = withBinsDf.select(bins.head, bins.tail:_*)
val sums = bins.map(b => sum(b).as(b)) // keep col name as is
val summedBinsDf = binsDf.groupBy().agg(sums.head, sums.tail:_*)
summedBinsDf.show
//+-----------------------+-----------------------+-----------------------+
//|01-01-1990 - 12-31-1999|01-01-2000 - 12-31-2009|01-01-2010 - 12-31-2020|
//+-----------------------+-----------------------+-----------------------+
//| 2450| 3653| 3897|
//+-----------------------+-----------------------+-----------------------+
2450 + 3653 + 3897 = 10000, so it seems our work was correct.
Perhaps I overdid it and there is a simpler solution, please let me know if you know a better way (especially to handle MM-dd-yyyy dates).
I want to group by the records by date. but the date is in epoch timestamp in millisec.
Here is the sample data.
date, Col1
1506838074000, a
1506868446000, b
1506868534000, c
1506869064000, a
1506869211000, c
1506871846000, f
1506874462000, g
1506879651000, a
Here is what I'm trying to achieve.
**date Count of records**
02-10-2017 4
04-10-2017 3
03-10-2017 5
Here is the code which I tried to group by,
import java.text.SimpleDateFormat
val dateformat:SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd")
val df = sqlContext.read.csv("<path>")
val result = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")
But while executing code I am getting below exception.
<console>:30: error: value toLong is not a member of org.apache.spark.sql.ColumnName
val t = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")
Please help me to resolve the issue.
you would need to change the date column, which seems to be in long, to date data type. This can be done by using from_unixtime built-in function. And then its just a groupBy and agg function calls and use count function.
import org.apache.spark.sql.functions._
def stringDate = udf((date: Long) => new java.text.SimpleDateFormat("dd-MM-yyyy").format(date))
df.withColumn("date", stringDate($"date"))
.groupBy("date")
.agg(count("Col1").as("Count of records"))
.show(false)
Above answer is using udf function which should be avoided as much as possible, since udf is a black box and requires serialization and deserialisation of columns.
Updated
Thanks to #philantrovert for his suggestion to divide by 1000
import org.apache.spark.sql.functions._
df.withColumn("date", from_unixtime($"date"/1000, "yyyy-MM-dd"))
.groupBy("date")
.agg(count("Col1").as("Count of records"))
.show(false)
Both ways work.
I have 2 dataframes in spark as mentioned below.
val test = hivecontext.sql("select max(test_dt) as test_dt from abc");
test: org.apache.spark.sql.DataFrame = [test_dt: string]
val test1 = hivecontext.table("testing");
where test1 has columns like id,name,age,audit_dt
I want to compare these 2 dataframes and filter rows from test1 where audit_dt > test_dt. Somehow I am not able to do that. I am able to compare audit_dt with literal date using lit function but i am not able to compare it with another dataframe column.
I am able to compare literal date using lit function as mentioned below
val output = test1.filter(to_date(test1("audit_date")).gt(lit("2017-03-23")))
Max Date in test dataframe is -> 2017-04-26
Data in test1 Dataframe ->
Id,Name,Age,Audit_Dt
1,Rahul,23,2017-04-26
2,Ankit,25,2017-04-26
3,Pradeep,28,2017-04-27
I just need the data for Id=3 since that only row qualifies the greater than criteria of max date.
I have already tried below mentioned option but it is not working.
val test = hivecontext.sql("select max(test_dt) as test_dt from abc")
val MAX_AUDIT_DT = test.first().toString()
val output = test.filter(to_date(test("audit_date")).gt((lit(MAX_AUDIT_DT))))
Can anyone suggest as way to compare it with column of dataframe test?
Thanks
You can use non-equi joins, if both columns "test_dt" and "audit_date" are of class date.
/// cast to correct type
import org.apache.spark.sql.functions.to_date
val new_test = test.withColumn("test_dt",to_date($"test_dt"))
val new_test1 = test1.withColumn("Audit_Dt", to_date($"Audit_Dt"))
/// join
new_test1.join(new_test, $"Audit_Dt" > $"test_dt")
.drop("test_dt").show()
+---+-------+---+----------+
| Id| Name|Age| Audit_Dt|
+---+-------+---+----------+
| 3|Pradeep| 28|2017-04-27|
+---+-------+---+----------+
Data
val test1 = sc.parallelize(Seq((1,"Rahul",23,"2017-04-26"),(2,"Ankit",25,"2017-04-26"),
(3,"Pradeep",28,"2017-04-27"))).toDF("Id","Name", "Age", "Audit_Dt")
val test = sc.parallelize(Seq(("2017-04-26"))).toDF("test_dt")
Try with this:
test1.filter(to_date(test1("audit_date")).gt(to_date(test("test_dt"))))
Store the value in a variable and use in filter.
val dtValue = test.select("test_dt")
OR
val dtValue = test.first().getString(0)
Now apply filter
val output = test1.filter(to_date(test1("audit_date")).gt(lit(dtValue)))
I have a function "toDate(v:String):Timestamp" that takes a string an converts it into a timestamp with the format "MM-DD-YYYY HH24:MI:SS.NS".
I make a udf of the function:
val u_to_date = sqlContext.udf.register("u_to_date", toDate_)
The issue happens when you apply the UDF to dataframes. The resulting dataframe will lose the last 3 nanoseconds.
For example when using the argument "0001-01-01 00:00:00.123456789"
The resulting dataframe will be in the format
[0001-01-01 00:00:00.123456]
I have even tried a dummy function that returns Timestamp.valueOf("1234-01-01 00:00:00.123456789"). When applying the udf of the dummy function, it will truncate the last 3 nanoseconds.
I have looked into the sqlContext conf and
spark.sql.parquet.int96AsTimestamp is set to True. (I tried when it's set to false)
I am at lost here. What is causing the truncation of the last 3 digits?
example
The function could be:
def date123(v: String): Timestamp = {
Timestamp.valueOf("0001-01-01 00:00:00.123456789")
}
It's just a dummy function that should return a timestamp with full nanosecond precision.
Then I would make a udf:
`val u_date123 = sqlContext.udf.register("u_date123", date123 _)`
example df:
val theRow =Row("blah")
val theRdd = sc.makeRDD(Array(theRow))
case class X(x: String )
val df = theRdd.map{case Row(s0) => X(s0.asInstanceOf[String])}.toDF()
If I apply the udf to the dataframe df with a string column, it will return a dataframe that looks like '[0001-01-01 00:00:00.123456]'
df.select(u_date123($"x")).collect.foreach(println)
I think I found the issue.
On spark 1.5.1, they changed the size of the timestamp datatype from 12 bytes to 8 bytes
https://fossies.org/diffs/spark/1.4.1_vs_1.5.0/sql/catalyst/src/main/scala/org/apache/spark/sql/types/TimestampType.scala-diff.html
I tested on spark 1.4.1, and it produces the full nanosecond precision.