How to generate unique date ranges - scala

I would like to generate unique date range between current date to say like 2050.
val start_date = "2017-03-21"
val end_date = "2050-03-21"
Not sure how can i create a function for it. Any inputs here please. The difference between the start and end dates can be anything.
Unique date range means the function would never return me a date range which it has already returned.
I have this solution in mind:
val start_date = "2017-03-21"
val end_date = "2050-03-21"
while(start_date= "2017-03-21")
{
end_date = start_date+1
return( start_date, end_date)
}
start_date=start_date+1

We will use the java.time.LocalDate and temporal.ChronoUnit imports to achieve this:
scala> import java.time.LocalDate
import java.time.LocalDate
scala> import java.time.temporal.ChronoUnit
import java.time.temporal.ChronoUnit
scala> val startDate = LocalDate.parse("2017-03-21")
startDate: java.time.LocalDate = 2017-03-21
scala> val endDate = LocalDate.parse("2050-03-21")
endDate: java.time.LocalDate = 2050-03-21
scala> val dateAmount = 5
dateAmount: Int = 5
scala> val randomDates = List.fill(dateAmount) {
val randomAmt = ChronoUnit.DAYS.between(startDate, endDate) * math.random() // used to generate a random amount of days within given limits
startDate.plusDays(randomAmt.toInt) // returns a date from that random amount, will not go beyond endDate
}
randomDates: List[java.time.LocalDate] = List(2049-03-16, 2025-12-30, 2042-04-20, 2027-03-14, 2031-03-15)

Related

Date conversion to timestamp in EPOCH

I am looking to convert date to day minus 1 12:00 AM (epoch time) in spark
val dateInformat=DateTimeFormatter.ofPattern("MM-dd-yyyy")
val batchpartitiondate= LocalDate.parse("10-14-2022",dateInformat)
batchpartitiondate: javatimelocalDate=2022-10-14
batchpartitiondate should be converted to epochtime(1665619200)
Date for example:
InputDate in spark submit argument is 12-15-2022
I need the output as epoch time (1665705600) i.e in GMT:Friday,October 14,12:00:00AM
if i give input as 12-14-2022 it should give the output as epoch time (1665619200) i.e in GMT:Thursday,October 13,12:00:00AM
Does this achieve what you are looking to do?
val dateInFormat = DateTimeFormatter.ofPattern("MM-dd-yyyy")
val batchPartitionDate = LocalDate.parse("10-14-2022", dateInformat)
val alteredDateTime = batchPartitionDate.minusDays(1).atStartOfDay()
// current zone
{
val zone = ZoneId.systemDefault()
val instant = alteredDateTime.atZone(zone).toInstant
val epochMillis = instant.toEpochMilli
}
// UTC
// Or you can specify the appropriate timezone insteasd of UTC
{
val zone = ZoneOffset.UTC
val instant = alteredDateTime.toInstant(zone)
val epochMillis = instant.toEpochMilli
}

Filter a scala dataset on a Option[timestamp] column to return dates that are within n days of current date

Lets say I have the following dataset called customers
lastVisit
id
2018-08-08 12:23:43.234
11
2021-12-08 14:13:45.4
12
And the lastVisit field is of type Option[Timestamp]
I want to be able to perform the following...
val filteredCustomers = customers.filter($"lastVisit" > current date - x days)
so that I return all the customers that have a lastVisit date within the last x days.
This is what I have tried so far.
val timeFilter: Timestamp => Long = input => {
val sdf = new SimpleDateFormat("yyyy-mm-dd")
val visitDate = sdf.parse(input.toString).toInstant.atZone(ZoneId.systemDefault()).toLocalDate
val dateNow = LocalDate.now()
ChronoUnit.DAYS.between(visitDate, dateNow)
}
val timeFilterUDF = udf(timeFilter)
val filteredCustomers = customers.withColumn("days", timeFilteredUDF(col("lastVisit")))
val filteredCustomers2 = filteredCustomers.filter($"days" < n)
This runs locally but when I submit it as a spark job to run on the full table I got a null pointer exception in the following line
val visitDate = sdf.parse(input.toString).toInstant.atZone(ZoneId.systemDefault()).toLocalDate
val dateNow = LocalDate.now()
The data looks good so I am unsure what the problem could be, I also imagine there is a much better way to implement the logic I am trying to do, any advice would be greatly appreciated!
Thank you
#Xaleate, Based on your query, seems like you want to achieve a logic of
current_date - lastVisits < x days
Did you try using the datediff UDF already available in spark? here is a two line solution using datediff
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object LastDateIssue {
val spark: SparkSession = SparkSession.builder().appName("Last Date issue").master("local[*]").getOrCreate()
def main(args: Array[String]): Unit = {
import spark.implicits._
//prepare customer data for test
var customers = Map(
"2018-08-08 12:23:43.234"-> 11,
"2021-12-08 14:13:45.4"-> 12,
"2022-02-01 14:13:45.4"-> 13)
.toSeq
.toDF("lastVisit", "id")
// number of days
val x: Int = 10
customers = customers.filter(datediff(lit(current_date()), col("lastVisit")) < x)
customers.show(20, truncate = false)
}
}
This returns id = 13 as that is within the last 10 days (you could chose x accordingly)
+---------------------+---+
|lastVisit |id |
+---------------------+---+
|2022-02-01 14:13:45.4|13 |
+---------------------+---+
Use date_sub function.
df.filter($"lastVisit" > date_sub(current_date(),n)).show(false)

How to subtract days in scala?

I parsed a string to a date:
val deathTime = "2019-03-14 05:22:45"
val dateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
val deathDate = new java.sql.Date(dateFormat.parse(deathtime).getTime)
Now, I want to subtract 30 days from deathDate. How to do that? I have tried
deathDate.minusDays(30)
It does not work.
If your requirement is to do with data-frame in spark-scala.
df.select(date_add(lit(current_date),-30)).show
+-----------------------------+
|date_add(current_date(), -30)|
+-----------------------------+
| 2019-03-02|
+-----------------------------+
date_add function with negative value or date_sub with positive value can do the desired.
If you are java8 then you can decode the date as LocalDateTime. LocalDateTime allows operations on dates - https://docs.oracle.com/javase/8/docs/api/java/time/LocalDateTime.html#minusDays-long-
scala> import java.time.LocalDateTime
import java.time.LocalDateTime
scala> import java.time.format.DateTimeFormatter
import java.time.format.DateTimeFormatter
scala> val deathTime = "2019-03-14 05:22:45"
deathTime: String = 2019-03-14 05:22:45
scala> val deathDate = LocalDateTime.parse(deathTime, DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss"))
deathDate: java.time.LocalDateTime = 2019-03-14T05:22:45
scala> deathDate.minusDays(30)
res1: java.time.LocalDateTime = 2019-02-12T05:22:45
Also see Java: Easiest Way to Subtract Dates

convert integer into date to count number of days

I need to convert Integer to date(yyyy-mm-dd) format, to calculate number of days.
registryDate
20130826
20130829
20130816
20130925
20130930
20130926
Desired output:
registryDate TodaysDate DaysInBetween
20130826 2018-11-24 1916
20130829 2018-11-24 1913
20130816 2018-11-24 1926
You can cast registryDate to String type, then apply to_date and datediff to compute the difference in days, as shown below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import java.sql.Date
val df = Seq(
20130826, 20130829, 20130816, 20130825
).toDF("registryDate")
df.
withColumn("registryDate2", to_date($"registryDate".cast(StringType), "yyyyMMdd")).
withColumn("todaysDate", lit(Date.valueOf("2018-11-24"))).
withColumn("DaysInBetween", datediff($"todaysDate", $"registryDate2")).
show
// +------------+-------------+----------+-------------+
// |registryDate|registryDate2|todaysDate|DaysInBetween|
// +------------+-------------+----------+-------------+
// | 20130826| 2013-08-26|2018-11-24| 1916|
// | 20130829| 2013-08-29|2018-11-24| 1913|
// | 20130816| 2013-08-16|2018-11-24| 1926|
// | 20130825| 2013-08-25|2018-11-24| 1917|
// +------------+-------------+----------+-------------+

How to order string of exact format (dd-MM-yyyy HH:mm) using sparkSQL or Dataframe API

I want a dataframe to be reordered in ascending order based on a datetime column which is in the format of "23-07-2018 16:01"
My program sorts to date level but not HH:mm standard.I want output to include HH:mm details as well sorted according to it.
package com.spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{to_date, to_timestamp}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
object conversion{
def main(args:Array[String]) = {
val spark = SparkSession.builder().master("local").appName("conversion").enableHiveSupport().getOrCreate()
import spark.implicits._
val sourceDF = spark.read.format("csv").option("header","true").option("inferSchema","true").load("D:\\2018_Sheet1.csv")
val modifiedDF = sourceDF.withColumn("CredetialEndDate",to_date($"CredetialEndDate","dd-MM-yyyy HH:mm"))
//This converts into "dd-MM-yyyy" but "dd-MM-yyyy HH:mm" is expected
//what is the equivalent Dataframe API to convert string to HH:mm ?
modifiedDF.createOrReplaceGlobalTempView("conversion")
val sortedDF = spark.sql("select * from global_temp.conversion order by CredetialEndDate ASC ").show(50)
//dd-MM-YYYY 23-07-2018 16:01
}
}
So my result should have the column in the format "23-07-2018 16:01" instead of just "23-07-2018" and having sorted ascending manner.
The method to_date converts the column into a DateType which has date only, no time. Try to use to_timestamp instead.
Edit: If you want to do the sorting but keep the original string representation you can do something like:
val modifiedDF = sourceDF.withColumn("SortingColumn",to_timestamp($"CredetialEndDate","dd-MM-yyyy HH:mm"))
and then modify the result to:
val sortedDF = spark.sql("select * from global_temp.conversion order by SortingColumnASC ").drop("SortingColumn").show(50)