I am trying to extract dates between two dates.
If my input is:
start date: 2020_04_02
end date: 2020_06_02
Output should be:
List("2020_04_02","2020_04_03", "2020_04_04", "2020_04_05", "2020_04_06")
So far i have tried:
val beginDate = LocalDate.parse(startDate, formatted)
val lastDate = LocalDate.parse(endDate, formatted)
beginDate.datesUntil(lastDate.plusDays(1))
.iterator()
.asScala
.map(date => formatter.format(date))
.toList
import java.time.format.DateTimeFormatter
private def formatter = DateTimeFormatter.ofPattern("yyyy_MM_dd")
But i think it could be done even in a more refined way
I'm not super proud of this, but I've done it this way before:
import java.time.LocalDate
val start = LocalDate.of(2020,1,1).toEpochDay
val end = LocalDate.of(2020,12,31).toEpochDay
val dates = (start to end).map(LocalDate.ofEpochDay(_).toString).toArray
You end up with:
dates: Array[String] = Array(2020-01-01, 2020-01-02, ..., 2020-12-31)
What you are doing is correct but your end date is not matching what you are expecting:
import java.time.format.DateTimeFormatter
import java.time.LocalDate
import scala.jdk.CollectionConverters._
private def formatter = DateTimeFormatter.ofPattern("yyyy_MM_dd")
val startDate = "2020_04_02"
val endDate1 = "2020_04_06" // "2020_06_02"
val endDate2 = "2020_06_02"
val beginDate = LocalDate.parse(startDate, formatter)
val lastDate1 = LocalDate.parse(endDate1, formatter)
val lastDate2 = LocalDate.parse(endDate2, formatter)
val res1 = beginDate.datesUntil(lastDate1.plusDays(1)).iterator().asScala.map(date => formatter.format(date)).toList
val res2 = beginDate.datesUntil(lastDate2.plusDays(1)).iterator().asScala.map(date => formatter.format(date)).toList
println(res1) // List(2020_04_02, 2020_04_03, 2020_04_04, 2020_04_05, 2020_04_06)
println(res2) // List(2020_04_02, 2020_04_03, 2020_04_04, 2020_04_05, 2020_04_06 ... 2020_05_31, 2020_06_01, 2020_06_02)
This function will return List[String].
import java.time.format.DateTimeFormatter
import java.time.LocalDate
val dt1 = "2020_04_02"
val dt2 = "2020_04_06"
def DatesBetween(startDate: String,endDate: String) : List[String] = {
def ConvertToFormat = DateTimeFormatter.ofPattern("yyyy_MM_dd")
val sdate = LocalDate.parse(startDate,ConvertToFormat)
val edate = LocalDate.parse(endDate,ConvertToFormat)
val DateRange = sdate.toEpochDay.until(edate.plusDays(1).toEpochDay).map(LocalDate.ofEpochDay).toList
val ListofDateRange = DateRange.map(date => ConvertToFormat.format(date)).toList
ListofDateRange
}
println(DatesBetween(dt1,dt2))
Related
I have defined a function to convert Epoch time to CET and using that function after wrapping as UDF in Spark dataFrame. It is throwing error and not allowing me to use it. Please find below my code.
Function used to convert Epoch time to CET:
import java.text.SimpleDateFormat
import java.util.{Calendar, Date, TimeZone}
import java.util.concurrent.TimeUnit
def convertNanoEpochToDateTime(
d: Long,
f: String = "dd/MM/yyyy HH:mm:ss.SSS",
z: String = "CET",
msPrecision: Int = 9
): String = {
val sdf = new SimpleDateFormat(f)
sdf.setTimeZone(TimeZone.getTimeZone(z))
val date = new Date((d / Math.pow(10, 9).toLong) * 1000L)
val stringTime = sdf.format(date)
if (f.contains(".S")) {
val lng = d.toString.length
val milliSecondsStr = d.toString.substring(lng-9,lng)
stringTime.substring(0, stringTime.lastIndexOf(".") + 1) + milliSecondsStr.substring(0,msPrecision)
}
else stringTime
}
val epochToDateTime = udf(convertNanoEpochToDateTime _)
Below given Spark DataFrame uses the above defined UDF for converting Epoch time to CET
val df2 = df1.select($"messageID",$"messageIndex",epochToDateTime($"messageTimestamp").as("messageTimestamp"))
I am getting the below shown error, when I run the code
Any idea how am I supposed to proceed in this scenario ?
The spark optimizer execution tells you that your function is not a Function1, that means that it is not a function that accepts one parameter. You have a function with four input parameters. And, although you may think that in Scala you are allowed to call that function with only one parameter because you have default values for the other three, it seems that Catalyst does not work in this way, so you will need to change the definition of your function to something like:
def convertNanoEpochToDateTime(
f: String = "dd/MM/yyyy HH:mm:ss.SSS"
)(z: String = "CET")(msPrecision: Int = 9)(d: Long): String
or
def convertNanoEpochToDateTime(f: String)(z: String)(msPrecision: Int)(d: Long): String
and put the default values in the udf creation:
val epochToDateTime = udf(
convertNanoEpochToDateTime("dd/MM/yyyy HH:mm:ss.SSS")("CET")(9) _
)
and try to define the SimpleDateFormat as a static transient value out of the function.
I found why the error is due to and resolved it. The problem is when I wrap the scala function as UDF, its expecting 4 parameters, but I was passing only one parameter. Now, I removed 3 parameters from the function and took those values inside the function itself, since they are constant values. Now in Spark Dataframe, I am calling the function with only 1 parameter and it works perfectly fine.
import java.text.SimpleDateFormat
import java.util.{Calendar, Date, TimeZone}
import java.util.concurrent.TimeUnit
def convertNanoEpochToDateTime(
d: Long
): String = {
val f: String = "dd/MM/yyyy HH:mm:ss.SSS"
val z: String = "CET"
val msPrecision: Int = 9
val sdf = new SimpleDateFormat(f)
sdf.setTimeZone(TimeZone.getTimeZone(z))
val date = new Date((d / Math.pow(10, 9).toLong) * 1000L)
val stringTime = sdf.format(date)
if (f.contains(".S")) {
val lng = d.toString.length
val milliSecondsStr = d.toString.substring(lng-9,lng)
stringTime.substring(0, stringTime.lastIndexOf(".") + 1) + milliSecondsStr.substring(0,msPrecision)
}
else stringTime
}
val epochToDateTime = udf(convertNanoEpochToDateTime _)
import spark.implicits._
val df1 = List(1659962673251388155L,1659962673251388155L,1659962673251388155L,1659962673251388155L).toDF("epochTime")
val df2 = df1.select(epochToDateTime($"epochTime"))
I am trying to add dates in string from an array into a seq while determining whether it is a weekend day.
import java.text.SimpleDateFormat
import java.util.{Calendar, Date, GregorianCalendar}
import org.apache.spark.sql.SparkSession
val arrDateEsti=dtfBaseNonLong.select("AAA").distinct().collect.map(_(0).toString);
var dtfDateCate = Seq(
("0000", "0")
);
for (a<-0 to arrDateEsti.length-1){
val dayDate:Date = dateFormat.parse(arrDateEsti(a));
val cal=new GregorianCalendar
cal.setTime(dayDate);
if (cal.get(Calendar.DAY_OF_WEEK)==1 || cal.get(Calendar.DAY_OF_WEEK)==7){
dtfDateCate:+(arrDateEsti(a),"1")
}else{
dtfDateCate:+(arrDateEsti(a),"0")
}
};
scala> dtfDateCate
res20: Seq[(String, String)] = List((0000,0))
It returns the same initial sequence. But if I run one single element it works. What went wrong?
scala> val dayDate:Date = dateFormat.parse(arrDateEsti(0));
dayDate: java.util.Date = Thu Oct 15 00:00:00 CST 2020
scala> cal.setTime(dayDate);
scala> if (cal.get(Calendar.DAY_OF_WEEK)==1 || cal.get(Calendar.DAY_OF_WEEK)==7){
| dtfDateCate:+(arrDateEsti(0),"1")
| }else{
| dtfDateCate:+(arrDateEsti(0),"0")
| };
res14: Seq[(String, String)] = List((0000,0), (20201015,0))
I think this gets at what you're trying to do.
import java.time.LocalDate
import java.time.DayOfWeek.{SATURDAY, SUNDAY}
import java.time.format.DateTimeFormatter
//replace with dtfBaseNonLong.select(... code
val arrDateEsti = Seq("20201015", "20201017") //place holder
val dtFormat = DateTimeFormatter.ofPattern("yyyyMMdd")
val dtfDateCate = ("0000", "0") +:
arrDateEsti.map { dt =>
val day = LocalDate.parse(dt,dtFormat).getDayOfWeek()
if (day == SATURDAY || day == SUNDAY) (dt, "1")
else (dt, "0")
}
//dtfDateCate = Seq((0000,0), (20201015,0), (20201017,1))
yeah, it should be seqDateCate=seqDateCate:+(arrDateEsti(a),"1")
I have a DateType input in the function. I would like to exclude Saturday and Sunday and get the next week day, if the input date falls on the weekend, otherwise it should give the next day's date
Example:
Input: Monday 1/1/2017 output: 1/2/2017 (which is Tuesday)
Input: Saturday 3/4/2017 output: 3/5/2017 (which is Monday)
I have gone through https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/functions.html but I don't see a ready made function, so I think it will need to be created.
So far I have something that is:
val nextWeekDate = udf {(startDate: DateType) =>
val day= date_format(startDate,'E'
if(day=='Sat' or day=='Sun'){
nextWeekDate = next_day(startDate,'Mon')
}
else{
nextWeekDate = date_add(startDate, 1)
}
}
Need help to get it valid and working.
Using dates as strings:
import java.time.{DayOfWeek, LocalDate}
import java.time.format.DateTimeFormatter
// If that is your format date
object MyFormat {
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
}
object MainSample {
import MyFormat._
def main(args: Array[String]): Unit = {
import java.sql.Date
import org.apache.spark.sql.types.{DateType, IntegerType}
import spark.implicits._
import org.apache.spark.sql.types.{ StringType, StructField, StructType }
import org.apache.spark.sql.functions._
implicit val spark: SparkSession =
SparkSession
.builder()
.appName("YourApp")
.config("spark.master", "local")
.getOrCreate()
val someData = Seq(
Row(1,"2013-01-30"),
Row(2,"2012-01-01")
)
val schema = List(StructField("id", IntegerType), StructField("date",StringType))
val sourceDF = spark.createDataFrame(spark.sparkContext.parallelize(someData), StructType(schema))
sourceDF.show()
val _udf = udf { (dt: String) =>
// Parse your date, dt is a string
val localDate = LocalDate.parse(dt, formatter)
// Check the week day and add days in each case
val newDate = if ((localDate.getDayOfWeek == DayOfWeek.SATURDAY)) {
localDate.plusDays(2)
} else if (localDate.getDayOfWeek == DayOfWeek.SUNDAY) {
localDate.plusDays(1)
} else {
localDate.plusDays(1)
}
newDate.toString
}
sourceDF.withColumn("NewDate", _udf('date)).show()
}
}
Here's a much simpler answer that's defined in spark-daria:
def nextWeekday(col: Column): Column = {
val d = dayofweek(col)
val friday = lit(6)
val saturday = lit(7)
when(col.isNull, null)
.when(d === friday || d === saturday, next_day(col,"Mon"))
.otherwise(date_add(col, 1))
}
You always want to stick with the native Spark functions whenever possible. This post explains the derivation of this function in greater detail.
I want to be able to filter on a date just like you would in normal SQL. Is that possible? I'm running into an issue on how to convert the string from the text file into a date.
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
import org.apache.log4j._
import java.text._
//import java.util.Date
import java.sql.Date
object BayAreaBikeAnalysis {
case class Station(ID:Int, name:String, lat:Double, longitude:Double, dockCount:Int, city:String, installationDate:Date)
case class Status(station_id:Int, bikesAvailable:Int, docksAvailable:Int, time:String)
val dateFormat = new SimpleDateFormat("yyyy-MM-dd")
def extractStations(line: String): Station = {
val fields = line.split(",",-1)
val station:Station = Station(fields(0).toInt, fields(1), fields(2).toDouble, fields(3).toDouble, fields(4).toInt, fields(5), dateFormat.parse(fields(6)))
return station
}
def extractStatus(line: String): Status = {
val fields = line.split(",",-1)
val status:Status = Status(fields(0).toInt, fields(1).toInt, fields(2).toInt, fields(3))
return status
}
def main(args: Array[String]) {
// Set the log level to only print errors
//Logger.getLogger("org").setLevel(Level.ERROR)
// Use new SparkSession interface in Spark 2.0
val spark = SparkSession
.builder
.appName("BayAreaBikeAnalysis")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp")
.getOrCreate()
//Load files into data sets
import spark.implicits._
val stationLines = spark.sparkContext.textFile("Data/station.csv")
val stations = stationLines.map(extractStations).toDS().cache()
val statusLines = spark.sparkContext.textFile("Data/status.csv")
val statuses = statusLines.map(extractStatus).toDS().cache()
//people.select("name").show()
stations.select("installationDate").show()
spark.stop()
}
}
Obviously fields(6).toDate() doesn't compile but I'm not sure what to use.
I think this post is what you are looking for.
Also here you'll find a good tutorial for string parse to date.
Hope this helps!
Following are the ways u can convert string to date in scala.
(1) In case of java.util.date :-
val date= new SimpleDateFormat("yyyy-MM-dd")
date.parse("2017-09-28")
(2) In case of joda's dateTime:-
DateTime.parse("09-28-2017")
Here is a helping function that takes on a string representing a date and transforms it into a Timestamp
import java.sql.Timestamp
import java.util.TimeZone
import java.text.{DateFormat, SimpleDateFormat}
def getTimeStamp(timeStr: String): Timestamp = {
val dateFormat: DateFormat = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss")
dateFormat.setTimeZone(TimeZone.getTimeZone("UTC"))
val date: Option[Timestamp] = {
try {
Some(new Timestamp(dateFormat.parse(timeStr).getTime))
} catch {
case _: Exception => Some(Timestamp.valueOf("19700101'T'000000"))
}
}
date.getOrElse(Timestamp.valueOf(timeStr))
}
Obviously, you will need to change your input date format from "yyyy-MM-dd'T'HH:mm:ss" into whatever format you have the date string.
Hope this helps.
I have generated an Array of dates with the following code using jodatime
import org.joda.time.{DateTime, Period}
import org.joda.time.format.DateTimeFormat
import java.text.SimpleDateFormat
def dateRange(from: DateTime, to: DateTime, step: Period): Iterator[DateTime]
=Iterator.iterate(from)(_.plus(step)).takeWhile(!_.isAfter(to))
val from = new DateTime(2000, 06, 30,0,0,0,0)
val to = new DateTime(2001, 06, 30,0,0,0,0)
val by = new Period(0,2,0,0,0,0,0,0)
val range = { dateRange(from ,to, by)}
val dateRaw = (range).toArray
How can I pass DateTimeFormat.forPattern("YYYYMMdd") to each value in order to get an Array of integers of format yyyyMMdd
Array[Int] = Array(20000630,20000830,20001030...
val f = DateTimeFormat.forPattern("YYYYMMdd")
dateRaw.map(d => f.print(d).toInt)