How to automate the creation of String elements with datetime - scala

This thread arises from my previous question. I need to create Seq[String] that contains paths as String elements, however now I also need to add numbers 7,8,...-22 after a date. Also I cannot use LocalDate as it was suggested in the answer to the above-cited question:
path/file_2017-May-1-7
path/file_2017-May-1-8
...
path/file_2017-May-1-22
path/file_2017-April-30-7
path/file_2017-April-30-8
...
path/file_2017-April-30-22
..
I am searching for a flexible solution. My current solution implies the manual definition of dates yyyy-MMM-dd. However it is not efficient if I need to include more than 2 dates, e.g. 10 or 100. Moreover filePathsList is currently Set[Seq[String]] and I don't know how to convert it into Seq[String].
val formatter = new SimpleDateFormat("yyyy-MMM-dd")
val currDay = Calendar.getInstance
currDay.add(Calendar.DATE, -1)
val day_1_ago = currDay.getTime
currDay.add(Calendar.DATE, -1)
val day_2_ago = currDay.getTime
val dates = Set(formatter.format(day_1_ago),formatter.format(day_2_ago))
val filePathsList = dates.map(date => {
val list: Seq.empty[String]
for (num <- 7 to 22) {
list :+ s"path/file_$date-$num" + "
}
list
})

Here is how I was able to achieve what you outlined, adjust the days val to configure the amount of days you care about:
import java.text.SimpleDateFormat
import java.util.Calendar
val currDay = Calendar.getInstance
val days = 5
val dates = currDay.getTime +: List.fill(days){
currDay.add(Calendar.DATE, -1)
currDay.getTime
}
val formatter = new SimpleDateFormat("yyyy-MMM-dd")
val filePathsList = for {
date <- dates
num <- 7 to 22
} yield s"path/file_${formatter.format(date)}-$num"

Related

Spark scala UDF in DataFrames is not working

I have defined a function to convert Epoch time to CET and using that function after wrapping as UDF in Spark dataFrame. It is throwing error and not allowing me to use it. Please find below my code.
Function used to convert Epoch time to CET:
import java.text.SimpleDateFormat
import java.util.{Calendar, Date, TimeZone}
import java.util.concurrent.TimeUnit
def convertNanoEpochToDateTime(
d: Long,
f: String = "dd/MM/yyyy HH:mm:ss.SSS",
z: String = "CET",
msPrecision: Int = 9
): String = {
val sdf = new SimpleDateFormat(f)
sdf.setTimeZone(TimeZone.getTimeZone(z))
val date = new Date((d / Math.pow(10, 9).toLong) * 1000L)
val stringTime = sdf.format(date)
if (f.contains(".S")) {
val lng = d.toString.length
val milliSecondsStr = d.toString.substring(lng-9,lng)
stringTime.substring(0, stringTime.lastIndexOf(".") + 1) + milliSecondsStr.substring(0,msPrecision)
}
else stringTime
}
val epochToDateTime = udf(convertNanoEpochToDateTime _)
Below given Spark DataFrame uses the above defined UDF for converting Epoch time to CET
val df2 = df1.select($"messageID",$"messageIndex",epochToDateTime($"messageTimestamp").as("messageTimestamp"))
I am getting the below shown error, when I run the code
Any idea how am I supposed to proceed in this scenario ?
The spark optimizer execution tells you that your function is not a Function1, that means that it is not a function that accepts one parameter. You have a function with four input parameters. And, although you may think that in Scala you are allowed to call that function with only one parameter because you have default values for the other three, it seems that Catalyst does not work in this way, so you will need to change the definition of your function to something like:
def convertNanoEpochToDateTime(
f: String = "dd/MM/yyyy HH:mm:ss.SSS"
)(z: String = "CET")(msPrecision: Int = 9)(d: Long): String
or
def convertNanoEpochToDateTime(f: String)(z: String)(msPrecision: Int)(d: Long): String
and put the default values in the udf creation:
val epochToDateTime = udf(
convertNanoEpochToDateTime("dd/MM/yyyy HH:mm:ss.SSS")("CET")(9) _
)
and try to define the SimpleDateFormat as a static transient value out of the function.
I found why the error is due to and resolved it. The problem is when I wrap the scala function as UDF, its expecting 4 parameters, but I was passing only one parameter. Now, I removed 3 parameters from the function and took those values inside the function itself, since they are constant values. Now in Spark Dataframe, I am calling the function with only 1 parameter and it works perfectly fine.
import java.text.SimpleDateFormat
import java.util.{Calendar, Date, TimeZone}
import java.util.concurrent.TimeUnit
def convertNanoEpochToDateTime(
d: Long
): String = {
val f: String = "dd/MM/yyyy HH:mm:ss.SSS"
val z: String = "CET"
val msPrecision: Int = 9
val sdf = new SimpleDateFormat(f)
sdf.setTimeZone(TimeZone.getTimeZone(z))
val date = new Date((d / Math.pow(10, 9).toLong) * 1000L)
val stringTime = sdf.format(date)
if (f.contains(".S")) {
val lng = d.toString.length
val milliSecondsStr = d.toString.substring(lng-9,lng)
stringTime.substring(0, stringTime.lastIndexOf(".") + 1) + milliSecondsStr.substring(0,msPrecision)
}
else stringTime
}
val epochToDateTime = udf(convertNanoEpochToDateTime _)
import spark.implicits._
val df1 = List(1659962673251388155L,1659962673251388155L,1659962673251388155L,1659962673251388155L).toDF("epochTime")
val df2 = df1.select(epochToDateTime($"epochTime"))

Populating a array of date tuples

i am trying to pass a list of date ranges needs to be in the below format.
val predicates =
Array(
“2021-05-16” → “2021-05-17”,
“2021-05-18” → “2021-05-19”,
“2021-05-20” → “2021-05-21”)
I am then using map to create a range of conditions that will be passed to the jdbc method.
val predicates =
Array(
“2021-05-16” → “2021-05-17”,
“2021-05-18” → “2021-05-19”,
“2021-05-20” → “2021-05-21”
).map { case (start, end) =>
s"cast(NEW_DT as date) >= date ‘$start’ AND cast(NEW_DT as date) <= date ‘$end’"
}
The process will need to run daily and i need to dynamically populate these values as i cannot use the hard coded way.
Need help in how i can return these values from a method with incrementing start_date and end_date tuples that can generate like above.I had a mere idea like below but as i am new to scala not able to figure out. Please help
def predicateRange(start_date: String, end_date: String): Array[(String,String)] = {
// iterate over the date values and add + 1 to both start and end and return the tuple
}
This assumes that every range is the same duration, and that each date range starts the next day after the end of the previous range.
import java.time.LocalDate
import java.time.format.DateTimeFormatter
def dateRanges(start: String
,rangeLen: Int
,ranges: Int): Array[(String,String)] = {
val startDate =
LocalDate.parse(start, DateTimeFormatter.ofPattern("yyyy-MM-dd"))
Array.iterate(startDate -> startDate.plusDays(rangeLen), ranges){
case (_, end) => end.plusDays(1) -> end.plusDays(rangeLen+1)
}.map{case (s,e) => (s.toString, e.toString)}
}
usage:
dateRanges("2021-05-16", 1, 3)
//res0: Array[(String, String)] = Array((2021-05-16,2021-05-17), (2021-05-18,2021-05-19), (2021-05-20,2021-05-21))
You can use following method to generate your tuple array,
import java.time.LocalDate
import java.time.format.DateTimeFormatter
def generateArray3(startDateString: String, endDateString: String): Array[(String, String)] = {
val dateFormatter = DateTimeFormatter.ISO_LOCAL_DATE
val startDate = LocalDate.parse(startDateString)
val endDate = LocalDate.parse(endDateString)
val daysCount = startDate.until(endDate).getDays
val dateStringTuples = Array.tabulate(daysCount)(i => {
val firstDate = startDate.plusDays(i)
val secondDate = startDate.plusDays(i + 1)
(dateFormatter.format(firstDate), dateFormatter.format(secondDate))
})
dateStringTuples
}
Usage:
println("--------------------------")
generateArray("2021-02-27", "2021-03-02").foreach(println)
println("--------------------------")
generateArray("2021-05-27", "2021-06-02").foreach(println)
println("--------------------------")
generateArray("2021-12-27", "2022-01-02").foreach(println)
println("--------------------------")
output :
--------------------------
(2021-02-27,2021-02-28)
(2021-02-28,2021-03-01)
(2021-03-01,2021-03-02)
--------------------------
(2021-05-27,2021-05-28)
(2021-05-28,2021-05-29)
(2021-05-29,2021-05-30)
(2021-05-30,2021-05-31)
(2021-05-31,2021-06-01)
(2021-06-01,2021-06-02)
--------------------------
(2021-12-27,2021-12-28)
(2021-12-28,2021-12-29)
(2021-12-29,2021-12-30)
(2021-12-30,2021-12-31)
(2021-12-31,2022-01-01)
(2022-01-01,2022-01-02)
--------------------------

How take data from several parquet files at once?

I need your help cause I am new in Spark Framework.
I have folder with a lot of parquet files. The name of these files has the same format: DD-MM-YYYY. For example: '01-10-2018', '02-10-2018', '03-10-2018', etc.
My application has two input parameters: dateFrom and dateTo.
When I try to use next code application hangs. It seems like application scan all files in folder.
val mf = spark.read.parquet("/PATH_TO_THE_FOLDER/*")
.filter($"DATE".between(dateFrom + " 00:00:00", dateTo + " 23:59:59"))
mf.show()
I need to take data pool for period as fast as it possible.
I think it would be great to divide period into days and then read files separately, join them like that:
val mf1 = spark.read.parquet("/PATH_TO_THE_FOLDER/01-10-2018");
val mf2 = spark.read.parquet("/PATH_TO_THE_FOLDER/02-10-2018");
val final = mf1.union(mf2).distinct();
dateFrom and dateTo are dynamic, so I don't know how correctly organize code right now. Please help!
#y2k-shubham I tried to test next code, but it raise error:
import org.joda.time.{DateTime, Days}
import org.apache.spark.sql.{DataFrame, SparkSession}
val dateFrom = DateTime.parse("2018-10-01")
val dateTo = DateTime.parse("2018-10-05")
def getDaysInBetween(from: DateTime, to: DateTime): Int = Days.daysBetween(from, to).getDays
def getDatesInBetween(from: DateTime, to: DateTime): Seq[DateTime] = {
val days = getDaysInBetween(from, to)
(0 to days).map(day => from.plusDays(day).withTimeAtStartOfDay())
}
val datesInBetween: Seq[DateTime] = getDatesInBetween(dateFrom, dateTo)
val unionDf: DataFrame = datesInBetween.foldLeft(spark.emptyDataFrame) { (intermediateDf: DataFrame, date: DateTime) =>
intermediateDf.union(spark.read.parquet("PATH" + date.toString("yyyy-MM-dd") + "/*.parquet"))
}
unionDf.show()
ERROR:
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 0 columns and the second table has 20 columns;
It seems like intermediateDf DateFrame at start is empty. How to fix the problem?
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import org.apache.spark.sql.{DataFrame, SparkSession}
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
def dateRangeInclusive(start: String, end: String): Iterator[LocalDate] = {
val startDate = LocalDate.parse(start, formatter)
val endDate = LocalDate.parse(end, formatter)
Iterator.iterate(startDate)(_.plusDays(1))
.takeWhile(d => d.isBefore(endDate) || d.isEqual(endDate))
}
val spark = SparkSession.builder().getOrCreate()
val data: DataFrame = dateRangeInclusive("2018-10-01", "2018-10-05")
.map(d => spark.read.parquet(s"/path/to/directory/${formatter.format(d)}"))
.reduce(_ union _)
I also suggest using the native JSR 310 API (part of Java SE since Java 8) rather than joda-time, since it is more modern and does not require external dependencies. Note that first creating a sequence of paths and doing map+reduce is probably simpler for this use case than a more general foldLeft-based solution.
Additionally, you can use reduceOption, then you'll get an Option[DataFrame] if the input date range is empty. Also, if it is possible for some input directories/files to be missing, you'd want to do a check before invoking spark.read.parquet. If your data is on HDFS, you should probably use the Hadoop FS API:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
val spark = SparkSession.builder().getOrCreate()
val fs = FileSystem.get(new Configuration(spark.sparkContext.hadoopConfiguration))
val data: Option[DataFrame] = dateRangeInclusive("2018-10-01", "2018-10-05")
.map(d => s"/path/to/directory/${formatter.format(d)}")
.filter(p => fs.exists(new Path(p)))
.map(spark.read.parquet(_))
.reduceOption(_ union _)
While I haven't tested this piece of code, it must work (probably slight modification?)
import org.joda.time.{DateTime, Days}
import org.apache.spark.sql.{DataFrame, SparkSession}
// return no of days between two dates
def getDaysInBetween(from: DateTime, to: DateTime): Int = Days.daysBetween(from, to).getDays
// return sequence of dates between two dates
def getDatesInBetween(from: DateTime, to: DateTime): Seq[DateTime] = {
val days = getDaysInBetween(from, to)
(0 to days).map(day => from.plusDays(day).withTimeAtStartOfDay())
}
// read parquet data of given date-range from given path
// (you might want to pass SparkSession in a different manner)
def readDataForDateRange(path: String, from: DateTime, to: DateTime)(implicit spark: SparkSession): DataFrame = {
// get date-range sequence
val datesInBetween: Seq[DateTime] = getDatesInBetween(from, to)
// read data of from-date (needed because schema of all DataFrames should be same for union)
val fromDateDf: DataFrame = spark.read.parquet(path + "/" + datesInBetween.head.toString("yyyy-MM-dd"))
// read and union remaining dataframes (functionally)
val unionDf: DataFrame = datesInBetween.tail.foldLeft(fromDateDf) { (intermediateDf: DataFrame, date: DateTime) =>
intermediateDf.union(spark.read.parquet(path + "/" + date.toString("yyyy-MM-dd")))
}
// return union-df
unionDf
}
Reference: How to calculate 'n' days interval date in functional style?

Create a list of String elements entitles according to dates

I want to create a list of String elements, each one having a date in its title:
data_2017_May_4
data_2017_May_3
data_2017_May_2
The important thing is how these dates are created. They should be created starting from current date till minus 2 days. If the current date is May 1 2017, then the result would be:
data_2017_May_1
data_2017_April_30
data_2017_April_29
The same logic is applied to the switch between years (December/January).
This is my code, but it does not consider the changes of months and years. Also it jumps in dates:
val formatter = new SimpleDateFormat("yyyy-MMM-dd")
val currDay = Calendar.getInstance
var list: List[String] = Nil
var day = null
var date: String = ""
for (i <- 0 to 2) {
currDay.add(Calendar.DATE, -i)
date = "data_"+formatter.format(currDay.getTime)
list ::= date
}
println(list.mkString(","))
How to reach the objective?
Can you use java.time.LocalDate? If so you can easily accomplish this:
import java.time.LocalDate
import java.time.format.DateTimeFormatter
val desiredFormat = DateTimeFormatter.ofPattern("yyyy-MMM-dd")
val now = LocalDate.now()
val dates = Set(now, now.minusDays(1), now.minusDays(2))
dates.map(_.format(desiredFormat))
.foreach(date => println(s"data_$date"))

Iterate over dates range (the scala way)

Given a start and an end date I would like to iterate on it by day using a foreach, map or similar function. Something like
(DateTime.now to DateTime.now + 5.day by 1.day).foreach(println)
I am using https://github.com/nscala-time/nscala-time, but I get returned a joda Interval object if I use the syntax above, which I suspect is also not a range of dates, but a sort of range of milliseconds.
EDIT: The question is obsolete. As advised on the joda homepage, if you are using java 8 you should start with or migrate to java.time.
You may use plusDays:
val now = DateTime.now
(0 until 5).map(now.plusDays(_)).foreach(println)
Given start and end dates:
import org.joda.time.Days
val start = DateTime.now.minusDays(5)
val end = DateTime.now.plusDays(5)
val daysCount = Days.daysBetween(start, end).getDays()
(0 until daysCount).map(start.plusDays(_)).foreach(println)
For just iterating by day, I do:
Iterator.iterate(start) { _ + 1.day }.takeWhile(_.isBefore(end))
This has proven to be useful enough that I have a small helper object to provide an implicit and allow for a type transformation:
object IntervalIterators {
implicit class ImplicitIterator(val interval: Interval) extends AnyVal {
def iterateBy(step: Period): Iterator[DateTime] = Iterator.iterate(interval.start) { _ + step }
.takeWhile(_.isBefore(interval.end))
def iterateBy[A](step: Period, transform: DateTime => A): Iterator[A] = iterateBy(step).map(transform)
def iterateByDay: Iterator[LocalDate] = iterateBy(1.day, { _.toLocalDate })
def iterateByHour: Iterator[DateTime] = iterateBy(1.hour)
}
}
Sample usage:
import IntervalIterators._
(DateTime.now to 5.day.from(DateTime.now)).iterateByDay // Iterator[LocalDate]
(30.minutes.ago to 1.hour.from(DateTime.now)).iterateBy(1.second) // Iterator[DateTime], broken down by second
Solution with java.time API using Scala
Necessary import and initialization
import java.time.temporal.ChronoUnit
import java.time.temporal.ChronoField.EPOCH_DAY
import java.time.{LocalDate, Period}
val now = LocalDate.now
val daysTill = 5
Create List of LocalDate for sample duration
(0 to daysTill)
.map(days => now.plusDays(days))
.foreach(println)
Iterate over specific dates between start and end using toEpochDay or getLong(ChronoField.EPOCH_DAY)
//Extract the duration
val endDay = now.plusDays(daysTill)
val startDay = now
val duration = endDay.getLong(EPOCH_DAY) - startDay.getLong(EPOCH_DAY)
/* This code does not give desired results as trudolf pointed
val duration = Period
.between(now, now.plusDays(daysTill))
.get(ChronoUnit.DAYS)
*/
//Create list for the duration
(0 to duration)
.map(days => now.plusDays(days))
.foreach(println)
This answer fixes the issue of mrsrinivas answer, that .get(ChronoUnits.DAYS) returns only the days part of the duration, and not the total number of days.
Necessary import and initialization
import java.time.temporal.ChronoUnit
import java.time.{LocalDate, Period}
Note how above answer would lead to wrong result (total number of days is 117)
scala> Period.between(start, end)
res6: java.time.Period = P3M26D
scala> Period.between(start, end).get(ChronoUnit.DAYS)
res7: Long = 26
Iterate over specific dates between start and end
val start = LocalDate.of(2018, 1, 5)
val end = LocalDate.of(2018, 5, 1)
// Create List of `LocalDate` for the period between start and end date
val dates: IndexedSeq[LocalDate] = (0L to (end.toEpochDay - start.toEpochDay))
.map(days => start.plusDays(days))
dates.foreach(println)
you can use something like that:
object Test extends App {
private val startDate: DateTime = DateTime.now()
private val endDate: DateTime = DateTime.now().plusDays(5)
private val interval: Interval = new Interval(startDate, endDate)
Stream.from(0,1)
.takeWhile(index => interval.contains(startDate.plusDays(index)))
.foreach(index => println(startDate.plusDays(index)))
}
In this case, the Scala way is the Java way:
When running Scala on Java 9+, we can use java.time.LocalDate::datesUntil:
import java.time.LocalDate
import collection.JavaConverters._
// val start = LocalDate.of(2019, 1, 29)
// val end = LocalDate.of(2018, 2, 2)
start.datesUntil(end).iterator.asScala
// Iterator[java.time.LocalDate] = <iterator> (2019-01-29, 2019-01-30, 2019-01-31, 2019-02-01)
And if the last date is to be included:
start.datesUntil(end.plusDays(1)).iterator.asScala
// 2019-01-29, 2019-01-30, 2019-01-31, 2019-02-01, 2019-02-02
import java.util.{Calendar, Date}
import scala.annotation.tailrec
/** Gets date list between two dates
*
* #param startDate Start date
* #param endDate End date
* #return List of dates from startDate to endDate
*/
def getDateRange(startDate: Date, endDate: Date): List[Date] = {
#tailrec
def addDate(acc: List[Date], startDate: Date, endDate: Date): List[Date] = {
if (startDate.after(endDate)) acc
else addDate(endDate :: acc, startDate, addDays(endDate, -1))
}
addDate(List(), startDate, endDate)
}
/** Adds a date offset to the given date
*
* #param date ==> Date
* #param amount ==> Offset (can be negative)
* #return ==> New date
*/
def addDays(date: Date, amount: Int): Date = {
val cal = Calendar.getInstance()
cal.setTime(date)
cal.add(Calendar.DATE, amount)
cal.getTime
}