Populating a array of date tuples - scala

i am trying to pass a list of date ranges needs to be in the below format.
val predicates =
Array(
“2021-05-16” → “2021-05-17”,
“2021-05-18” → “2021-05-19”,
“2021-05-20” → “2021-05-21”)
I am then using map to create a range of conditions that will be passed to the jdbc method.
val predicates =
Array(
“2021-05-16” → “2021-05-17”,
“2021-05-18” → “2021-05-19”,
“2021-05-20” → “2021-05-21”
).map { case (start, end) =>
s"cast(NEW_DT as date) >= date ‘$start’ AND cast(NEW_DT as date) <= date ‘$end’"
}
The process will need to run daily and i need to dynamically populate these values as i cannot use the hard coded way.
Need help in how i can return these values from a method with incrementing start_date and end_date tuples that can generate like above.I had a mere idea like below but as i am new to scala not able to figure out. Please help
def predicateRange(start_date: String, end_date: String): Array[(String,String)] = {
// iterate over the date values and add + 1 to both start and end and return the tuple
}

This assumes that every range is the same duration, and that each date range starts the next day after the end of the previous range.
import java.time.LocalDate
import java.time.format.DateTimeFormatter
def dateRanges(start: String
,rangeLen: Int
,ranges: Int): Array[(String,String)] = {
val startDate =
LocalDate.parse(start, DateTimeFormatter.ofPattern("yyyy-MM-dd"))
Array.iterate(startDate -> startDate.plusDays(rangeLen), ranges){
case (_, end) => end.plusDays(1) -> end.plusDays(rangeLen+1)
}.map{case (s,e) => (s.toString, e.toString)}
}
usage:
dateRanges("2021-05-16", 1, 3)
//res0: Array[(String, String)] = Array((2021-05-16,2021-05-17), (2021-05-18,2021-05-19), (2021-05-20,2021-05-21))

You can use following method to generate your tuple array,
import java.time.LocalDate
import java.time.format.DateTimeFormatter
def generateArray3(startDateString: String, endDateString: String): Array[(String, String)] = {
val dateFormatter = DateTimeFormatter.ISO_LOCAL_DATE
val startDate = LocalDate.parse(startDateString)
val endDate = LocalDate.parse(endDateString)
val daysCount = startDate.until(endDate).getDays
val dateStringTuples = Array.tabulate(daysCount)(i => {
val firstDate = startDate.plusDays(i)
val secondDate = startDate.plusDays(i + 1)
(dateFormatter.format(firstDate), dateFormatter.format(secondDate))
})
dateStringTuples
}
Usage:
println("--------------------------")
generateArray("2021-02-27", "2021-03-02").foreach(println)
println("--------------------------")
generateArray("2021-05-27", "2021-06-02").foreach(println)
println("--------------------------")
generateArray("2021-12-27", "2022-01-02").foreach(println)
println("--------------------------")
output :
--------------------------
(2021-02-27,2021-02-28)
(2021-02-28,2021-03-01)
(2021-03-01,2021-03-02)
--------------------------
(2021-05-27,2021-05-28)
(2021-05-28,2021-05-29)
(2021-05-29,2021-05-30)
(2021-05-30,2021-05-31)
(2021-05-31,2021-06-01)
(2021-06-01,2021-06-02)
--------------------------
(2021-12-27,2021-12-28)
(2021-12-28,2021-12-29)
(2021-12-29,2021-12-30)
(2021-12-30,2021-12-31)
(2021-12-31,2022-01-01)
(2022-01-01,2022-01-02)
--------------------------

Related

Spark scala UDF in DataFrames is not working

I have defined a function to convert Epoch time to CET and using that function after wrapping as UDF in Spark dataFrame. It is throwing error and not allowing me to use it. Please find below my code.
Function used to convert Epoch time to CET:
import java.text.SimpleDateFormat
import java.util.{Calendar, Date, TimeZone}
import java.util.concurrent.TimeUnit
def convertNanoEpochToDateTime(
d: Long,
f: String = "dd/MM/yyyy HH:mm:ss.SSS",
z: String = "CET",
msPrecision: Int = 9
): String = {
val sdf = new SimpleDateFormat(f)
sdf.setTimeZone(TimeZone.getTimeZone(z))
val date = new Date((d / Math.pow(10, 9).toLong) * 1000L)
val stringTime = sdf.format(date)
if (f.contains(".S")) {
val lng = d.toString.length
val milliSecondsStr = d.toString.substring(lng-9,lng)
stringTime.substring(0, stringTime.lastIndexOf(".") + 1) + milliSecondsStr.substring(0,msPrecision)
}
else stringTime
}
val epochToDateTime = udf(convertNanoEpochToDateTime _)
Below given Spark DataFrame uses the above defined UDF for converting Epoch time to CET
val df2 = df1.select($"messageID",$"messageIndex",epochToDateTime($"messageTimestamp").as("messageTimestamp"))
I am getting the below shown error, when I run the code
Any idea how am I supposed to proceed in this scenario ?
The spark optimizer execution tells you that your function is not a Function1, that means that it is not a function that accepts one parameter. You have a function with four input parameters. And, although you may think that in Scala you are allowed to call that function with only one parameter because you have default values for the other three, it seems that Catalyst does not work in this way, so you will need to change the definition of your function to something like:
def convertNanoEpochToDateTime(
f: String = "dd/MM/yyyy HH:mm:ss.SSS"
)(z: String = "CET")(msPrecision: Int = 9)(d: Long): String
or
def convertNanoEpochToDateTime(f: String)(z: String)(msPrecision: Int)(d: Long): String
and put the default values in the udf creation:
val epochToDateTime = udf(
convertNanoEpochToDateTime("dd/MM/yyyy HH:mm:ss.SSS")("CET")(9) _
)
and try to define the SimpleDateFormat as a static transient value out of the function.
I found why the error is due to and resolved it. The problem is when I wrap the scala function as UDF, its expecting 4 parameters, but I was passing only one parameter. Now, I removed 3 parameters from the function and took those values inside the function itself, since they are constant values. Now in Spark Dataframe, I am calling the function with only 1 parameter and it works perfectly fine.
import java.text.SimpleDateFormat
import java.util.{Calendar, Date, TimeZone}
import java.util.concurrent.TimeUnit
def convertNanoEpochToDateTime(
d: Long
): String = {
val f: String = "dd/MM/yyyy HH:mm:ss.SSS"
val z: String = "CET"
val msPrecision: Int = 9
val sdf = new SimpleDateFormat(f)
sdf.setTimeZone(TimeZone.getTimeZone(z))
val date = new Date((d / Math.pow(10, 9).toLong) * 1000L)
val stringTime = sdf.format(date)
if (f.contains(".S")) {
val lng = d.toString.length
val milliSecondsStr = d.toString.substring(lng-9,lng)
stringTime.substring(0, stringTime.lastIndexOf(".") + 1) + milliSecondsStr.substring(0,msPrecision)
}
else stringTime
}
val epochToDateTime = udf(convertNanoEpochToDateTime _)
import spark.implicits._
val df1 = List(1659962673251388155L,1659962673251388155L,1659962673251388155L,1659962673251388155L).toDF("epochTime")
val df2 = df1.select(epochToDateTime($"epochTime"))

How can I convert from a CSV file to scala toList?

How can i convert from a CSV file to a list in Scala, when the rows of data have a date and a int, and are separated by different lines. This is about a stock market.
The function below takes a stock symbol and a year as arguments.
It should read the corresponding CSV-file and then extract the January
data from the given year. The data should be collected in a list of
strings (one entry for each line in the CSV-file).
def get_january_data(symbol: String, year: Int) : List[String] = {
}
The csv file is like this:
Data AJD
12/01/1998 0.32232
14/01/1998 0.32232
12/01/1998,0.32232,14/01/1998,0.32232
If I understand your question you can start with following piece of code:
val data =
"""Data AJD
|12/01/1998 0.32232
|14/01/1998 0.32232
|14/02/1998 0.12333
|01/01/1999 0.12333""".stripMargin
def getJanuaryData(symbol: String, year: Int): List[String] = {
val dtf = DateTimeFormatter.ofPattern("dd/MM/yyyy")
val xs: Vector[(LocalDateTime, Double)] = fromString(data, dtf)
xs.filter(x => (x._1.getMonthValue == 1) && x._1.getYear == year)
.map(x => s"${x._1.format(dtf)} ${x._2}").toList
}
def fromString(str: String, dateFormatter: DateTimeFormatter = DateTimeFormatter.ofPattern("dd/MM/yyyy")): Vector[(LocalDateTime, Double)] = {
str.split("\n").tail.map { line =>
val tokens = line.split(" ") // consider different separator
(LocalDate.parse(tokens(0), dateFormatter).atStartOfDay()
, tokens.tail.head.toDouble)
}.toVector
}
getJanuaryData("", 1998)
getJanuaryData() return rows for january and required year. Please add more information about stock symbol to get precise answer.

How to get earliest date From RDD[String, List[java.sql.date]], Scala

I have the below RDD, t1RDD2, only the first five rows present:
(000471242-01,CompactBuffer(2012-05-07, 2006-11-15, 2014-10-08, 2010-05-20))
(996006688-01,CompactBuffer(2011-01-18, 2005-08-19, 2008-08-27, 2014-09-05, 2006-06-26, 2012-05-10, 2013-11-22, 2005-10-14, 2007-03-26, 2007-05-17, 2010-05-19, 2008-07-11, 2009-03-09))
(788000995-01,CompactBuffer(2006-01-06, 2013-05-01))
(525570000-01,CompactBuffer(2009-07-06, 2010-06-10, 2013-01-22, 2005-03-09, 2008-06-09, 2008-11-07))
(418500000-01,CompactBuffer(2007-07-09, 2011-02-16, 2012-10-16, 2005-10-18, 2009-05-11, 2008-01-22, 2014-07-08, 2010-01-04, 2009-03-23, 2013-08-16))
I am trying to get the earliest date from the buffer, but I am getting an error from my code.
Code:
val t1RDD = t1RDD2.reduceByKey((date1, date2) => if (date1.before(date2)) date1 else date2)
Error:
value before is not a member of Iterable[java.sql.Date]
Any suggestions?
Apparently, your t1RDD2 is equivalent to the result of groupByKey on a PairRDD, as follows (with stripped-down sample data):
import java.sql.Date
val rdd = sc.parallelize(Seq(
("000471242-01", Date.valueOf("2012-05-07")),
("000471242-01", Date.valueOf("2006-11-15")),
("996006688-01", Date.valueOf("2011-01-18")),
("996006688-01", Date.valueOf("2005-08-19")),
("996006688-01", Date.valueOf("2008-08-27"))
))
val t1RDD2 = rdd.groupByKey
// t1RDD2: org.apache.spark.rdd.RDD[(String, Iterable[java.sql.Date])] = ...
t1RDD2.collect
// res1: Array[(String, Iterable[java.sql.Date])] = Array(
// (996006688-01,CompactBuffer(2011-01-18, 2005-08-19, 2008-08-27)),
// (000471242-01,CompactBuffer(2012-05-07, 2006-11-15))
// )
If you want to get the earliest date per key from t1RDD2, use map to reduce the value column for the minimal value:
t1RDD2.map{ case (k, v) => ( k, v.reduce((min, d) => if (min.before(d)) min else d) ) }.
collect
// res2: Array[(String, java.sql.Date)] = Array((996006688-01,2005-08-19), (000471242-01,2006-11-15))
But it would be better to directly perform reduceByKey from the pre-grouped RDD, if applicable:
rdd.reduceByKey( (min, d) => if (min.before(d)) min else d ).
collect
// res3: Array[(String, java.sql.Date)] = Array((996006688-01,2005-08-19), (000471242-01,2006-11-15))

How to automate the creation of String elements with datetime

This thread arises from my previous question. I need to create Seq[String] that contains paths as String elements, however now I also need to add numbers 7,8,...-22 after a date. Also I cannot use LocalDate as it was suggested in the answer to the above-cited question:
path/file_2017-May-1-7
path/file_2017-May-1-8
...
path/file_2017-May-1-22
path/file_2017-April-30-7
path/file_2017-April-30-8
...
path/file_2017-April-30-22
..
I am searching for a flexible solution. My current solution implies the manual definition of dates yyyy-MMM-dd. However it is not efficient if I need to include more than 2 dates, e.g. 10 or 100. Moreover filePathsList is currently Set[Seq[String]] and I don't know how to convert it into Seq[String].
val formatter = new SimpleDateFormat("yyyy-MMM-dd")
val currDay = Calendar.getInstance
currDay.add(Calendar.DATE, -1)
val day_1_ago = currDay.getTime
currDay.add(Calendar.DATE, -1)
val day_2_ago = currDay.getTime
val dates = Set(formatter.format(day_1_ago),formatter.format(day_2_ago))
val filePathsList = dates.map(date => {
val list: Seq.empty[String]
for (num <- 7 to 22) {
list :+ s"path/file_$date-$num" + "
}
list
})
Here is how I was able to achieve what you outlined, adjust the days val to configure the amount of days you care about:
import java.text.SimpleDateFormat
import java.util.Calendar
val currDay = Calendar.getInstance
val days = 5
val dates = currDay.getTime +: List.fill(days){
currDay.add(Calendar.DATE, -1)
currDay.getTime
}
val formatter = new SimpleDateFormat("yyyy-MMM-dd")
val filePathsList = for {
date <- dates
num <- 7 to 22
} yield s"path/file_${formatter.format(date)}-$num"

Iterate over dates range (the scala way)

Given a start and an end date I would like to iterate on it by day using a foreach, map or similar function. Something like
(DateTime.now to DateTime.now + 5.day by 1.day).foreach(println)
I am using https://github.com/nscala-time/nscala-time, but I get returned a joda Interval object if I use the syntax above, which I suspect is also not a range of dates, but a sort of range of milliseconds.
EDIT: The question is obsolete. As advised on the joda homepage, if you are using java 8 you should start with or migrate to java.time.
You may use plusDays:
val now = DateTime.now
(0 until 5).map(now.plusDays(_)).foreach(println)
Given start and end dates:
import org.joda.time.Days
val start = DateTime.now.minusDays(5)
val end = DateTime.now.plusDays(5)
val daysCount = Days.daysBetween(start, end).getDays()
(0 until daysCount).map(start.plusDays(_)).foreach(println)
For just iterating by day, I do:
Iterator.iterate(start) { _ + 1.day }.takeWhile(_.isBefore(end))
This has proven to be useful enough that I have a small helper object to provide an implicit and allow for a type transformation:
object IntervalIterators {
implicit class ImplicitIterator(val interval: Interval) extends AnyVal {
def iterateBy(step: Period): Iterator[DateTime] = Iterator.iterate(interval.start) { _ + step }
.takeWhile(_.isBefore(interval.end))
def iterateBy[A](step: Period, transform: DateTime => A): Iterator[A] = iterateBy(step).map(transform)
def iterateByDay: Iterator[LocalDate] = iterateBy(1.day, { _.toLocalDate })
def iterateByHour: Iterator[DateTime] = iterateBy(1.hour)
}
}
Sample usage:
import IntervalIterators._
(DateTime.now to 5.day.from(DateTime.now)).iterateByDay // Iterator[LocalDate]
(30.minutes.ago to 1.hour.from(DateTime.now)).iterateBy(1.second) // Iterator[DateTime], broken down by second
Solution with java.time API using Scala
Necessary import and initialization
import java.time.temporal.ChronoUnit
import java.time.temporal.ChronoField.EPOCH_DAY
import java.time.{LocalDate, Period}
val now = LocalDate.now
val daysTill = 5
Create List of LocalDate for sample duration
(0 to daysTill)
.map(days => now.plusDays(days))
.foreach(println)
Iterate over specific dates between start and end using toEpochDay or getLong(ChronoField.EPOCH_DAY)
//Extract the duration
val endDay = now.plusDays(daysTill)
val startDay = now
val duration = endDay.getLong(EPOCH_DAY) - startDay.getLong(EPOCH_DAY)
/* This code does not give desired results as trudolf pointed
val duration = Period
.between(now, now.plusDays(daysTill))
.get(ChronoUnit.DAYS)
*/
//Create list for the duration
(0 to duration)
.map(days => now.plusDays(days))
.foreach(println)
This answer fixes the issue of mrsrinivas answer, that .get(ChronoUnits.DAYS) returns only the days part of the duration, and not the total number of days.
Necessary import and initialization
import java.time.temporal.ChronoUnit
import java.time.{LocalDate, Period}
Note how above answer would lead to wrong result (total number of days is 117)
scala> Period.between(start, end)
res6: java.time.Period = P3M26D
scala> Period.between(start, end).get(ChronoUnit.DAYS)
res7: Long = 26
Iterate over specific dates between start and end
val start = LocalDate.of(2018, 1, 5)
val end = LocalDate.of(2018, 5, 1)
// Create List of `LocalDate` for the period between start and end date
val dates: IndexedSeq[LocalDate] = (0L to (end.toEpochDay - start.toEpochDay))
.map(days => start.plusDays(days))
dates.foreach(println)
you can use something like that:
object Test extends App {
private val startDate: DateTime = DateTime.now()
private val endDate: DateTime = DateTime.now().plusDays(5)
private val interval: Interval = new Interval(startDate, endDate)
Stream.from(0,1)
.takeWhile(index => interval.contains(startDate.plusDays(index)))
.foreach(index => println(startDate.plusDays(index)))
}
In this case, the Scala way is the Java way:
When running Scala on Java 9+, we can use java.time.LocalDate::datesUntil:
import java.time.LocalDate
import collection.JavaConverters._
// val start = LocalDate.of(2019, 1, 29)
// val end = LocalDate.of(2018, 2, 2)
start.datesUntil(end).iterator.asScala
// Iterator[java.time.LocalDate] = <iterator> (2019-01-29, 2019-01-30, 2019-01-31, 2019-02-01)
And if the last date is to be included:
start.datesUntil(end.plusDays(1)).iterator.asScala
// 2019-01-29, 2019-01-30, 2019-01-31, 2019-02-01, 2019-02-02
import java.util.{Calendar, Date}
import scala.annotation.tailrec
/** Gets date list between two dates
*
* #param startDate Start date
* #param endDate End date
* #return List of dates from startDate to endDate
*/
def getDateRange(startDate: Date, endDate: Date): List[Date] = {
#tailrec
def addDate(acc: List[Date], startDate: Date, endDate: Date): List[Date] = {
if (startDate.after(endDate)) acc
else addDate(endDate :: acc, startDate, addDays(endDate, -1))
}
addDate(List(), startDate, endDate)
}
/** Adds a date offset to the given date
*
* #param date ==> Date
* #param amount ==> Offset (can be negative)
* #return ==> New date
*/
def addDays(date: Date, amount: Int): Date = {
val cal = Calendar.getInstance()
cal.setTime(date)
cal.add(Calendar.DATE, amount)
cal.getTime
}