Related
I have defined a function to convert Epoch time to CET and using that function after wrapping as UDF in Spark dataFrame. It is throwing error and not allowing me to use it. Please find below my code.
Function used to convert Epoch time to CET:
import java.text.SimpleDateFormat
import java.util.{Calendar, Date, TimeZone}
import java.util.concurrent.TimeUnit
def convertNanoEpochToDateTime(
d: Long,
f: String = "dd/MM/yyyy HH:mm:ss.SSS",
z: String = "CET",
msPrecision: Int = 9
): String = {
val sdf = new SimpleDateFormat(f)
sdf.setTimeZone(TimeZone.getTimeZone(z))
val date = new Date((d / Math.pow(10, 9).toLong) * 1000L)
val stringTime = sdf.format(date)
if (f.contains(".S")) {
val lng = d.toString.length
val milliSecondsStr = d.toString.substring(lng-9,lng)
stringTime.substring(0, stringTime.lastIndexOf(".") + 1) + milliSecondsStr.substring(0,msPrecision)
}
else stringTime
}
val epochToDateTime = udf(convertNanoEpochToDateTime _)
Below given Spark DataFrame uses the above defined UDF for converting Epoch time to CET
val df2 = df1.select($"messageID",$"messageIndex",epochToDateTime($"messageTimestamp").as("messageTimestamp"))
I am getting the below shown error, when I run the code
Any idea how am I supposed to proceed in this scenario ?
The spark optimizer execution tells you that your function is not a Function1, that means that it is not a function that accepts one parameter. You have a function with four input parameters. And, although you may think that in Scala you are allowed to call that function with only one parameter because you have default values for the other three, it seems that Catalyst does not work in this way, so you will need to change the definition of your function to something like:
def convertNanoEpochToDateTime(
f: String = "dd/MM/yyyy HH:mm:ss.SSS"
)(z: String = "CET")(msPrecision: Int = 9)(d: Long): String
or
def convertNanoEpochToDateTime(f: String)(z: String)(msPrecision: Int)(d: Long): String
and put the default values in the udf creation:
val epochToDateTime = udf(
convertNanoEpochToDateTime("dd/MM/yyyy HH:mm:ss.SSS")("CET")(9) _
)
and try to define the SimpleDateFormat as a static transient value out of the function.
I found why the error is due to and resolved it. The problem is when I wrap the scala function as UDF, its expecting 4 parameters, but I was passing only one parameter. Now, I removed 3 parameters from the function and took those values inside the function itself, since they are constant values. Now in Spark Dataframe, I am calling the function with only 1 parameter and it works perfectly fine.
import java.text.SimpleDateFormat
import java.util.{Calendar, Date, TimeZone}
import java.util.concurrent.TimeUnit
def convertNanoEpochToDateTime(
d: Long
): String = {
val f: String = "dd/MM/yyyy HH:mm:ss.SSS"
val z: String = "CET"
val msPrecision: Int = 9
val sdf = new SimpleDateFormat(f)
sdf.setTimeZone(TimeZone.getTimeZone(z))
val date = new Date((d / Math.pow(10, 9).toLong) * 1000L)
val stringTime = sdf.format(date)
if (f.contains(".S")) {
val lng = d.toString.length
val milliSecondsStr = d.toString.substring(lng-9,lng)
stringTime.substring(0, stringTime.lastIndexOf(".") + 1) + milliSecondsStr.substring(0,msPrecision)
}
else stringTime
}
val epochToDateTime = udf(convertNanoEpochToDateTime _)
import spark.implicits._
val df1 = List(1659962673251388155L,1659962673251388155L,1659962673251388155L,1659962673251388155L).toDF("epochTime")
val df2 = df1.select(epochToDateTime($"epochTime"))
I need to find all the year weeks between the given weeks.
201824 is an example of an year week. It means 24th week of the year 2018.
Assuming that there are 52 weeks in a year, The year weeks of 2018 start with 201801 and end with 201852. After that, it continues with 201901.
I was able to find the range of all year weeks between 2 weeks if the start week and the end week are in the same year like below
val range = udf((i: Int, j: Int) => (i to j).toArray)
The above code works only works when the start week and end week are in the same year, for example 201912 - 201917
How do I make it work if the start week and the end week belongs to different years.
Example: 201849 - 201903
The above weeks should give the output as:
201849,201850,201851,201852,201901,201902,201903
Well there is still a lot of optimizations to do, but for the general direction you could use:
I am using org.joda.time.format here, but java.time should also fit.
def rangeOfYearWeeks(weeksRange: String): Array[String] = {
try {
val left = weeksRange.split("-")(0).trim
val right = weeksRange.split("-")(1).trim
val leftPattern = s"${left.substring(0, 4)}-${left.substring(4)}"
val rightPattern = s"${right.substring(0, 4)}-${right.substring(4)}"
val fmt = DateTimeFormat.forPattern("yyyy-w")
val leftDate = fmt.parseDateTime(leftPattern)
val rightDate = fmt.parseDateTime(rightPattern)
//if (leftDate.isAfter(rightDate))
val weeksBetween = Weeks.weeksBetween(leftDate, rightDate).getWeeks
val dates = for (one <- 0 to weeksBetween) yield {
leftDate.plusWeeks(one)
}
val result: Array[String] = dates.map(date => fmt.print(date)).map(_.replaceAll("-", "")).toArray
result
} catch {
case e: Exception => Array.empty
}
}
Example:
val dates = Seq("201849 - 201903", "201912 - 201917").toDF("col")
val weeks = udf((d: String) => rangeOfYearWeeks(d))
dates.select(weeks($"col")).show(false)
+-----------------------------------------------------+
|UDF(col) |
+-----------------------------------------------------+
|[201849, 201850, 201851, 201852, 20181, 20192, 20193]|
|[201912, 201913, 201914, 201915, 201916, 201917] |
+-----------------------------------------------------+
Here's a solution with an UDF that uses the java.time API:
def weeksBetween = udf{ (startWk: Int, endWk: Int) =>
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import scala.util.{Try, Success, Failure}
def formatYW(yw: Int): String = {
val pattern = "(\\d{4})(\\d+)".r
s"$yw" match { case pattern(y, w) => s"$y-$w-1"}
}
val formatter = DateTimeFormatter.ofPattern("YYYY-w-e") // week-based year
Try(
Iterator.iterate(LocalDate.parse(formatYW(startWk), formatter))(_.plusWeeks(1)).
takeWhile(_.isBefore(LocalDate.parse(formatYW(endWk), formatter))).
map{ s =>
val a = s.format(formatter).split("-")
(a(0) + f"${a(1).toInt}%02d").toInt
}.
toList.tail
) match {
case Success(ls) => ls
case Failure(_) => List.empty[Int] // return an empty list
}
}
Testing the UDF:
val df = Seq(
(1, 201849, 201903), (2, 201908, 201916), (3, 201950, 201955)
).toDF("id", "start_wk", "end_wk")
df.withColumn("weeks_between", weeksBetween($"start_wk", $"end_wk")).show(false)
// +---+--------+------+--------------------------------------------------------+
// |id |start_wk|end_wk|weeks_between |
// +---+--------+------+--------------------------------------------------------+
// |1 |201849 |201903|[201850, 201851, 201852, 201901, 201902] |
// |2 |201908 |201916|[201909, 201910, 201911, 201912, 201913, 201914, 201915]|
// |3 |201950 |201955|[] |
// +---+--------+------+--------------------------------------------------------+
I am trying to calculae the interval of 'n' days from start date to end date.Function signature would have start_date,end_date, interval as argument which return a map with list of start, end days of given intervals.
Example: start_date:2018-01-01 , End_date : 2018-02-20 interval: 20
Expected Output:
2018-01-01 , 2018-01-20 (20 days)
2018-01-21 , 2018-02-09 (20 days)
2018-02-09 , 2018-01-20 (remaining)
I tried to write in scala but i dont feel it's a proper functional style of writing.
case class DateContainer(period: String, range: (LocalDate, LocalDate))
def generateDates(startDate: String, endDate: String,interval:Int): Unit = {
import java.time._
var lstDDateContainer = List[DateContainer]()
var start = LocalDate.parse(startDate)
val end = LocalDate.parse(endDate)
import java.time.temporal._
var futureMonth = ChronoUnit.DAYS.addTo(start, interval)
var i = 1
while (end.isAfter(futureMonth)) {
lstDDateContainer = DateContainer("P" + i, (start, futureMonth)):: lstDDateContainer
start=futureMonth
futureMonth = ChronoUnit.DAYS.addTo(futureMonth, interval)
i += 1
}
lstDDateContainer= DateContainer("P" + i, (start, end))::lstDDateContainer
lstDDateContainer.foreach(println)
}
generateDates("2018-01-01", "2018-02-20",20)
Could anyone help me to write in a functional style.
I offer a solution that produces a slightly different result than given in the question but could be easily modified to get the desired answer:
//Preliminaries
val fmt = DateTimeFormatter.ofPattern("yyyy-MM-dd")
val startDate ="2018-01-01"
val endDate = "2018-02-21"
val interval = 20L
val d1 = LocalDate.parse(startDate, fmt)
val d2 = LocalDate.parse(endDate, fmt)
//The main code
Stream.continually(interval)
.scanLeft((d1, d1.minusDays(1), interval)) ((x,y) => {
val finDate = x._2.plusDays(y)
if(finDate.isAfter(d2))
(x._2.plusDays(1), d2, ChronoUnit.DAYS.between(x._2, d2))
else
(x._2.plusDays(1), x._2.plusDays(y), y)
}).takeWhile(d => d._3 > 0).drop(1).toList
Result:
(2018-01-01,2018-01-20,20)
(2018-01-21,2018-02-09,20)
(2018-02-10,2018-02-21,12)
The idea is to scan a 3-tuple through a stream of interval and stop when no more days are remaining.
Use the java.time library to generate the dates and Stream.iterate() to generate the sequence of intervals.
import java.time.LocalDate
def generateDates( startDate :LocalDate
, endDate :LocalDate
, dayInterval :Int ) :Unit = {
val intervals =
Stream.iterate((startDate, startDate plusDays dayInterval-1)){
case (_,lastDate) =>
val nextDate = lastDate plusDays dayInterval
(lastDate plusDays 1, if (nextDate isAfter endDate) endDate
else nextDate)
}.takeWhile(_._1 isBefore endDate)
println(intervals.mkString("\n"))
}
usage:
generateDates(LocalDate.parse("2018-01-01"), LocalDate.parse("2018-02-20"), 20)
// (2018-01-01,2018-01-20)
// (2018-01-21,2018-02-09)
// (2018-02-10,2018-02-20)
Something like (Untested):
def dates(startDate: LocalDate, endDate: LocalDate, dayInterval: Int): List[(LocalDate, LocalDate, Int)] = {
if(startDate.isAfter(endDate)) {
Nil
}
else {
val nextStart = startDate.plusDays(dayInterval)
if(nextStart.isAfter(startDate)) {
List((startDate, endDate, ChronoUnit.DAYS.between(startDate, endDate)))
}
else {
(startDate, nextStart, dayInterval) :: dates(nextStart, endDate, dayInterval)
}
}
}
If your'e open to using Joda for date-time manipulations, here's what I use
import org.joda.time.{DateTime, Days}
// given from & to dates, find no of days elapsed in between (integer)
def getDaysInBetween(from: DateTime, to: DateTime): Int = Days.daysBetween(from, to).getDays
def getDateSegments(from: DateTime, to: DateTime, interval: Int): Seq[(DateTime, DateTime)] = {
// no of days between from & to dates
val days: Int = DateTimeUtils.getDaysInBetween(from, to) + 1
// no of segments (date ranges) between to & from dates
val segments: Int = days / interval
// (remaining) no of days in last range
val remainder: Int = days % interval
// last date-range
val remainderRanges: Seq[(DateTime, DateTime)] =
if (remainder != 0) from -> from.plusDays(remainder - 1) :: Nil
else Nil
// all (remaining) date-ranges + last date-range
(0 until segments).map { segment: Int =>
to.minusDays(segment * interval + interval - 1) -> to.minusDays(segment * interval)
} ++ remainderRanges
}
This thread arises from my previous question. I need to create Seq[String] that contains paths as String elements, however now I also need to add numbers 7,8,...-22 after a date. Also I cannot use LocalDate as it was suggested in the answer to the above-cited question:
path/file_2017-May-1-7
path/file_2017-May-1-8
...
path/file_2017-May-1-22
path/file_2017-April-30-7
path/file_2017-April-30-8
...
path/file_2017-April-30-22
..
I am searching for a flexible solution. My current solution implies the manual definition of dates yyyy-MMM-dd. However it is not efficient if I need to include more than 2 dates, e.g. 10 or 100. Moreover filePathsList is currently Set[Seq[String]] and I don't know how to convert it into Seq[String].
val formatter = new SimpleDateFormat("yyyy-MMM-dd")
val currDay = Calendar.getInstance
currDay.add(Calendar.DATE, -1)
val day_1_ago = currDay.getTime
currDay.add(Calendar.DATE, -1)
val day_2_ago = currDay.getTime
val dates = Set(formatter.format(day_1_ago),formatter.format(day_2_ago))
val filePathsList = dates.map(date => {
val list: Seq.empty[String]
for (num <- 7 to 22) {
list :+ s"path/file_$date-$num" + "
}
list
})
Here is how I was able to achieve what you outlined, adjust the days val to configure the amount of days you care about:
import java.text.SimpleDateFormat
import java.util.Calendar
val currDay = Calendar.getInstance
val days = 5
val dates = currDay.getTime +: List.fill(days){
currDay.add(Calendar.DATE, -1)
currDay.getTime
}
val formatter = new SimpleDateFormat("yyyy-MMM-dd")
val filePathsList = for {
date <- dates
num <- 7 to 22
} yield s"path/file_${formatter.format(date)}-$num"
Given a start and an end date I would like to iterate on it by day using a foreach, map or similar function. Something like
(DateTime.now to DateTime.now + 5.day by 1.day).foreach(println)
I am using https://github.com/nscala-time/nscala-time, but I get returned a joda Interval object if I use the syntax above, which I suspect is also not a range of dates, but a sort of range of milliseconds.
EDIT: The question is obsolete. As advised on the joda homepage, if you are using java 8 you should start with or migrate to java.time.
You may use plusDays:
val now = DateTime.now
(0 until 5).map(now.plusDays(_)).foreach(println)
Given start and end dates:
import org.joda.time.Days
val start = DateTime.now.minusDays(5)
val end = DateTime.now.plusDays(5)
val daysCount = Days.daysBetween(start, end).getDays()
(0 until daysCount).map(start.plusDays(_)).foreach(println)
For just iterating by day, I do:
Iterator.iterate(start) { _ + 1.day }.takeWhile(_.isBefore(end))
This has proven to be useful enough that I have a small helper object to provide an implicit and allow for a type transformation:
object IntervalIterators {
implicit class ImplicitIterator(val interval: Interval) extends AnyVal {
def iterateBy(step: Period): Iterator[DateTime] = Iterator.iterate(interval.start) { _ + step }
.takeWhile(_.isBefore(interval.end))
def iterateBy[A](step: Period, transform: DateTime => A): Iterator[A] = iterateBy(step).map(transform)
def iterateByDay: Iterator[LocalDate] = iterateBy(1.day, { _.toLocalDate })
def iterateByHour: Iterator[DateTime] = iterateBy(1.hour)
}
}
Sample usage:
import IntervalIterators._
(DateTime.now to 5.day.from(DateTime.now)).iterateByDay // Iterator[LocalDate]
(30.minutes.ago to 1.hour.from(DateTime.now)).iterateBy(1.second) // Iterator[DateTime], broken down by second
Solution with java.time API using Scala
Necessary import and initialization
import java.time.temporal.ChronoUnit
import java.time.temporal.ChronoField.EPOCH_DAY
import java.time.{LocalDate, Period}
val now = LocalDate.now
val daysTill = 5
Create List of LocalDate for sample duration
(0 to daysTill)
.map(days => now.plusDays(days))
.foreach(println)
Iterate over specific dates between start and end using toEpochDay or getLong(ChronoField.EPOCH_DAY)
//Extract the duration
val endDay = now.plusDays(daysTill)
val startDay = now
val duration = endDay.getLong(EPOCH_DAY) - startDay.getLong(EPOCH_DAY)
/* This code does not give desired results as trudolf pointed
val duration = Period
.between(now, now.plusDays(daysTill))
.get(ChronoUnit.DAYS)
*/
//Create list for the duration
(0 to duration)
.map(days => now.plusDays(days))
.foreach(println)
This answer fixes the issue of mrsrinivas answer, that .get(ChronoUnits.DAYS) returns only the days part of the duration, and not the total number of days.
Necessary import and initialization
import java.time.temporal.ChronoUnit
import java.time.{LocalDate, Period}
Note how above answer would lead to wrong result (total number of days is 117)
scala> Period.between(start, end)
res6: java.time.Period = P3M26D
scala> Period.between(start, end).get(ChronoUnit.DAYS)
res7: Long = 26
Iterate over specific dates between start and end
val start = LocalDate.of(2018, 1, 5)
val end = LocalDate.of(2018, 5, 1)
// Create List of `LocalDate` for the period between start and end date
val dates: IndexedSeq[LocalDate] = (0L to (end.toEpochDay - start.toEpochDay))
.map(days => start.plusDays(days))
dates.foreach(println)
you can use something like that:
object Test extends App {
private val startDate: DateTime = DateTime.now()
private val endDate: DateTime = DateTime.now().plusDays(5)
private val interval: Interval = new Interval(startDate, endDate)
Stream.from(0,1)
.takeWhile(index => interval.contains(startDate.plusDays(index)))
.foreach(index => println(startDate.plusDays(index)))
}
In this case, the Scala way is the Java way:
When running Scala on Java 9+, we can use java.time.LocalDate::datesUntil:
import java.time.LocalDate
import collection.JavaConverters._
// val start = LocalDate.of(2019, 1, 29)
// val end = LocalDate.of(2018, 2, 2)
start.datesUntil(end).iterator.asScala
// Iterator[java.time.LocalDate] = <iterator> (2019-01-29, 2019-01-30, 2019-01-31, 2019-02-01)
And if the last date is to be included:
start.datesUntil(end.plusDays(1)).iterator.asScala
// 2019-01-29, 2019-01-30, 2019-01-31, 2019-02-01, 2019-02-02
import java.util.{Calendar, Date}
import scala.annotation.tailrec
/** Gets date list between two dates
*
* #param startDate Start date
* #param endDate End date
* #return List of dates from startDate to endDate
*/
def getDateRange(startDate: Date, endDate: Date): List[Date] = {
#tailrec
def addDate(acc: List[Date], startDate: Date, endDate: Date): List[Date] = {
if (startDate.after(endDate)) acc
else addDate(endDate :: acc, startDate, addDays(endDate, -1))
}
addDate(List(), startDate, endDate)
}
/** Adds a date offset to the given date
*
* #param date ==> Date
* #param amount ==> Offset (can be negative)
* #return ==> New date
*/
def addDays(date: Date, amount: Int): Date = {
val cal = Calendar.getInstance()
cal.setTime(date)
cal.add(Calendar.DATE, amount)
cal.getTime
}