Subtract Months from YYYYMM date in Scala - scala

I am trying to subtract months from YYYYMM format.
import java.text.SimpleDateFormat
val date = 202012
val dt_format = new SimpleDateFormat("YYYYMM")
val formattedDate = dt_format.format(date)
new DateTime(formattedDate).minusMonths(3).toDate();
Expected output:
202012 - 3 months = 202009,
202012 - 14 months = 201910
But it did not work as expected. Please help!

Among standard date/time types YearMonth seems to be the most appropriate for the given use case.
import java.time.format.DateTimeFormatter
import java.time.YearMonth
val format = DateTimeFormatter.ofPattern("yyyyMM")
YearMonth.parse("197001", format).minusMonths(13) // 1968-12

This solution uses the functionality in java.time, available since Java 8. I would have preferred coming up with a solution that did not require to adjust the input so that it could be (forcefully) parsed into a LocalDate (so that plusMonths) could be used, but at least it works.
Probably a simple regex could get the job done. ;-)
import java.time.format.DateTimeFormatter
import java.time.LocalDate
val inFmt = DateTimeFormatter.ofPattern("yyyyMMdd")
val outFmt = DateTimeFormatter.ofPattern("yyyyMM")
def plusMonths(string: String, months: Int): String =
LocalDate.parse(s"${string}01", inFmt).plusMonths(months).format(outFmt)
assert(plusMonths("202012", -3) == "202009")
assert(plusMonths("202012", -14) == "201910")
You can play around with this code here on Scastie.

Related

How take data from several parquet files at once?

I need your help cause I am new in Spark Framework.
I have folder with a lot of parquet files. The name of these files has the same format: DD-MM-YYYY. For example: '01-10-2018', '02-10-2018', '03-10-2018', etc.
My application has two input parameters: dateFrom and dateTo.
When I try to use next code application hangs. It seems like application scan all files in folder.
val mf = spark.read.parquet("/PATH_TO_THE_FOLDER/*")
.filter($"DATE".between(dateFrom + " 00:00:00", dateTo + " 23:59:59"))
mf.show()
I need to take data pool for period as fast as it possible.
I think it would be great to divide period into days and then read files separately, join them like that:
val mf1 = spark.read.parquet("/PATH_TO_THE_FOLDER/01-10-2018");
val mf2 = spark.read.parquet("/PATH_TO_THE_FOLDER/02-10-2018");
val final = mf1.union(mf2).distinct();
dateFrom and dateTo are dynamic, so I don't know how correctly organize code right now. Please help!
#y2k-shubham I tried to test next code, but it raise error:
import org.joda.time.{DateTime, Days}
import org.apache.spark.sql.{DataFrame, SparkSession}
val dateFrom = DateTime.parse("2018-10-01")
val dateTo = DateTime.parse("2018-10-05")
def getDaysInBetween(from: DateTime, to: DateTime): Int = Days.daysBetween(from, to).getDays
def getDatesInBetween(from: DateTime, to: DateTime): Seq[DateTime] = {
val days = getDaysInBetween(from, to)
(0 to days).map(day => from.plusDays(day).withTimeAtStartOfDay())
}
val datesInBetween: Seq[DateTime] = getDatesInBetween(dateFrom, dateTo)
val unionDf: DataFrame = datesInBetween.foldLeft(spark.emptyDataFrame) { (intermediateDf: DataFrame, date: DateTime) =>
intermediateDf.union(spark.read.parquet("PATH" + date.toString("yyyy-MM-dd") + "/*.parquet"))
}
unionDf.show()
ERROR:
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 0 columns and the second table has 20 columns;
It seems like intermediateDf DateFrame at start is empty. How to fix the problem?
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import org.apache.spark.sql.{DataFrame, SparkSession}
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
def dateRangeInclusive(start: String, end: String): Iterator[LocalDate] = {
val startDate = LocalDate.parse(start, formatter)
val endDate = LocalDate.parse(end, formatter)
Iterator.iterate(startDate)(_.plusDays(1))
.takeWhile(d => d.isBefore(endDate) || d.isEqual(endDate))
}
val spark = SparkSession.builder().getOrCreate()
val data: DataFrame = dateRangeInclusive("2018-10-01", "2018-10-05")
.map(d => spark.read.parquet(s"/path/to/directory/${formatter.format(d)}"))
.reduce(_ union _)
I also suggest using the native JSR 310 API (part of Java SE since Java 8) rather than joda-time, since it is more modern and does not require external dependencies. Note that first creating a sequence of paths and doing map+reduce is probably simpler for this use case than a more general foldLeft-based solution.
Additionally, you can use reduceOption, then you'll get an Option[DataFrame] if the input date range is empty. Also, if it is possible for some input directories/files to be missing, you'd want to do a check before invoking spark.read.parquet. If your data is on HDFS, you should probably use the Hadoop FS API:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
val spark = SparkSession.builder().getOrCreate()
val fs = FileSystem.get(new Configuration(spark.sparkContext.hadoopConfiguration))
val data: Option[DataFrame] = dateRangeInclusive("2018-10-01", "2018-10-05")
.map(d => s"/path/to/directory/${formatter.format(d)}")
.filter(p => fs.exists(new Path(p)))
.map(spark.read.parquet(_))
.reduceOption(_ union _)
While I haven't tested this piece of code, it must work (probably slight modification?)
import org.joda.time.{DateTime, Days}
import org.apache.spark.sql.{DataFrame, SparkSession}
// return no of days between two dates
def getDaysInBetween(from: DateTime, to: DateTime): Int = Days.daysBetween(from, to).getDays
// return sequence of dates between two dates
def getDatesInBetween(from: DateTime, to: DateTime): Seq[DateTime] = {
val days = getDaysInBetween(from, to)
(0 to days).map(day => from.plusDays(day).withTimeAtStartOfDay())
}
// read parquet data of given date-range from given path
// (you might want to pass SparkSession in a different manner)
def readDataForDateRange(path: String, from: DateTime, to: DateTime)(implicit spark: SparkSession): DataFrame = {
// get date-range sequence
val datesInBetween: Seq[DateTime] = getDatesInBetween(from, to)
// read data of from-date (needed because schema of all DataFrames should be same for union)
val fromDateDf: DataFrame = spark.read.parquet(path + "/" + datesInBetween.head.toString("yyyy-MM-dd"))
// read and union remaining dataframes (functionally)
val unionDf: DataFrame = datesInBetween.tail.foldLeft(fromDateDf) { (intermediateDf: DataFrame, date: DateTime) =>
intermediateDf.union(spark.read.parquet(path + "/" + date.toString("yyyy-MM-dd")))
}
// return union-df
unionDf
}
Reference: How to calculate 'n' days interval date in functional style?

Scala - Get list of days and months for a selected year

I asked a question regarding this here. It works fine and is a nice solution, however I just realized that in some cases when java 1.8 is NOT installed import java.time is not available. I have java 1.7 and cannot update due many other issues.
after java 1.8
import java.time.{LocalDate, Year}
def allDaysForYear(year: String): List[(String, String, String)] = {
val daysInYear = if(Year.of(year.toInt).isLeap) 366 else 365
for {
day <- (1 to daysInYear).toList
localDate = LocalDate.ofYearDay(year.toInt, day)
month = localDate.getMonthValue
dayOfMonth = localDate.getDayOfMonth
} yield (year, month.toString, dayOfMonth.toString)
}
older java versions:
so, for instance using import java.util.{Calendar} how can the same issue be solved? -> Get all months and days for a given year
If you can use Joda-Time, the following worked for me:
import java.util.Calendar
import org.joda.time.{DateTime, LocalDate, DurationFieldType}
def allDaysForYear(year: String): List[(String, String, String)] = {
val dateTime = new DateTime()
val daysInYear = if(dateTime.withYear(year.toInt).year.isLeap) 366 else 365
val calendar = Calendar.getInstance
calendar.set(year.toInt, 0, 0)
val ld = LocalDate.fromCalendarFields(calendar)
for {
day <- (1 to daysInYear).toList
updatedLd = ld.withFieldAdded(DurationFieldType.days, day)
} yield (year, updatedLd.getMonthOfYear.toString, updatedLd.getDayOfMonth.toString)
}

Create a list of String elements entitles according to dates

I want to create a list of String elements, each one having a date in its title:
data_2017_May_4
data_2017_May_3
data_2017_May_2
The important thing is how these dates are created. They should be created starting from current date till minus 2 days. If the current date is May 1 2017, then the result would be:
data_2017_May_1
data_2017_April_30
data_2017_April_29
The same logic is applied to the switch between years (December/January).
This is my code, but it does not consider the changes of months and years. Also it jumps in dates:
val formatter = new SimpleDateFormat("yyyy-MMM-dd")
val currDay = Calendar.getInstance
var list: List[String] = Nil
var day = null
var date: String = ""
for (i <- 0 to 2) {
currDay.add(Calendar.DATE, -i)
date = "data_"+formatter.format(currDay.getTime)
list ::= date
}
println(list.mkString(","))
How to reach the objective?
Can you use java.time.LocalDate? If so you can easily accomplish this:
import java.time.LocalDate
import java.time.format.DateTimeFormatter
val desiredFormat = DateTimeFormatter.ofPattern("yyyy-MMM-dd")
val now = LocalDate.now()
val dates = Set(now, now.minusDays(1), now.minusDays(2))
dates.map(_.format(desiredFormat))
.foreach(date => println(s"data_$date"))

Spark UDF thread safety

I'm using Spark to take a dataframe containing a column of dates, and create 3 new columns containing the time in days, weeks, and months between the date in the column and today.
My concern is around the use of SimpleDateFormat, which isn't thread safe. Ordinarily without Spark this would be fine since it's a local variable, but with Spark's lazy evaluation, is sharing a single SimpleDateFormat instance over multiple UDFs likely to cause an issue?
def calcTimeDifference(...){
val sdf = new SimpleDateFormat(dateFormat)
val dayDifference = udf{(x: String) => math.abs(Days.daysBetween(new DateTime(sdf.parse(x)), presentDate).getDays)}
output = output.withColumn("days", dayDifference(myCol))
val weekDifference = udf{(x: String) => math.abs(Weeks.weeksBetween(new DateTime(sdf.parse(x)), presentDate).getWeeks)}
output = output.withColumn("weeks", weekDifference(myCol))
val monthDifference = udf{(x: String) => math.abs(Months.monthsBetween(new DateTime(sdf.parse(x)), presentDate).getMonths)}
output = output.withColumn("months", monthDifference(myCol))
}
I dont think its safe, as we know, SimpleDateFormat is not thread-safe.
So I prefer this method to use SimpleDateFormat in Spark if you need:
import java.text.SimpleDateFormat
import java.util.SimpleTimeZone
/**
* Thread Safe SimpleDateFormat for Spark.
*/
object ThreadSafeFormat extends ThreadLocal[SimpleDateFormat] {
override def initialValue(): SimpleDateFormat = {
val dateFormat = new SimpleDateFormat("yyyy-MM-dd:H")
// if you need get UTC time, you can set UTC timezone
val utcTimeZone = new SimpleTimeZone(SimpleTimeZone.UTC_TIME, "UTC")
dateFormat.setTimeZone(utcTimeZone)
dateFormat
}
}
Then use ThreadSafeFormat.get() to get Thread-safe SimpleDateFormat to do anything.

How to match dates through fromJson(toJson(date)) with specs2

I am stuck on the following problem : I want to write a specs2 specification to assert that my to and from json transformations are symetrical. However, I get an error on joda datetime dates.
'2012-04-17T00:04:00.000+02:00' is not equal to '2012-04-17T00:04:00.000+02:00'. Values have the same string representation but possibly different types like List[Int] and List[String] (TimeSpecs.scala:18)
Here is a minimalist specs demonstrating the problem
import org.joda.time.DateTime
import org.specs2.mutable.Specification
class TimeSpecs extends Specification {
"joda and specs2" should {
"play nice" in {
val date = DateTime.parse("2012-04-17T00:04:00+0200")
val date2 = DateTime.parse("2012-04-17T00:04:00+0200")
date === date2
}
"play nice through play json transform" in {
import play.api.libs.json._
import play.api.libs.json.Json._
val date = DateTime.parse("2012-04-17T00:04:00+0200")
val jsDate= toJson(date)
val date2= jsDate.as[DateTime]
date === date2
}
}
}
how should I compare date and date2 in the second test ? they are the same but specs2 doesn't seem to see that :(
--- edit
"manually" inspecting the type at runtime with date.getClass.getCanonicalName returns org.joda.time.Datetime as expected
import org.joda.time.DateTime
import org.specs2.mutable.Specification
class TimeSpecs extends Specification {
"joda and specs2" should {
"play nice" in {
val date = DateTime.parse("2012-04-17T00:04:00+0200")
val date2 = DateTime.parse("2012-04-17T00:04:00+0200")
date === date2
}
"play nice through play json transform" in {
import play.api.libs.json._
import play.api.libs.json.Json._
val date:DateTime = DateTime.parse("2012-04-17T00:04:00+0200")
val jsDate= toJson(date)
val date2:DateTim= jsDate.as[DateTime]
println(date.getClass.getCanonicalName) //prints org.joda.time.DateTime
println(date2.getClass.getCanonicalName)//prints org.joda.time.DateTime
date === date2
}
}
}
Using DateTime#isEqual does kind of work but I loose the benefit of fluent matchers and the useful error messages they bring. Aditionally, what I am actually trying to compare are case class instances which happen to contain dates, not the dates themselves.
Using
date should beEqualTo(date2)
yields the same error as ===
The problem is that joda time defines a very strict equals which considers the date's Chronology for the equality ( DateTime#getChronology ). The isEqual method proposed by Kim Stebel does ignore the Chronology.
From there on, there are 2 possibilities: Defining custom read and writes for play, then using the same pattern to create the dates as in the following example
import org.joda.time.DateTime
import org.joda.time.format.DateTimeFormat
import org.specs2.mutable.Specification
class TimeSpecs extends Specification {
val pattern = "yyyy-MM-dd'T'HH:mm:ssZZ"
"joda and specs2" should {
"play nice" in {
val date = DateTime.parse("2012-04-17T00:04:00+0200",DateTimeFormat.forPattern(pattern))
val date2 = DateTime.parse("2012-04-17T00:04:00+0200",DateTimeFormat.forPattern(pattern))
date === date2
}
"play nice through play json transform" in {
import play.api.libs.json.Json._
//play2 custom write
implicit def customJodaWrite = play.api.libs.json.Writes.jodaDateWrites(pattern)
//play2 custom read
implicit def customJodaRead = play.api.libs.json.Reads.jodaDateReads(pattern)
val date:DateTime = DateTime.parse("2012-04-17T00:04:00+0200",DateTimeFormat.forPattern(pattern)) //make sure you parse the initial date with the same pattern
val jsDate= toJson(date)
val date2:DateTime= jsDate.as[DateTime]
println(date.getClass.getCanonicalName)
println(date2.getClass.getCanonicalName)
println(jsDate)
date should beEqualTo(date2)
}
}
}
Play 2.1 defaults to parsing (and writing to json) based on the unix timestamp in milliseconds without timezone information. When parsing back from the unix timestamp, it will consider it in the local computer timezone (in my case Europe/Paris). Hence the need for a custom parser/writer
Joda uses a specific formatter when calling parse without a parser argument, it doesn't seem possible to create the same formatter with only a pattern string ( I haven't found a way to activate the DateTimeFormatter#withOffsetParsed method through a pattern string).
Another possibility may be to define a custom specs2 matcher for jodatime which would use isEqual instead of equals.
Since I don't want the unix epoch in my json anyway, I'll stick with the custom play transformers