reformat DateTime Array - scala

I have generated an Array of dates with the following code using jodatime
import org.joda.time.{DateTime, Period}
import org.joda.time.format.DateTimeFormat
import java.text.SimpleDateFormat
def dateRange(from: DateTime, to: DateTime, step: Period): Iterator[DateTime]
=Iterator.iterate(from)(_.plus(step)).takeWhile(!_.isAfter(to))
val from = new DateTime(2000, 06, 30,0,0,0,0)
val to = new DateTime(2001, 06, 30,0,0,0,0)
val by = new Period(0,2,0,0,0,0,0,0)
val range = { dateRange(from ,to, by)}
val dateRaw = (range).toArray
How can I pass DateTimeFormat.forPattern("YYYYMMdd") to each value in order to get an Array of integers of format yyyyMMdd
Array[Int] = Array(20000630,20000830,20001030...

val f = DateTimeFormat.forPattern("YYYYMMdd")
dateRaw.map(d => f.print(d).toInt)

Related

Spark scala UDF in DataFrames is not working

I have defined a function to convert Epoch time to CET and using that function after wrapping as UDF in Spark dataFrame. It is throwing error and not allowing me to use it. Please find below my code.
Function used to convert Epoch time to CET:
import java.text.SimpleDateFormat
import java.util.{Calendar, Date, TimeZone}
import java.util.concurrent.TimeUnit
def convertNanoEpochToDateTime(
d: Long,
f: String = "dd/MM/yyyy HH:mm:ss.SSS",
z: String = "CET",
msPrecision: Int = 9
): String = {
val sdf = new SimpleDateFormat(f)
sdf.setTimeZone(TimeZone.getTimeZone(z))
val date = new Date((d / Math.pow(10, 9).toLong) * 1000L)
val stringTime = sdf.format(date)
if (f.contains(".S")) {
val lng = d.toString.length
val milliSecondsStr = d.toString.substring(lng-9,lng)
stringTime.substring(0, stringTime.lastIndexOf(".") + 1) + milliSecondsStr.substring(0,msPrecision)
}
else stringTime
}
val epochToDateTime = udf(convertNanoEpochToDateTime _)
Below given Spark DataFrame uses the above defined UDF for converting Epoch time to CET
val df2 = df1.select($"messageID",$"messageIndex",epochToDateTime($"messageTimestamp").as("messageTimestamp"))
I am getting the below shown error, when I run the code
Any idea how am I supposed to proceed in this scenario ?
The spark optimizer execution tells you that your function is not a Function1, that means that it is not a function that accepts one parameter. You have a function with four input parameters. And, although you may think that in Scala you are allowed to call that function with only one parameter because you have default values for the other three, it seems that Catalyst does not work in this way, so you will need to change the definition of your function to something like:
def convertNanoEpochToDateTime(
f: String = "dd/MM/yyyy HH:mm:ss.SSS"
)(z: String = "CET")(msPrecision: Int = 9)(d: Long): String
or
def convertNanoEpochToDateTime(f: String)(z: String)(msPrecision: Int)(d: Long): String
and put the default values in the udf creation:
val epochToDateTime = udf(
convertNanoEpochToDateTime("dd/MM/yyyy HH:mm:ss.SSS")("CET")(9) _
)
and try to define the SimpleDateFormat as a static transient value out of the function.
I found why the error is due to and resolved it. The problem is when I wrap the scala function as UDF, its expecting 4 parameters, but I was passing only one parameter. Now, I removed 3 parameters from the function and took those values inside the function itself, since they are constant values. Now in Spark Dataframe, I am calling the function with only 1 parameter and it works perfectly fine.
import java.text.SimpleDateFormat
import java.util.{Calendar, Date, TimeZone}
import java.util.concurrent.TimeUnit
def convertNanoEpochToDateTime(
d: Long
): String = {
val f: String = "dd/MM/yyyy HH:mm:ss.SSS"
val z: String = "CET"
val msPrecision: Int = 9
val sdf = new SimpleDateFormat(f)
sdf.setTimeZone(TimeZone.getTimeZone(z))
val date = new Date((d / Math.pow(10, 9).toLong) * 1000L)
val stringTime = sdf.format(date)
if (f.contains(".S")) {
val lng = d.toString.length
val milliSecondsStr = d.toString.substring(lng-9,lng)
stringTime.substring(0, stringTime.lastIndexOf(".") + 1) + milliSecondsStr.substring(0,msPrecision)
}
else stringTime
}
val epochToDateTime = udf(convertNanoEpochToDateTime _)
import spark.implicits._
val df1 = List(1659962673251388155L,1659962673251388155L,1659962673251388155L,1659962673251388155L).toDF("epochTime")
val df2 = df1.select(epochToDateTime($"epochTime"))

scala loop to add date string into a seq

I am trying to add dates in string from an array into a seq while determining whether it is a weekend day.
import java.text.SimpleDateFormat
import java.util.{Calendar, Date, GregorianCalendar}
import org.apache.spark.sql.SparkSession
val arrDateEsti=dtfBaseNonLong.select("AAA").distinct().collect.map(_(0).toString);
var dtfDateCate = Seq(
("0000", "0")
);
for (a<-0 to arrDateEsti.length-1){
val dayDate:Date = dateFormat.parse(arrDateEsti(a));
val cal=new GregorianCalendar
cal.setTime(dayDate);
if (cal.get(Calendar.DAY_OF_WEEK)==1 || cal.get(Calendar.DAY_OF_WEEK)==7){
dtfDateCate:+(arrDateEsti(a),"1")
}else{
dtfDateCate:+(arrDateEsti(a),"0")
}
};
scala> dtfDateCate
res20: Seq[(String, String)] = List((0000,0))
It returns the same initial sequence. But if I run one single element it works. What went wrong?
scala> val dayDate:Date = dateFormat.parse(arrDateEsti(0));
dayDate: java.util.Date = Thu Oct 15 00:00:00 CST 2020
scala> cal.setTime(dayDate);
scala> if (cal.get(Calendar.DAY_OF_WEEK)==1 || cal.get(Calendar.DAY_OF_WEEK)==7){
| dtfDateCate:+(arrDateEsti(0),"1")
| }else{
| dtfDateCate:+(arrDateEsti(0),"0")
| };
res14: Seq[(String, String)] = List((0000,0), (20201015,0))
I think this gets at what you're trying to do.
import java.time.LocalDate
import java.time.DayOfWeek.{SATURDAY, SUNDAY}
import java.time.format.DateTimeFormatter
//replace with dtfBaseNonLong.select(... code
val arrDateEsti = Seq("20201015", "20201017") //place holder
val dtFormat = DateTimeFormatter.ofPattern("yyyyMMdd")
val dtfDateCate = ("0000", "0") +:
arrDateEsti.map { dt =>
val day = LocalDate.parse(dt,dtFormat).getDayOfWeek()
if (day == SATURDAY || day == SUNDAY) (dt, "1")
else (dt, "0")
}
//dtfDateCate = Seq((0000,0), (20201015,0), (20201017,1))
yeah, it should be seqDateCate=seqDateCate:+(arrDateEsti(a),"1")

Find dates between two dates scala

I am trying to extract dates between two dates.
If my input is:
start date: 2020_04_02
end date: 2020_06_02
Output should be:
List("2020_04_02","2020_04_03", "2020_04_04", "2020_04_05", "2020_04_06")
So far i have tried:
val beginDate = LocalDate.parse(startDate, formatted)
val lastDate = LocalDate.parse(endDate, formatted)
beginDate.datesUntil(lastDate.plusDays(1))
.iterator()
.asScala
.map(date => formatter.format(date))
.toList
import java.time.format.DateTimeFormatter
private def formatter = DateTimeFormatter.ofPattern("yyyy_MM_dd")
But i think it could be done even in a more refined way
I'm not super proud of this, but I've done it this way before:
import java.time.LocalDate
val start = LocalDate.of(2020,1,1).toEpochDay
val end = LocalDate.of(2020,12,31).toEpochDay
val dates = (start to end).map(LocalDate.ofEpochDay(_).toString).toArray
You end up with:
dates: Array[String] = Array(2020-01-01, 2020-01-02, ..., 2020-12-31)
What you are doing is correct but your end date is not matching what you are expecting:
import java.time.format.DateTimeFormatter
import java.time.LocalDate
import scala.jdk.CollectionConverters._
private def formatter = DateTimeFormatter.ofPattern("yyyy_MM_dd")
val startDate = "2020_04_02"
val endDate1 = "2020_04_06" // "2020_06_02"
val endDate2 = "2020_06_02"
val beginDate = LocalDate.parse(startDate, formatter)
val lastDate1 = LocalDate.parse(endDate1, formatter)
val lastDate2 = LocalDate.parse(endDate2, formatter)
val res1 = beginDate.datesUntil(lastDate1.plusDays(1)).iterator().asScala.map(date => formatter.format(date)).toList
val res2 = beginDate.datesUntil(lastDate2.plusDays(1)).iterator().asScala.map(date => formatter.format(date)).toList
println(res1) // List(2020_04_02, 2020_04_03, 2020_04_04, 2020_04_05, 2020_04_06)
println(res2) // List(2020_04_02, 2020_04_03, 2020_04_04, 2020_04_05, 2020_04_06 ... 2020_05_31, 2020_06_01, 2020_06_02)
This function will return List[String].
import java.time.format.DateTimeFormatter
import java.time.LocalDate
val dt1 = "2020_04_02"
val dt2 = "2020_04_06"
def DatesBetween(startDate: String,endDate: String) : List[String] = {
def ConvertToFormat = DateTimeFormatter.ofPattern("yyyy_MM_dd")
val sdate = LocalDate.parse(startDate,ConvertToFormat)
val edate = LocalDate.parse(endDate,ConvertToFormat)
val DateRange = sdate.toEpochDay.until(edate.plusDays(1).toEpochDay).map(LocalDate.ofEpochDay).toList
val ListofDateRange = DateRange.map(date => ConvertToFormat.format(date)).toList
ListofDateRange
}
println(DatesBetween(dt1,dt2))

How take data from several parquet files at once?

I need your help cause I am new in Spark Framework.
I have folder with a lot of parquet files. The name of these files has the same format: DD-MM-YYYY. For example: '01-10-2018', '02-10-2018', '03-10-2018', etc.
My application has two input parameters: dateFrom and dateTo.
When I try to use next code application hangs. It seems like application scan all files in folder.
val mf = spark.read.parquet("/PATH_TO_THE_FOLDER/*")
.filter($"DATE".between(dateFrom + " 00:00:00", dateTo + " 23:59:59"))
mf.show()
I need to take data pool for period as fast as it possible.
I think it would be great to divide period into days and then read files separately, join them like that:
val mf1 = spark.read.parquet("/PATH_TO_THE_FOLDER/01-10-2018");
val mf2 = spark.read.parquet("/PATH_TO_THE_FOLDER/02-10-2018");
val final = mf1.union(mf2).distinct();
dateFrom and dateTo are dynamic, so I don't know how correctly organize code right now. Please help!
#y2k-shubham I tried to test next code, but it raise error:
import org.joda.time.{DateTime, Days}
import org.apache.spark.sql.{DataFrame, SparkSession}
val dateFrom = DateTime.parse("2018-10-01")
val dateTo = DateTime.parse("2018-10-05")
def getDaysInBetween(from: DateTime, to: DateTime): Int = Days.daysBetween(from, to).getDays
def getDatesInBetween(from: DateTime, to: DateTime): Seq[DateTime] = {
val days = getDaysInBetween(from, to)
(0 to days).map(day => from.plusDays(day).withTimeAtStartOfDay())
}
val datesInBetween: Seq[DateTime] = getDatesInBetween(dateFrom, dateTo)
val unionDf: DataFrame = datesInBetween.foldLeft(spark.emptyDataFrame) { (intermediateDf: DataFrame, date: DateTime) =>
intermediateDf.union(spark.read.parquet("PATH" + date.toString("yyyy-MM-dd") + "/*.parquet"))
}
unionDf.show()
ERROR:
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 0 columns and the second table has 20 columns;
It seems like intermediateDf DateFrame at start is empty. How to fix the problem?
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import org.apache.spark.sql.{DataFrame, SparkSession}
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
def dateRangeInclusive(start: String, end: String): Iterator[LocalDate] = {
val startDate = LocalDate.parse(start, formatter)
val endDate = LocalDate.parse(end, formatter)
Iterator.iterate(startDate)(_.plusDays(1))
.takeWhile(d => d.isBefore(endDate) || d.isEqual(endDate))
}
val spark = SparkSession.builder().getOrCreate()
val data: DataFrame = dateRangeInclusive("2018-10-01", "2018-10-05")
.map(d => spark.read.parquet(s"/path/to/directory/${formatter.format(d)}"))
.reduce(_ union _)
I also suggest using the native JSR 310 API (part of Java SE since Java 8) rather than joda-time, since it is more modern and does not require external dependencies. Note that first creating a sequence of paths and doing map+reduce is probably simpler for this use case than a more general foldLeft-based solution.
Additionally, you can use reduceOption, then you'll get an Option[DataFrame] if the input date range is empty. Also, if it is possible for some input directories/files to be missing, you'd want to do a check before invoking spark.read.parquet. If your data is on HDFS, you should probably use the Hadoop FS API:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
val spark = SparkSession.builder().getOrCreate()
val fs = FileSystem.get(new Configuration(spark.sparkContext.hadoopConfiguration))
val data: Option[DataFrame] = dateRangeInclusive("2018-10-01", "2018-10-05")
.map(d => s"/path/to/directory/${formatter.format(d)}")
.filter(p => fs.exists(new Path(p)))
.map(spark.read.parquet(_))
.reduceOption(_ union _)
While I haven't tested this piece of code, it must work (probably slight modification?)
import org.joda.time.{DateTime, Days}
import org.apache.spark.sql.{DataFrame, SparkSession}
// return no of days between two dates
def getDaysInBetween(from: DateTime, to: DateTime): Int = Days.daysBetween(from, to).getDays
// return sequence of dates between two dates
def getDatesInBetween(from: DateTime, to: DateTime): Seq[DateTime] = {
val days = getDaysInBetween(from, to)
(0 to days).map(day => from.plusDays(day).withTimeAtStartOfDay())
}
// read parquet data of given date-range from given path
// (you might want to pass SparkSession in a different manner)
def readDataForDateRange(path: String, from: DateTime, to: DateTime)(implicit spark: SparkSession): DataFrame = {
// get date-range sequence
val datesInBetween: Seq[DateTime] = getDatesInBetween(from, to)
// read data of from-date (needed because schema of all DataFrames should be same for union)
val fromDateDf: DataFrame = spark.read.parquet(path + "/" + datesInBetween.head.toString("yyyy-MM-dd"))
// read and union remaining dataframes (functionally)
val unionDf: DataFrame = datesInBetween.tail.foldLeft(fromDateDf) { (intermediateDf: DataFrame, date: DateTime) =>
intermediateDf.union(spark.read.parquet(path + "/" + date.toString("yyyy-MM-dd")))
}
// return union-df
unionDf
}
Reference: How to calculate 'n' days interval date in functional style?

How can I cast a string to a date in Scala so I can filter in in SparkSQL?

I want to be able to filter on a date just like you would in normal SQL. Is that possible? I'm running into an issue on how to convert the string from the text file into a date.
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
import org.apache.log4j._
import java.text._
//import java.util.Date
import java.sql.Date
object BayAreaBikeAnalysis {
case class Station(ID:Int, name:String, lat:Double, longitude:Double, dockCount:Int, city:String, installationDate:Date)
case class Status(station_id:Int, bikesAvailable:Int, docksAvailable:Int, time:String)
val dateFormat = new SimpleDateFormat("yyyy-MM-dd")
def extractStations(line: String): Station = {
val fields = line.split(",",-1)
val station:Station = Station(fields(0).toInt, fields(1), fields(2).toDouble, fields(3).toDouble, fields(4).toInt, fields(5), dateFormat.parse(fields(6)))
return station
}
def extractStatus(line: String): Status = {
val fields = line.split(",",-1)
val status:Status = Status(fields(0).toInt, fields(1).toInt, fields(2).toInt, fields(3))
return status
}
def main(args: Array[String]) {
// Set the log level to only print errors
//Logger.getLogger("org").setLevel(Level.ERROR)
// Use new SparkSession interface in Spark 2.0
val spark = SparkSession
.builder
.appName("BayAreaBikeAnalysis")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp")
.getOrCreate()
//Load files into data sets
import spark.implicits._
val stationLines = spark.sparkContext.textFile("Data/station.csv")
val stations = stationLines.map(extractStations).toDS().cache()
val statusLines = spark.sparkContext.textFile("Data/status.csv")
val statuses = statusLines.map(extractStatus).toDS().cache()
//people.select("name").show()
stations.select("installationDate").show()
spark.stop()
}
}
Obviously fields(6).toDate() doesn't compile but I'm not sure what to use.
I think this post is what you are looking for.
Also here you'll find a good tutorial for string parse to date.
Hope this helps!
Following are the ways u can convert string to date in scala.
(1) In case of java.util.date :-
val date= new SimpleDateFormat("yyyy-MM-dd")
date.parse("2017-09-28")
(2) In case of joda's dateTime:-
DateTime.parse("09-28-2017")
Here is a helping function that takes on a string representing a date and transforms it into a Timestamp
import java.sql.Timestamp
import java.util.TimeZone
import java.text.{DateFormat, SimpleDateFormat}
def getTimeStamp(timeStr: String): Timestamp = {
val dateFormat: DateFormat = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss")
dateFormat.setTimeZone(TimeZone.getTimeZone("UTC"))
val date: Option[Timestamp] = {
try {
Some(new Timestamp(dateFormat.parse(timeStr).getTime))
} catch {
case _: Exception => Some(Timestamp.valueOf("19700101'T'000000"))
}
}
date.getOrElse(Timestamp.valueOf(timeStr))
}
Obviously, you will need to change your input date format from "yyyy-MM-dd'T'HH:mm:ss" into whatever format you have the date string.
Hope this helps.