scala loop to add date string into a seq - scala

I am trying to add dates in string from an array into a seq while determining whether it is a weekend day.
import java.text.SimpleDateFormat
import java.util.{Calendar, Date, GregorianCalendar}
import org.apache.spark.sql.SparkSession
val arrDateEsti=dtfBaseNonLong.select("AAA").distinct().collect.map(_(0).toString);
var dtfDateCate = Seq(
("0000", "0")
);
for (a<-0 to arrDateEsti.length-1){
val dayDate:Date = dateFormat.parse(arrDateEsti(a));
val cal=new GregorianCalendar
cal.setTime(dayDate);
if (cal.get(Calendar.DAY_OF_WEEK)==1 || cal.get(Calendar.DAY_OF_WEEK)==7){
dtfDateCate:+(arrDateEsti(a),"1")
}else{
dtfDateCate:+(arrDateEsti(a),"0")
}
};
scala> dtfDateCate
res20: Seq[(String, String)] = List((0000,0))
It returns the same initial sequence. But if I run one single element it works. What went wrong?
scala> val dayDate:Date = dateFormat.parse(arrDateEsti(0));
dayDate: java.util.Date = Thu Oct 15 00:00:00 CST 2020
scala> cal.setTime(dayDate);
scala> if (cal.get(Calendar.DAY_OF_WEEK)==1 || cal.get(Calendar.DAY_OF_WEEK)==7){
| dtfDateCate:+(arrDateEsti(0),"1")
| }else{
| dtfDateCate:+(arrDateEsti(0),"0")
| };
res14: Seq[(String, String)] = List((0000,0), (20201015,0))

I think this gets at what you're trying to do.
import java.time.LocalDate
import java.time.DayOfWeek.{SATURDAY, SUNDAY}
import java.time.format.DateTimeFormatter
//replace with dtfBaseNonLong.select(... code
val arrDateEsti = Seq("20201015", "20201017") //place holder
val dtFormat = DateTimeFormatter.ofPattern("yyyyMMdd")
val dtfDateCate = ("0000", "0") +:
arrDateEsti.map { dt =>
val day = LocalDate.parse(dt,dtFormat).getDayOfWeek()
if (day == SATURDAY || day == SUNDAY) (dt, "1")
else (dt, "0")
}
//dtfDateCate = Seq((0000,0), (20201015,0), (20201017,1))

yeah, it should be seqDateCate=seqDateCate:+(arrDateEsti(a),"1")

Related

Create two rows based on a date time column in spark using scala

I have a DF with column session_start and session end. I need to create another row so if the start and end fall in different dates.
For ex :
We have df as
session_start
session_stop
01-05-2021 23:11:40
02-05-2021 02:13:25
So the new output df should break this into two rows like :
session_start
session_stop
01-05-2021 23:11:40
01-05-2021 23:59:59
02-05-2021 00:00:00
02-05-2021 02:13:25
Will all other columns should remain common in both the rows.
You can use a flatMap operation on your DF.
The function you use in the flatMap will produce either one or two row(s).
I did it without flatMap function. Created a UDF generateOverlappedSessionsFromTimestampRanges which does the conversion and used it as below
// UDF
import java.sql.Timestamp
import java.time.temporal.ChronoUnit
import java.time.LocalDateTime
val generateOverlappedSessionsFromTimestampRanges = udf {(localStartTimestamp: Timestamp, localEndTimestamp: Timestamp) =>
val localStartLdt = localStartTimestamp.toLocalDateTime
val localEndLdt = localEndTimestamp.toLocalDateTime
var output : List[(Timestamp, Timestamp)] = List()
if(localStartLdt.toLocalDate().until(localEndLdt.toLocalDate(), ChronoUnit.DAYS) > 0) {
val newLocalEndLdt = LocalDateTime.of(localStartLdt.getYear(), localStartLdt.getMonth(), localStartLdt.getDayOfMonth(), 23, 59, 59)
val newLocalStartLdt = LocalDateTime.of(localEndLdt.getYear(), localEndLdt.getMonth(), localEndLdt.getDayOfMonth(), 0, 0, 0)
output = output :+ (Timestamp.valueOf(localStartLdt),
Timestamp.valueOf(newLocalEndLdt)
)
output = output :+ (Timestamp.valueOf(newLocalStartLdt),
Timestamp.valueOf(localEndLdt)
)
} else {
output = output :+ (Timestamp.valueOf(localStartLdt),
Timestamp.valueOf(localEndLdt)
)
}
output
}
//Unit test case for above UDF
import org.apache.spark.sql.functions._
import java.sql.Timestamp
import org.apache.spark.sql.types.TimestampType
val timestamps: Seq[(Timestamp, Timestamp)] = Seq(
(Timestamp.valueOf("2020-02-10 22:07:25.000"),
Timestamp.valueOf("2020-02-11 02:07:25.000")
)
)
val timestampsDf = timestamps.toDF("local_session_start_timestamp", "local_session_stop_timestamp")
var output = timestampsDf.withColumn("to_be_explode", TimeUtil.generateOverlappedSessionsFromTimestampRanges1(timestampsDf("local_session_start_timestamp"),
timestampsDf("local_session_stop_timestamp")
))
output = output.withColumn("exploded_session_time",explode(col("to_be_explode")))
.withColumn("new_local_session_start",col("exploded_session_time._1"))
.withColumn("new_local_session_stop", col("exploded_session_time._2"))
.drop("to_be_explode", "exploded_session_time")
display(output)
df.withColumn("to_be_explode", generateOverlappedSessionsFromTimestampRanges(df("session_start"), df("session_stop")))
.withColumn("exploded_session_time",explode(col("to_be_explode")))
.withColumn("session_start",col("exploded_session_time._1"))
.withColumn("session_stop", col("exploded_session_time._2"))
.drop("to_be_explode", "exploded_session_time")

Find dates between two dates scala

I am trying to extract dates between two dates.
If my input is:
start date: 2020_04_02
end date: 2020_06_02
Output should be:
List("2020_04_02","2020_04_03", "2020_04_04", "2020_04_05", "2020_04_06")
So far i have tried:
val beginDate = LocalDate.parse(startDate, formatted)
val lastDate = LocalDate.parse(endDate, formatted)
beginDate.datesUntil(lastDate.plusDays(1))
.iterator()
.asScala
.map(date => formatter.format(date))
.toList
import java.time.format.DateTimeFormatter
private def formatter = DateTimeFormatter.ofPattern("yyyy_MM_dd")
But i think it could be done even in a more refined way
I'm not super proud of this, but I've done it this way before:
import java.time.LocalDate
val start = LocalDate.of(2020,1,1).toEpochDay
val end = LocalDate.of(2020,12,31).toEpochDay
val dates = (start to end).map(LocalDate.ofEpochDay(_).toString).toArray
You end up with:
dates: Array[String] = Array(2020-01-01, 2020-01-02, ..., 2020-12-31)
What you are doing is correct but your end date is not matching what you are expecting:
import java.time.format.DateTimeFormatter
import java.time.LocalDate
import scala.jdk.CollectionConverters._
private def formatter = DateTimeFormatter.ofPattern("yyyy_MM_dd")
val startDate = "2020_04_02"
val endDate1 = "2020_04_06" // "2020_06_02"
val endDate2 = "2020_06_02"
val beginDate = LocalDate.parse(startDate, formatter)
val lastDate1 = LocalDate.parse(endDate1, formatter)
val lastDate2 = LocalDate.parse(endDate2, formatter)
val res1 = beginDate.datesUntil(lastDate1.plusDays(1)).iterator().asScala.map(date => formatter.format(date)).toList
val res2 = beginDate.datesUntil(lastDate2.plusDays(1)).iterator().asScala.map(date => formatter.format(date)).toList
println(res1) // List(2020_04_02, 2020_04_03, 2020_04_04, 2020_04_05, 2020_04_06)
println(res2) // List(2020_04_02, 2020_04_03, 2020_04_04, 2020_04_05, 2020_04_06 ... 2020_05_31, 2020_06_01, 2020_06_02)
This function will return List[String].
import java.time.format.DateTimeFormatter
import java.time.LocalDate
val dt1 = "2020_04_02"
val dt2 = "2020_04_06"
def DatesBetween(startDate: String,endDate: String) : List[String] = {
def ConvertToFormat = DateTimeFormatter.ofPattern("yyyy_MM_dd")
val sdate = LocalDate.parse(startDate,ConvertToFormat)
val edate = LocalDate.parse(endDate,ConvertToFormat)
val DateRange = sdate.toEpochDay.until(edate.plusDays(1).toEpochDay).map(LocalDate.ofEpochDay).toList
val ListofDateRange = DateRange.map(date => ConvertToFormat.format(date)).toList
ListofDateRange
}
println(DatesBetween(dt1,dt2))

Spark - Find the range of all year weeks between 2 weeks

I need to find all the year weeks between the given weeks.
201824 is an example of an year week. It means 24th week of the year 2018.
Assuming that there are 52 weeks in a year, The year weeks of 2018 start with 201801 and end with 201852. After that, it continues with 201901.
I was able to find the range of all year weeks between 2 weeks if the start week and the end week are in the same year like below
val range = udf((i: Int, j: Int) => (i to j).toArray)
The above code works only works when the start week and end week are in the same year, for example 201912 - 201917
How do I make it work if the start week and the end week belongs to different years.
Example: 201849 - 201903
The above weeks should give the output as:
201849,201850,201851,201852,201901,201902,201903
Well there is still a lot of optimizations to do, but for the general direction you could use:
I am using org.joda.time.format here, but java.time should also fit.
def rangeOfYearWeeks(weeksRange: String): Array[String] = {
try {
val left = weeksRange.split("-")(0).trim
val right = weeksRange.split("-")(1).trim
val leftPattern = s"${left.substring(0, 4)}-${left.substring(4)}"
val rightPattern = s"${right.substring(0, 4)}-${right.substring(4)}"
val fmt = DateTimeFormat.forPattern("yyyy-w")
val leftDate = fmt.parseDateTime(leftPattern)
val rightDate = fmt.parseDateTime(rightPattern)
//if (leftDate.isAfter(rightDate))
val weeksBetween = Weeks.weeksBetween(leftDate, rightDate).getWeeks
val dates = for (one <- 0 to weeksBetween) yield {
leftDate.plusWeeks(one)
}
val result: Array[String] = dates.map(date => fmt.print(date)).map(_.replaceAll("-", "")).toArray
result
} catch {
case e: Exception => Array.empty
}
}
Example:
val dates = Seq("201849 - 201903", "201912 - 201917").toDF("col")
val weeks = udf((d: String) => rangeOfYearWeeks(d))
dates.select(weeks($"col")).show(false)
+-----------------------------------------------------+
|UDF(col) |
+-----------------------------------------------------+
|[201849, 201850, 201851, 201852, 20181, 20192, 20193]|
|[201912, 201913, 201914, 201915, 201916, 201917] |
+-----------------------------------------------------+
Here's a solution with an UDF that uses the java.time API:
def weeksBetween = udf{ (startWk: Int, endWk: Int) =>
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import scala.util.{Try, Success, Failure}
def formatYW(yw: Int): String = {
val pattern = "(\\d{4})(\\d+)".r
s"$yw" match { case pattern(y, w) => s"$y-$w-1"}
}
val formatter = DateTimeFormatter.ofPattern("YYYY-w-e") // week-based year
Try(
Iterator.iterate(LocalDate.parse(formatYW(startWk), formatter))(_.plusWeeks(1)).
takeWhile(_.isBefore(LocalDate.parse(formatYW(endWk), formatter))).
map{ s =>
val a = s.format(formatter).split("-")
(a(0) + f"${a(1).toInt}%02d").toInt
}.
toList.tail
) match {
case Success(ls) => ls
case Failure(_) => List.empty[Int] // return an empty list
}
}
Testing the UDF:
val df = Seq(
(1, 201849, 201903), (2, 201908, 201916), (3, 201950, 201955)
).toDF("id", "start_wk", "end_wk")
df.withColumn("weeks_between", weeksBetween($"start_wk", $"end_wk")).show(false)
// +---+--------+------+--------------------------------------------------------+
// |id |start_wk|end_wk|weeks_between |
// +---+--------+------+--------------------------------------------------------+
// |1 |201849 |201903|[201850, 201851, 201852, 201901, 201902] |
// |2 |201908 |201916|[201909, 201910, 201911, 201912, 201913, 201914, 201915]|
// |3 |201950 |201955|[] |
// +---+--------+------+--------------------------------------------------------+

Get next week date in Spark Dataframe using scala

I have a DateType input in the function. I would like to exclude Saturday and Sunday and get the next week day, if the input date falls on the weekend, otherwise it should give the next day's date
Example:
Input: Monday 1/1/2017 output: 1/2/2017 (which is Tuesday)
Input: Saturday 3/4/2017 output: 3/5/2017 (which is Monday)
I have gone through https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/functions.html but I don't see a ready made function, so I think it will need to be created.
So far I have something that is:
val nextWeekDate = udf {(startDate: DateType) =>
val day= date_format(startDate,'E'
if(day=='Sat' or day=='Sun'){
nextWeekDate = next_day(startDate,'Mon')
}
else{
nextWeekDate = date_add(startDate, 1)
}
}
Need help to get it valid and working.
Using dates as strings:
import java.time.{DayOfWeek, LocalDate}
import java.time.format.DateTimeFormatter
// If that is your format date
object MyFormat {
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
}
object MainSample {
import MyFormat._
def main(args: Array[String]): Unit = {
import java.sql.Date
import org.apache.spark.sql.types.{DateType, IntegerType}
import spark.implicits._
import org.apache.spark.sql.types.{ StringType, StructField, StructType }
import org.apache.spark.sql.functions._
implicit val spark: SparkSession =
SparkSession
.builder()
.appName("YourApp")
.config("spark.master", "local")
.getOrCreate()
val someData = Seq(
Row(1,"2013-01-30"),
Row(2,"2012-01-01")
)
val schema = List(StructField("id", IntegerType), StructField("date",StringType))
val sourceDF = spark.createDataFrame(spark.sparkContext.parallelize(someData), StructType(schema))
sourceDF.show()
val _udf = udf { (dt: String) =>
// Parse your date, dt is a string
val localDate = LocalDate.parse(dt, formatter)
// Check the week day and add days in each case
val newDate = if ((localDate.getDayOfWeek == DayOfWeek.SATURDAY)) {
localDate.plusDays(2)
} else if (localDate.getDayOfWeek == DayOfWeek.SUNDAY) {
localDate.plusDays(1)
} else {
localDate.plusDays(1)
}
newDate.toString
}
sourceDF.withColumn("NewDate", _udf('date)).show()
}
}
Here's a much simpler answer that's defined in spark-daria:
def nextWeekday(col: Column): Column = {
val d = dayofweek(col)
val friday = lit(6)
val saturday = lit(7)
when(col.isNull, null)
.when(d === friday || d === saturday, next_day(col,"Mon"))
.otherwise(date_add(col, 1))
}
You always want to stick with the native Spark functions whenever possible. This post explains the derivation of this function in greater detail.

Better way to convert a string field into timestamp in Spark

I have a CSV in which a field is datetime in a specific format. I cannot import it directly in my Dataframe because it needs to be a timestamp. So I import it as string and convert it into a Timestamp like this
import java.sql.Timestamp
import java.text.SimpleDateFormat
import java.util.Date
import org.apache.spark.sql.Row
def getTimestamp(x:Any) : Timestamp = {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
if (x.toString() == "")
return null
else {
val d = format.parse(x.toString());
val t = new Timestamp(d.getTime());
return t
}
}
def convert(row : Row) : Row = {
val d1 = getTimestamp(row(3))
return Row(row(0),row(1),row(2),d1)
}
Is there a better, more concise way to do this, with the Dataframe API or spark-sql? The above method requires the creation of an RDD and to give the schema for the Dataframe again.
Spark >= 2.2
Since you 2.2 you can provide format string directly:
import org.apache.spark.sql.functions.to_timestamp
val ts = to_timestamp($"dts", "MM/dd/yyyy HH:mm:ss")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+-------------------+
// |id |dts |ts |
// +---+-------------------+-------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01|
// |2 |#$#### |null |
// +---+-------------------+-------------------+
Spark >= 1.6, < 2.2
You can use date processing functions which have been introduced in Spark 1.5. Assuming you have following data:
val df = Seq((1L, "05/26/2016 01:01:01"), (2L, "#$####")).toDF("id", "dts")
You can use unix_timestamp to parse strings and cast it to timestamp
import org.apache.spark.sql.functions.unix_timestamp
val ts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("timestamp")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+---------------------+
// |id |dts |ts |
// +---+-------------------+---------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01.0|
// |2 |#$#### |null |
// +---+-------------------+---------------------+
As you can see it covers both parsing and error handling. The format string should be compatible with Java SimpleDateFormat.
Spark >= 1.5, < 1.6
You'll have to use use something like this:
unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("double").cast("timestamp")
or
(unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss") * 1000).cast("timestamp")
due to SPARK-11724.
Spark < 1.5
you should be able to use these with expr and HiveContext.
I haven't played with Spark SQL yet but I think this would be more idiomatic scala (null usage is not considered a good practice):
def getTimestamp(s: String) : Option[Timestamp] = s match {
case "" => None
case _ => {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
Try(new Timestamp(format.parse(s).getTime)) match {
case Success(t) => Some(t)
case Failure(_) => None
}
}
}
Please notice I assume you know Row elements types beforehand (if you read it from a csv file, all them are String), that's why I use a proper type like String and not Any (everything is subtype of Any).
It also depends on how you want to handle parsing exceptions. In this case, if a parsing exception occurs, a None is simply returned.
You could use it further on with:
rows.map(row => Row(row(0),row(1),row(2), getTimestamp(row(3))
I have ISO8601 timestamp in my dataset and I needed to convert it to "yyyy-MM-dd" format. This is what I did:
import org.joda.time.{DateTime, DateTimeZone}
object DateUtils extends Serializable {
def dtFromUtcSeconds(seconds: Int): DateTime = new DateTime(seconds * 1000L, DateTimeZone.UTC)
def dtFromIso8601(isoString: String): DateTime = new DateTime(isoString, DateTimeZone.UTC)
}
sqlContext.udf.register("formatTimeStamp", (isoTimestamp : String) => DateUtils.dtFromIso8601(isoTimestamp).toString("yyyy-MM-dd"))
And you can just use the UDF in your spark SQL query.
Spark Version: 2.4.4
scala> import org.apache.spark.sql.types.TimestampType
import org.apache.spark.sql.types.TimestampType
scala> val df = Seq("2019-04-01 08:28:00").toDF("ts")
df: org.apache.spark.sql.DataFrame = [ts: string]
scala> val df_mod = df.select($"ts".cast(TimestampType))
df_mod: org.apache.spark.sql.DataFrame = [ts: timestamp]
scala> df_mod.printSchema()
root
|-- ts: timestamp (nullable = true)
I would like to move the getTimeStamp method wrote by you into rdd's mapPartitions and reuse GenericMutableRow among rows in an iterator:
val strRdd = sc.textFile("hdfs://path/to/cvs-file")
val rowRdd: RDD[Row] = strRdd.map(_.split('\t')).mapPartitions { iter =>
new Iterator[Row] {
val row = new GenericMutableRow(4)
var current: Array[String] = _
def hasNext = iter.hasNext
def next() = {
current = iter.next()
row(0) = current(0)
row(1) = current(1)
row(2) = current(2)
val ts = getTimestamp(current(3))
if(ts != null) {
row.update(3, ts)
} else {
row.setNullAt(3)
}
row
}
}
}
And you should still use schema to generate a DataFrame
val df = sqlContext.createDataFrame(rowRdd, tableSchema)
The usage of GenericMutableRow inside an iterator implementation could be find in Aggregate Operator, InMemoryColumnarTableScan, ParquetTableOperations etc.
I would use https://github.com/databricks/spark-csv
This will infer timestamps for you.
import com.databricks.spark.csv._
val rdd: RDD[String] = sc.textFile("csvfile.csv")
val df : DataFrame = new CsvParser().withDelimiter('|')
.withInferSchema(true)
.withParseMode("DROPMALFORMED")
.csvRdd(sqlContext, rdd)
I had some issues with to_timestamp where it was returning an empty string. After a lot of trial and error, I was able to get around it by casting as a timestamp, and then casting back as a string. I hope this helps for anyone else with the same issue:
df.columns.intersect(cols).foldLeft(df)((newDf, col) => {
val conversionFunc = to_timestamp(newDf(col).cast("timestamp"), "MM/dd/yyyy HH:mm:ss").cast("string")
newDf.withColumn(col, conversionFunc)
})