I have the input in timestamp , based on some condition i need to minus 1 sec or minus 3 months using scala programming
Input:
val date :String = "2017-10-31T23:59:59.000"
Output:
For Minus 1 sec
val lessOneSec = "2017-10-31T23:59:58.000"
For Minus 3 Months
val less3Mon = "2017-07-31T23:59:58.000"
How to convert a string value to Timestamp and do the operations like minus in scala programming ?
I assume you are working with Dataframes, since you have the spark-dataframe tag.
You can use the SQL INTERVAL to reduce the time, but your column should be in timestamp format for that:
df.show(false)
+-----------------------+
|ts |
+-----------------------+
|2017-10-31T23:59:59.000|
+-----------------------+
import org.apache.spark.sql.functions._
df.withColumn("minus1Sec" , date_format($"ts".cast("timestamp") - expr("interval 1 second") , "yyyy-MM-dd'T'HH:mm:ss.SSS") )
.withColumn("minus3Mon" , date_format($"ts".cast("timestamp") - expr("interval 3 month ") , "yyyy-MM-dd'T'HH:mm:ss.SSS") )
.show(false)
+-----------------------+-----------------------+-----------------------+
|ts |minus1Sec |minus3Mon |
+-----------------------+-----------------------+-----------------------+
|2017-10-31T23:59:59.000|2017-10-31T23:59:58.000|2017-07-31T23:59:59.000|
+-----------------------+-----------------------+-----------------------+
Try this below code
val yourDate = "2017-10-31T23:59:59.000"
val formater = DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss.SSS")
val date = LocalDateTime.parse(yourDate, formater)
println(date.minusSeconds(1).toString(formater))
println(date.minusMonths(3).toString(formater))
Output
2017-10-31T23:59:58.000
2017-07-31T23:59:59.000
Look at the jodatime library. it has all the APIs you need to minus seconds or months from a timestamp
http://www.joda.org/joda-time/
sbt dependency
"joda-time" % "joda-time" % "2.9.9"
Related
I have a UDF that creates a timestamp out of 2 field values with date and time. However, the field with time is of seconds format.
So how сan I merge 2 fields of type date and seconds into an hour of a type Unix timestamp?
My current implementation looks like this:
private val unix_epoch = udf[Long, String, String]{ (date, time) =>
deltaDateFormatter.parseDateTime(s"$date $formatted").getSeconds
}
def transform(inputDf: DataFrame): Unit = {
inputDf
.withColumn("event_hour", unix_epoch($"event_date", $"event_time"))
.withColumn("event_ts", from_unixtime($"event_hour").cast(TimestampType))
}
Input data:
event_date,event_time
20170501,87721
20170501,87728
20170501,87721
20170501,87726
Desired output:
event_tmstp, event_hour
2017-05-01 00:22:01,1493598121
2017-05-01 00:22:08,1493598128
2017-05-01 00:22:01,1493598121
2017-05-01 00:22:06,1493598126
Update. data schema:
event_date: string (nullable = true)
event_time: integer (nullable = true)
Cast event_date to a unix timestamp, add the event_time column to get event_hour, and convert back to normal timestamp event_tmstp.
PS I'm not sure why event_time has 86400 seconds (1 day) more. I needed to subtract that to get your expected output.
val df = Seq(
("20170501", 87721),
("20170501", 87728),
("20170501", 87721),
("20170501", 87726)
).toDF("event_date","event_time")
val df2 = df.select(
unix_timestamp(to_date($"event_date", "yyyyMMdd")) + $"event_time" - 86400
).toDF("event_hour").select(
$"event_hour".cast("timestamp").as("event_tmstp"),
$"event_hour"
)
df2.show
+-------------------+----------+
| event_tmstp|event_hour|
+-------------------+----------+
|2017-05-01 00:22:01|1493598121|
|2017-05-01 00:22:08|1493598128|
|2017-05-01 00:22:01|1493598121|
|2017-05-01 00:22:06|1493598126|
+-------------------+----------+
Check below code if this helps without UDF
val df = Seq(
(20170501,87721),
(20170501,87728),
(20170501,87721),
(20170501,87726)
).toDF("date","time")
df
.withColumn("date",
to_date(
unix_timestamp($"date".cast("string"),
"yyyyMMdd"
).cast("timestamp")
)
)
.withColumn(
"event_hour",
unix_timestamp(
concat_ws(
" ",
$"date",
from_unixtime($"time","HH:mm:ss.S")
).cast("timestamp")
)
)
.withColumn(
"event_ts",
from_unixtime($"event_hour")
)
.show(false)
+----------+-----+----------+-------------------+
|date |time |event_hour|event_ts |
+----------+-----+----------+-------------------+
|2017-05-01|87721|1493598121|2017-05-01 00:22:01|
|2017-05-01|87728|1493598128|2017-05-01 00:22:08|
|2017-05-01|87721|1493598121|2017-05-01 00:22:01|
|2017-05-01|87726|1493598126|2017-05-01 00:22:06|
+----------+-----+----------+-------------------+
Hi how's it going? I'm a Python developer trying to learn Spark Scala. My task is to create date range bins, and count the frequency of occurrences in each bin (histogram).
My input dataframe looks something like this
My bin edges are this (in Python):
bins = ["01-01-1990 - 12-31-1999","01-01-2000 - 12-31-2009"]
and the output dataframe I'm looking for is (counts of how many values in original dataframe per bin):
Is there anyone who can guide me on how to do this is spark scala? I'm a bit lost. Thank you.
Are You Looking for A result Like Following:
+------------------------+------------------------+
|01-01-1990 -- 12-31-1999|01-01-2000 -- 12-31-2009|
+------------------------+------------------------+
| 3| null|
| null| 2|
+------------------------+------------------------+
It can be achieved with little bit of spark Sql and pivot function as shown below
check out the left join condition
val binRangeData = Seq(("01-01-1990","12-31-1999"),
("01-01-2000","12-31-2009"))
val binRangeDf = binRangeData.toDF("start_date","end_date")
// binRangeDf.show
val inputDf = Seq((0,"10-12-1992"),
(1,"11-11-1994"),
(2,"07-15-1999"),
(3,"01-20-2001"),
(4,"02-01-2005")).toDF("id","input_date")
// inputDf.show
binRangeDf.createOrReplaceTempView("bin_range")
inputDf.createOrReplaceTempView("input_table")
val countSql = """
SELECT concat(date_format(c.st_dt,'MM-dd-yyyy'),' -- ',date_format(c.end_dt,'MM-dd-yyyy')) as header, c.bin_count
FROM (
(SELECT
b.st_dt, b.end_dt, count(1) as bin_count
FROM
(select to_date(input_date,'MM-dd-yyyy') as date_input , * from input_table) a
left join
(select to_date(start_date,'MM-dd-yyyy') as st_dt, to_date(end_date,'MM-dd-yyyy') as end_dt from bin_range ) b
on
a.date_input >= b.st_dt and a.date_input < b.end_dt
group by 1,2) ) c"""
val countDf = spark.sql(countSql)
countDf.groupBy("bin_count").pivot("header").sum("bin_count").drop("bin_count").show
Although, since you have 2 bin ranges there will be 2 rows generated.
We can achieve this by looking at the date column and determining within which range each record falls.
// First we set up the problem
// Create a format that looks like yours
val dateFormat = java.time.format.DateTimeFormatter.ofPattern("MM-dd-yyyy")
// Get the current local date
val now = java.time.LocalDate.now
// Create a range of 1-10000 and map each to minusDays
// so we can have range of dates going 10000 days back
val dates = (1 to 10000).map(now.minusDays(_).format(dateFormat))
// Create a DataFrame we can work with.
val df = dates.toDF("date")
So far so good. We have date entries to work with, and they are like your format (MM-dd-yyyy).
Next up, we need a function which returns 1 if the date falls within range, and 0 if not. We create a UserDefinedFunction (UDF) from this function so we can apply it to all rows simultaneously across Spark executors.
// We will process each range one at a time, so we'll take it as a string
// and split it accordingly. Then we perform our tests. Using Dates is
// necessary to cater to your format.
import java.text.SimpleDateFormat
def isWithinRange(date: String, binRange: String): Int = {
val format = new SimpleDateFormat("MM-dd-yyyy")
val startDate = format.parse(binRange.split(" - ").head)
val endDate = format.parse(binRange.split(" - ").last)
val testDate = format.parse(date)
if (!(testDate.before(startDate) || testDate.after(endDate))) 1
else 0
}
// We create a udf which uses an anonymous function taking two args and
// simply pass the values to our prepared function
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
def isWithinRangeUdf: UserDefinedFunction =
udf((date: String, binRange: String) => isWithinRange(date, binRange))
Now that we have our UDF setup, we create new columns in our DataFrame and group by the given bins and sum the values over (hence why we made our functions evaluate to an Int)
// We define our bins List
val bins = List("01-01-1990 - 12-31-1999",
"01-01-2000 - 12-31-2009",
"01-01-2010 - 12-31-2020")
// We fold through the bins list, creating a column from each bin at a time,
// enriching the DataFrame with more columns as we go
import org.apache.spark.sql.functions.{col, lit}
val withBinsDf = bins.foldLeft(df){(changingDf, bin) =>
changingDf.withColumn(bin, isWithinRangeUdf(col("date"), lit(bin)))
}
withBinsDf.show(1)
//+----------+-----------------------+-----------------------+-----------------------+
//| date|01-01-1990 - 12-31-1999|01-01-2000 - 12-31-2009|01-01-2010 - 12-31-2020|
//+----------+-----------------------+-----------------------+-----------------------+
//|09-01-2020| 0| 0| 1|
//+----------+-----------------------+-----------------------+-----------------------+
//only showing top 1 row
Finally we select our bin columns and groupBy them and sum.
val binsDf = withBinsDf.select(bins.head, bins.tail:_*)
val sums = bins.map(b => sum(b).as(b)) // keep col name as is
val summedBinsDf = binsDf.groupBy().agg(sums.head, sums.tail:_*)
summedBinsDf.show
//+-----------------------+-----------------------+-----------------------+
//|01-01-1990 - 12-31-1999|01-01-2000 - 12-31-2009|01-01-2010 - 12-31-2020|
//+-----------------------+-----------------------+-----------------------+
//| 2450| 3653| 3897|
//+-----------------------+-----------------------+-----------------------+
2450 + 3653 + 3897 = 10000, so it seems our work was correct.
Perhaps I overdid it and there is a simpler solution, please let me know if you know a better way (especially to handle MM-dd-yyyy dates).
I have a date and want to add and subtract 10 days to it. Start_date and end_date are dynamic variables from one table and will be used to filter another table.
eg.
val start_date = "2018-09-08"
val end_date = "2018-09-15"
I want to use the two dates above in a filter shown below;
myDF.filter($"timestamp".between(date_sub(start_date, 10),date_add(end_date, 10)))
The functions date_add and date_sub only take in columns as an input. How can I add/subtract 10 (this is an arbitrary number) from my dates?
Thanks
Thank you Luis! Your solution worked, for anyone interested the solution looks like;
val start_date = lit("2018-09-08")
val end_date = lit("2018-09-15")
myDF.filter($"timestamp".between(date_sub(start_date, 10),date_add(end_date, 10)))
Another way...If you can create a temp view, then you can access the vals using $ interpolation.
You should make sure the format is of default ones for date/timestamp.
Check this out:
scala> val start_date = "2018-09-08"
start_date: String = 2018-09-08
scala> val end_date = "2018-09-15"
end_date: String = 2018-09-15
scala> val myDF=Seq(("2018-09-08"),("2018-09-15")).toDF("timestamp").withColumn("timestamp",to_timestamp('timestamp))
myDF: org.apache.spark.sql.DataFrame = [timestamp: timestamp]
scala> myDF.show(false)
+-------------------+
|timestamp |
+-------------------+
|2018-09-08 00:00:00|
|2018-09-15 00:00:00|
+-------------------+
scala> myDF.createOrReplaceTempView("ts_table")
scala> spark.sql(s""" select timestamp, date_sub('$start_date',10) as d_sub, date_add('$end_date',10) d_add from ts_table """).show(false)
+-------------------+----------+----------+
|timestamp |d_sub |d_add |
+-------------------+----------+----------+
|2018-09-08 00:00:00|2018-08-29|2018-09-25|
|2018-09-15 00:00:00|2018-08-29|2018-09-25|
+-------------------+----------+----------+
scala>
I need to convert Integer to date(yyyy-mm-dd) format, to calculate number of days.
registryDate
20130826
20130829
20130816
20130925
20130930
20130926
Desired output:
registryDate TodaysDate DaysInBetween
20130826 2018-11-24 1916
20130829 2018-11-24 1913
20130816 2018-11-24 1926
You can cast registryDate to String type, then apply to_date and datediff to compute the difference in days, as shown below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import java.sql.Date
val df = Seq(
20130826, 20130829, 20130816, 20130825
).toDF("registryDate")
df.
withColumn("registryDate2", to_date($"registryDate".cast(StringType), "yyyyMMdd")).
withColumn("todaysDate", lit(Date.valueOf("2018-11-24"))).
withColumn("DaysInBetween", datediff($"todaysDate", $"registryDate2")).
show
// +------------+-------------+----------+-------------+
// |registryDate|registryDate2|todaysDate|DaysInBetween|
// +------------+-------------+----------+-------------+
// | 20130826| 2013-08-26|2018-11-24| 1916|
// | 20130829| 2013-08-29|2018-11-24| 1913|
// | 20130816| 2013-08-16|2018-11-24| 1926|
// | 20130825| 2013-08-25|2018-11-24| 1917|
// +------------+-------------+----------+-------------+
I have two dataframes in Scala:
df1 =
ID start_date_time
1 2016-10-12 11:55:23
2 2016-10-12 12:25:00
3 2016-10-12 16:20:00
and
df2 =
PK start_date
1 2016-10-12
2 2016-10-14
I need to add a new column to df1 that will have value 0 if the following condition fails, otherwise -> 1:
If ID == PK and start_date_time refers to the same year, month and day as start_date.
The result should be this one:
df1 =
ID start_date_time check
1 2016-10-12-11-55-23 1
2 2016-10-12-12-25-00 0
3 2016-10-12-16-20-00 0
How can I do it?
I assume that the logic should be something like this:
df1 = df.withColumn("check", define(df("ID"),df("start_date")))
val define = udf {(id: String,dateString:String) =>
val formatter = new SimpleDateFormat("yyyy-MM-dd")
val date = formatter.format(dateString)
val checks = df2.filter(df2("PK")===ID).filter(df2("start_date_time")===date)
if(checks.collect().length>0) "1" else "0"
}
However, I have doubts regarding how to compare dates, because df1 and df2 have differently formatted dates. How to better implement it?
You can use spark datetime functions to create date columns on both df1 and df2 and then do a left join on df1, df2, here you create an extra constant column check on df2 to indicate if there is a match in the result:
import org.apache.spark.sql.functions.lit
val df1_date = df1.withColumn("date", to_date(df1("start_date_time")))
val df2_date = (df2.withColumn("date", to_date(df2("start_date"))).
withColumn("check", lit(1)).
select($"PK".as("ID"), $"date", $"check"))
df1_date.join(df2_date, Seq("ID", "date"), "left").drop($"date").na.fill(0).show
+---+--------------------+-----+
| ID| start_date_time|check|
+---+--------------------+-----+
| 1|2016-10-12 11:55:...| 1|
| 2|2016-10-12 12:25:...| 0|
| 3|2016-10-12 16:20:...| 0|
+---+--------------------+-----+
I don't have the exact logic I would do something like that:
val df3 = df2.
join(df1,df1("ID") === df2("ID")).
filter( ($"start_date_time").isBefore($"start_date") )
You will need to convert the 2 timestamp to joda time using this: Converting a date string to a DateTime object using Joda Time library
Good luck !