convert integer into date to count number of days - scala

I need to convert Integer to date(yyyy-mm-dd) format, to calculate number of days.
registryDate
20130826
20130829
20130816
20130925
20130930
20130926
Desired output:
registryDate TodaysDate DaysInBetween
20130826 2018-11-24 1916
20130829 2018-11-24 1913
20130816 2018-11-24 1926

You can cast registryDate to String type, then apply to_date and datediff to compute the difference in days, as shown below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import java.sql.Date
val df = Seq(
20130826, 20130829, 20130816, 20130825
).toDF("registryDate")
df.
withColumn("registryDate2", to_date($"registryDate".cast(StringType), "yyyyMMdd")).
withColumn("todaysDate", lit(Date.valueOf("2018-11-24"))).
withColumn("DaysInBetween", datediff($"todaysDate", $"registryDate2")).
show
// +------------+-------------+----------+-------------+
// |registryDate|registryDate2|todaysDate|DaysInBetween|
// +------------+-------------+----------+-------------+
// | 20130826| 2013-08-26|2018-11-24| 1916|
// | 20130829| 2013-08-29|2018-11-24| 1913|
// | 20130816| 2013-08-16|2018-11-24| 1926|
// | 20130825| 2013-08-25|2018-11-24| 1917|
// +------------+-------------+----------+-------------+

Related

Convert string (with timestamp) to timestamp in pyspark

I have a dataframe with a string datetime column.
I am converting it to timestamp, but the values are changing.
Following is my code, can anyone help me to convert without changing values.
df=spark.createDataFrame(
data = [ ("1","2020-04-06 15:06:16 +00:00")],
schema=["id","input_timestamp"])
df.printSchema()
#Timestamp String to DateType
df = df.withColumn("timestamp",to_timestamp("input_timestamp"))
# Using Cast to convert TimestampType to DateType
df.withColumn('timestamp_string', \
to_timestamp('timestamp').cast('string')) \
.show(truncate=False)
This is the output:
+---+--------------------------+-------------------+-------------------+
|id |input_timestamp |timestamp |timestamp_string |
+---+--------------------------+-------------------+-------------------+
|1 |2020-04-06 15:06:16 +00:00|2020-04-06 08:06:16|2020-04-06 08:06:16|
+---+--------------------------+-------------------+-------------------+
I want to know why the hour is changing from 15 to 8 and how can I prevent it?
I believe to_timestamp is converting timestamp value to your local time as you have +00:00 in your data.
Try to pass the format to to_timestamp() function.
Example:
from pyspark.sql.functions import to_timestamp
df.withColumn("timestamp",to_timestamp(col("input_timestamp"),"yyyy-MM-dd HH:mm:ss +00:00")).show(10,False)
#+---+--------------------------+-------------------+
#|id |input_timestamp |timestamp |
#+---+--------------------------+-------------------+
#|1 |2020-04-06 15:06:16 +00:00|2020-04-06 15:06:16|
#+---+--------------------------+-------------------+
from pyspark.sql.functions import to_utc_timestamp
df = spark.createDataFrame(
data=[('1', '2020-04-06 15:06:16 +00:00')],
schema=['id', 'input_timestamp'])
df.printSchema()
df = df.withColumn('timestamp', to_utc_timestamp('input_timestamp',
your_local_timezone))
df.withColumn('timestamp_string', df.timestamp.cast('string')).show(truncate=False)
Replace your_local_timezone with the actual value.

How to change date format in Spark?

I have the following DataFrame:
+----------+-------------------+
| timestamp| created|
+----------+-------------------+
|1519858893|2018-03-01 00:01:33|
|1519858950|2018-03-01 00:02:30|
|1519859900|2018-03-01 00:18:20|
|1519859900|2018-03-01 00:18:20|
How to create a timestamp correctly`?
I was able to create timestamp column which is epoch timestamp, but dates to not coincide:
df.withColumn("timestamp",unix_timestamp($"created"))
For example, 1519858893 points to 2018-02-28.
Just use date_format and to_utc_timestamp inbuilt functions
import org.apache.spark.sql.functions._
df.withColumn("timestamp", to_utc_timestamp(date_format(col("created"), "yyy-MM-dd"), "Asia/Kathmandu"))
Try below code
df.withColumn("dateColumn", df("timestamp").cast(DateType))
You can check one solution here https://stackoverflow.com/a/46595413
To elaborate more on that with the dataframe having different formats of timestamp/date in string, you can do this -
val df = spark.sparkContext.parallelize(Seq("2020-04-21 10:43:12.000Z", "20-04-2019 10:34:12", "11-30-2019 10:34:12", "2020-05-21 21:32:43", "20-04-2019", "2020-04-21")).toDF("ts")
def strToDate(col: Column): Column = {
val formats: Seq[String] = Seq("dd-MM-yyyy HH:mm:SS", "yyyy-MM-dd HH:mm:SS", "dd-MM-yyyy", "yyyy-MM-dd")
coalesce(formats.map(f => to_timestamp(col, f).cast(DateType)): _*)
}
val formattedDF = df.withColumn("dt", strToDate(df.col("ts")))
formattedDF.show()
+--------------------+----------+
| ts| dt|
+--------------------+----------+
|2020-04-21 10:43:...|2020-04-21|
| 20-04-2019 10:34:12|2019-04-20|
| 2020-05-21 21:32:43|2020-05-21|
| 20-04-2019|2019-04-20|
| 2020-04-21|2020-04-21|
+--------------------+----------+
Note: - This code assumes that data does not contain any column of format -> MM-dd-yyyy, MM-dd-yyyy HH:mm:SS

Scala Operation on TimeStamp values

I have the input in timestamp , based on some condition i need to minus 1 sec or minus 3 months using scala programming
Input:
val date :String = "2017-10-31T23:59:59.000"
Output:
For Minus 1 sec
val lessOneSec = "2017-10-31T23:59:58.000"
For Minus 3 Months
val less3Mon = "2017-07-31T23:59:58.000"
How to convert a string value to Timestamp and do the operations like minus in scala programming ?
I assume you are working with Dataframes, since you have the spark-dataframe tag.
You can use the SQL INTERVAL to reduce the time, but your column should be in timestamp format for that:
df.show(false)
+-----------------------+
|ts |
+-----------------------+
|2017-10-31T23:59:59.000|
+-----------------------+
import org.apache.spark.sql.functions._
df.withColumn("minus1Sec" , date_format($"ts".cast("timestamp") - expr("interval 1 second") , "yyyy-MM-dd'T'HH:mm:ss.SSS") )
.withColumn("minus3Mon" , date_format($"ts".cast("timestamp") - expr("interval 3 month ") , "yyyy-MM-dd'T'HH:mm:ss.SSS") )
.show(false)
+-----------------------+-----------------------+-----------------------+
|ts |minus1Sec |minus3Mon |
+-----------------------+-----------------------+-----------------------+
|2017-10-31T23:59:59.000|2017-10-31T23:59:58.000|2017-07-31T23:59:59.000|
+-----------------------+-----------------------+-----------------------+
Try this below code
val yourDate = "2017-10-31T23:59:59.000"
val formater = DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss.SSS")
val date = LocalDateTime.parse(yourDate, formater)
println(date.minusSeconds(1).toString(formater))
println(date.minusMonths(3).toString(formater))
Output
2017-10-31T23:59:58.000
2017-07-31T23:59:59.000
Look at the jodatime library. it has all the APIs you need to minus seconds or months from a timestamp
http://www.joda.org/joda-time/
sbt dependency
"joda-time" % "joda-time" % "2.9.9"

Aggregating JSON object in Dataframe and converting string timestamp to date

I got JSON rows that looks like the following
[{"time":"2017-03-23T12:23:05","user":"randomUser","action":"sleeping"}]
[{"time":"2017-03-23T12:24:05","user":"randomUser","action":"sleeping"}]
[{"time":"2017-03-23T12:33:05","user":"randomUser","action":"sleeping"}]
[{"time":"2017-03-23T15:33:05","user":"randomUser2","action":"eating"}]
[{"time":"2017-03-23T15:33:06","user":"randomUser2","action":"eating"}]
So I got 2 problem, First of all the time is stored as String inside my df, I believe it has to be date for me to aggregate them?
second of all, I need to aggregate those datas by 5 minutes interval,
just for example everything that happens from 2017-03-23T12:20:00 to 2017-03-23T12:24:59 need to be aggregated and considered as 2017-03-23T12:20:00 timestamp
expected output is
[{"time":"2017-03-23T12:20:00","user":"randomUser","action":"sleeping","count":2}]
[{"time":"2017-03-23T12:30:00","user":"randomUser","action":"sleeping","count":1}]
[{"time":"2017-03-23T15:30:00","user":"randomUser2","action":"eating","count":2}]
thanks
You can convert the StringType column into a TimestampType column using casting; Then, you can cast the timestamp into IntegerType to make the "rounding" down to the last 5-minute interval easier, and group by that (and all other columns):
// importing SparkSession's implicits
import spark.implicits._
// Use casting to convert String into Timestamp:
val withTime = df.withColumn("time", $"time" cast TimestampType)
// calculate the "most recent 5-minute-round time" and group by all
val result = withTime.withColumn("time", $"time" cast IntegerType)
.withColumn("time", ($"time" - ($"time" mod 60 * 5)) cast TimestampType)
.groupBy("time", "user", "action").count()
result.show(truncate = false)
// +---------------------+-----------+--------+-----+
// |time |user |action |count|
// +---------------------+-----------+--------+-----+
// |2017-03-23 12:20:00.0|randomUser |sleeping|2 |
// |2017-03-23 15:30:00.0|randomUser2|eating |2 |
// |2017-03-23 12:30:00.0|randomUser |sleeping|1 |
// +---------------------+-----------+--------+-----+

How to add a new column with day of week based on another in dataframe?

I have a field in a data frame currently formatted as a string (mm/dd/yyyy) and I want to create a new column in that data frame with the day of week name (i.e. Thursday) for that field. I've imported
import com.github.nscala_time.time.Imports._
but am not sure where to go from here.
Create formatter:
val fmt = DateTimeFormat.forPattern("MM/dd/yyyy")
Parse date:
val dt = fmt.parseDateTime("09/11/2015")
Get a day of the week:
dt.toString("EEEEE")
Wrap it using org.apache.spark.sql.functions.udf and you have a complete solution. Still there is no need for that since HiveContext already provides all the required UDFs:
val df = sc.parallelize(Seq(
Tuple1("08/11/2015"), Tuple1("09/11/2015"), Tuple1("09/12/2015")
)).toDF("date_string")
df.registerTempTable("df")
sqlContext.sql(
"""SELECT date_string,
from_unixtime(unix_timestamp(date_string,'MM/dd/yyyy'), 'EEEEE') AS dow
FROM df"""
).show
// +-----------+--------+
// |date_string| dow|
// +-----------+--------+
// | 08/11/2015| Tuesday|
// | 09/11/2015| Friday|
// | 09/12/2015|Saturday|
// +-----------+--------+
EDIT:
Since Spark 1.5 you can use from_unixtime, unix_timestamp functions directly:
import org.apache.spark.sql.functions.{from_unixtime, unix_timestamp}
df.select(from_unixtime(
unix_timestamp($"date_string", "MM/dd/yyyy"), "EEEEE").alias("dow"))