Aggregating JSON object in Dataframe and converting string timestamp to date - scala

I got JSON rows that looks like the following
[{"time":"2017-03-23T12:23:05","user":"randomUser","action":"sleeping"}]
[{"time":"2017-03-23T12:24:05","user":"randomUser","action":"sleeping"}]
[{"time":"2017-03-23T12:33:05","user":"randomUser","action":"sleeping"}]
[{"time":"2017-03-23T15:33:05","user":"randomUser2","action":"eating"}]
[{"time":"2017-03-23T15:33:06","user":"randomUser2","action":"eating"}]
So I got 2 problem, First of all the time is stored as String inside my df, I believe it has to be date for me to aggregate them?
second of all, I need to aggregate those datas by 5 minutes interval,
just for example everything that happens from 2017-03-23T12:20:00 to 2017-03-23T12:24:59 need to be aggregated and considered as 2017-03-23T12:20:00 timestamp
expected output is
[{"time":"2017-03-23T12:20:00","user":"randomUser","action":"sleeping","count":2}]
[{"time":"2017-03-23T12:30:00","user":"randomUser","action":"sleeping","count":1}]
[{"time":"2017-03-23T15:30:00","user":"randomUser2","action":"eating","count":2}]
thanks

You can convert the StringType column into a TimestampType column using casting; Then, you can cast the timestamp into IntegerType to make the "rounding" down to the last 5-minute interval easier, and group by that (and all other columns):
// importing SparkSession's implicits
import spark.implicits._
// Use casting to convert String into Timestamp:
val withTime = df.withColumn("time", $"time" cast TimestampType)
// calculate the "most recent 5-minute-round time" and group by all
val result = withTime.withColumn("time", $"time" cast IntegerType)
.withColumn("time", ($"time" - ($"time" mod 60 * 5)) cast TimestampType)
.groupBy("time", "user", "action").count()
result.show(truncate = false)
// +---------------------+-----------+--------+-----+
// |time |user |action |count|
// +---------------------+-----------+--------+-----+
// |2017-03-23 12:20:00.0|randomUser |sleeping|2 |
// |2017-03-23 15:30:00.0|randomUser2|eating |2 |
// |2017-03-23 12:30:00.0|randomUser |sleeping|1 |
// +---------------------+-----------+--------+-----+

Related

how to persist a spark Date type column into a DB Date column

I have a dataSet contains a Date type column, when I tried to store this dataSet to DB i'm getting :
ERROR: column "processDate" is of type date but expression is of type character varying
which obviously telling that I'm trying to store a varchar type column in a date column, however, I'm using to_date (from sql.function) to convert the processingDate from string to Date (which works, I tried it )
can anyone help ?
Make sure that your data is on the specified data format, you can also pass the format explicitly to to_date() function.
Sample data:
val somedata = Seq(
(1, "11-11-2019"),
(2, "11-11-2019")
).toDF("id", "processDate")
somedata.printSchema()
Schema for this sample data is:
somedata:org.apache.spark.sql.DataFrame = [id: integer, processDate: string]
You can do something like this:
import org.apache.spark.sql.functions.to_date
val newDF = somedata.withColumn("processDate", to_date('processDate, "MM-dd-yyyy"))
newDF.show()
newDF.printSchema()
Above code will output:
+---+-----------+
| id|processDate|
+---+-----------+
| 1| 2019-11-11|
| 2| 2019-11-11|
+---+-----------+
With the following schema:
newDF: org.apache.spark.sql.DataFrame = [id: int, processDate: date]

Filtering a DataFrame on date columns comparison

I am trying to filter a DataFrame comparing two date columns using Scala and Spark. Based on the filtered DataFrame there are calculations running on top to calculate new columns.
Simplified my data frame has the following schema:
|-- received_day: date (nullable = true)
|-- finished: int (nullable = true)
On top of that I create two new column t_start and t_end that would be used for filtering the DataFrame. They have 10 and 20 days difference from the original column received_day:
val dfWithDates= df
.withColumn("t_end",date_sub(col("received_day"),10))
.withColumn("t_start",date_sub(col("received_day"),20))
I now want to have a new calculated column that indicates for each row of data how many rows of the dataframe are in the t_start to t_end period. I thought I can achieve this the following way:
val dfWithCount = dfWithDates
.withColumn("cnt", lit(
dfWithDates.filter(
$"received_day".lt(col("t_end"))
&& $"received_day".gt(col("t_start"))).count()))
However, this count only returns 0 and I believe that the problem is in the the argument that I am passing to lt and gt.
From following that issue here Filtering a spark dataframe based on date I realized that I need to pass a string value. If I try with hard coded values like lt(lit("2018-12-15")), then the filtering works. So I tried casting my columns to StringType:
val dfWithDates= df
.withColumn("t_end",date_sub(col("received_day"),10).cast(DataTypes.StringType))
.withColumn("t_start",date_sub(col("received_day"),20).cast(DataTypes.StringType))
But the filter still returns an empty dataFrame.
I would assume that I am not handling the data type right.
I am running on Scala 2.11.0 with Spark 2.0.2.
Yes you are right. For $"received_day".lt(col("t_end") each reveived_day value is compared with the current row's t_end value, not the whole dataframe. So each time you'll get zero as count.
You can solve this by writing a simple udf. Here is the way how you can solve the issue:
Creating sample input dataset:
import org.apache.spark.sql.{Row, SparkSession}
import java.sql.Date
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq((Date.valueOf("2018-10-12"),1),
(Date.valueOf("2018-10-13"),1),
(Date.valueOf("2018-09-25"),1),
(Date.valueOf("2018-10-14"),1)).toDF("received_day", "finished")
val dfWithDates= df
.withColumn("t_start",date_sub(col("received_day"),20))
.withColumn("t_end",date_sub(col("received_day"),10))
dfWithDates.show()
+------------+--------+----------+----------+
|received_day|finished| t_start| t_end|
+------------+--------+----------+----------+
| 2018-10-12| 1|2018-09-22|2018-10-02|
| 2018-10-13| 1|2018-09-23|2018-10-03|
| 2018-09-25| 1|2018-09-05|2018-09-15|
| 2018-10-14| 1|2018-09-24|2018-10-04|
+------------+--------+----------+----------+
Here for 2018-09-25 we desire count 3
Generate output:
val count_udf = udf((received_day:Date) => {
(dfWithDates.filter((col("t_end").gt(s"$received_day")) && col("t_start").lt(s"$received_day")).count())
})
val dfWithCount = dfWithDates.withColumn("count",count_udf(col("received_day")))
dfWithCount.show()
+------------+--------+----------+----------+-----+
|received_day|finished| t_start| t_end|count|
+------------+--------+----------+----------+-----+
| 2018-10-12| 1|2018-09-22|2018-10-02| 0|
| 2018-10-13| 1|2018-09-23|2018-10-03| 0|
| 2018-09-25| 1|2018-09-05|2018-09-15| 3|
| 2018-10-14| 1|2018-09-24|2018-10-04| 0|
+------------+--------+----------+----------+-----+
To make computation faster i would suggest to cache dfWithDates as there are repetition of same operation for each row.
You can cast date value to string with any pattern using DateTimeFormatter
import java.time.format.DateTimeFormatter
date.format(DateTimeFormatter.ofPattern("yyyy-MM-dd"))

Filter a dataframe based on the string date input in spark scala

I have a table with a column 'date' and the date format is yyyyMMdd. I need to filter this dataframe and return a dataframe with only rows with dates greater than an input, For eg: Return all the rows where date is greater than "20180715". I did the following.
scala> df.groupBy("date").count.show(50,false)
+--------+----------+
|date |count |
+--------+----------+
|20180707|200 |
|20180715|1429586969|
|20180628|1425490080|
|20180716|1429819708|
+--------+----------+
scala> var con = df.filter(to_date(df("date"),"yyyyMMdd").gt(lit("20180715")))
scala> con.count
res4: Long = 0
scala> var con = df.filter(to_date(df("date"),"yyyyMMdd").gt(lit("20170715")))
scala> con.count
res1: Long = 4284896957
When I input the date as "20170715", it counts all the records, whereas if the date is "20180715", the filter condition does not work. What is the correct way to compare with a string date.
Changing the format of the input string passed to the lit function, solved this issue.
scala> var con = df.filter(to_date(df("date"),"yyyyMMdd").gt(lit("2018-07-15")))
scala> con.count
res6: Long = 1429819708

get date difference from the columns in dataframe and get seconds -Spark scala

I have a dataframe with two date columns .Now I need to get the difference and the results should be seconds
UNIX_TIMESTAMP(SUBSTR(date1, 1, 19)) - UNIX_TIMESTAMP(SUBSTR(date2, 1, 19)) AS delta
that hive query I am trying to convert into dataframe query using scala
df.select(col("date").substr(1,19)-col("poll_date").substr(1,19))
from here I am not able to convert into seconds , Can any body help on this .Thanks in advance
Using DataFrame API, you can calculate the date difference in seconds simply by subtracting one column from the other in unix_timestamp:
val df = Seq(
("2018-03-05 09:00:00", "2018-03-05 09:01:30"),
("2018-03-06 08:30:00", "2018-03-08 15:00:15")
).toDF("date1", "date2")
df.withColumn("tsdiff", unix_timestamp($"date2") - unix_timestamp($"date1")).
show
// +-------------------+-------------------+------+
// | date1| date2|tsdiff|
// +-------------------+-------------------+------+
// |2018-03-05 09:00:00|2018-03-05 09:01:30| 90|
// |2018-03-06 08:30:00|2018-03-08 15:00:15|196215|
// +-------------------+-------------------+------+
You could perform the calculation in Spark SQL as well, if necessary:
df.createOrReplaceTempView("dfview")
spark.sql("""
select date1, date2, (unix_timestamp(date2) - unix_timestamp(date1)) as tsdiff
from dfview
""")

How to add a new column with day of week based on another in dataframe?

I have a field in a data frame currently formatted as a string (mm/dd/yyyy) and I want to create a new column in that data frame with the day of week name (i.e. Thursday) for that field. I've imported
import com.github.nscala_time.time.Imports._
but am not sure where to go from here.
Create formatter:
val fmt = DateTimeFormat.forPattern("MM/dd/yyyy")
Parse date:
val dt = fmt.parseDateTime("09/11/2015")
Get a day of the week:
dt.toString("EEEEE")
Wrap it using org.apache.spark.sql.functions.udf and you have a complete solution. Still there is no need for that since HiveContext already provides all the required UDFs:
val df = sc.parallelize(Seq(
Tuple1("08/11/2015"), Tuple1("09/11/2015"), Tuple1("09/12/2015")
)).toDF("date_string")
df.registerTempTable("df")
sqlContext.sql(
"""SELECT date_string,
from_unixtime(unix_timestamp(date_string,'MM/dd/yyyy'), 'EEEEE') AS dow
FROM df"""
).show
// +-----------+--------+
// |date_string| dow|
// +-----------+--------+
// | 08/11/2015| Tuesday|
// | 09/11/2015| Friday|
// | 09/12/2015|Saturday|
// +-----------+--------+
EDIT:
Since Spark 1.5 you can use from_unixtime, unix_timestamp functions directly:
import org.apache.spark.sql.functions.{from_unixtime, unix_timestamp}
df.select(from_unixtime(
unix_timestamp($"date_string", "MM/dd/yyyy"), "EEEEE").alias("dow"))