Window operations on date column in Spark Structured Streaming - scala

I am attempting to group data that fits into a specified window period using Spark Structured Streaming.
val profiles = rawProfiles.select("*")
.groupBy(window($"date", "10 minutes", "5 minutes").alias("date"), $"id", $"name")
.agg(sum("value").alias("value"))
.join(url.value, Seq("url"), "left")
.where("value > 20")
.as[profileRecord]
The format of the date from the rawProfiles is a string like this:
2017-07-20 18:27:45
What is returned for the date column after the window aggregation is something like this:
[0,554c749fb8a00,554c76dbed000]
I'm not really sure what to do with that. Does anyone have any ideas?

you can reformat your date field as follows;
rawProfiles.select(<your other fields>,to_date(unix_timestamp($"date").cast(DataTypes.TimestampType)).as("date"))).groupBy(window($"date", "10 minutes", "5 minutes").alias("date"), $"id", $"name")
.agg(sum("value").alias("value"))
.join(url.value, Seq("url"), "left")
.where("value > 20")
.as[profileRecord]

Related

Count the duration between 2 time in scala ( streaming)

I have a list of events and I need to count the time duration between the 2 events. The previous event can be in another window. How to do it? I use scala and streaming(mini-batches). I wrote the next but I had a mistake :"Non-time-based windows are not supported on streaming DataFrames/Datasets;"
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val ttt = tester
.withWatermark("time", "1 minutes")
.groupBy(
window($"time", "1 minutes", "10 seconds"),
$"nodeId"
)
.count()
.withColumn("WindowSize",(col("window.end").cast("Long") -
col("window.start").cast("Long")) / 60)
.withColumn("prev_time" ,lead(col("window.start"), 1)
.over(Window.partitionBy("nodeId").orderBy("window.start")))

[Spark][DataFrame] Most elegant way to calculate 1st dayofWeek

I have data in a DataFrame with a column as DateTime. Now I wish to find out the first day of the the week , First day means MONDAY.
So I have thought of the following Ways to achieve this -
With an arithmetic Calculation:
import org.apache.spark.sql.functions._
val df1 = Seq((1, "2020-05-12 10:23:45", 5000), (2, "2020-11-11 12:12:12", 2000)).toDF("id", "DateTime", "miliseconds")
val new_df1=df1.withColumn("week",date_sub(next_day(col("DateTime"),"monday"),7))
Result -
Creating an UDF which does the similar activity ( This is of least priority )
df1.withColumn("week", date_trunc("week", $"DateTime"))
If there are any more methods to achieve this, I would like to see more implementations.
Thanks in Advance.

How to include kafka timestamp value as columns in spark structured streaming?

I am looking for the solution for adding timestamp value of kafka to my Spark structured streaming schema. I have extracted the value field from kafka and making dataframe. My issue is, I need to get the timestamp field (from kafka) also along with the other columns.
Here is my current code:
val kafkaDatademostr = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers","zzzz.xxx.xxx.xxx.com:9002")
.option("subscribe","csvstream")
.load
val interval = kafkaDatademostr.select(col("value").cast("string")).alias("csv")
.select("csv.*")
val xmlData = interval.selectExpr("split(value,',')[0] as ddd" ,
"split(value,',')[1] as DFW",
"split(value,',')[2] as DTG",
"split(value,',')[3] as CDF",
"split(value,',')[4] as DFO",
"split(value,',')[5] as SAD",
"split(value,',')[6] as DER",
"split(value,',')[7] as time_for",
"split(value,',')[8] as fort")
How can I get the timestamp from kafka and add as columns along with other columns?
Timestamp is included in the source schema. Just add a "select timestamp" to get the timestamp like the below.
val interval = kafkaDatademostr.select(col("value").cast("string").alias("csv"), col("timestamp")).select("csv.*", "timestamp")
At Apache Spark official web page you can find guide: Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)
There you can find information about the schema of DataFrame that is loaded from Kafka.
Each row from Kafka source has following columns:
key - message key
value - message value
topic - name message topic
partition - partitions from which that message came from
offset - offset of the message
timestamp - timestamp
timestampType timestamp type
All of above columns are available to query.
In your example you use only value, so to get timestamp just need to add timestamp to your select statement:
val allFields = kafkaDatademostr.selectExpr(
s"CAST(value AS STRING) AS csv",
s"CAST(key AS STRING) AS key",
s"topic as topic",
s"partition as partition",
s"offset as offset",
s"timestamp as timestamp",
s"timestampType as timestampType"
)
In my case of Kafka, I was receiving the values in JSON format. Which contains the actual data along with original Event Time not kafka timestamp. Below is the schema.
val mySchema = StructType(Array(
StructField("time", LongType),
StructField("close", DoubleType)
))
In order to use watermarking feature of Spark Structured Streaming, I had to cast the time field into the timestamp format.
val df1 = df.selectExpr("CAST(value AS STRING)").as[(String)]
.select(from_json($"value", mySchema).as("data"))
.select(col("data.time").cast("timestamp").alias("time"),col("data.close"))
Now you can use the time field for window operation as well as watermarking purpose.
import spark.implicits._
val windowedData = df1.withWatermark("time","1 minute")
.groupBy(
window(col("time"), "1 minute", "30 seconds"),
$"close"
).count()
I hope this answer clarifies.

get date difference from the columns in dataframe and get seconds -Spark scala

I have a dataframe with two date columns .Now I need to get the difference and the results should be seconds
UNIX_TIMESTAMP(SUBSTR(date1, 1, 19)) - UNIX_TIMESTAMP(SUBSTR(date2, 1, 19)) AS delta
that hive query I am trying to convert into dataframe query using scala
df.select(col("date").substr(1,19)-col("poll_date").substr(1,19))
from here I am not able to convert into seconds , Can any body help on this .Thanks in advance
Using DataFrame API, you can calculate the date difference in seconds simply by subtracting one column from the other in unix_timestamp:
val df = Seq(
("2018-03-05 09:00:00", "2018-03-05 09:01:30"),
("2018-03-06 08:30:00", "2018-03-08 15:00:15")
).toDF("date1", "date2")
df.withColumn("tsdiff", unix_timestamp($"date2") - unix_timestamp($"date1")).
show
// +-------------------+-------------------+------+
// | date1| date2|tsdiff|
// +-------------------+-------------------+------+
// |2018-03-05 09:00:00|2018-03-05 09:01:30| 90|
// |2018-03-06 08:30:00|2018-03-08 15:00:15|196215|
// +-------------------+-------------------+------+
You could perform the calculation in Spark SQL as well, if necessary:
df.createOrReplaceTempView("dfview")
spark.sql("""
select date1, date2, (unix_timestamp(date2) - unix_timestamp(date1)) as tsdiff
from dfview
""")

Timestamp changes format when writing to csv file spark

I am trying to save a dataframe to a csv file, that contains a timestamp.
The problem that this column changes of format one written in the csv file. Here is the code I used:
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:/Users/mhattabi/Desktop/dataTest2.csv")
//val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:\\dataSet.csv\\datasetTest.csv")
//convert all column to numeric value in order to apply aggregation function
df.columns.map { c =>df.withColumn(c, col(c).cast("int")) }
//add a new column inluding the new timestamp column
val result2=df.withColumn("new_time",((unix_timestamp(col("time"))/300).cast("long") * 300).cast("timestamp")).drop("time")
val finalresult=result2.groupBy("new_time").agg(result2.drop("new_time").columns.map((_ -> "mean")).toMap).sort("new_time") //agg(avg(all columns..)
finalresult.coalesce(1).write.option("header",true).option("inferSchema","true").csv("C:/mydata.csv")
when display via df.show it shoes the correct format
But in the csv file it shoes this format:
Use option to format timestamp into desired one which you need:
finalresult.coalesce(1).write.option("header",true).option("inferSchema","true").option("dateFormat", "yyyy-MM-dd HH:mm:ss").csv("C:/mydata.csv")
or
finalresult.coalesce(1).write.format("csv").option("delimiter", "\t").option("header",true).option("inferSchema","true").option("dateFormat", "yyyy-MM-dd HH:mm:ss").option("escape", "\\").save("C:/mydata.csv")
Here is the code snippet that worked for me to modify the CSV output format for timestamps.
I needed a 'T' character in there, and no seconds or microseconds. The timestampFormat option did work for this.
DF.write
.mode(SaveMode.Overwrite)
.option("timestampFormat", "yyyy-MM-dd'T'HH:mm")
Such as 2017-02-20T06:53
If you substitute a space for 'T' then you get this:
DF.write
.mode(SaveMode.Overwrite)
.option("timestampFormat", "yyyy-MM-dd HH:mm")
Such as 2017-02-20 06:53