Inconsistent spark structured streaming behavior when using ProcessingTimeTimeout

Inconsistent spark structured streaming behavior when using ProcessingTimeTimeout - scala

Hope you can help.
The problem
Stack: spark 3.2.1, kafka.
After the first start of structured streaming query, which uses (flat)mapGroupsWithState function with GroupStateTimeout.ProcessingTimeTimeout(), empty batches are generated every trigger interval (e.g. 10 seconds) in case when there is no input data. This is ok and great actually, because processing time timeouts are processed.
However, after restart of the same query from checkpoint, empty batches are not generated every trigger interval in case when there is no input data. After data comes, empty batches are generated again. This is not ok. Here is a business case.
Imagine a processing time timeout was set to 12:00. Imagine app got down at 11:59 and was restarted at 12:01 (e.g. for maintenance). If new data hadn't arrived during this period, timeout would not be processed. It means app would contain expired state until new data arrives.
Reproducing the problem
Start query (bootstrapServer, topic and checkpointLocation are parameters)
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
val query = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServer)
.option("subscribe", topic)
.load()
.select("value")
.as[String]
.groupByKey(v => v)
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout()) {
(k: String, values: Iterator[String], state: GroupState[String]) =>
state.setTimeoutDuration("1 minute")
(k, values.size)
}
.writeStream
.outputMode("update")
.trigger(Trigger.ProcessingTime("10 seconds"))
.option("checkpointLocation", checkpointLocation)
.format("console")
.option("truncate", "false")
.start()
query.awaitTermination()
Ensure that empty batches are generated every trigger interval
-------------------------------------------
Batch: 0
-------------------------------------------
+---+---+
|_1 |_2 |
+---+---+
+---+---+
-------------------------------------------
Batch: 1
-------------------------------------------
+---+---+
|_1 |_2 |
+---+---+
+---+---+
Write some record to topic
Ensure it is processed
-------------------------------------------
Batch: 2
-------------------------------------------
+----+---+
|_1 |_2 |
+----+---+
|qwer|1 |
+----+---+
Stop app
Wait for more than 1 minute, i.e. timeout duration
Start app again
Wait for more than 10 seconds, i.e. trigger interval
Ensure that empty batches are not generated
Write another record to topic
Ensure that timeout is processed
-------------------------------------------
Batch: 3
-------------------------------------------
+----+---+
|_1 |_2 |
+----+---+
|qwer|0 |
|asdf|1 |
+----+---+
Ensure that empty batches are generated again
-------------------------------------------
Batch: 4
-------------------------------------------
+---+---+
|_1 |_2 |
+---+---+
+---+---+
-------------------------------------------
Batch: 5
-------------------------------------------
+---+---+
|_1 |_2 |
+---+---+
+---+---+
Research
Documentation
Documentation of GroupState - https://spark.apache.org/docs/3.2.1/api/scala/org/apache/spark/sql/streaming/GroupState.html:
With ProcessingTimeTimeout, the timeout duration can be set by calling GroupState.setTimeoutDuration. The timeout will occur when the clock has advanced by the set duration. Guarantees provided by this timeout with a duration of D ms are as follows:
Timeout will never occur before the clock time has advanced by D ms
Timeout will occur eventually when there is a trigger in the query (i.e. after D ms). So there is no strict upper bound on when the timeout would occur. For example, the trigger interval of the query will affect when the timeout actually occurs. If there is no data in the stream (for any group) for a while, then there will not be any trigger and timeout function call will not occur until there is data.
Since the processing time timeout is based on the clock time, it is affected by the variations in the system clock (i.e. time zone changes, clock skew, etc.).
This part seems incorrect:
If there is no data in the stream (for any group) for a while, then there will not be any trigger and timeout function call will not occur until there is data.
But as we've seen earlier, actually empty batches are generated (not always though) in case of data absence. As I checked the source code, documentation was written before related code was refactored.
Source code
I've also tried to find answers in spark source code, but all I've found is that code behaves as we've observed earlier. If you are curious, interesting part starts at this line.
Solution (crutch)
For now the solution is to write some garbage to the topic on the app start before starting structured streaming query.
Desired behavior
Query works the same way on the first start and after restart, ideally, always generating empty batches when there is no input data.
Finally, the question
Do you know any other ways (less crutchy, maybe) to trigger such query after restart?
If not, maybe you could tell the reason behind described behavior?
And also, would you agree with me that this behavior seems incorrect and desired behavior looks more consistent?
Thank you!

Related

datetime offset issue while saving data into parquet for day light saving time

screenshot from source and destinationWhile we are writing into parquet file using spark/scala, DST(day light saving times) times are auto converting by one hour delay time for example (2011-09-20 00:00:00.000 into "2011-09-19 23:00:00.000").
Source(reading data from) : sql server
Destination(writing into ): AWS S3
code:
val ssdf= spark.read.format("jdbc").option("driver" , "${ssDriver}").
option("url", "${ssConnectionString}").
option("dbtable", "${SCHEMANAME}.${RESULTTABLENAME}").
option("user", "${ssUsername}").
option("password", "${ssPassword}").
load()
ssdf.write.format("parquet").mode("overwrite").option("header", "true").save("s3://targetS3path/")`
###########################################################
Code is running fine, but which dates are DST datetime, delaying by 1 hour. check screenshot.
Expecting datetime value as per source, 2011-09-20 00:00:00.000.

Set the JVM timezone you will need to add extra JVM options for the driver and executors:
val spark = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.config("spark.driver.extraJavaOptions", "-Duser.timezone=GMT")
.config("spark.executor.extraJavaOptions", "-Duser.timezone=GMT")
.config("spark.sql.session.timeZone", "UTC")
.getOrCreate();

Drop duplicates over time window in pyspark

I have a streaming data frame in spark reading from a kafka topic and I want to drop duplicates for the past 5 minutes every time a new record is parsed.
I am aware of the dropDuplicates(["uid"]) function, I am just not sure how to check for duplicates over a specific historic time interval.
My understanding is that the following:
df = df.dropDuplicates(["uid"])
either works on the data read over the current (micro)batch or else over "anything" that is right now into memory.
Is there a way to set the time for this de-duplication, using a "timestamp" column within the data?
Thanks in advance.

df\
.withWatermark("event_time", "5 seconds")\
.dropDuplicates(["User", "uid"])\
.groupBy("User")\
.count()\
.writeStream\
.queryName("pydeduplicated")\
.format("memory")\
.outputMode("complete")\
.start()
for more info you can refer,
https://databricks.com/blog/2017/10/17/arbitrary-stateful-processing-in-apache-sparks-structured-streaming.html

Spark Timestamp issue, same timestamp but mismatching

I am moving data from source into my bucket and need to write a script for data validation. But for Timestamp data type, I face some weird issue: I have two rows containing two same timestamp [2017-06-08 17:50:02.422437], [2017-06-08 17:50:02.422], because the second one is having a different format due to different file system configuration Spark considers them different. Is there anyway to resolve this problem? The ideal way is to ignore this column when doing data frame comparison.

You can use unix_timestamp and use that number for comparison. For actual date requirements you can use from_unixtime to convert to your required format. Not sure it is efficient method on large volume of data...
sqlContext.sql("Select unix_timestamp('2017-06-08 17:50:02.422'), unix_timestamp('2017-06-08 17:50:02.422437') ").show
+----------+----------+
| _c0| _c1|
+----------+----------+
|1496958602|1496958602|
+----------+----------+

Left outer join not emitting null values when joining two streams in spark structured streaming 2.3.0

Left outer join on two streams not emitting the null outputs. It is just waiting for the record to be added to the other stream. Using socketstream to test this. In our case, we want to emit the records with null values which don't match with id or/and not fall in time range condition
Details of the watermarks and intervals are:
val ds1Map = ds1
.selectExpr("Id AS ds1_Id", "ds1_timestamp")
.withWatermark("ds1_timestamp","10 seconds")
val ds2Map = ds2
.selectExpr("Id AS ds2_Id", "ds2_timestamp")
.withWatermark("ds2_timestamp", "20 seconds")
val output = ds1Map.join( ds2Map,
expr(
""" ds1_Id = ds2_Id AND ds2_timestamp >= ds1_timestamp AND ds2_timestamp <= ds1_timestamp + interval 1 minutes """),
"leftOuter")
val query = output.select("*")
.writeStream
.outputMode(OutputMode.Append)
.format("console")
.option("checkpointLocation", "./spark-checkpoints/")
.start()
query.awaitTermination()
Thank you.

This may be due to one of the caveats of the micro-batch architecture implementation as noted in the developers guide here: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#semantic-guarantees-of-stream-stream-inner-joins-with-watermarking
In the current implementation in the micro-batch engine, watermarks are advanced at the end of a micro-batch, and the next micro-batch uses the updated watermark to clean up state and output outer results. Since we trigger a micro-batch only when there is new data to be processed, the generation of the outer result may get delayed if there no new data being received in the stream. In short, if any of the two input streams being joined does not receive data for a while, the outer (both cases, left or right) output may get delayed.
This was the case for me where the null data was not getting flushed out until a further batch was triggered sometime later

Hi Jack and thanks for the response. question/issue was a year and a half ago and it took some time to recover what I did last year:),
I run stream 2 stream join on two topics one with more the 10K sec msg and it was running on Spark cluster with 4.67 TB total memory with 1614 VCors total.
Implementation was simple structured streaming stream 2 stream join as in Spark official documents :
// Join with event-time constraints
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= impressionTime AND
clickTime <= impressionTime + interval 1 hour
""")
)
It was running for a few hours until OOM.
After investigation, I found out the issue about spark clean state in HDFSBackedStateStoreProvider and the open Jira in spark :
https://issues.apache.org/jira/browse/SPARK-23682
Memory issue with spark structured streaming
And this is why I moved back and implemented stream to stream join in spark streaming 2.1.1 mapWithState.
Thx

Spark Streaming Guarantee Specific Start Window Time

I'm using Spark Streaming to read data from Kinesis using the Structured Streaming framework, my connection is as follows
val kinesis = spark
.readStream
.format("kinesis")
.option("streams", streamName)
.option("endpointUrl", endpointUrl)
.option("initialPositionInStream", "earliest")
.option("format", "json")
.schema(<my-schema>)
.load
The data comes from several IoT devices which have a unique id, I need to aggregate the data by this id and by a tumbling window over the timestamp field, as follows:
val aggregateData = kinesis
.groupBy($"uid", window($"timestamp", "15 minute", "15 minute"))
.agg(...)
The problem I'm encountering is that I need to guarantee that every window starts at round times (such as 00:00:00, 00:15:00 and so on), also I need a guarantee that only rows containing full 15-minute long windows are going to be output to my sink, what I'm currently doing is
val query = aggregateData
.writeStream
.foreach(postgreSQLWriter)
.outputMode("update")
.start()
.awaitTermination()
Where ths postgreSQLWriter is a StreamWriter I created for inserting each row into a PostgreSQL SGBD. How can I force my windows to be exactly 15-minute long and the start time to be round 15-minute timestamp values for each device unique id?

question1:
to start at specific times to start, there is one more parameters spark grouping function takes which is "offset".
By specifying that it will start after the specified time from an hour
Example:
dataframe.groupBy($"Column1",window($"TimeStamp","22 minute","1 minute","15 minute"))
so the above syntax will group by column1 and create windows of 22 minute duration with sliding window size of 1 minute and offset as 15 minute
for example it starts from:
window1: 8:15(8:00 add 15 minute offset) to 8:37 (8:15 add 22 minutes)
window2: 8:16(previous window start + 1 minute) to 8:38 ( 22 minute size again)
question2:
to push only those windows having full 15 minute size, create a count column which counts the number of events having in that window. once it reaches 15, push it to wherever you want using filter command
calculating count:
dataframe.groupBy($"Column1",window($"TimeStamp","22 minute","1 minute","15 minute")).agg(count*$"Column2").as("count"))
writestream filter containing count 15 only:
aggregateddata.filter($"count"===15).writeStream.format(....).outputMode("complete").start()