Drop duplicates over time window in pyspark - pyspark

I have a streaming data frame in spark reading from a kafka topic and I want to drop duplicates for the past 5 minutes every time a new record is parsed.
I am aware of the dropDuplicates(["uid"]) function, I am just not sure how to check for duplicates over a specific historic time interval.
My understanding is that the following:
df = df.dropDuplicates(["uid"])
either works on the data read over the current (micro)batch or else over "anything" that is right now into memory.
Is there a way to set the time for this de-duplication, using a "timestamp" column within the data?
Thanks in advance.

df\
.withWatermark("event_time", "5 seconds")\
.dropDuplicates(["User", "uid"])\
.groupBy("User")\
.count()\
.writeStream\
.queryName("pydeduplicated")\
.format("memory")\
.outputMode("complete")\
.start()
for more info you can refer,
https://databricks.com/blog/2017/10/17/arbitrary-stateful-processing-in-apache-sparks-structured-streaming.html

Related

Apache Flink - SQL Kafka connector Watermark on event time doesn't pull records

I have a question similar to the one Apache Flink Tumbling Window delayed result. The difference is, I'm using the SQL using kafka connect to read records from topic. I get the records on regular intervals, but somehow, I don't get the last few records in the output. For example, the last record in Kafka topic is with timestamp 2020-11-26T13:11:36.605Z and the last timestamp for aggregated value is 2020-11-26T12:59:59.999. I don't understand why I'm not getting the aggregation on the last record in topic. Please help. Here is my code.
sourceSQL = "CREATE TABLE flink_read_kafka (clientId INT, orderId INT, contactTimeStamp, WATERMARK FOR contactTimeStamp AS contactTimeStamp - INTERVAL '5' SECOND with (kafka config) ";
sinkSQL = "CREATE TABLE flink_aggr_kafka (contactTimeStamp STRING, clientId INT, orderCount BIGINT) with (kafka config) ";
aggrSQL = "insert into flink_aggr_kafka SELECT TUMBLE_ROWTIME(contactTimeStamp, INTERVAL '5' MINUTE) as contactTimeStamp, clientId, COUNT(*) orderCount from flink_read_kafka GROUP BY clientId , TUMBLE(commsTimestamp, INTERVAL '5' MINUTE)";
blinkStreamTableEnv.executeSql(sourceSQL);
blinkStreamTableEnv.executeSql(sinkSQL);
blinkStreamTableEnv.executeSql(aggrSQL);
First, some background: A tumbling window only emits results once the watermark has passed the maximum timestamp of the window. The watermark indicates to the framework that all records with a lower timestamp have arrived, and hence the window is complete and the results can be emitted.
The watermark can only advance based on the timestamp of records coming in, so if no more records are coming in the watermark will not advance and currently open windows will not be closed. So, it is expected that last windows remain open when there is no influx of data anymore.
In your example, one would normally assume that the windows with a rowtime of 2020-11-26T13:04:59.999 and 26T13:09:59.999 are also emitted, because the latest records should have pushed the watermark beyond these timestamps.
I can think of two reasons right now why this might not be the case:
not all parallel source instances have seen a timestamp higher than 26T13:05:04.999 and hence the output watermark has actually not passed that value. You can test this by either running the job with a parallelism of 1 which would mitigate the problem or verify if this is the case by checking the watermark of the window operator in the Flink Web UI.
if you are using the Kafka producer in exactly-once mode and only consume records that have been comitted the records will only become visible once a checkpoint has been completed after the window has fired.

Spark avro predicate pushdown

We are using Avro data format and the data is partitioned by year, month, day, hour, min
I see the data stored in HDFS as
/data/year=2018/month=01/day=01/hour=01/min=00/events.avro
And we load the data using
val schema = new Schema.Parser().parse(this.getClass.getResourceAsStream("/schema.txt"))
val df = spark.read.format("com.databricks.spark.avro").option("avroSchema",schema.toString).load("/data")
And then using predicate push down for filtering the data -
var x = isInRange(startDate, endDate)($"year", $"month", $"day", $"hour", $"min")
df = tableDf.filter(x)
Can someone explain what is happening behind the scenes?
I want to specifically understand when does the filtering of input files happen and where?
Interestingly, when I print the schema, the fields year, month, day and hour are automatically added, i.e the actual data does not contain these columns. Does Avro add these fields?
Want to understand clearly how files are filtered and how the partitions are created.

Left outer join not emitting null values when joining two streams in spark structured streaming 2.3.0

Left outer join on two streams not emitting the null outputs. It is just waiting for the record to be added to the other stream. Using socketstream to test this. In our case, we want to emit the records with null values which don't match with id or/and not fall in time range condition
Details of the watermarks and intervals are:
val ds1Map = ds1
.selectExpr("Id AS ds1_Id", "ds1_timestamp")
.withWatermark("ds1_timestamp","10 seconds")
val ds2Map = ds2
.selectExpr("Id AS ds2_Id", "ds2_timestamp")
.withWatermark("ds2_timestamp", "20 seconds")
val output = ds1Map.join( ds2Map,
expr(
""" ds1_Id = ds2_Id AND ds2_timestamp >= ds1_timestamp AND ds2_timestamp <= ds1_timestamp + interval 1 minutes """),
"leftOuter")
val query = output.select("*")
.writeStream
.outputMode(OutputMode.Append)
.format("console")
.option("checkpointLocation", "./spark-checkpoints/")
.start()
query.awaitTermination()
Thank you.
This may be due to one of the caveats of the micro-batch architecture implementation as noted in the developers guide here: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#semantic-guarantees-of-stream-stream-inner-joins-with-watermarking
In the current implementation in the micro-batch engine, watermarks are advanced at the end of a micro-batch, and the next micro-batch uses the updated watermark to clean up state and output outer results. Since we trigger a micro-batch only when there is new data to be processed, the generation of the outer result may get delayed if there no new data being received in the stream. In short, if any of the two input streams being joined does not receive data for a while, the outer (both cases, left or right) output may get delayed.
This was the case for me where the null data was not getting flushed out until a further batch was triggered sometime later
Hi Jack and thanks for the response. question/issue was a year and a half ago and it took some time to recover what I did last year:),
I run stream 2 stream join on two topics one with more the 10K sec msg and it was running on Spark cluster with 4.67 TB total memory with 1614 VCors total.
Implementation was simple structured streaming stream 2 stream join as in Spark official documents :
// Join with event-time constraints
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= impressionTime AND
clickTime <= impressionTime + interval 1 hour
""")
)
It was running for a few hours until OOM.
After investigation, I found out the issue about spark clean state in HDFSBackedStateStoreProvider and the open Jira in spark :
https://issues.apache.org/jira/browse/SPARK-23682
Memory issue with spark structured streaming
And this is why I moved back and implemented stream to stream join in spark streaming 2.1.1 mapWithState.
Thx

Spark Streaming Guarantee Specific Start Window Time

I'm using Spark Streaming to read data from Kinesis using the Structured Streaming framework, my connection is as follows
val kinesis = spark
.readStream
.format("kinesis")
.option("streams", streamName)
.option("endpointUrl", endpointUrl)
.option("initialPositionInStream", "earliest")
.option("format", "json")
.schema(<my-schema>)
.load
The data comes from several IoT devices which have a unique id, I need to aggregate the data by this id and by a tumbling window over the timestamp field, as follows:
val aggregateData = kinesis
.groupBy($"uid", window($"timestamp", "15 minute", "15 minute"))
.agg(...)
The problem I'm encountering is that I need to guarantee that every window starts at round times (such as 00:00:00, 00:15:00 and so on), also I need a guarantee that only rows containing full 15-minute long windows are going to be output to my sink, what I'm currently doing is
val query = aggregateData
.writeStream
.foreach(postgreSQLWriter)
.outputMode("update")
.start()
.awaitTermination()
Where ths postgreSQLWriter is a StreamWriter I created for inserting each row into a PostgreSQL SGBD. How can I force my windows to be exactly 15-minute long and the start time to be round 15-minute timestamp values for each device unique id?
question1:
to start at specific times to start, there is one more parameters spark grouping function takes which is "offset".
By specifying that it will start after the specified time from an hour
Example:
dataframe.groupBy($"Column1",window($"TimeStamp","22 minute","1 minute","15 minute"))
so the above syntax will group by column1 and create windows of 22 minute duration with sliding window size of 1 minute and offset as 15 minute
for example it starts from:
window1: 8:15(8:00 add 15 minute offset) to 8:37 (8:15 add 22 minutes)
window2: 8:16(previous window start + 1 minute) to 8:38 ( 22 minute size again)
question2:
to push only those windows having full 15 minute size, create a count column which counts the number of events having in that window. once it reaches 15, push it to wherever you want using filter command
calculating count:
dataframe.groupBy($"Column1",window($"TimeStamp","22 minute","1 minute","15 minute")).agg(count*$"Column2").as("count"))
writestream filter containing count 15 only:
aggregateddata.filter($"count"===15).writeStream.format(....).outputMode("complete").start()

spark df.write.partitionBy run very slow

I have a data frame that when saved as Parquet format takes ~11GB.
When reading to a dataframe and writing to json, it takes 5 minutes.
When I add partitionBy("day") it takes hours to finish.
I understand that the distribution to partitions is the costly action.
Is there a way to make it faster? Will sorting the files can make it better?
Example:
Run 5 minutes
df=spark.read.parquet(source_path).
df.write.json(output_path)
Run for hours
spark.read.parquet(source_path).createOrReplaceTempView("source_table")
sql="""
select cast(trunc(date,'yyyymmdd') as int) as day, a.*
from source_table a"""
spark.sql(sql).write.partitionBy("day").json(output_path)
Try adding a repartition("day") before the write, like this:
spark
.sql(sql)
.repartition("day")
.write
.partitionBy("day")
.json(output_path)
It should speed up your query.
Try adding repartition(any number ) to start with, then try increasing / decreasing the number depending upon the time it takes to write
spark
.sql(sql)
.repartition(any number)
.write
.partitionBy("day")
.json(output_path)