How to print the current time inside foreachRDD in apache spark? - scala

I am having a batch interval of 5 seconds.I want to look at the number of rdd's formed in one batch. So i added a time inside forEach to print the time in seconds and count rdd's after 5 seconds.
textStream.foreachRDD(rdd =>{
println("======="+ TimeUnit.MILLISECONDS.toMinutes(Instant.now.toEpochMilli))
rdd.foreach(println(_))
})
This gives the same time (currentl empty input):
=======26461220
=======26461220
=======26461220
=======26461220
The time should change right?
Q1. How to print the time of the current?
Q2. How many rdd's are formed in a dstream ?

Q1. How to print the time of the current?
You could simply use System.nanoTime()
textStream.foreachRDD(rdd => {
rdd.foreach(println(System.nanoTime())
})
Q2. How many rdd's are formed in a dstream ?
You will get one RDD for each batch interval. The batch interval is set in your configuration of the SparkSession. The stream is called a DStream which is a sequence of individual RDDs.

Related

Drop duplicates over time window in pyspark

I have a streaming data frame in spark reading from a kafka topic and I want to drop duplicates for the past 5 minutes every time a new record is parsed.
I am aware of the dropDuplicates(["uid"]) function, I am just not sure how to check for duplicates over a specific historic time interval.
My understanding is that the following:
df = df.dropDuplicates(["uid"])
either works on the data read over the current (micro)batch or else over "anything" that is right now into memory.
Is there a way to set the time for this de-duplication, using a "timestamp" column within the data?
Thanks in advance.
df\
.withWatermark("event_time", "5 seconds")\
.dropDuplicates(["User", "uid"])\
.groupBy("User")\
.count()\
.writeStream\
.queryName("pydeduplicated")\
.format("memory")\
.outputMode("complete")\
.start()
for more info you can refer,
https://databricks.com/blog/2017/10/17/arbitrary-stateful-processing-in-apache-sparks-structured-streaming.html

Left outer join not emitting null values when joining two streams in spark structured streaming 2.3.0

Left outer join on two streams not emitting the null outputs. It is just waiting for the record to be added to the other stream. Using socketstream to test this. In our case, we want to emit the records with null values which don't match with id or/and not fall in time range condition
Details of the watermarks and intervals are:
val ds1Map = ds1
.selectExpr("Id AS ds1_Id", "ds1_timestamp")
.withWatermark("ds1_timestamp","10 seconds")
val ds2Map = ds2
.selectExpr("Id AS ds2_Id", "ds2_timestamp")
.withWatermark("ds2_timestamp", "20 seconds")
val output = ds1Map.join( ds2Map,
expr(
""" ds1_Id = ds2_Id AND ds2_timestamp >= ds1_timestamp AND ds2_timestamp <= ds1_timestamp + interval 1 minutes """),
"leftOuter")
val query = output.select("*")
.writeStream
.outputMode(OutputMode.Append)
.format("console")
.option("checkpointLocation", "./spark-checkpoints/")
.start()
query.awaitTermination()
Thank you.
This may be due to one of the caveats of the micro-batch architecture implementation as noted in the developers guide here: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#semantic-guarantees-of-stream-stream-inner-joins-with-watermarking
In the current implementation in the micro-batch engine, watermarks are advanced at the end of a micro-batch, and the next micro-batch uses the updated watermark to clean up state and output outer results. Since we trigger a micro-batch only when there is new data to be processed, the generation of the outer result may get delayed if there no new data being received in the stream. In short, if any of the two input streams being joined does not receive data for a while, the outer (both cases, left or right) output may get delayed.
This was the case for me where the null data was not getting flushed out until a further batch was triggered sometime later
Hi Jack and thanks for the response. question/issue was a year and a half ago and it took some time to recover what I did last year:),
I run stream 2 stream join on two topics one with more the 10K sec msg and it was running on Spark cluster with 4.67 TB total memory with 1614 VCors total.
Implementation was simple structured streaming stream 2 stream join as in Spark official documents :
// Join with event-time constraints
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= impressionTime AND
clickTime <= impressionTime + interval 1 hour
""")
)
It was running for a few hours until OOM.
After investigation, I found out the issue about spark clean state in HDFSBackedStateStoreProvider and the open Jira in spark :
https://issues.apache.org/jira/browse/SPARK-23682
Memory issue with spark structured streaming
And this is why I moved back and implemented stream to stream join in spark streaming 2.1.1 mapWithState.
Thx

Pyspark windows on last 30 days on subset of data

I have a working Pyspark Windowing function (Spark 2.0) that takes the last 30 days (86400*30) seconds and counts the number of times each action in column 'a' happens per ID. The dataset that I am applying this function to has multiple records for every day between '2018-01-01' and '2018-04-01'. Because this is a 30 day look back, I don't want to apply this function to data that doesn't have a full 30 days to look back on. For convenience, I want to start my counts on Feb 1st. I can' filter out January, because it is needed for Februrary's counts. I know I can just throw a filter on the new dataframe and filter out the data before for February, but is there a way to do it without that extra step? It'd be nice to not have to preform the calculations which could save time.
Here's the code:
from pyspark.sql import Window
from pyspark.sql import functions as F
windowsess = Window.partitionBy("id",'a').orderBy('ts').rangeBetween(-86400*30, Window.currentRow)
df4 = df3.withColumn("2h4_ct",F.count(df.a).over(windowsess))
Mockup of current dataset. I didn't want to convert the col ts, by hand so I wrote in a substitute for it.
id,a,timestamp,ts
1,soccer,2018-01-01 10:41:00, <unix_timestamp>
1,soccer,2018-01-13 10:40:00, <unix_timestamp>
1,soccer,2018-01-23 10:39:00, <unix_timestamp>
1,soccer,2018-02-01 10:38:00, <unix_timestamp>
1,soccer,2018-02-03 10:37:00, <unix_timestamp>
1,leagueoflegends,2018-02-04 10:36:00, <unix_timestamp>
With my made up sample data. I want to return the following rows
1,soccer,2018-02-01 10:38:00, <unix_timestamp>,4
1,soccer,2018-02-03 10:37:00, <unix_timestamp>,5
1,leagueoflegends,2018-02-04 10:36:00, <unix_timestamp>,1
instead I get this:
1,soccer,2018-01-01 10:41:00, <unix_timestamp>,1
1,soccer,2018-01-13 10:40:00, <unix_timestamp>,2
1,soccer,2018-01-23 10:39:00, <unix_timestamp>,3
1,soccer,2018-02-01 10:38:00, <unix_timestamp>,4
1,soccer,2018-02-03 10:37:00, <unix_timestamp>,5
1,leagueoflegends,2018-02-04 10:36:00, <unix_timestamp>,1
What if you use :
df4 = df3.groupby(['id', 'a', 'timestamp']).count()

Spark Streaming Guarantee Specific Start Window Time

I'm using Spark Streaming to read data from Kinesis using the Structured Streaming framework, my connection is as follows
val kinesis = spark
.readStream
.format("kinesis")
.option("streams", streamName)
.option("endpointUrl", endpointUrl)
.option("initialPositionInStream", "earliest")
.option("format", "json")
.schema(<my-schema>)
.load
The data comes from several IoT devices which have a unique id, I need to aggregate the data by this id and by a tumbling window over the timestamp field, as follows:
val aggregateData = kinesis
.groupBy($"uid", window($"timestamp", "15 minute", "15 minute"))
.agg(...)
The problem I'm encountering is that I need to guarantee that every window starts at round times (such as 00:00:00, 00:15:00 and so on), also I need a guarantee that only rows containing full 15-minute long windows are going to be output to my sink, what I'm currently doing is
val query = aggregateData
.writeStream
.foreach(postgreSQLWriter)
.outputMode("update")
.start()
.awaitTermination()
Where ths postgreSQLWriter is a StreamWriter I created for inserting each row into a PostgreSQL SGBD. How can I force my windows to be exactly 15-minute long and the start time to be round 15-minute timestamp values for each device unique id?
question1:
to start at specific times to start, there is one more parameters spark grouping function takes which is "offset".
By specifying that it will start after the specified time from an hour
Example:
dataframe.groupBy($"Column1",window($"TimeStamp","22 minute","1 minute","15 minute"))
so the above syntax will group by column1 and create windows of 22 minute duration with sliding window size of 1 minute and offset as 15 minute
for example it starts from:
window1: 8:15(8:00 add 15 minute offset) to 8:37 (8:15 add 22 minutes)
window2: 8:16(previous window start + 1 minute) to 8:38 ( 22 minute size again)
question2:
to push only those windows having full 15 minute size, create a count column which counts the number of events having in that window. once it reaches 15, push it to wherever you want using filter command
calculating count:
dataframe.groupBy($"Column1",window($"TimeStamp","22 minute","1 minute","15 minute")).agg(count*$"Column2").as("count"))
writestream filter containing count 15 only:
aggregateddata.filter($"count"===15).writeStream.format(....).outputMode("complete").start()

spark df.write.partitionBy run very slow

I have a data frame that when saved as Parquet format takes ~11GB.
When reading to a dataframe and writing to json, it takes 5 minutes.
When I add partitionBy("day") it takes hours to finish.
I understand that the distribution to partitions is the costly action.
Is there a way to make it faster? Will sorting the files can make it better?
Example:
Run 5 minutes
df=spark.read.parquet(source_path).
df.write.json(output_path)
Run for hours
spark.read.parquet(source_path).createOrReplaceTempView("source_table")
sql="""
select cast(trunc(date,'yyyymmdd') as int) as day, a.*
from source_table a"""
spark.sql(sql).write.partitionBy("day").json(output_path)
Try adding a repartition("day") before the write, like this:
spark
.sql(sql)
.repartition("day")
.write
.partitionBy("day")
.json(output_path)
It should speed up your query.
Try adding repartition(any number ) to start with, then try increasing / decreasing the number depending upon the time it takes to write
spark
.sql(sql)
.repartition(any number)
.write
.partitionBy("day")
.json(output_path)