I am writing a Spark streaming application, using Scala, where my goal is by reading the Twitter feed every second to calculate the most retweeted statuses in a window of 60 seconds.
What i conceptually want is to get the number of retweets of a status at the end of the sliding window and subtract it from the equivalent number at its start, in order to find the no. of retweets inside the window. The relevant line of code is:
val counts = tweets.filter(_.isRetweet).map { status =>
(status.getText(), status.getRetweetedStatus().getRetweetCount())
}.reduceByKeyAndWindow(*function*, Seconds(60), Seconds(1))
So, my question is what function should I use here to achieve the desired result, that is to get the maximum value that getRetweetCount() returns inside the window and subtract the minimum value from it.
Correct me if I'm wrong or making misassumptions here, but you are essentially checking the number of retweets for a status within your window of Seconds(60). To do this, you already have your filter which removes all non-retweeted tweets (filter(_.isRetweet)). Now, all you need to do is an aggregation of the retweeted statuses to determine their frequencies.
This can be achieved with the following:
val counts = tweets.filter(_.isRetweet).map { status =>
(status.getText(), null)
}.countByValueAndWindow(Seconds(60), Seconds(1))
Perhaps after this, you can order by value and gather the most retweeted tweets within that window.
Related
I’m using watermark to join two streams as you can see below:
val order_wm = order_details.withWatermark("tstamp_trans", "20 seconds")
val invoice_wm = invoice_details.withWatermark("tstamp_trans", "20 seconds")
val join_df = order_wm
.join(invoice_wm, order_wm.col("s_order_id") === invoice_wm.col("order_id"))
My understanding with the above code, it will keep each of the stream for 20 secs. After it comes but, when I’m giving one stream now and the another after 20secs then also both are getting joined. It seems like even after watermark got finished Spark is holding the data in memory. I even tried after 45 seconds and that was getting joined too.
This is creating confusion in my mind regarding watermark.
After it comes but, when I’m giving one stream now and the another after 20secs then also both are getting joined.
That's possible since the time measured is not the time of events as they arrive, but the time that is inside the watermarked field, i.e. tstamp_trans. You have to make sure that the last time in tstamp_trans is 20 seconds after the rows that will participate in the join.
Quoting the doc from: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#inner-joins-with-optional-watermarking
In other words, you will have to do the following additional steps in the join.
Define watermark delays on both inputs such that the engine knows how delayed the input can be (similar to streaming aggregations)
Define a constraint on event-time across the two inputs such that the engine can figure out when old rows of one input is not going to be required (i.e. will not satisfy the time constraint) for matches with the other input. This constraint can be defined in one of the two ways.
Time range join conditions (e.g. ...JOIN ON leftTime BETWEEN rightTime AND rightTime + INTERVAL 1 HOUR),
Join on event-time windows (e.g. ...JOIN ON leftTimeWindow = rightTimeWindow).
I was checking out spark window function to check page hit per 30 second,but it keep on adding value of previous window time too.
suppose at **12:00:30 count is 10** and **12:01:00 count is 10**.
But spark gives **output as 20**
Adding previous window value.I'm using Kafka-spark streaming.
val rs=words.reduceByKeyAndWindow((x,y)=>(x._1 + y._1,x._2 + y._2),Durations.seconds(30))
Please help and how i can reset value like window tumbling works in Kafka's KSQL.
I am tracking the rolling sum of a particular field by using a query which looks something like this :
SELECT id, SUM(quantity) AS quantity from stream \
WINDOW HOPPING (SIZE 1 MINUTE, ADVANCE BY 10 SECONDS) \
GROUP BY id;
Now, for every input tick, it seems to return me 6 different aggregated values I guess which are for the following time periods :
[start, start+60] seconds
[start+10, start+60] seconds
[start+20, start+60] seconds
[start+30, start+60] seconds
[start+40, start+60] seconds
[start+50, start+60] seconds
What if I am interested is only getting the [start, start+60] seconds result for every tick that comes in. Is there anyway to get ONLY that?
Because you specify a hopping window, each record falls into multiple windows and all windows need to be updated when processing a record. Updating only one window would be incorrect and the result would be wrong.
Compare the Kafka Streams docs about hopping windows (Kafka Streams is KSQL's internal runtime engine): https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#hopping-time-windows
Update
Kafka Streams is adding proper sliding window support via KIP-450 (https://cwiki.apache.org/confluence/display/KAFKA/KIP-450%3A+Sliding+Window+Aggregations+in+the+DSL). This should allow to add sliding window to ksqlDB later, too.
I was in a similar situation and creating a user defined function to access only the window with collect_list(column).size() = window duration appears to be a promising track.
In the udf use List type to get one of your aggregate base column list of values. Then assess is the formed list size is equal to the hopping window number of period, return null otherwise.
From this create a table selecting data and transforming it with the udf.
Create a table from this latest table and filter out null values on the transformed column.
My use is the following. Consider I have a pyspark dataframe which has the following format:
df.columns:
1. hh: Contains the hour of the day (type int)
2. userId : some unique identifier.
What I want to do is I want to figure out list of userIds which have anomalous hits onto the page. So I first do a groupby as so:
df=df.groupby("hh","userId).count().alias("LoginCounts)
Now the format of the dataframe would be:
1. hh
2. userId
3.LoginCounts: Number of times a specific user logs in at a particular hour.
I want to use the pyspark kde function as follows:
from pyspark.mllib.stat import KernelDensity
kd=KernelDensity()
kd.setSample(df.select("LoginCounts").rdd)
kd.estimate([13.0,14.0]).
I get the error:
Py4JJavaError: An error occurred while calling o647.estimateKernelDensity.
: org.apache.spark.SparkException: Job aborted due to stage failure
Now my end goal is to fit a kde on say a day's hour based data and then use the next day's data to get the probability estimates for each login count.
Eg: I would like to achieve something of this nature:
df.withColumn("kdeProbs",kde.estimate(col("LoginCounts)))
So the column kdeProbs will contain P(LoginCount=x | estimated kde).
I have tried searching for an example of the same but am always redirected to the standard kde example on the spark.apache.org page, which does not solve my case.
It's not enough to just select one column and convert it to an RDD; you need to also select the actual data in that column for it to work. Try this:
from pyspark.mllib.stat import KernelDensity
dat_rdd = df.select("LoginCounts").rdd
# actually select data from RDD
dat_rdd_data = dat_rdd.map(lambda x: x[0])
kd = KernelDensity()
kd.setSample(dat_rdd_data)
kd.estimate([13.0,14.0])
I'm developing a spark streaming app in scala and now that i'm finished my first version I want to improve performance.
My spark app currently is printing some important alerts of mine on every batch which means i've got new text files being generated in the range of seconds whereas i'd prefer if the computations are performed but writing to file occurs only when my sliding window expires which is in the range of 10mins.
An example of the rdd of interest follows:
val s = events.flatMap(_.split("\n")) //split block into lines of single json events
.map(toMyObject) //convert raw json to MyObject
.filter(checkCondition) //filter events based on condition
.map(x => (x._1,1L)) //count alerts based on area
.reduceByKeyAndWindow(_ + _, _ - _, Minutes(window_length), Seconds(sliding_interval), 2) //count alerts per area
.repartition(1)
.saveAsTextFiles("alerts")
As we discussed in the comments: to implement non-overlapping window slide duration should be the same as window duration.
I.e. in example above with window duration 10 minutes, if slide duration is set to 10 minutes as well - it will produce file once per 10 minutes including calculations on all data within those 10 minutes.