I am using a Windowed Join between two streams, let's say a 7 day window.
On initial load, all records in the DB (via kafka connect source connector) are being loaded to the streams. It seems then that ALL records end up in the window state store for those first 7 days as the producer/ingested timestamps are all in current time vs. a field (like create_time) that might be in the message value.
Is there a recommended way to balance the initial load against the Windows of the join?
Well, the question is what records do you want to join to each other? And what timestamp the source connector sets as record timestamp (might also depend on the topic configuration, [log.]message.timestamp.type.
The join is executed based on whatever the TimestampExtractor returns. By default, that is the record timestamp. If you want to base the join on some other timestamp, a custom timestampe extractor is the way to go.
If you want to get processing time semantics, you may want to use the WallclockTimestampExtractor though.
I'm reviewing the StructuredNetworkWordCountWindowed example in Apache Spark Structured Streaming and am having trouble finding information about how I can update the example to control the output intervals. When I run the example I receive output every time a micro batch is processed. I understand that this is intended because the main case is to process data and emit results in real time but what about the case where I want to process data in real time but output the state at some specific interval? Does Spark Structured Streaming support this scenario? I reviewed the programming guide and the only similar concept that is mentioned is the Trigger.ProcessingTime option. Unfortunately, this option is not quite what is needed since it applies to the batch processing time and the scenario described above still requires processing data in real time.
Is this feature supported? More specifically, how do I only output the state at the time the window ends assuming there are no late arrivals and using a tumbling window?
I'd like to understand the following:
In Spark Structured streaming, there is the notion of trigger that says at which interval spark will try to read data to start a processing. What I would like to know is how long does the the readying operation may last? In particular in the context of Kafka, what exactly happens? Let say, we have configured spark to retrieve the latest offsets always. What I want to know is, does Spark try to read an arbitrary amount of data (as in from where it last left off up to the latest offset available) on each trigger? What if the readying operation is longer than the interval? What is supposed to happen at that point?
I wonder if there is a readying operation time that can be set, as in every trigger, keep readying for this amount of time? Or is the rate actually controlled in the two following ways:
Manually with maxOffsetsPerTrigger, and in that case, the trigger does not really matter,
Choose a trigger that make sense with respect to how much data you may have available and be able to process between triggers.
The second options sounds quite difficult to calibrate.
I have time series data in Kafka. The schema is quite simple - the key is the channel name, and the values are Long/Double tuples of the timestamp and the value (in reality it's a custom Avro object but it boils down to this). They always come in correct chronological order.
The wanted end result is data packaged in 10 minute batches, aligned at 10 minutes (i.e., 00:00 < t <= 00:10, 00:10 < t <= 00:20, ..., 23: 50 < t <= 00:00). Each package is to contain only data of one channel.
My idea is to have two Spark Streaming jobs. The first one takes the data from the Kafka topics and dumps it to a table in a Cassandra database where the key is the timestamp and the channel name, and every time such an RDD hits a 10 minute boundary, this boundary is posted to another topic, alongside the channel whose boundary is hit.
The second job listens to this "boundary topic", and for every received 10 minute boundary, the data is pulled from Cassandra, some calculations like min, max, mean, stddev are done and the data and these results are packaged to a defined output directory. That way, each directory contains the data from one channel and one 10 minute window.
However, this looks a bit clunky and like a lot of extra work to me. Is this a feasible solution or are there any other more efficient tricks to it, like some custom windowing of the Kafka data?
I agree with your intuition that this solution is clunky. How about simply using the time windowing functionality built into the Streams DSL?
http://kafka.apache.org/11/documentation/streams/developer-guide/dsl-api.html#windowing
The most natural output would be a new topic containing the windowed aggregations, but if you really need it written to a directory that should be possible with Kafka Connect.
I work with the Flink Stream Processing, not Spark-streaming but I guess the programming concept of both of them is alike. So supposing data are ordered chronologically and you want to aggregate data for every 10 minutes and do some processing on aggregated data, I think the best approach is to use the Streaming Window Functions. I suggest to define a function to map every incoming data's timestamp to the last 10 minutes:
12:10:24 ----> 12:10:00
12:10:30 ----> 12:10:00
12:25:24 ----> 12:20:00
So you can create a keyed stream object like:
StreamObject<Long, Tuple<data>>
That the Long field is the mapped timestamp of every message. Then you can apply a window. You should search what kind of window is more appropriate for your case.
Point: Setting a key for the data stream will cause the window function to consider a logical window for every key.
In the most simple case, you should define a time window of 10 minutes and aggregate all data incoming on that period of time.
The other approach, if you know the rate of generating of data and how many messages will be generated in a period of 10 minutes, is to use Count window. For example, a window with the count of 20 will listen to the stream and aggregate all the messages with the same key in a logical window and apply the window function just when the number of messages in the window reaches 20.
After messages aggregated in a window as desired, you could apply your processing logic using a reduce function or some action like that.
We have an IoT device that will send in sensor readings when a user interacts with the device. The user will typically interact with the device multiple times with each interaction generating a sensor reading. The device will then transmit each sensor reading to a backend API but order and timeliness are not guaranteed.
On the backend API we must window all sensor readings into a single event that happened within the same 2 minutes and timestamp the event with the timestamp of the first event in the window. Due to not having order or timeliness guarantees from the IoT device we currently have a schema where each user has a document per day and an array of events that in turn has an array of the sensor readings. We currently query this whole document on every sensor reading and in application code determine the correct events based on the existing sensor readings and the new one just received.
We've seen some performance and locking issues having to query the entire document and process this event windowing in application code and save the entire document. Is there a way to do this event windowing completely in mongodb? What is the pattern that would give us the highest write throughput overall.