My use case is the following:
Orders are flowing into an activation system via a topic. I have to Identify changes for records of same key. I compare the existing value with the new value using the aggregate function and output an event that points out the type of change identified i.e. DueDate Change.
The key is a randomly generated number and the number of unique keys is pretty much unbound. The same key will be reused in case the ordering system push a revision to an existing order.
The code has been running for a couple month in production but the state store and changelog topic are growing and there is a concern of space usage. I would like to have records expire after 90 days in the state store. I read about ways to apply a time based retention on state store and it looks like windowing the aggregation is a way of achieving that.
I understand that windowed aggregation are only available for tumbling and hopping window. Sliding window is available for join operation only.
Tumbling window wouldn't work in this case because I would have windows for 0-90, 90-180 and I wouldn't be able to identify an update on day 92 for a record that came in on day 89 (they wouldn't share the same window).
Now the only other option is hopping window.
TimeWindows timeWindow = TimeWindows.of(90days).advanceBy(1day).until(1day);
The problem is that I'll have to persist and update 90 windows. When the stream starts, 90 windows will be created 0-90, 1-91, 2-92, 3-93 etc. If I have a retention of 1 day on the windows, the window 0-90 will be cleaned up on day 91.
Now lets say on day 90 I get an update. Correct me if I'm wrong but my understanding is that I will have to update 90 windows and my state store will be quite large by that time because of all the duplicates. Maybe this is where I'm missing something. If a record is present in 90 windows, is it physically written on disk 90 times?
In the end all I need is to prevent my state store and changelog topic from growing indefinitely. 90 days of historical data is sufficient to support my use case.
Would there be a better way to approach this?
It might be simpler to not use the DSL but the Processor API with a windowed state store. A windowed state store is just a key-value store with expiration. Hence, you can use it similar to a key-value store -- you only provide an additional timestamp that will be used to expire data eventually.
Related
I've recently upgraded my kafka streams from 2.0.1 to 2.5.0. As a result I'm seeing a lot of warnings like the following:
org.apache.kafka.streams.kstream.internals.KStreamWindowAggregate$KStreamWindowAggregateProcessor Skipping record for expired window. key=[325233] topic=[MY_TOPIC] partition=[20] offset=[661798621] timestamp=[1600041596350] window=[1600041570000,1600041600000) expiration=[1600059629913] streamTime=[1600145999913]
There seem to be new logic in the KStreamWindowAggregate class that checks if a window has closed. If it has been closed the messages are skipped. Compared to 2.0.1 these messages where still processed.
Question
Is there a way to get the same behavior like before? I'm seeing lots of gaps in my data with this upgrade and not sure how to solve this, as previously these gaps where not seen.
The aggregate function that I'm using already deals with windowing and as a result with expired windows. How does this new logic relate to this expiring windows?
Update
While further exploring I indeed see it to be related to the graceperiod in ms. It seems that in my custom timestampextractor (that has the logic to use the timestamp from the payload instead of the normal timestamp), I'm able to see that the incoming timestamp for the expired window warnings indeed is bigger than the 24 hours compared to the event time from the payload.
I assume this is caused by consumer lags of over 24 hours.
The timestamp extractor extract method has a partition time which according to the docs:
partitionTime the highest extracted valid timestamp of the current record's partition˙ (could be -1 if unknown)
so is this the create time of the record on the topic? And is there a way to influence this in a way that my records are no longer skipped?
Compared to 2.0.1 these messages where still processed.
That is a little bit surprising (even if I would need to double check the code), at least for the default config. By default, store retention time is set to 24h, and thus in 2.0.1 older messages than 24h should also not be processed as the corresponding state got purged already. If you did change the store retention time (via Materialized#withRetention) to a larger value, you would also need to increase the window grace period via TimeWindows#grace() method accordingly.
The aggregate function that I'm using already deals with windowing and as a result with expired windows. How does this new logic relate to this expiring windows?
Not sure what you mean by this or how you actually do this? The old and new logic are similar with regard to how a long a window is stored (retention time config). The new part is the grace period that you can increase to the same value as retention time if you wish).
About "partition time": it is computed base on whatever TimestampExtractor returns. For your case, it's the max of whatever you extracted from the message payload.
I am building the following Kafka Streams topology (pseudo code):
gK = builder.stream().gropuByKey();
g1 = gK.windowedBy(TimeWindows.of("PT1H")).reduce().mapValues().toStream().mapValues().selectKey();
g2 = gK.reduce().mapValues();
g1.leftJoin(g2).to();
If you notice, this is a rhomb-like topology that starts at single input topic and ends in the single output topic with messages flowing through two parallel flows that eventually get joined together at the end. One flow applies (tumbling?) windowing, the other does not. Both parts of the flow work on the same key (apart from the WindowedKey intermediately introduced by the windowing).
The timestamp for my messages is event-time. That is, they get picked from the message body by my custom configured TimestampExtractor implementation. The actual timestamps in my messages are several years to the past.
That all works well at first sight in my unit tests with a couple of input/output messages and in the runtime environment (with real Kafka).
The problem seems to come when the number of messages starts being significant (e.g. 40K).
My failing scenario is following:
~40K records with the same
key get uploaded into the input topic first
~40K updates are
coming out of the output topic, as expected
another ~40K records
with the same but different to step 1) key get uploaded into the
input topic
only ~100 updates are coming out of the output topic,
instead of expected new ~40K updates. There is nothing special to
see on those ~100 updates, their contents seems to be right, but
only for certain time windows. For other time windows there are no
updates even though the flow logic and input data should definetly
generate 40K records. In fact, when I exchange dataset in step 1)
and 3) I have exactly same situation with ~40K updates coming from
the second dataset and same number ~100 from the first.
I can easily reproduce this issue in the unit tests using TopologyTestDriver locally (but only on bigger numbers of input records).
In my tests, I've tried disabling caching with StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG. Unfortunately, that didn't make any difference.
UPDATE
I tried both, reduce() calls and aggregate() calls instead. The issue persists in both cases.
What I'm noticing else is that with StreamsConfig.TOPOLOGY_OPTIMIZATION set to StreamsConfig.OPTIMIZE and without it, the mapValues() handler gets called in debugger before the preceding reduce() (or aggregate()) handlers at least for the first time. I didn't expect that.
Tried both join() and leftJoin() unfortunately same result.
In debugger the second portion of the data doesn't trigger reduce() handler in the "left" flow at all, but does trigger reduce() handler in the "right" flow.
With my configuration, if the number or records in both datasets is 100 in each, the problem doesn't manifests itself, I'm getting 200 output messages as I expect. When I raise the number to 200 in each data set, I'm getting less than 400 expected messages out.
So, it seems at the moment that something like "old" windows get dropped and the new records for those old windows get ignored by the stream.
There is window retention setting that can be set, but with its default value that I use I was expecting for windows to retain their state and stay active for at least 12 hours (what exceeds the time of my unit test run significantly).
Tried to amend the left reducer with the following Window storage config:
Materialized.as(
Stores.inMemoryWindowStore(
"rollup-left-reduce",
Duration.ofDays(5 * 365),
Duration.ofHours(1), false)
)
still no difference in results.
Same issue persists even with only single "left" flow without the "right" flow and without join(). It seems that the problem is in the window retention settings of my set up. Timestamps (event-time) of my input records span 2 years. The second dataset starts from the beginning of 2 years again. this place in Kafka Streams makes sure that the second data set records get ignored:
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/InMemoryWindowStore.java#L125
Kafka Streams Version is 2.4.0. Also using Confluent dependencies version 5.4.0.
My questions are
What could be the reason for such behaviour?
Did I miss anything in my stream topology?
Is such topology expected to work at all?
After some debugging time I found the reason for my problem.
My input datasets contain records with timestamps that span 2 years. I am loading the first dataset and with that the "observed" time of my stream gets set to the maximum timestamp from from input data set.
The upload of the second dataset that starts with records with timestamps that are 2 years before the new observed time causes the stream internal to drop the messages. This can be seen if you set the Kafka logging to TRACE level.
So, to fix my problem I had to configure the retention and grace period for my windows:
instead of
.windowedBy(TimeWindows.of(windowSize))
I have to specify
.windowedBy(TimeWindows.of(windowSize).grace(Duration.ofDays(5 * 365)))
Also, I had to explicitly configure reducer storage settings as:
Materialized.as(
Stores.inMemoryWindowStore(
"rollup-left-reduce",
Duration.ofDays(5 * 365),
windowSize, false)
)
That's it, the output is as expected.
I am using a Windowed Join between two streams, let's say a 7 day window.
On initial load, all records in the DB (via kafka connect source connector) are being loaded to the streams. It seems then that ALL records end up in the window state store for those first 7 days as the producer/ingested timestamps are all in current time vs. a field (like create_time) that might be in the message value.
Is there a recommended way to balance the initial load against the Windows of the join?
Well, the question is what records do you want to join to each other? And what timestamp the source connector sets as record timestamp (might also depend on the topic configuration, [log.]message.timestamp.type.
The join is executed based on whatever the TimestampExtractor returns. By default, that is the record timestamp. If you want to base the join on some other timestamp, a custom timestampe extractor is the way to go.
If you want to get processing time semantics, you may want to use the WallclockTimestampExtractor though.
I have time series data in Kafka. The schema is quite simple - the key is the channel name, and the values are Long/Double tuples of the timestamp and the value (in reality it's a custom Avro object but it boils down to this). They always come in correct chronological order.
The wanted end result is data packaged in 10 minute batches, aligned at 10 minutes (i.e., 00:00 < t <= 00:10, 00:10 < t <= 00:20, ..., 23: 50 < t <= 00:00). Each package is to contain only data of one channel.
My idea is to have two Spark Streaming jobs. The first one takes the data from the Kafka topics and dumps it to a table in a Cassandra database where the key is the timestamp and the channel name, and every time such an RDD hits a 10 minute boundary, this boundary is posted to another topic, alongside the channel whose boundary is hit.
The second job listens to this "boundary topic", and for every received 10 minute boundary, the data is pulled from Cassandra, some calculations like min, max, mean, stddev are done and the data and these results are packaged to a defined output directory. That way, each directory contains the data from one channel and one 10 minute window.
However, this looks a bit clunky and like a lot of extra work to me. Is this a feasible solution or are there any other more efficient tricks to it, like some custom windowing of the Kafka data?
I agree with your intuition that this solution is clunky. How about simply using the time windowing functionality built into the Streams DSL?
http://kafka.apache.org/11/documentation/streams/developer-guide/dsl-api.html#windowing
The most natural output would be a new topic containing the windowed aggregations, but if you really need it written to a directory that should be possible with Kafka Connect.
I work with the Flink Stream Processing, not Spark-streaming but I guess the programming concept of both of them is alike. So supposing data are ordered chronologically and you want to aggregate data for every 10 minutes and do some processing on aggregated data, I think the best approach is to use the Streaming Window Functions. I suggest to define a function to map every incoming data's timestamp to the last 10 minutes:
12:10:24 ----> 12:10:00
12:10:30 ----> 12:10:00
12:25:24 ----> 12:20:00
So you can create a keyed stream object like:
StreamObject<Long, Tuple<data>>
That the Long field is the mapped timestamp of every message. Then you can apply a window. You should search what kind of window is more appropriate for your case.
Point: Setting a key for the data stream will cause the window function to consider a logical window for every key.
In the most simple case, you should define a time window of 10 minutes and aggregate all data incoming on that period of time.
The other approach, if you know the rate of generating of data and how many messages will be generated in a period of 10 minutes, is to use Count window. For example, a window with the count of 20 will listen to the stream and aggregate all the messages with the same key in a logical window and apply the window function just when the number of messages in the window reaches 20.
After messages aggregated in a window as desired, you could apply your processing logic using a reduce function or some action like that.
We have a Kafka topic configured on which we publish accumulated reports for each stock we traded throughout the day.
For example Stock A - Buy-50, Sell-60, Stock B - Buy-44, Sell-34 etc. The key while publishing is RIC code of the stock.
The next day I want all consumers to get the last published positions for each stock individually. I want to understand how to configure Kafka producer/consumer to achieve this behavior.
One thing that comes to mind is creating a partition for each stock, this will result into individual offsets for each stock and all consumers can point to the HIGHEST offset and get the latest position.
Is this the correct approach or am I missing something obvious?
Your approach will work, but only if you don't care about the time boundaries too much - for example, you do not need to get the counts for each day separately, with a strict requirement that only events that happened between say, [01/25/2017 00:00 - 01/26/2017 00:00] must be counted.
If you do need to get counts per day in a strict manner - you could try using Kafka Streams , with the key of RIC and the window set to 24 hours based on the event timestamp.
This is just one other way to do that - I'm sure there are more approaches available!