How or When is time window of Kafka Streams expired? - apache-kafka

I use kafka streams in my application, I have a question about time window in aggregate function.
KTable<Windowed<String>, PredictReq> windowedKtable = views.map(new ValueMapper()).groupByKey().windowedBy(TimeWindows.of(TimeUnit.MINUTES.toMillis(1)))
.aggregate(new ADInitializer(), new ADAggregator(),Materialized.with(Serdes.String(), ReqJsonSerde));
KStream<Windowed<String>, Req> filtered = windowedKtable.toStream().transform(new ADTransformerFilter());
KStream<String, String> result = filtered.transform(new ADTransformerTrans());
I aggregrate data in 1 minute window and then transform to get the final aggregate result and do a second transform.
Here is some sample data:
msg1: 10:00:00 comes, msg2: 10:00:20 comes, msg3: 10:01:10 comes
window starts from 10:00:00 to 10:01:00 for example.
I found the windows is not expired until msg3 comes! (because the following transform is not executed until msg3 comes.)
This is not what I want.
Is there something wrong in my testing? If this is truth, how to change it?

I see...
Kafka streams doesn't have the window expired concept. so I use window in message to check whether the window is changed, so I must wait message from next window.
If next message is not come, I don't know the window is finished.

Related

rxdart: Get buffered elements on stream subscription cancel

I'm using a rxdart ZipStream within my app to combine two streams of incoming bluetooth data. Those streams are used along with "bufferCount" to collect 500 elements each before emitting. Everything works fine so far, but if the stream subscription gets cancelled at some point, there might be a number of elements in those buffers that are omitted after that. I could wait for a "buffer cycle" to complete before cancelling the stream subscription, but as this might take some time depending on the configured sample rate, I wonder if there is a solution to get those buffers as they are even if the number of elements might be less than 500.
Here is some simplified code for explanation:
subscription = ZipStream.zip2(
streamA.bufferCount(500),
streamB.bufferCount(500),
(streamABuffer, streamBBuffer) {
return ...;
},
).listen((data) {
...
});
Thanks in advance!
So for anyone wondering: As bufferCount is implemented with BufferCountStreamTransformer which extends BackpressureStreamTransformer, there is a dispatchOnClose property that defaults to true. That means if the underlying stream whose emitted elements are buffered is closed, then the remaining elements in that buffer are emitted finally. This also applies to the example above. My fault was to close the stream and to cancel the stream subscription instantly. With awaiting the stream's closing and cancelling the stream subscription afterwards, everything works as expected.

Kafka Streams - time window close delay?

I'm new to Kafka Streams.
I use the suppress method of KTable in order to handle only the final result of a window like this:
myStream
.windowedBy(TimeWindows.of(Duration.ofSeconds(10)).grace(Duration.ofMillis(500)))
.aggregate(new Aggregation(),
(k, v, a) -> a, // Disabled the actual aggregation in order to eliminate possiblities of latency
materialized.withLoggingDisabled())
.suppress(untilWindowCloses(Suppressed.BufferConfig.unbounded()))
.toStream().peek((k, v) -> log.info("delay " + (System.currentTimeMillis() - k.window().endTime().toEpochMilli())));
This way I get a log with the delay every 10 seconds with the difference between the window end and the actual time the peek was called.
I would exect a very small number here, since this code practically does nothing...
Nevertheless, I get delay of 4-20 sec for each key/window.
I use a thread per task (5 threads for this topic).
Can someone please point out if I'm doing anything wrong?
Thanks!
Edit:
Using VirtualVM shows that ~99% of the time consumed over sun.nio.ch.SelectorImpl.select(). This means AFAIU, that the process is "idle" most of the time.
Edit:
It seems that changing "commit.interval.ms" (which was by default 30000) reduced the delay drastically.
Still delay has peaks of event 15 seconds, so the problem isn't solved yet...

apache storm missing event detection based on time

I want to detect a missing event in a data stream ( e.g. detect a customer request that has not been responded within 1 hour of its reception. )
Here, I want to detect the "Response" event and make an alert.
I tried using tick tuple by setting TOPOLOGY_TICK_TUPLE_FREQ_SECS but it is configured at a bolt level and might come after 15th minute of getting a customer request.
#Override public Map getComponentConfiguration() {
Config conf = new Config();
conf.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 1800);
return conf; }
^ this doesn't work.
Let me know in comments if any other information is required. Thanks in advance for the help.
This might help http://storm.apache.org/releases/1.0.3/Windowing.html
you can define 5 mins windows and check the status of last window events and alert based on what is received
or create an intermediate bolt which maintains these windows and sends the normal alert tuples(instead of tick tuple) in case of timeouts

Read from Kinesis is giving empty records when run using previous sequence number or timestamp

I am trying to read the messages pushed to Kinesis stream with the help of
get_records() and get_shard_iterator() APIs.
My producer keeps pushing the records when processed at it's end and consumer also keeps running as a cron every 30 minutes. So, I tried storing the sequence number of the current message read in my database and use AFTER_SEQUENCE_NUMBER shard iterator along with the sequence number last read. However, the same won't work for the second time (first time successfully read all messages in the stream) after new messages are pushed.
I also tried using AT_TIMESTAMP along with message timestamp that producer pushed to stream as part of the message and stored that message to be further used. Again, first run processes all messages and from the second run I get empty records.
I am really not sure where I am going wrong. I would appreciate if someone can help me in this.
Providing the code below using timestamp but the same thing is done for sequence number method too.
def listen_to_kinesis_stream():
kinesis_client = boto3.client('kinesis', region_name=SETTINGS['region_name'])
stream_response = kinesis_client.describe_stream(StreamName=SETTINGS['kinesis_stream'])
for shard_info in stream_response['StreamDescription']['Shards']:
kinesis_stream_status = mongo_coll.find_one({'_id': "DOC_ID"})
last_read_ts = kinesis_stream_status.get('state', {}).get(
shard_info['ShardId'], datetime.datetime.strftime(datetime.date(1970, 01, 01), "%Y-%m-%dT%H:%M:%S.%f"))
shard_iterator = kinesis_client.get_shard_iterator(
StreamName=SETTINGS['kinesis_stream'],
ShardId=shard_info['ShardId'],
ShardIteratorType='AT_TIMESTAMP',
Timestamp=last_read_ts)
get_response = kinesis_client.get_records(ShardIterator=shard_iterator['ShardIterator'], Limit=1)
if len(get_response['Records']) == 0:
continue
message = json.loads(get_response['Records'][0]['Data'])
process_resp = process_message(message)
if process_resp['success'] is False:
print process_resp
generic_config_coll.update({'_id': "DOC_ID"}, {'$set': {'state.{0}'.format(shard_info['ShardId']): message['ts']}})
print "Processed {0}".format(message)
while 'NextShardIterator' in get_response:
get_response = kinesis_client.get_records(ShardIterator=get_response['NextShardIterator'], Limit=1)
if len(get_response['Records']) == 0:
break
message = json.loads(get_response['Records'][0]['Data'])
process_resp = process_message(message)
if process_resp['success'] is False:
print process_resp
mongo_coll.update({'_id': "DOC_ID"}, {'$set': {'state.{0}'.format(shard_info['ShardId']): message['ts']}})
print "Processed {0}".format(message)
logger.debug("Processed all messages from Kinesis stream")
print "Processed all messages from Kinesis stream"
As per my discussion with AWS technical support person, there can be a few messages with empty records and hence it is not a good idea to break when len(get_response['Records']) == 0.
The better approach suggested was - we can have a counter indicating maximum number of messages that you read in a run and exit loop after reading as many messages.

flink streaming window trigger

I have flink stream and I am calucating few things on some time window say 30 seconds.
here what happens it is giving me result my aggregating previous windows as well.
say for first 30 seconds I get result 10.
next thiry seconds I want fresh result, instead I get last window result + new
and so on.
so my question is how I get fresh result for each window.
You need to use a purging trigger. What you want is FIRE_AND_PURGE (emit and remove window content), what the default flink trigger does is FIRE (emit and keep window content).
input
.keyBy(...)
.timeWindow(Time.seconds(30))
// The important part: Replace the default non-purging ProcessingTimeTrigger
.trigger(new PurgingTrigger[..., TimeWindow](ProcessingTimeTrigger))
.reduce(...)
For a more in depth explanation have a look into Triggers and FIRE vs FIRE_AND_PURGE.
A Trigger determines when a window (as formed by the window assigner) is ready to be processed by the window function. Each WindowAssigner comes with a default Trigger. If the default trigger does not fit your needs, you can specify a custom trigger using trigger(...).
When a trigger fires, it can either FIRE or FIRE_AND_PURGE. While FIRE keeps the contents of the window, FIRE_AND_PURGE removes its content. By default, the pre-implemented triggers simply FIRE without purging the window state.
The functionality you describe can be found in Tumbling Windows: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/windows.html#tumbling-windows
A bit more detail and/or code would help :)
I'm little late into this question but I encountered the same issue with OP's. What I found out later was a bug in my own code. FYI my mistake could be good reference for your problem.
// Old code (modified to be an example):
val tenSecondGrouping: DataStream[MyCustomGrouping] = userIdsStream
.keyBy(_.somePartitionedKey)
.window(TumblingProcessingTimeWindows.of(Time.of(10, TimeUnit.SECONDS)))
.trigger(ProcessingTimeTrigger.create())
.aggregate(new MyCustomAggregateFunc(new MyCustomGrouping()))
Bug happened at new MyCustomGrouping: I unintentionally created a singleton MyCustomGrouping object and reusing it in MyCustomAggregateFunc. As more tumbling windows created, the later aggregation results grow crazy! The fix was to create new MyCustomGrouping each time MyCustomAggregateFunc is triggered. So:
// New code, problem solved
...
.aggregate(new MyCustomAggregateFunc(() => new MyCustomGrouping()))
// passing in a func to create new object per trigger