I'm using GroupByUntil to group messages from MSMQ that have specific property values which is working wonderfully. I'm using this code.
observable.GroupByUntil(
message => message.Source,
message => message.Body,
message => Observable.Timer(new TimeSpan(0,0,5)) //I thought this was sliding expiration
).Subscribe(HandleGroup);
I mistakenly thought that each time a new message arrived for a given group, that group's durationSelector would restart, essentially waiting for the duration to pass with no new messages before ending the group. I realize that is not the case, and that the durationSelector is going to continue to count down no matter what. What is the best way to achieve a sliding durationSelector for each group as it's being grouped?
Switch is your friend.
observable.GroupByUntil(
message => message.Source,
message => message.Body,
group => group
.Select(message => Observable.Timer(new TimeSpan(0, 0, 5)))
.Switch()
).Subscribe(HandleGroup);
Explanation:
For each message, create a timer that fires once after 5 seconds
If another message comes along within the same group, drop the old timer, and switch to the new one.
Related
I use kafka streams in my application, I have a question about time window in aggregate function.
KTable<Windowed<String>, PredictReq> windowedKtable = views.map(new ValueMapper()).groupByKey().windowedBy(TimeWindows.of(TimeUnit.MINUTES.toMillis(1)))
.aggregate(new ADInitializer(), new ADAggregator(),Materialized.with(Serdes.String(), ReqJsonSerde));
KStream<Windowed<String>, Req> filtered = windowedKtable.toStream().transform(new ADTransformerFilter());
KStream<String, String> result = filtered.transform(new ADTransformerTrans());
I aggregrate data in 1 minute window and then transform to get the final aggregate result and do a second transform.
Here is some sample data:
msg1: 10:00:00 comes, msg2: 10:00:20 comes, msg3: 10:01:10 comes
window starts from 10:00:00 to 10:01:00 for example.
I found the windows is not expired until msg3 comes! (because the following transform is not executed until msg3 comes.)
This is not what I want.
Is there something wrong in my testing? If this is truth, how to change it?
I see...
Kafka streams doesn't have the window expired concept. so I use window in message to check whether the window is changed, so I must wait message from next window.
If next message is not come, I don't know the window is finished.
I want to detect a missing event in a data stream ( e.g. detect a customer request that has not been responded within 1 hour of its reception. )
Here, I want to detect the "Response" event and make an alert.
I tried using tick tuple by setting TOPOLOGY_TICK_TUPLE_FREQ_SECS but it is configured at a bolt level and might come after 15th minute of getting a customer request.
#Override public Map getComponentConfiguration() {
Config conf = new Config();
conf.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 1800);
return conf; }
^ this doesn't work.
Let me know in comments if any other information is required. Thanks in advance for the help.
This might help http://storm.apache.org/releases/1.0.3/Windowing.html
you can define 5 mins windows and check the status of last window events and alert based on what is received
or create an intermediate bolt which maintains these windows and sends the normal alert tuples(instead of tick tuple) in case of timeouts
I want to offer to queue a string sent in load request after some initial delay say 10 seconds.
If the subsequent request is made with some short interval delay(1 second) then everything works fine, but if it is made continuously like from a script then there is no delay.
Here is the sample code.
def load(randomStr :String) = Action { implicit request =>
Source.single(randomStr)
.delay(10 seconds, DelayOverflowStrategy.backpressure)
.map(x =>{
println(x)
queue.offer(x)
})
.runWith(Sink.ignore)
Ok("")
}
I am not entirely sure that this is the correct way of doing what you want. There are some things you need to reconsider:
a delayed source has an initial buffer capacity of 16 elements. You can increase this with addAttributes(initialBuffer)
In your case the buffer cannot actually become full because every time you provide just one element.
Who is the caller of the Action? You are defining a DelayOverflowStrategy.backpressure strategy but is the caller able to handle this?
On every call of the action you are creating a Stream consisting of one element, how is the backpressure here helping? It is applied on the stream processing and not on the offering to the queue
I am trying to read the messages pushed to Kinesis stream with the help of
get_records() and get_shard_iterator() APIs.
My producer keeps pushing the records when processed at it's end and consumer also keeps running as a cron every 30 minutes. So, I tried storing the sequence number of the current message read in my database and use AFTER_SEQUENCE_NUMBER shard iterator along with the sequence number last read. However, the same won't work for the second time (first time successfully read all messages in the stream) after new messages are pushed.
I also tried using AT_TIMESTAMP along with message timestamp that producer pushed to stream as part of the message and stored that message to be further used. Again, first run processes all messages and from the second run I get empty records.
I am really not sure where I am going wrong. I would appreciate if someone can help me in this.
Providing the code below using timestamp but the same thing is done for sequence number method too.
def listen_to_kinesis_stream():
kinesis_client = boto3.client('kinesis', region_name=SETTINGS['region_name'])
stream_response = kinesis_client.describe_stream(StreamName=SETTINGS['kinesis_stream'])
for shard_info in stream_response['StreamDescription']['Shards']:
kinesis_stream_status = mongo_coll.find_one({'_id': "DOC_ID"})
last_read_ts = kinesis_stream_status.get('state', {}).get(
shard_info['ShardId'], datetime.datetime.strftime(datetime.date(1970, 01, 01), "%Y-%m-%dT%H:%M:%S.%f"))
shard_iterator = kinesis_client.get_shard_iterator(
StreamName=SETTINGS['kinesis_stream'],
ShardId=shard_info['ShardId'],
ShardIteratorType='AT_TIMESTAMP',
Timestamp=last_read_ts)
get_response = kinesis_client.get_records(ShardIterator=shard_iterator['ShardIterator'], Limit=1)
if len(get_response['Records']) == 0:
continue
message = json.loads(get_response['Records'][0]['Data'])
process_resp = process_message(message)
if process_resp['success'] is False:
print process_resp
generic_config_coll.update({'_id': "DOC_ID"}, {'$set': {'state.{0}'.format(shard_info['ShardId']): message['ts']}})
print "Processed {0}".format(message)
while 'NextShardIterator' in get_response:
get_response = kinesis_client.get_records(ShardIterator=get_response['NextShardIterator'], Limit=1)
if len(get_response['Records']) == 0:
break
message = json.loads(get_response['Records'][0]['Data'])
process_resp = process_message(message)
if process_resp['success'] is False:
print process_resp
mongo_coll.update({'_id': "DOC_ID"}, {'$set': {'state.{0}'.format(shard_info['ShardId']): message['ts']}})
print "Processed {0}".format(message)
logger.debug("Processed all messages from Kinesis stream")
print "Processed all messages from Kinesis stream"
As per my discussion with AWS technical support person, there can be a few messages with empty records and hence it is not a good idea to break when len(get_response['Records']) == 0.
The better approach suggested was - we can have a counter indicating maximum number of messages that you read in a run and exit loop after reading as many messages.
I have flink stream and I am calucating few things on some time window say 30 seconds.
here what happens it is giving me result my aggregating previous windows as well.
say for first 30 seconds I get result 10.
next thiry seconds I want fresh result, instead I get last window result + new
and so on.
so my question is how I get fresh result for each window.
You need to use a purging trigger. What you want is FIRE_AND_PURGE (emit and remove window content), what the default flink trigger does is FIRE (emit and keep window content).
input
.keyBy(...)
.timeWindow(Time.seconds(30))
// The important part: Replace the default non-purging ProcessingTimeTrigger
.trigger(new PurgingTrigger[..., TimeWindow](ProcessingTimeTrigger))
.reduce(...)
For a more in depth explanation have a look into Triggers and FIRE vs FIRE_AND_PURGE.
A Trigger determines when a window (as formed by the window assigner) is ready to be processed by the window function. Each WindowAssigner comes with a default Trigger. If the default trigger does not fit your needs, you can specify a custom trigger using trigger(...).
When a trigger fires, it can either FIRE or FIRE_AND_PURGE. While FIRE keeps the contents of the window, FIRE_AND_PURGE removes its content. By default, the pre-implemented triggers simply FIRE without purging the window state.
The functionality you describe can be found in Tumbling Windows: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/windows.html#tumbling-windows
A bit more detail and/or code would help :)
I'm little late into this question but I encountered the same issue with OP's. What I found out later was a bug in my own code. FYI my mistake could be good reference for your problem.
// Old code (modified to be an example):
val tenSecondGrouping: DataStream[MyCustomGrouping] = userIdsStream
.keyBy(_.somePartitionedKey)
.window(TumblingProcessingTimeWindows.of(Time.of(10, TimeUnit.SECONDS)))
.trigger(ProcessingTimeTrigger.create())
.aggregate(new MyCustomAggregateFunc(new MyCustomGrouping()))
Bug happened at new MyCustomGrouping: I unintentionally created a singleton MyCustomGrouping object and reusing it in MyCustomAggregateFunc. As more tumbling windows created, the later aggregation results grow crazy! The fix was to create new MyCustomGrouping each time MyCustomAggregateFunc is triggered. So:
// New code, problem solved
...
.aggregate(new MyCustomAggregateFunc(() => new MyCustomGrouping()))
// passing in a func to create new object per trigger