Apache Storm aggregation rules for missing expected events in rolling time-period - real-time

My use-case is to identify entities from which expected events have not been received after X amount of time in real-time rather than using batch jobs. For Example:
If we have received PaymentInitiated event at time T but didn't receive either of PaymentFailed / PaymentAborted / PaymentSucedded by T+X, then raise a trigger saying PaymentStuck along with details of PaymentIntitiated event.
How can I model such use-cases in Apache Storm as it is rolling time period X on each event, rather than fixed time interval.
Thanks,
Harish

For Storm, would need to put all your logic into your UDF code using low level Java API (I doubt that Trindent is helpful). I never worked with Samza and cannot provide any help for it (or judge which system would be the better fit for your problem).
In Storm for example, you could assign a timestamp to each tuple in Spout.nextTuple(), and buffer all tuples of an incomplete payment within a Bolt in descending order of the timestamp. Each time Bolt.execute() is called, you can compare the timestamp of the new tuple with the head (ie, oldest tuple) of your queue. If the input tuple has a larger timestamep than head-T plus X, you know that your head tuple times out and you can raise your trigger for it.
Of course, you need to do fieldsGrouping() to ensure that all tuples belonging to the same payment are processed by the same Bolt instance. You might also need to somewhat order the incoming bolt tuples by timestamp or use more advance time-out logic to deal with out-of-order tuples (with regard to increasing timestamps).
Depending on you latency requirement and input stream rate you might also use "tick tuples" to trigger the comparison of the head tuple with this dummy tick tuple. Or as an ever stricter implementation, do all this logic directly in Spout.next() (if you know that all tuples of a payment go through the same Spout instance).

Related

How to replay in a deterministic way in CQRS / event-sourcing?

In CQRS / ES based systems, you store events in an event-store. These events refer to an aggregate, and they have an order with respect to the aggregate they belong to. Furthermore, aggregates are consistency / transactional boundaries, which means that any transactional guarantees are only given on a per-aggregate level.
Now, supposed I have a read model which consumes events from multiple aggregates (which is perfectly fine, AFAIK). To be able to replay the read model in a deterministic way, the events need some kind of global ordering, across aggregates – otherwise you wouldn't know whether to replay events for aggregate A before or after the ones for B, or how to intermix them.
The simplest solution to achieve this is by using a timestamp on the events, but typically timestamps are not fine-granular enough (or, to put it another way, not all databases are created equal). Another option is to use a global sequence, but this is bad performance-wise and hinders scaling.
How do you solve this issue? Or is my basic assumption, that replays of read models should be deterministic, wrong?
I see these options:
Global sequence
if your database allows it, you can use timestamp+aggregateId+aggregateVersion as an index. This usually doesnt work well in the distributed database case.
in the distributed database you can use vector clock to get a global sequence without having a lock.
Event sequence inside each read model. You can literally store all events in the read model and sort them as you want before applying a projection function.
Allow non-determinism and deal with it. For instance, in your example, if there is no group when add_user event arrives - just create an empty group record to the read model and add a user. And when create_group event arrives - update that group record.
After all, you have checked in UI and/or command handler that there
is a group with this aggregateId, right?
How do you solve this issue?
It's known issue, and of course nor simple timestamps, nor global sequence, nor event naïve methods will not help.
Use vector clock with weak timestamp to enumerate your events and vector cursor to read them. That guarantees some stable deterministic order to intermix events between aggregates. This will work even if each thread has clock synchronization gap, which is regular use case for database clusters, because perfect timestamp synchronization is impossible.
Also this automatically gives possibility to seamless mix reading events from event store and event bus later, and excludes any database locks inter different aggregates events.
Algorithm draft:
1) Determine real quantity of simultaneous transactions in your database, e.g. maximum number of workers in cluster.
Since every event had been written in only one transaction in one thread, you can determine it's unique id as tuple (thread number, thread counter), where thread counter is amount of transactions processed on current thread.
Calculate event weak timestamp as MAX(thread timestamp, aggregate timestamp), where aggregate timestamp is timestamp of last event for current aggregate.
2) Prepare vector cursor for reading events via thread number boundary. Read events from each thread sequentially until timestamp gap exceed allowed value. Allowed weak timestamp gap is trade between event reading performance and preserving native events order.
Minimal value is cluster threads synchronization time delta, so events are arrived in native aggregate intermix order. Maximum value is infinity, so events will be spitted by aggregate. When using RDBMS like postgres, that value can be automatically determined via smart SQL query.
You can see referent implementation for PostgreSQL database for saving events and loading events. Saving events performance is about 10000 events per second for 4GB RAM RDS Postgres cluster.

Kafka Stream groupBy behavior: many intermediate outputs/updates for an aggregation

I'm trying to play with Kafka Stream to aggregate some attribute of People.
I have a kafka stream test like this :
new ConsumerRecordFactory[Array[Byte], Character]("input", new ByteArraySerializer(), new CharacterSerializer())
var i = 0
while (i != 5) {
testDriver.pipeInput(
factory.create("input",
Character(123,12), 15*10000L))
i+=1;
}
val output = testDriver.readOutput....
I'm trying to group the value by key like this :
streamBuilder.stream[Array[Byte], Character](inputKafkaTopic)
.filter((key, _) => key == null )
.mapValues(character=> PersonInfos(character.id, character.id2, character.age) // case class
.groupBy((_, value) => CharacterInfos(value.id, value.id2) // case class)
.count().toStream.print(Printed.toSysOut[CharacterInfos, Long])
When i'm running the code, I got this :
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 1
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 2
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 3
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 4
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 5
Why i'm getting 5 rows instead of just one line with CharacterInfos and the count ?
Doesn't groupBy just change the key ?
If you use the TopologyTestDriver caching is effectively disabled and thus, every input record will always produce an output record. This is by design, because caching implies non-deterministic behavior what makes itsvery hard to write an actual unit test.
If you deploy the code in a real application, the behavior will be different and caching will reduce the output load -- which intermediate results you will get, is not defined (ie, non-deterministic); compare Michael Noll's answer.
For your unit test, it should actually not really matter, and you can either test for all output records (ie, all intermediate results), or put all output records into a key-value Map and only test for the last emitted record per key (if you don't care about the intermediate results) in the test.
Furthermore, you could use suppress() operator to get fine grained control over what output messages you get. suppress()—in contrast to caching—is fully deterministic and thus writing a unit test works well. However, note that suppress() is event-time driven, and thus, if you stop sending new records, time does not advance and suppress() does not emit data. For unit testing, this is important to consider, because you might need to send some additional "dummy" data to trigger the output you actually want to test for. For more details on suppress() check out this blog post: https://www.confluent.io/blog/kafka-streams-take-on-watermarks-and-triggers
Update: I didn't spot the line in the example code that refers to the TopologyTestDriver in Kafka Streams. My answer below is for the 'normal' KStreams application behavior, whereas the TopologyTestDriver behaves differently. See the answer by Matthias J. Sax for the latter.
This is expected behavior. Somewhat simplified, Kafka Streams emits by default a new output record as soon as a new input record was received.
When you are aggregating (here: counting) the input data, then the aggregation result will be updated (and thus a new output record produced) as soon as new input was received for the aggregation.
input record 1 ---> new output record with count=1
input record 2 ---> new output record with count=2
...
input record 5 ---> new output record with count=5
What to do about it: You can reduce the number of 'intermediate' outputs through configuring the size of the so-called record caches as well as the setting of the commit.interval.ms parameter. See Memory Management. However, how much reduction you will be seeing depends not only on these settings but also on the characteristics of your input data, and because of that the extent of the reduction may also vary over time (think: could be 90% in the first hour of data, 76% in the second hour of data, etc.). That is, the reduction process is deterministic but from the resulting reduction amount is difficult to predict from the outside.
Note: When doing windowed aggregations (like windowed counts) you can also use the Suppress() API so that the number of intermediate updates is not only reduced, but there will only ever be a single output per window. However, in your use case/code you the aggregation is not windowed, so cannot use the Suppress API.
To help you understand why the setup is this way: You must keep in mind that a streaming system generally operates on unbounded streams of data, which means the system doesn't know 'when it has received all the input data'. So even the term 'intermediate outputs' is actually misleading: at the time the second input record was received, for example, the system believes that the result of the (non-windowed) aggregation is '2' -- its the correct result to the best of its knowledge at this point in time. It cannot predict whether (or when) another input record might arrive.
For windowed aggregations (where Suppress is supported) this is a bit easier, because the window size defines a boundary for the input data of a given window. Here, the Suppress() API allows you to make a trade-off decision between better latency but with multiple outputs per window (default behavior, Suppress disabled) and longer latency but you'll get only a single output per window (Suppress enabled). In the latter case, if you have 1h windows, you will not see any output for a given window until 1h later, so to speak. For some use cases this is acceptable, for others it is not.

How do you ensure that events are applied in order to read model?

This is easy for projections that subscribe to all events from the stream, you just keep version of the last event applied on your read model. But what do you do when projection is composite of multiple streams? Do you keep version of each stream that is partaking in the projection. But then what about the gaps, if you are not subscribing to all events? At most you can assert that version is greater than the last one. How do others deal with this? Do you respond to every event and bump up version(s)?
For the EventStore, I would suggest using the $all stream as the default stream for any read-model subscription.
I have used the category stream that essentially produces the snapshot of a given entity type but I stopped doing so since read-models serve a different purpose.
It might be not desirable to use the $all stream as it might also get events, which aren't domain events. Integration events could be an example. In this case, adding some attributes either to event contracts or to the metadata might help to create an internal (JS) projection that will create a special all stream for domain events, or any event category in that regard, where you can subscribe to. You can also use a negative condition, for example, filter out all system events and those that have the original stream name starting with Integration.
As well as processing messages in the correct order, you also have the problem of resuming a projection after it is restarted - how do you ensure you start from the right place when you restart?
The simplest option is to use an event store or message broker that both guarantees order and provides some kind of global stream position field (such as a global event number or an ordered timestamp with a disambiguating component such as MongoDB's Timestamp type). Event stores where you pull the events directly from the store (such as eventstore.org or homegrown ones built on a database) tend to guarantee this. Also, some message brokers like Apache Kafka guarantee ordering (again, this is pull-based). You want at-least-once ordered delivery, ideally.
This approach limits write scalability (reads scale fine, using read replicas) - you can shard your streams across multiple event store instances in various ways, then you have to track the position on a per-shard basis, which adds some complexity.
If you don't have these ordering, delivery and position guarantees, your life is much harder, and it may be hard to make the system completely reliable. You can:
Hold onto messages for a while after receiving them, before processing them, to allow other ones to arrive
Have code to detect missing or out-of-order messages. As you mention, this only works if you receive all events with a global sequence number or if you track all stream version numbers, and even then it isn't reliable in all cases.
For each individual stream, you keep things in order by fetching them from a data store that knows the correct order. A way of thinking of this is that your query the data store, and you get a Document Message back.
It may help to review Greg Young's Polyglot Data talk.
As for synchronization of events in multiple streams; a thing that you need to recognize is that events in different streams are inherently concurrent.
You can get some loose coordination between different streams if you have happens-before data encoded into your messages. "Event B happened in response to Event A, therefore A happened-before B". That gets you a partial ordering.
If you really do need a total ordering of everything everywhere, then you'll need to be looking into patterns like Lamport Clocks.

Kafka streams reduce after groupby to stream sends partial reduce output on commit [duplicate]

This question already has answers here:
How to send final kafka-streams aggregation result of a time windowed KTable?
(3 answers)
Closed 4 years ago.
We're having an issue where upon doing a groupby --> reduce --> toStream, partial reduce values are being sent downstream when a commit happens during the reduce. So if there are 65 keys to be reduced, and say a commit happens half we through, the output will be two messages: one partially reduced, the other with all the values reduced.
So here is our case in more detail:
msg --> leftJoin
leftJoin --> flatMap //break msg into parts so we can join again downstream
flatMap --> leftJoin
leftJoin --> groupByKey
groupByKey --> reduce
reduce --> toStream
toStream --> to
Currently, we've come up with a very ugly fix for this, which has to do with adding an index and out of values to each message created during the flatMap phase...we filter out any message emitted by the reduce where index != out of. My feeling is we're not doing something right here or looking at it the wrong way. Please advise on the correct way of doing this.
Thanks.
So if there are 65 keys to be reduced, and say a commit happens half we through, the output will be two messages: one partially reduced, the other with all the values reduced.
If I understand your description correctly, this is actually intended behavior. For one, it's a tradeoff between processing latency (where you want to see update records as soon as you have a new piece of input data) vs. coalescing multiple update records into fewer or even just a single update record.
The default behavior of Kafka Streams is to favor lower processing latency. That is, it will not wait for "all input data to have arrived" before sending downstream updates. Rather, it will send updates once new data has arrived. Some background information is described at https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/.
Today, you have two main knobs to change/tune this default behavior, which is controlled by (1) Kafka Streams record caches (for the DSL) and (2) the configured commit interval (you already mentioned this).
Moving forward, the Kafka community has also been working on a new feature that will allow you to define that you just want a single, final update record to be sent (rather than what you described as "partial" updates). This new feature, in case you are interested, is described in the Kafka Improvement Proposal KIP-328: Ability to suppress updates for KTables. This is actively being worked on, but it will unlikely to be finished in time for the upcoming Kafka v2.1 release in October.
Currently, we've come up with a very ugly fix for this, which has to do with adding an index and out of values to each message created during the flatMap phase...we filter out any message emitted by the reduce where index != out of. My feeling is we're not doing something right here or looking at it the wrong way. Please advise on the correct way of doing this.
In short, in stream processing you should embrace the nature of how streaming works. In general, you will only have partial/incomplete knowledge of the world, so to speak, or rather: you only know what you observed thus far. So, at any given point in time, you must deal with the situation that more, additional data may arrive that you still have to deal with.
A typical situation is having to deal with late-arriving data, where your application logic must decide whether you want to still integrate and process this data (quite likely) or discard (sometimes the way it needs to be).
Going back to your example:
So if there are 65 keys to be reduced [...]
How would one know it's 65, and not 100 or 28, and so on? One can only tell that: "Thus far, at this point in time, I have received 65. So, what do I do? Do I reduce those 65 because I believe that's all the input? Or do I wait some seconds/minutes/hours longer because there might be 35 more to arrive, but this will mean that I will not send an update/answer downstream until this waiting time has elapsed (which results in higher processing latency)?"
In your situation, I would ask: Why do you consider the streaming behavior of how/when updates are being sent a problem? Perhaps it's because you have a downstream system or application that doesn't know how to handle such streaming updates?
Does that make any sense? Again, the above is based on my understanding of what you described as being the issue.

How to resequence after filtering for aggregation /Spring Integration/

I'm doing a project in Spring Integration and I have a big problem.
There are some filtering components in the flow and later in the flow I have an aggregation element.
The problem is that the filtering component does not support to "apply-sequence" property. It filters out some records without modifying the original sequence number however the number of messages are reduced.
Later in the flow I need an aggregation which fails releasing elements since some messages are filtered out.
I don't want to use any special routing elements which have apply-sequence property.
Can you suggest me any common solution for this type of filtering problem?
Thanks,
I'd say you misunderstand the behaviour of the filter and aggregator.
I guees you have some apply-sequence-aware component upstream. So, all messages in that group accept several headers - correlationId - to group messages in the default aggregator; sequenceNumber - the index of the message; sequenceSize - the number of messages in the group.
Filter just checks messages for some condition and sends them to the outpu-channel or does discard logic. It doesn't modify messages. However even if we could do that, it doesn't sounds good anyway.
Assume we have just only two messages in the group. The first on is OK for filtering - we just send it to the aggregator. But the second is discarded, and, yes, it won't be sent to aggregator. And the last one never releases that group, because the sequenceSize isn't reached.
To overcome your requirement you need to have some custom ReleaseStrategy on the aggregator (by default it is SequenceSizeReleaseStrategy). For example to check some state in your system that all messages in the group have been sent independently of true or false result after filter. Or have some fake message for the same reason and check its availability in the group.
In this case you will need just take care about correlationId to group messages in the aggregator.
UPDATE
What is the suggested release strategy for such a scenario? Would it be a good strategy to use timeout as release stretegy?
What I can say that sometimes it is really difficult to find good solution for some integration scenarios. The messaging is stateless by nature, so to correlate and group an undetermined number of messages may be a problem.
There is need to see requirements and environment.
For example when all your messages are processed in the single thread you can safely send some fake marker message in the end directly to the aggregator and check it from ReleaseStrategy. And it will work even when all your messages from the group may be discarded.
If you process those messages in parallel or they are received from different threads, you really won't be able to determine the order of messages and the time for each process.
In this case the TimeoutCountSequenceSizeReleaseStrategy really can help. Of course, there will be need to find the good timeframe compromise according to the requirements to your system.