Clarify "the order of execution for the subtractor and adder is not defined" - apache-kafka

The Streams DSL documentation includes a caveat about using the aggregate method to transform a KGroupedTable → KTable, as follows (emphasis mine):
When subsequent non-null values are received for a key (e.g., UPDATE), then (1) the subtractor is called with the old value as stored in the table and (2) the adder is called with the new value of the input record that was just received. The order of execution for the subtractor and adder is not defined.
My interpretation of that last line implies that one of three things can happen:
subtractor can be called before adder
adder can be called before subtractor
adder and subtractor could be called at the same time
Here is the question I'm looking to get answered:
Are all 3 scenarios above actually possible when using the aggregate method on a KGroupedTable?
Or am I misinterpreting the documentation? For my use-case (detailed below), it would be ideal if the subtractor was always be called before the adder.
Why is this question important?
If the adder and subtractor are non-commutative operations and the order in which they are executed can vary, you can end up with different results depending on the order of execution of adder and subtractor. An example of a useful non-commutative operation would be something like if we’re aggregating records into a Set:
.aggregate[Set[Animal]](Set.empty)(
adder = (zooKey, animalValue, setOfAnimals) => setOfAnimals + animalValue,
subtractor = (zooKey, animalValue, setOfAnimals) => setOfAnimals - animalValue
)
In this example, for duplicated events, if the adder is called before the subtractor you would end up removing the value entirely from the set (which would be problematic for most use-cases I imagine).
Why am I doubting the documentation (assuming my interpretation of it is correct)?
Seems like an unusual design choice
When I've run unit tests (using TopologyTestDriver and
EmbeddedKafka), I always see the subtractor is called before the
adder. Unfortunately, if there is some kind of race condition
involved, it's entirely possible that I would never hit the other
scenarios.
I did try looking into the kafka-streams codebase as well. The KTableProcessorSupplier that calls the user-supplied adder/subtracter functions appears to be this one: https://github.com/apache/kafka/blob/18547633697a29b690a8fb0c24e2f0289ecf8eeb/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableAggregate.java#L81 and on line 92, you can even see a comment saying "first try to remove the old value". Seems like this would answer my question definitively right? Unfortunately, in my own testing, what I saw was that the process function itself is called twice; first with a Change<V> value that includes only the old value and then the process function is called again with a Change<V> value that includes only the new value. Unfortunately, I haven't been able to dig deep enough to find the internal code that is generating the old value record and the new value record (upon receiving an update) to determine if it actually produces those records in that order.

The order is hard-coded (ie, no race condition), but there is no guarantee that the order won't change in future releases without notice (ie, it's not a public contract and no KIP is needed to change it). I guess there would be a Jira about it... But as a matter of fact, it does not really matter (detail below).
For the three scenarios you mentioned, the 3rd one cannot happen though: Aggregators are execute in a single thread (per shard) and thus either the adder or subtractor is called first.
first with a Change value that includes only
the old value and then the process function is called again with a Change
value that includes only the new value.
In general, both records might be processed by different threads and thus it's not possible to send only one record. It's just that the TTD simulates a single threaded execution thus both records always end up in the same processor.
Cf TopologyTestDriver sending incorrect message on KTable aggregations
However, the order actually only matters if both records really end up in the same processor (if the grouping key did not change during the upstream update).
Furthermore, the order actually depends not on the downstream aggregate implementation, but on the order of writes into the repartitions topic of the groupBy() and with multiple parallel upstream processor, those writes are interleaved anyway. Thus, in general, you should think of the "add" and "subtract" part as independent entities and not make any assumption about their order (also, even if the key did not change, both records might be interleaved by other records...)
The only guarantee provided is (given that you configured the producer correctly to avoid re-ordering during send()), that if the grouping key does not change, the send of the old and new value will not be re-ordered relative to each other. The order of the send is hard-coded in the upstream processor though:
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableRepartitionMap.java#L93-L99
Thus, the order of the downstream aggregate processor is actually meaningless.

Related

Firestore Increment - Cloud Function Invoked Twice

With Firestore Increment, what happens if you're using it in a Cloud Function and the Cloud Function is accidentally invoked twice?
To make sure that your function behaves correctly on retried execution attempts, you should make it idempotent by implementing it so that an event results in the desired results (and side effects) even if it is delivered multiple times.
E.g. the function is trying to increment a document field by 1
document("post/Post_ID_1").
updateData(["likes" : FieldValue.increment(1)])
So while Increment may be atomic it's not idempotent? If we want to make our counters idempotent we still need to use a transaction and keep track of who was the last person to like the post?
It will increment once for each invocation of the function. If that's not acceptable, you will need to write some code to figure out if any subsequent invocations are valid for your case.
There are many strategies to implement this, and it's up to you to choose one that suits your needs. The usual strategy is to use the event ID in the context object passed to your function to determine if that event has been successfully processed in the past. Maybe this involves storing that record in another document, in Redis, or somewhere that persists long enough for duplicates to be prevented (an hour should be OK).

Parallel design of program working with Flink and scala

This is the context:
There is an input event stream,
There are some methods to apply on
the stream, which applies different logic to evaluates each event,
saying it is a "good" or "bad" event.
An event can be a real "good" one only if it passes all the methods, otherwise it is a "bad" event.
There is an output event stream who has result of event and its eventID.
To solve this problem, I have two ideas:
We can apply each method sequentially to each event. But this is a kind of batch processing, and doesn't apply the advantages of stream processing, in the same time, it takes Time(M(ethod)1) + Time(M2) + Time(M3) + ....., which maybe not suitable to real-time processing.
We can pass the input stream to each method, and then we can run each method in parallel, each method saves the bad event into a permanent storage, then the Main method could query the permanent storage to get the result of each event. But this has some problems to solve:
how to execute methods in parallel in the programming language(e.g. Scala), how about the performance(network, CPUs, memory)
how to solve the synchronization problem? It's sure that those methods need sometime to calculate and save flag into the permanent storage, but the Main just need less time to query the flag, which a delay issue occurs.
etc.
This is not a kind of tech and design question, I would like to ask your guys' ideas, if you have some new ideas or ideas to solve the problem ? Looking forward to your opinions.
Parallel streams, each doing the full set of evaluations sequentially, is the more straightforward solution. But if that introduces too much latency, then you can fan out the evaluations to be done in parallel, and then bring the results back together again to make a decision.
To do the fan-out, look at the split operation on DataStream, or use side outputs. But before doing this n-way fan-out, make sure that each event has a unique ID. If necessary, add a field containing a random number to each event to use as the unique ID. Later we will use this unique ID as a key to gather back together all of the partial results for each event.
Once the event stream is split, each copy of the stream can use a MapFunction to compute one of evaluation methods.
Gathering all of these separate evaluations of a given event back together is a bit more complex. One reasonable approach here is to union all of the result streams together, and then key the unioned stream by the unique ID described above. This will bring together all of the individual results for each event. Then you can use a RichFlatMapFunction (using Flink's keyed, managed state) to gather the results for the separate evaluations in one place. Once the full set of evaluations for a given event has arrived at this stateful flatmap operator, it can compute and emit the final result.

Converting Rx-Observables to Twitter Futures in Scala

I want to implement the following functions in the most re-active way. I need these for implementing the bijections for automatic conversion between the said types.
def convertScalaRXObservableToTwitterFuture[A](a: Observable[A]): TwitterFuture[A] = ???
def convertScalaRXObservableToTwitterFutureList[A](a: Observable[A]): TwitterFuture[List[A]] = ???
I came across this article on a related subject but I can't get it working.
Unfortunately the claim in that article is not correct and there can't be a true bijection between Observable and anything like Future. The thing is that Observable is more powerful abstraction that can represent things that can't be represented by Future. For example, Observable might actually represent an infinite sequence. For example see Observable.interval. Obviously there is no way to represent something like this with a Future. The Observable.toList call used in that article explicitly mentions that:
Returns a Single that emits a single item, a list composed of all the items emitted by the finite source ObservableSource.
and later it says:
Sources that are infinite and never complete will never emit anything through this operator and an infinite source may lead to a fatal OutOfMemoryError.
Even if you limit yourself to only finite Observables, still Future can't fully express semantics of Observable. Consider Observable.intervalRange that generates a limited range one by one over some time period. With Observable the first event comes after initialDelay and then you get event each period. With Future you can get only one event and it must be only when the sequence is fully generated so Observable is completed. It means that by transforming Observable[A] into Future[List[A]] you immediately break the main benefit of Observable - reactivity: you can't process events one by one, you have to process them all in a single bunch.
To sum up the claim at the first paragraph of the article:
convert between the two, without loosing asynchronous and event-driven nature of them.
is false because conversion Observable[A] -> Future[List[A]] exactly looses the "event-driven nature" of Observable and there is no way to work this around.
P.S. Actually the fact that Future is less powerful than Observable should not be a big surprise. If it was not, why anybody would create Observable in the first place?

When can you call operation idempotent?

The definition of idempotent in wikipedia is:Idempotence is the property of certain operations in mathematics and computer science, that can be applied multiple times without changing the result beyond the initial application.
The problem is: I have REST API PUT call, which updates properties of domain aggregate. Also, it fires event for each property, which was updated. Now if we have two exact the same PUT calls one after the other to our backend:
First PUT call updates the properties of aggregate and fires lets say 5 events.
Second PUT call updates the properties of aggregate, but do not fire any event, because the properties of aggregate did not change (first PUT call updated the values of aggregate properties).
The question is: is this operation idempotent?
This question and its answers explain what an idempotent operation is. In short: repeated calls do not change the outcome.
So from your description of this operation, it seems it qualifies as idempotent.
Yes and No. It is idempotent from the data point of view. The data in the database won't change no matter how many times you execute the call. But it is not idempotent in the sense that some logging or other events might occur that would change the "Entropy" of the system :)
Yes, it is: no matter how many times you send the same PUT request, it leaves your system (your aggregate) in the same state.
It depends on your definition. Since you have side-effects (fiering off some events if there is a difference), multiple calls might cause more side-effects than desired. However, the state of the application, ignoring the side-effects, with no concurrency, will be idempotent. Remember that REST calls are made over an asynchronous network, so this is a distributed system.
If you have two concurrent processes, they may fire a different number of side-effects. For example:
a = 2
a = 3
a = 2
a = 3
will fire twice as many events as
a = 2
a = 2
a = 3
a = 3
which might cause some trouble.

Find First and First Difference in Progress 4GL

I'm not clear about below queries and curious to know what is the different between them even though both retrieves same results. (Database used sports2000).
FOR EACH Customer WHERE State = "NH",
FIRST Order OF Customer:
DISPLAY Customer.Cust-Num NAME Order-Num Order-Date.
END.
FOR EACH Customer WHERE State = "NH":
FIND FIRST Order OF Customer NO-ERROR.
IF AVAILABLE Order THEN
DISPLAY Customer.Cust-Num NAME Order-Num Order-Date.
END.
Please explain me
Regards
Suga
As AquaAlex says your first snippet is a join (the "," part of the syntax makes it a join) and has all of the pros and cons he mentions. There is, however, a significant additional "con" -- the join is being made with FIRST and FOR ... FIRST should never be used.
FOR LAST - Query, giving wrong result
It will eventually bite you in the butt.
FIND FIRST is not much better.
The fundamental problem with both statements is that they imply that there is an order which your desired record is the FIRST instance of. But no part of the statement specifies that order. So in the event that there is more than one record that satisfies the query you have no idea which record you will actually get. That might be ok if the only reason that you are doing this is to probe to see if there is one or more records and you have no intention of actually using the record buffer. But if that is the case then CAN-FIND() would be a better statement to be using.
There is a myth that FIND FIRST is supposedly faster. If you believe this, or know someone who does, I urge you to test it. It is not true. It is true that in the case where FIND returns a large set of records adding FIRST is faster -- but that is not apples to apples. That is throwing away the bushel after randomly grabbing an apple. And if you code like that your apple now has magical properties which will lead to impossible to cure bugs.
OF is also problematic. OF implies a WHERE clause based on the compiler guessing that fields with the same name in both tables and which are part of a unique index can be used to join the tables. That may seem reasonable, and perhaps it is, but it obscures the code and makes the maintenance programmer's job much more difficult. It makes a good demo but should never be used in real life.
Your first statement is a join statement, which means less network traffic. And you will only receive records where both the customer and order record exist so do not need to do any further checks. (MORE EFFICIENT)
The second statement will retrieve each customer and then for each customer found it will do a find on order. Because there may not be an order you need to do an additional statement (If Available) as well. This is a less efficient way to retrieve the records and will result in much more unwanted network traffic and more statements being executed.