With Firestore Increment, what happens if you're using it in a Cloud Function and the Cloud Function is accidentally invoked twice?
To make sure that your function behaves correctly on retried execution attempts, you should make it idempotent by implementing it so that an event results in the desired results (and side effects) even if it is delivered multiple times.
E.g. the function is trying to increment a document field by 1
document("post/Post_ID_1").
updateData(["likes" : FieldValue.increment(1)])
So while Increment may be atomic it's not idempotent? If we want to make our counters idempotent we still need to use a transaction and keep track of who was the last person to like the post?
It will increment once for each invocation of the function. If that's not acceptable, you will need to write some code to figure out if any subsequent invocations are valid for your case.
There are many strategies to implement this, and it's up to you to choose one that suits your needs. The usual strategy is to use the event ID in the context object passed to your function to determine if that event has been successfully processed in the past. Maybe this involves storing that record in another document, in Redis, or somewhere that persists long enough for duplicates to be prevented (an hour should be OK).
Related
The Streams DSL documentation includes a caveat about using the aggregate method to transform a KGroupedTable → KTable, as follows (emphasis mine):
When subsequent non-null values are received for a key (e.g., UPDATE), then (1) the subtractor is called with the old value as stored in the table and (2) the adder is called with the new value of the input record that was just received. The order of execution for the subtractor and adder is not defined.
My interpretation of that last line implies that one of three things can happen:
subtractor can be called before adder
adder can be called before subtractor
adder and subtractor could be called at the same time
Here is the question I'm looking to get answered:
Are all 3 scenarios above actually possible when using the aggregate method on a KGroupedTable?
Or am I misinterpreting the documentation? For my use-case (detailed below), it would be ideal if the subtractor was always be called before the adder.
Why is this question important?
If the adder and subtractor are non-commutative operations and the order in which they are executed can vary, you can end up with different results depending on the order of execution of adder and subtractor. An example of a useful non-commutative operation would be something like if we’re aggregating records into a Set:
.aggregate[Set[Animal]](Set.empty)(
adder = (zooKey, animalValue, setOfAnimals) => setOfAnimals + animalValue,
subtractor = (zooKey, animalValue, setOfAnimals) => setOfAnimals - animalValue
)
In this example, for duplicated events, if the adder is called before the subtractor you would end up removing the value entirely from the set (which would be problematic for most use-cases I imagine).
Why am I doubting the documentation (assuming my interpretation of it is correct)?
Seems like an unusual design choice
When I've run unit tests (using TopologyTestDriver and
EmbeddedKafka), I always see the subtractor is called before the
adder. Unfortunately, if there is some kind of race condition
involved, it's entirely possible that I would never hit the other
scenarios.
I did try looking into the kafka-streams codebase as well. The KTableProcessorSupplier that calls the user-supplied adder/subtracter functions appears to be this one: https://github.com/apache/kafka/blob/18547633697a29b690a8fb0c24e2f0289ecf8eeb/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableAggregate.java#L81 and on line 92, you can even see a comment saying "first try to remove the old value". Seems like this would answer my question definitively right? Unfortunately, in my own testing, what I saw was that the process function itself is called twice; first with a Change<V> value that includes only the old value and then the process function is called again with a Change<V> value that includes only the new value. Unfortunately, I haven't been able to dig deep enough to find the internal code that is generating the old value record and the new value record (upon receiving an update) to determine if it actually produces those records in that order.
The order is hard-coded (ie, no race condition), but there is no guarantee that the order won't change in future releases without notice (ie, it's not a public contract and no KIP is needed to change it). I guess there would be a Jira about it... But as a matter of fact, it does not really matter (detail below).
For the three scenarios you mentioned, the 3rd one cannot happen though: Aggregators are execute in a single thread (per shard) and thus either the adder or subtractor is called first.
first with a Change value that includes only
the old value and then the process function is called again with a Change
value that includes only the new value.
In general, both records might be processed by different threads and thus it's not possible to send only one record. It's just that the TTD simulates a single threaded execution thus both records always end up in the same processor.
Cf TopologyTestDriver sending incorrect message on KTable aggregations
However, the order actually only matters if both records really end up in the same processor (if the grouping key did not change during the upstream update).
Furthermore, the order actually depends not on the downstream aggregate implementation, but on the order of writes into the repartitions topic of the groupBy() and with multiple parallel upstream processor, those writes are interleaved anyway. Thus, in general, you should think of the "add" and "subtract" part as independent entities and not make any assumption about their order (also, even if the key did not change, both records might be interleaved by other records...)
The only guarantee provided is (given that you configured the producer correctly to avoid re-ordering during send()), that if the grouping key does not change, the send of the old and new value will not be re-ordered relative to each other. The order of the send is hard-coded in the upstream processor though:
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableRepartitionMap.java#L93-L99
Thus, the order of the downstream aggregate processor is actually meaningless.
This is the context:
There is an input event stream,
There are some methods to apply on
the stream, which applies different logic to evaluates each event,
saying it is a "good" or "bad" event.
An event can be a real "good" one only if it passes all the methods, otherwise it is a "bad" event.
There is an output event stream who has result of event and its eventID.
To solve this problem, I have two ideas:
We can apply each method sequentially to each event. But this is a kind of batch processing, and doesn't apply the advantages of stream processing, in the same time, it takes Time(M(ethod)1) + Time(M2) + Time(M3) + ....., which maybe not suitable to real-time processing.
We can pass the input stream to each method, and then we can run each method in parallel, each method saves the bad event into a permanent storage, then the Main method could query the permanent storage to get the result of each event. But this has some problems to solve:
how to execute methods in parallel in the programming language(e.g. Scala), how about the performance(network, CPUs, memory)
how to solve the synchronization problem? It's sure that those methods need sometime to calculate and save flag into the permanent storage, but the Main just need less time to query the flag, which a delay issue occurs.
etc.
This is not a kind of tech and design question, I would like to ask your guys' ideas, if you have some new ideas or ideas to solve the problem ? Looking forward to your opinions.
Parallel streams, each doing the full set of evaluations sequentially, is the more straightforward solution. But if that introduces too much latency, then you can fan out the evaluations to be done in parallel, and then bring the results back together again to make a decision.
To do the fan-out, look at the split operation on DataStream, or use side outputs. But before doing this n-way fan-out, make sure that each event has a unique ID. If necessary, add a field containing a random number to each event to use as the unique ID. Later we will use this unique ID as a key to gather back together all of the partial results for each event.
Once the event stream is split, each copy of the stream can use a MapFunction to compute one of evaluation methods.
Gathering all of these separate evaluations of a given event back together is a bit more complex. One reasonable approach here is to union all of the result streams together, and then key the unioned stream by the unique ID described above. This will bring together all of the individual results for each event. Then you can use a RichFlatMapFunction (using Flink's keyed, managed state) to gather the results for the separate evaluations in one place. Once the full set of evaluations for a given event has arrived at this stateful flatmap operator, it can compute and emit the final result.
The definition of idempotent in wikipedia is:Idempotence is the property of certain operations in mathematics and computer science, that can be applied multiple times without changing the result beyond the initial application.
The problem is: I have REST API PUT call, which updates properties of domain aggregate. Also, it fires event for each property, which was updated. Now if we have two exact the same PUT calls one after the other to our backend:
First PUT call updates the properties of aggregate and fires lets say 5 events.
Second PUT call updates the properties of aggregate, but do not fire any event, because the properties of aggregate did not change (first PUT call updated the values of aggregate properties).
The question is: is this operation idempotent?
This question and its answers explain what an idempotent operation is. In short: repeated calls do not change the outcome.
So from your description of this operation, it seems it qualifies as idempotent.
Yes and No. It is idempotent from the data point of view. The data in the database won't change no matter how many times you execute the call. But it is not idempotent in the sense that some logging or other events might occur that would change the "Entropy" of the system :)
Yes, it is: no matter how many times you send the same PUT request, it leaves your system (your aggregate) in the same state.
It depends on your definition. Since you have side-effects (fiering off some events if there is a difference), multiple calls might cause more side-effects than desired. However, the state of the application, ignoring the side-effects, with no concurrency, will be idempotent. Remember that REST calls are made over an asynchronous network, so this is a distributed system.
If you have two concurrent processes, they may fire a different number of side-effects. For example:
a = 2
a = 3
a = 2
a = 3
will fire twice as many events as
a = 2
a = 2
a = 3
a = 3
which might cause some trouble.
Consider a function: IsWalletValid(walletID). It returns true if the walletID exists in the database, and updates a 'last_accessed_time' field.
A task runs periodically to remove any wallets that have not been accessed for a set period of time.
Seems like an easy solution for what we want to do, but IsWalletValid() has a side effect because it writes to the database.
Should we add an additional 'UpdateLastAccessedTime(walletID)' function? Everytime we call IsWalletValid() we will also need to remember to call UpdateLastAccessedTime(walletID).
Do verifying that a wallet is valid and updating it's last_accessed_time field need to be transactionally consistent (ACID)? You could use eventual consistency here:
The method IsWalletValid publishes an WalletAccessed event, then an event handler updates last_accessed_time asynchronously.
if last_accessed_time is not accessed by domain logic to make decisions on any write handling this could just be a facet of the read only projection. Seems like this is the same concern as other more verbose read audit concerns. Just because data is being written and maintained doesn't mean that it necessarily needs to be part of the write model of the system. If you did however want to implement this as part of the domain and perhaps stored within the same event store it could be considered a separate auditing context outside of the boundary of the original aggregate being audited.
Observe
I was trying to figure it out how cursor.observe runs inside meteor, but found nothing about it.
Docs says
Establishes a live query that notifies callbacks on any change to the query result.
I would like to understand better what live query means.
Where will be my observer function executed? By Meteor or by mongo?
Multiple runs
When we have more than just a user subscribing an observer, one instance runs for each client, leading us to a performance and race condition issue.
How can I implement my observe to it be like a singleton? Just one instance running for all.
Edit: There was a third question here, but now it is a separated question: How to avoid race conditions on cursor.observe?
Server side, as of right now, observe works as follows:
Construct the set of documents that match the query.
Regularly poll the database with query and take a diff of the changes, emitting the relevant events to the callbacks.
When matching data is changed/inserted into mongo by meteor itself, emit the relevant events, short circuiting step #2 above.
There are plans (possibly in the next release) to automatically ensure that calls to subscribe that have the same arguments are shared. So basically taking care of the singleton part for you automatically.
Certainly you could achieve something like this yourself, but I believe it's a high priority for the meteor team, so it's probably not worth the effort at this point.