How to suppress warning in Spark? - scala

The incorrect operation has detected the possible error and throwed the following warning:
UnsupportedOperationChecker: Detected pattern of possible 'correctness' issue due to global watermark. The query contains stateful operation which can emit rows older than the current watermark plus allowed late record delay, which are "late rows" in downstream stateful operations and these rows can be discarded. Please refer the programming guide doc for more details.;
Since I am pretty sure the join operation is correct, how can I suppress the warning?

Related

Clarify "the order of execution for the subtractor and adder is not defined"

The Streams DSL documentation includes a caveat about using the aggregate method to transform a KGroupedTable → KTable, as follows (emphasis mine):
When subsequent non-null values are received for a key (e.g., UPDATE), then (1) the subtractor is called with the old value as stored in the table and (2) the adder is called with the new value of the input record that was just received. The order of execution for the subtractor and adder is not defined.
My interpretation of that last line implies that one of three things can happen:
subtractor can be called before adder
adder can be called before subtractor
adder and subtractor could be called at the same time
Here is the question I'm looking to get answered:
Are all 3 scenarios above actually possible when using the aggregate method on a KGroupedTable?
Or am I misinterpreting the documentation? For my use-case (detailed below), it would be ideal if the subtractor was always be called before the adder.
Why is this question important?
If the adder and subtractor are non-commutative operations and the order in which they are executed can vary, you can end up with different results depending on the order of execution of adder and subtractor. An example of a useful non-commutative operation would be something like if we’re aggregating records into a Set:
.aggregate[Set[Animal]](Set.empty)(
adder = (zooKey, animalValue, setOfAnimals) => setOfAnimals + animalValue,
subtractor = (zooKey, animalValue, setOfAnimals) => setOfAnimals - animalValue
)
In this example, for duplicated events, if the adder is called before the subtractor you would end up removing the value entirely from the set (which would be problematic for most use-cases I imagine).
Why am I doubting the documentation (assuming my interpretation of it is correct)?
Seems like an unusual design choice
When I've run unit tests (using TopologyTestDriver and
EmbeddedKafka), I always see the subtractor is called before the
adder. Unfortunately, if there is some kind of race condition
involved, it's entirely possible that I would never hit the other
scenarios.
I did try looking into the kafka-streams codebase as well. The KTableProcessorSupplier that calls the user-supplied adder/subtracter functions appears to be this one: https://github.com/apache/kafka/blob/18547633697a29b690a8fb0c24e2f0289ecf8eeb/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableAggregate.java#L81 and on line 92, you can even see a comment saying "first try to remove the old value". Seems like this would answer my question definitively right? Unfortunately, in my own testing, what I saw was that the process function itself is called twice; first with a Change<V> value that includes only the old value and then the process function is called again with a Change<V> value that includes only the new value. Unfortunately, I haven't been able to dig deep enough to find the internal code that is generating the old value record and the new value record (upon receiving an update) to determine if it actually produces those records in that order.
The order is hard-coded (ie, no race condition), but there is no guarantee that the order won't change in future releases without notice (ie, it's not a public contract and no KIP is needed to change it). I guess there would be a Jira about it... But as a matter of fact, it does not really matter (detail below).
For the three scenarios you mentioned, the 3rd one cannot happen though: Aggregators are execute in a single thread (per shard) and thus either the adder or subtractor is called first.
first with a Change value that includes only
the old value and then the process function is called again with a Change
value that includes only the new value.
In general, both records might be processed by different threads and thus it's not possible to send only one record. It's just that the TTD simulates a single threaded execution thus both records always end up in the same processor.
Cf TopologyTestDriver sending incorrect message on KTable aggregations
However, the order actually only matters if both records really end up in the same processor (if the grouping key did not change during the upstream update).
Furthermore, the order actually depends not on the downstream aggregate implementation, but on the order of writes into the repartitions topic of the groupBy() and with multiple parallel upstream processor, those writes are interleaved anyway. Thus, in general, you should think of the "add" and "subtract" part as independent entities and not make any assumption about their order (also, even if the key did not change, both records might be interleaved by other records...)
The only guarantee provided is (given that you configured the producer correctly to avoid re-ordering during send()), that if the grouping key does not change, the send of the old and new value will not be re-ordered relative to each other. The order of the send is hard-coded in the upstream processor though:
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableRepartitionMap.java#L93-L99
Thus, the order of the downstream aggregate processor is actually meaningless.

Is there a maximum number of conditions allowed with Firestore rules?

I have a set of conditions to check if an update is allowed for documents in a given Firestore collection.
Rules are evaluated properly. My issue is that my set of conditions is becoming quite large (more than 100 conditions). Today, I added a new condition and when I tried to perform an update, I got the classic "Missing or insufficient permissions.".
At first, I was thinking my condition is wrong or the data used for the update is not matching the condition. However, it is satisfying the condition.
What I noticed is that whatever the condition I put, the update always fails with a permission issue. This let me think that I have too many conditions. As a quick test, I kept my new condition but removed another one and then the update pass. This seems to confirm what I thought.
The Firestore documentation mentions quota limits. However, my conditions are not using exists(), get(), or getAfter().
Is there a limit on the number of conditions for a given operation rule?
How to check for the "Maximum number of expressions in a ruleset" or "Maximum size of a ruleset"?
While it's not documented, I'm told the the limit to the number of conditions is 1000, so that's probably not a limit you're running into here.

How to avoid arithmetic errors in PostgreSQL?

I have a PostgreSQL-powered web app that does some non-essential, simple calculations involving getting values from outside sources, multiplication and division for reporting purposes. Today an error where a multiplication that exceeded the value domain of a numeric( 10, 4 ) field led to an application crash. It would be much better if the relevant field had just been set to null and a notice be generated. The way the bug worked was that a wrong value in one field caused several views to become unavailable, and while a missing value in that place would have been sad but no big problem, the blocked view is still essential for the app to work.
Now I'm aware that in this particular case, setting that field to numeric( 11, 4 ) would have prevented the bailout, but that is, of course, only postponing the issue at hand. Since the error happened in a function call, I could also have written an exception handler; lastly, one could check either the multiplicands or the result for sane values (but that is in itself a little strange as I would either have to do a guess based on magnitudes or else do the multiplication in another numeric type that can probably handle a value whose magnitude is in principle not known to me with certainty, because external sources).
Exception handling is probably what this will boil down to, which, however, entails that all numeric calculations will have to be done via PL/pgSQL function calls, and will have to be implemented in many different places. None of the options seems particularly maintainable or elegant. So the question is: Can I somehow configure PostgreSQL to ignore some or all arithmetic errors and use default values in such cases? If so, can that be done per database or will I have to configure the server? If this is impossible or a Bad Idea, what are the best practices to avoid arithmetic errors?
Clarification This is not a question about how to rewrite numeric( 10, 4 ) so that the field can hold values of 1e6 and above, and also not so much about error handling in the application that uses the DB. It's more about whether there is an operator, a function call, a general configuration or a general pattern that is most commonly recommended to deal with situations where a (non-essential) computation normally results in a number (or in fact other value type) except with some inputs that cause exceptions, which is when the result could fully well and safely be discarded. Think Excel printing out #### when cell is too narrow for the digits to be displayed, or JavaScript giving you NaN in place of arithmetic errors. Returning null instead of raising an exception may be a bad idea in general programming but legitimate in specific case.
Observe that PostGreSQL error codes does have e.g. invalid_argument_for_logarithm, invalid_argument_for_ntile_function, division_by_zero all grouped together under Class 22 — Data Exception and does allow exception handling in function bodies, so I can also specifically ask: How to catch all class 22 exceptions short of listing all the error codes?, but then I still hope for a more principled approach.
Arguably the type numeric (without type modifiers) would be the right thing for you if you want to avoid overflows (that's what you seem to mean with “arithmetic error”) as much as possible.
However, there will still be the possibility of value overflows numeric format.
There is no way to configure PostgreSQL so that it ignores a numeric overflow.
If the result of an operation cannot be represented in a data type, there should be an error. If the data supplied by the application can lead to an error, the application should be ready to handle such an error rather than “crash”. Failure to do so is an application bug.

Drools Engine Inferred Expiration

I'm working on a project which needs to process a large number of events per second. The project uses Drools 6.5 running in stream mode. The data is fed to the engine as "event" objects.
Due to the large number of events that need to be processed, automatic memory management provided by Drools simplifies the development process significantly. However, drools has a somewhat vague documentation in this category. I need to count the number of events with certain conditions in the past T seconds and fire a rule if the number surpasses a threshold. I currently use sliding windows to achieve this. The problem is, Drools discards events before T seconds passes from their insertion (using #expires value) or does not discard them at all (if #expires tag is removed); thus either making inference impossible, or causing a heap memory overflow in the long run.
Is there a better approach to the problem? Can anyone clarify how inferred expiration works? Am I doing something wrong?
Any help would be greatly appreciated.
After a few hours of exploring the documentation for Drools 6.5, I finally found out what was happening. I will leave the information here to help anyone else who might have the same problem.
Important
An explicit expiration policy for a given event type overrides any inferred expiration offset for that same type.
As the documentation says(9.8.1), explicit #expires tag overrides any inferred expiration offset, so in order to let the engine handle the events' life cycle, do not use this tag.
7.5.1. Passive Mode
With Passive mode not only is the user responsible for working memory operations, such as insert(), but also for when the rules are to evaluate the data and fire the resulting rule instantiations - using fireAllRules()
Apparently, in order to use the inferred expiration feature, one can not use the passive execution mode. Running kSession.fireUntilHalt() runs the engine in active mode and enables the use of inferred expiration offsets.
TL;DR:
1. Remove any #expires tag
2. run the engine using fireUntilHalt() in a dedicated thread.

NEventStore CommonDomain: Why does EventStoreRepository.GetById open a new stream when an aggregate doesn't exist?

Please explain the reasoning behind making EventStoreRepository.GetById create a new stream when an aggregate doesn't exist. This behavior appears to give GetById a dual responsibility which in my case could result in undesirable results. For example, my aggregates always implement a factory method for their creation so that their first event results in the setting of the aggregate's Id. If a command was handled prior to the existance of the aggregate it may succeed even though the aggregate doesn't have it's Id set or other state initialized (crash-n-burn with null reference exception is equally as likely).
I would much rather have GetById throw an exception if an aggregate doesn't exist to prevent premature creation and command handling.
That was a dumb design decision on my part. I went back and forth on this one. To be honest, I'm still back and forth on it.
The issue is that exceptions should really be used to indicate exceptional or unexpected scenarios. The problem is that a stream not being found can be a common operation and even an expected operation in some regards. I tinkered with the idea of throwing an exception, returning null, or returning an empty stream.
The way to determine if the aggregate is empty is to check the Revision property to see if it's zero.