Kafka Streams WindowStore Fetch Record Ordering - apache-kafka

The Kafka Streams 2.2.0 documentation for the WindowStore and ReadOnlyWindowStore method fetch(K key, Instant from, Instant to) states:
For each key, the iterator guarantees ordering of windows, starting
from the oldest/earliest available window to the newest/latest window.
None of the other fetch methods state this (except the deprecated fetch(K key, long from, long to)), but do they offer the same guarantee?
Additionally, is there any guarantee on ordering of records within a given window? Or is that up to the underlying hashing collection (I assume) implementation and handling of possible hash collisions?
I should also note that we built the WindowStore with retainDuplicates() set to true. So a single key would have multiple entries within a window. Unless we're using it wrong; which I guess would be a different question...

The other methods don't have ordering guarantees, because the order depends on the byte-order of the serialized keys. It's hard to reason about this ordering for Kafka Streams, because the serializers are provided by the user.
I should also note that we built the WindowStore with retainDuplicates() set to true. So a single key would have multiple entries within a window. Unless we're using it wrong; which I guess would be a different question...
You are using it wrong :) -- you can store different keys for the same window by default. If you enable retainDuplicates() you can store the same key multiple times for the same window.

Related

How to "join" a frequently updating stream with an irregularly updating stream in Apache Beam?

I have a stream of measurements keyed by an ID PCollection<KV<ID,Measurement>> and something like a changelog stream of additional information for that ID PCollection<KV<ID,SomeIDInfo>>. New data is added to the measurement stream quite regularly, say once per second for every ID. The stream with additional information on the other hand is only updated when a user performs manual re-configuration. We can't tell often this happens and, in particular, the update frequency may vary among IDs.
My goal is now to enrich each entry in the measurements stream by the additional information for its ID. That is, the output should be something like PCollection<KV<ID,Pair<Measurement,SomeIDInfo>>>. Or, in other words, I would like to do a left join of the measurements stream with the additional information stream.
I would expect this to be a quite common use case. Coming from Kafka Streams, this can be quite easily implemented with a KStream-KTable-Join. With Beam, however, all my approaches so far seem not to work. I already thought about the following ideas.
Idea 1: CoGroupByKey with fixed time windows
Applying a window to the measurements stream would not be an issue. However, as the additional information stream is updating irregularly and also significantly less frequently than the measurements stream, there is no reasonable common window size such that there is at least one updated information for each ID.
Idea 2: CoGroupByKey with global window and as non-default trigger
Refining the previous idea, I thought about using a processing-time trigger, which fires e.g. every 5 seconds. The issue with this idea is that I need to use accumulatingFiredPanes() for the additional information as there might be no new data for a key between two firings, but I have to use discardingFiredPanes() for the measurements stream as otherwise my panes would quickly become too large. This simply does not work. When I configure my pipeline that way, also the additional information stream discards changes. Setting both trigger to accumulating it works, but, as I said, this is not scalable.
Idea 3: Side inputs
Another idea would be to use side inputs, but also this solution is not really scalable - at least if I don't miss something. With side inputs, I would create a PCollectionView from the additional information stream, which is a map of IDs to the (latest) additional information. The "join" can than be done in a DoFn with a side input of that view. However, the view seems to be shared by all instances that perform the side input. (It's a bit hard to find any information regarding this.) We would like to not make any assumptions regarding the amount of IDs and the size of additional info. Thus, using a side input seems also not to work here.
The side input option you discuss is currently the best option, although you are correct about the scalability concern due to the side input being broadcast to all workers.
Alternatively, you can store the infrequently-updated side in an external key-value store and just do lookups from a DoFn. If you go this route, it's generally useful to do a GroupByKey first on the main input with ID as a key, which lets you cache the lookups with a good cache-hit ratio.

Clarify "the order of execution for the subtractor and adder is not defined"

The Streams DSL documentation includes a caveat about using the aggregate method to transform a KGroupedTable → KTable, as follows (emphasis mine):
When subsequent non-null values are received for a key (e.g., UPDATE), then (1) the subtractor is called with the old value as stored in the table and (2) the adder is called with the new value of the input record that was just received. The order of execution for the subtractor and adder is not defined.
My interpretation of that last line implies that one of three things can happen:
subtractor can be called before adder
adder can be called before subtractor
adder and subtractor could be called at the same time
Here is the question I'm looking to get answered:
Are all 3 scenarios above actually possible when using the aggregate method on a KGroupedTable?
Or am I misinterpreting the documentation? For my use-case (detailed below), it would be ideal if the subtractor was always be called before the adder.
Why is this question important?
If the adder and subtractor are non-commutative operations and the order in which they are executed can vary, you can end up with different results depending on the order of execution of adder and subtractor. An example of a useful non-commutative operation would be something like if we’re aggregating records into a Set:
.aggregate[Set[Animal]](Set.empty)(
adder = (zooKey, animalValue, setOfAnimals) => setOfAnimals + animalValue,
subtractor = (zooKey, animalValue, setOfAnimals) => setOfAnimals - animalValue
)
In this example, for duplicated events, if the adder is called before the subtractor you would end up removing the value entirely from the set (which would be problematic for most use-cases I imagine).
Why am I doubting the documentation (assuming my interpretation of it is correct)?
Seems like an unusual design choice
When I've run unit tests (using TopologyTestDriver and
EmbeddedKafka), I always see the subtractor is called before the
adder. Unfortunately, if there is some kind of race condition
involved, it's entirely possible that I would never hit the other
scenarios.
I did try looking into the kafka-streams codebase as well. The KTableProcessorSupplier that calls the user-supplied adder/subtracter functions appears to be this one: https://github.com/apache/kafka/blob/18547633697a29b690a8fb0c24e2f0289ecf8eeb/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableAggregate.java#L81 and on line 92, you can even see a comment saying "first try to remove the old value". Seems like this would answer my question definitively right? Unfortunately, in my own testing, what I saw was that the process function itself is called twice; first with a Change<V> value that includes only the old value and then the process function is called again with a Change<V> value that includes only the new value. Unfortunately, I haven't been able to dig deep enough to find the internal code that is generating the old value record and the new value record (upon receiving an update) to determine if it actually produces those records in that order.
The order is hard-coded (ie, no race condition), but there is no guarantee that the order won't change in future releases without notice (ie, it's not a public contract and no KIP is needed to change it). I guess there would be a Jira about it... But as a matter of fact, it does not really matter (detail below).
For the three scenarios you mentioned, the 3rd one cannot happen though: Aggregators are execute in a single thread (per shard) and thus either the adder or subtractor is called first.
first with a Change value that includes only
the old value and then the process function is called again with a Change
value that includes only the new value.
In general, both records might be processed by different threads and thus it's not possible to send only one record. It's just that the TTD simulates a single threaded execution thus both records always end up in the same processor.
Cf TopologyTestDriver sending incorrect message on KTable aggregations
However, the order actually only matters if both records really end up in the same processor (if the grouping key did not change during the upstream update).
Furthermore, the order actually depends not on the downstream aggregate implementation, but on the order of writes into the repartitions topic of the groupBy() and with multiple parallel upstream processor, those writes are interleaved anyway. Thus, in general, you should think of the "add" and "subtract" part as independent entities and not make any assumption about their order (also, even if the key did not change, both records might be interleaved by other records...)
The only guarantee provided is (given that you configured the producer correctly to avoid re-ordering during send()), that if the grouping key does not change, the send of the old and new value will not be re-ordered relative to each other. The order of the send is hard-coded in the upstream processor though:
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableRepartitionMap.java#L93-L99
Thus, the order of the downstream aggregate processor is actually meaningless.

Kafka Message Keys with Composite Values

I am working on a system that will produce kafka messages. These messages will be organized into topics that more or less represent database tables. Many of these tables have composite keys and this aspect of the design is out of my control. The goal is to prepare these messages in a way that they can be easily consumed by common sink connectors, without a lot of manipulation.
I will be using the schema registry and avro format for all of the obvious advantages. Having the entire "row" expressed as a record in the message value is fine for upsert operations, but I also need to support deletes. From what I can tell, this means my message needs a key so I can have "tombstone" messages. Also keep in mind that I want to avoid any sort of transforms unless absolutely necessary.
In a perfect world, the message key would be a "record" that included strongly-typed key-column values and the message value would have the other column values (both controlled by the schema registry). However, it seems like a lot of the tooling around kafka expects message keys to be a single, primitive value. This makes me wonder if I need to compute a key value where I concatenate my multiple key columns into a single string value and keep the individual columns in my message value. Is this right or am I missing something? What other options do I have?
I'm assuming that you know the relationship between the message key and partition assignment.
As per my understanding, there is nothing that stops you from using a complex type like STRUCT as a key with or without a key schema. Please refer to the API here. If you are using an out of box connector that does not support complex type as key, then you may have to write your own Single Message Transformations (SMT) to move the key attributes into the value.
The approach that you mentioned - contacting columns to create the key and keeping the values of the same column in the value attribute would work in many cases if you don't want to write code. The only downside I could see is that your messages would be larger than required. If you don't need a partition assignment strategy or ordering requirement, then the message can have no key or a random key.
I wanted to follow-up with an answer that solved my issue:
The strategy I mentioned of using a concatenated string, technically worked. However, it certainly wasn't very elegant.
My original issue in using a structured key was that I wasn't using the correct converter for deserializing the key, which led to other errors. Once I used the avro converter, I was able to get my multi-part key and use it effectively.
Both, when implemented appropriately allowed me to produce valid tombstone messages that could represent deletes.

Can't change partitioning scheme for actors

While playing with azure service fabric actors, here is the weird thing I've recently found out about - I can't change the default settings for partitioning. If I try to, say, set Named partitioning or change low/high key for UniformInt64, it gets overwritten each time I build my project in Visual Studio. There is no problem to do this for statefull service, it only happens with actors. No errors, no records in Event Log, no nothing... I've found just one single reference about the same problem on the Internet -
https://social.msdn.microsoft.com/Forums/vstudio/en-US/4edbf0a3-307b-489f-b936-43af9a365a0a/applicationmanifestxml-overwritten-on-each-build?forum=AzureServiceFabric
But I haven't seen any explanations to that - neither on MSDN, nor in official documentation. Any ideas? Would it really be 'by design'?
P.S.
Executing just Powershell script to deploy the app does allow me to set the scheme the way I want it to. Still it's frustrating to not being able to do this in VS. Probably there is a good reason to that... it should be, right? :)
Reliable Services can be created with different partition schemes and
partition key ranges. The Actor Service uses the Int64 partitioning
scheme with the full Int64 key range to map actors to partitions.
Every ActorId is hashed to an Int64, which is why the actor service
must use an Int64 partitioning scheme with the full Int64 key range.
However, custom ID values can be used for an ActorID, including GUIDs,
strings, and Int64s.
When using GUIDs and strings, the values are hashed to an Int64.
However, when explicitly providing an Int64 to an ActorId, the Int64
will map directly to a partition without further hashing. This can be
used to control which partition actors are placed in.
(Source)
This ActorId => PartitionKey translation strategy doesn't work if your partitions are named.

Kafka: update a key when there is no update in x amount of time

Is there a way when using Kafka, to have a key being updated after it has not been seen for x amount of time?
Something like
records
.groupByKey
.windowedBy(
TimeWindows
.of(Duration.ofMinutes(5))
.grace(Duration.ofMinutes(1))
.advanceBy(Duration.ofMinutes(1))
).count()
.suppress(Suppressed.untilWindowCloses(BufferConfig.unbounded())
).updateNotSeen(Duration.ofMinutes(30), (k) => (k, 0))
So here, Kafka would emit a new record whenever it had not seen a record after 30 minutes. (Done by the hypothetical updateNotSeen.)
In my search I found this open issue, which, if it would be there, allowed me to do it in some way, but I don't know how I would be doing it now.
As far as I know this is not possible in the DSL (Java, Scala).
Until such functionality is provided out-of-the-box, you can implement such custom functionality yourself by using the Processor API of Kafka Streams, however. (The Processor API can similarly be used to implement custom join operations, for example.). In that case you'd not work with tables--which are a DSL-only abstraction--but with state stores (tables are backed by state stores, fwiw), which support direct read-write access from attached Processors or Transformers. Processors and transformers support punctuation to schedule periodic actions, similar to cron. During such a scheduled action you could check whether any record, identified by its record key, hasn't seen an update in the past 30 minutes and then act accordingly.
Also, it's very helpful to know that you can combine the Processor API and the DSL (which you have been using thus far). That is, you can keep using the DSL for most of your code and only 'plug in' the aforementioned Processors/Transformers (from the Processor API) when and where needed.
Hope this helps!