Consume all messages of a topic in all instances of a Streams app - apache-kafka

In a Kafka Streams app, an instance only gets messages of an input topic for the partitions that have been assigned to that instance. And as the group.id, which is based on the (for all instances identical) application.id, that means that every instance sees only parts of a topic.
This all makes perfect sense of course, and we make use of that with the high-throughput data topic, but we would also like to control the streams application by adding topic-wide "control messages" to the input topic. But as all instances need to get those messages, we would either have to send
one control message per partition (making it necessary for the sender to know about the partitioning scheme, something we would like to avoid)
one control message per key (so every active partition would be getting at least one control message)
Because this is cumbersome for the sender, we are thinking about creating a new topic for control messages that the streams application consumes, in addition to the data topic. But how can we make it so that every partition receives all messages from the control message topic?
According to https://stackoverflow.com/a/55236780/709537, the group id cannot be set for Kafka Streams.
One way to do this would be to create and use a KafkaConsumer in addition to using Kafka Streams, which would allow us to set the group id as we like. However this sounds complex and dirty enough to wonder if there isn't a more straightforward way that we are missing.
Any ideas?

You can use a global store which sources data from all the partitions.
From the documentation,
Adds a global StateStore to the topology. The StateStore sources its
data from all partitions of the provided input topic. There will be
exactly one instance of this StateStore per Kafka Streams instance.
The syntax is as follows:
public StreamsBuilder addGlobalStore(StoreBuilder storeBuilder,
String topic,
Consumed consumed,
ProcessorSupplier stateUpdateSupplier)
The last argument is the ProcessorSupplier which has a get() that returns a Processor that will be executed for every new message. The Processor contains the process() method that will be executed every time there is a new message to the topic.
The global store is per stream instance, so you get all the topic data in every stream instance.
In the process(K key, V value), you can write your processing logic.
A global store can be in-memory or persistent and can be backed by a changelog topic, so that even if the streams instance local data (state) is deleted, the store can be built using the changelog topic.

Related

Kafka client and aggregated events

In event-driven design we strive to find out events that we interested of. Using Kafka we can easily subscribe (a new group.id) to a topic and start consuming events. If retention policy is default one we could consume also one week old messages if specify auto.offset.reset=earliest. Right? But what if we want to start from the very beginning? I guess that KTable should be used but I'm not sure what will happened when a new client has subscribed to a stateful stream. Could you tell me is it true that the new subscriber will receive all aggregated messages?
You can't consume data that has been deleted.
That's why KTables are built on top of compacted topics, which will store the latest keys for each record, and have infinite retention.
If you want to read the "current state" of the table, to get all aggregated messages, then you can use Interactive Queries.
not sure what will happened when a new client has subscribed to a stateful stream
It needs to read the entire compacted topic, starting from the beginning (earliest available offset, not necessarily the first ever produced message) since it cannot easily find where in the topic that each unique key may start.

Publisher which subscribes to its own topic

I'm currently designing an application which is will have hundreds of log-compacted topics. Each topic is related to a failover group and should have a dynamic (e.g., to be changed on demand) set of producers and consumers.
For example, let's say I have 3 failover instances related to topic T1. Each of those failover instances should have the same data / state (eventually consistent). And each of the instances may consume and produce messages on that topic.
As I understand, I need to assign different group IDs for each consumer/producer in order to have every instance read the topic entirely.
Though given that the number of readers and writers for a topic are not fixed, how is it possible to avoid reading ones own messages for that topic?
Sure, I could add a source ID to the message and just dismiss the message when the consumer figures out that he is about to read a message he previously produced himself. But I'd rather avoid the data transfer entirely.
Producers and consumers are independent processes. If you subscribe to the same topic that's being produced to without some extra processing logic, you'll end up with an infinite loop.
You also cannot have more consumers than partitions, so the dynamic consumer amount will be limited by that.
need to assign different group IDs for each consumer/producer in order to have every instance read the topic entirely
Not necessarily. You've mentioned you have compacted topics, so I assume you are using Kafka Streams. In the Streams API, you can set num.standby.replicas for copying statestore data across instances of the same application.id

Get latest values from a topic on consumer start, then continue normally

We have a Kafka producer that produces keyed messages in a very high frequency to topics whose retention time = 10 hours. These messages are real-time updates and the used key is the ID of the element whose value has changed. So the topic is acting as a changelog and will have many duplicate keys.
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal, keeping the minimum load on Kafka server and letting the consumer do most of the job. We tried many ways and none of them seems the best.
What we tried:
1 changelog topic + 1 compact topic:
The producer sends the same message to both topics wrapped in a transaction to assure successful send.
Consumer launches and requests the latest offset of the changelog topic.
Consumes the compacted topic from beginning to construct the table.
Continues consuming the changelog since the requested offset.
Cons:
Having duplicates in compacted topic is a very high possibility even with setting the log compaction frequency the highest possible.
x2 number of topics on Kakfa server.
KSQL:
With KSQL we either have to rewrite a KTable as a topic so that consumer can see it (Extra topics), or we will need consumers to execute KSQL SELECT using to KSQL Rest Server and query the table (Not as fast and performant as Kafka APIs).
Kafka Consumer API:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Kafka Streams:
By using KTables as following:
KTable<Integer, MarketData> tableFromTopic = streamsBuilder.table("topic_name", Consumed.with(Serdes.Integer(), customSerde));
KTable<Integer, MarketData> filteredTable = tableFromTopic.filter((key, value) -> keys.contains(value.getRiskFactorId()));
Kafka Streams will create 1 topic on Kafka server per KTable (named {consumer_app_id}-{topic_name}-STATE-STORE-0000000000-changelog), which will result in a huge number of topics since we a big number of consumers.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Thanks in advance.
By using KTables, Kafka Streams will create 1 topic on Kafka server per KTable, which will result in a huge number of topics since we a big number of consumers.
If you are just reading an existing topic into a KTable (via StreamsBuilder#table()), then no extra topics are being created by Kafka Streams. Same for KSQL.
It would help if you could clarify what exactly you want to do with the KTable(s). Apparently you are doing something that does result in additional topics being created?
1 changelog topic + 1 compact topic:
Why were you thinking about having two separate topics? Normally, changelog topics should always be compacted. And given your use case description, I don't see a reason why it should not be:
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal [...]
Hence compaction would be very useful for your use case. It would also prevent this problem you described:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Note that, to reconstruct the latest table values, all three of Kafka Streams, KSQL, and the Kafka Consumer must read the table's underlying topic completely (from beginning to end). If that topic is NOT compacted, this might indeed take a long time depending on the data volume, topic retention settings, etc.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Without knowing more about your use case, particularly what you want to do with the KTable(s) once they are populated, my answer would be:
Make sure the "changelog topic" is also compacted.
Try KSQL first. If this doesn't satisfy your needs, try Kafka Streams. If this doesn't satisfy your needs, try the Kafka Consumer.
For example, I wouldn't use the Kafka Consumer if it is supposed to do any stateful processing with the "table" data, because the Kafka Consumer lacks built-in functionality for fault-tolerant stateful processing.
Consumer starts and consumes the topic from beginning. This worked
perfectly, but the consumer has to consume the 10 hours change log to
construct the last values table.
During the first time your application starts up, what you said is correct.
To avoid this during every restart, store the key-value data in a file.
For example, you might want to use a persistent map (like MapDB).
Since you give the consumer group.id and you commit the offset either periodically or after each record is stored in the map, the next time your application restarts it will read it from the last comitted offset for that group.id.
So the problem of taking a lot of time occurs only initially (during first time). So long as you have the file, you don't need to consume from beginning.
In case, if the file is not there or is deleted, just seekToBeginning in the KafkaConsumer and build it again.
Somewhere, you need to store this key-values for retrieval and why cannot it be a persistent store?
In case if you want to use Kafka streams for whatever reason, then an alternative (not as simple as the above) is to use a persistent backed store.
For example, a persistent global store.
streamsBuilder.addGlobalStore(Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore(topic), keySerde, valueSerde), topic, Consumed.with(keySerde, valueSerde), this::updateValue);
P.S: There will be a file called .checkpoint in the directory which stores the offsets. In case if the topic is deleted in the middle you get OffsetOutOfRangeException. You may want to avoid this, perhaps by using UncaughtExceptionHandler
Refer to https://stackoverflow.com/a/57301986/2534090 for more.
Finally,
It is better to use Consumer with persistent file rather than Streams for this, because of simplicity it offers.

Kafka Consumer API vs Streams API for event filtering

Should I use the Kafka Consumer API or the Kafka Streams API for this use case? I have a topic with a number of consumer groups consuming off it. This topic contains one type of event which is a JSON message with a type field buried internally. Some messages will be consumed by some consumer groups and not by others, one consumer group will probably not be consuming many messages at all.
My question is:
Should I use the consumer API, then on each event read the type field and drop or process the event based on the type field.
OR, should I filter using the Streams API, filter method and predicate?
After I consume an event, the plan is to process that event (DB delete, update, or other depending on the service) then if there is a failure I will produce to a separate queue which I will re-process later.
Thanks you.
This seems more a matter of opinion. I personally would go with Streams/KSQL, likely smaller code that you would have to maintain. You can have another intermediary topic that contains the cleaned up data that you can then attach a Connect sink, other consumers, or other Stream and KSQL processes. Using streams you can scale a single application on different machines, you can store state, have standby replicas and more, which would be a PITA to do it all yourself.

Is Kafka Stream StateStore global over all instances or just local?

In Kafka Stream WordCount example, it uses StateStore to store word counts. If there are multiple instances in the same consumer group, the StateStore is global to the group, or just local to an consumer instance?
Thnaks
This depends on your view on a state store.
In Kafka Streams a state is shared and thus each instance holds part of the overall application state. For example, using DSL stateful operator use a local RocksDB instance to hold their shard of the state. Thus, with this regard the state is local.
On the other hand, all changes to the state are written into a Kafka topic. This topic does not "live" on the application host but in the Kafka cluster and consists of multiple partition and can be replicated. In case of an error, this changelog topic is used to recreate the state of the failed instance in another still running instance. Thus, as the changelog is accessible by all application instances, it can be considered to be global, too.
Keep in mind, that the changelog is the truth of the application state and the local stores are basically caches of shards of the state.
Moreover, in the WordCount example, a record stream (the data stream) gets partitioned by words, such that the count of one word will be maintained by a single instance (and different instances maintain the counts for different words).
For an architectural overview, I recommend http://docs.confluent.io/current/streams/architecture.html
Also this blog post should be interesting http://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/
If worth mentioning that there is a GlobalKTable improvement proposal
GlobalKTable will be fully replicated once per KafkaStreams instance.
That is, each KafkaStreams instance will consume all partitions of the
corresponding topic.
From the Confluent Platform's mailing list, I've got this information
You could start
prototyping using Kafka 0.10.2 (or trunk) branch...
0.10.2-rc0 already has GlobalKTable!
Here's the actual PR.
And the person that told me that was Matthias J. Sax ;)
Use a Processor instead of Transformer, for all the transformations you want to perform on the input topic, whenever there is a usecase of lookingup data from GlobalStateStore . Use context.forward(key,value,childName) to send the data to the downstream nodes. context.forward(key,value,childName) may be called multiple times in a process() and punctuate() , so as to send multiple records to downstream node. If there is a requirement to update GlobalStateStore, do this only in Processor passed to addGlobalStore(..) because, there is a GlobalStreamThread associated with GlobalStateStore, which keeps the state of the store consistent across all the running kstream instances.