Eliminate redundancy between ktable internal topic and user topic - apache-kafka

I'm writing an app for a book library and I have 2 microservices: memberService, and bookService. The memberService creates a ktable of members (built by aggregating change messages on another topic) for its own internal use. The bookService also needs read access to that ktable. Currently I share the data by having the memberService call memberTable.toStream().to("memberTableTopic")
and I have the bookService subscribe to the memberTableTopic.
Based on my understanding of how ktables work, the data in memberTableTopic will be identical to the backing internal topic used by the ktable. Is there a good way to eliminate this redundancy? Should my bookService subscribe to the internal topic?

Yes the data will look same in both topics- internal topic and usertopic.
Conceptully, Internal topics are used internally by the KStream Applications. That implies that when an application ID is reset, these internal topics will be deleted and you lose the data. While User topics exist externally to an application, that can be used by any application at any time.
That depends on your need how you want to use the data. If you want to remove the redundancy, you can set a short retention on your internal topics.

Related

Kafka Streams - disable internal topic creation

I work in an organization where we must use the shared Kafka cluster.
Due to internal company policy, the account we use for authentication has only the read/write permissions assigned.
We are not able to request the topic-create permission.
To create the topic we need to follow the onboarding procedure and know the topic name upfront.
As we know, Kafka Streams creates internal topics to persist the stream's state.
Is there a way to disable the fault tolerance and keep the stream state in memory or persist in the file system?
Thank you in advance.
This entirely depends how you write the topology. For example, map/filter/forEach, etc stateless DSL operators don't create any internal topics.
If you actually need to do aggregation, and build state-stores, then you really shouldn't disable topics. Yes, statestores are stored either in-memory or as RocksDB on disk, but they're still initially stored as topics so they can actually be distributed, or rebuilt in case of failure.
If you want to prevent them, I think you'll need an authorizer class defined on the broker that can restrict topic creation based, at least, on client side application.id and client.id regex patterns, but there's nothing you can do at the client config.

Can compacted Kafka topic be used as key-value database?

In many articles, I've read that compacted Kafka topics can be used as a database. However, when looking at the Kafka API, I cannot find methods that allow me to query a topic for a value based on a key.
So, can a compacted Kafka topic be used as a (high performance, read-only) key-value database?
In my architecture I want to feed a component with a compacted topic. And I'm wondering whether that component needs to have a replica of that topic in its local database, or whether it can use that compacted topic as a key value database instead.
Compacted kafka topics themselves and basic Consumer/Producer kafka APIs are not suitable for a key-value database. They are, however, widely used as a backstore to persist KV Database/Cache data, i.e: in a write-through approach for instance. If you need to re-warmup your Cache for some reason, just replay the entire topic to repopulate.
In the Kafka world you have the Kafka Streams API which allows you to expose the state of your application, i.e: for your KV use case it could be the latest state of an order, by the means of queryable state stores. A state store is an abstraction of a KV Database and are actually implemented using a fast KV database called RocksDB which, in case of disaster, are fully recoverable because it's full data is persisted in a kafka topic, so it's quite resilient as to be a source of the data for your use case.
Imagine that this is your Kafka Streams Application architecture:
To be able to query these Kafka Streams state stores you need to bundle an HTTP Server and REST API in your Kafka Streams applications to query its local or remote state store (Kafka distributes/shards data across multiple partitions in a topic to enable parallel processing and high availability, and so does Kafka Streams). Because Kafka Streams API provides the metadata for you to know in which instance the key resides, you can surely query any instance and, if the key exists, a response can be returned regardless of the instance where the key lives.
With this approach, you can kill two birds in a shot:
Do stateful stream processing at scale with Kafka Streams
Expose its state to external clients in a KV Database query pattern style
All in a real-time, highly performant, distributed and resilient architecture.
The images were sourced from a wider article by Robert Schmid where you can find additional details and a prototype to implement queryable state stores with Kafka Streams.
Notable mention:
If you are not in the mood to implement all of this using the Kafka Streams API, take a look at ksqlDB from Confluent which provides an even higher level abstraction on top of Kafka Streams just using a cool and simple SQL dialect to achieve the same sort of use case using pull queries. If you want to prototype something really quickly, take a look at this answer by Robin Moffatt or even this blog post to get a grip on its simplicity.
While ksqlDB is not part of the Apache Kafka project, it's open-source, free and is built on top of the Kafka Streams API.

Consume all messages of a topic in all instances of a Streams app

In a Kafka Streams app, an instance only gets messages of an input topic for the partitions that have been assigned to that instance. And as the group.id, which is based on the (for all instances identical) application.id, that means that every instance sees only parts of a topic.
This all makes perfect sense of course, and we make use of that with the high-throughput data topic, but we would also like to control the streams application by adding topic-wide "control messages" to the input topic. But as all instances need to get those messages, we would either have to send
one control message per partition (making it necessary for the sender to know about the partitioning scheme, something we would like to avoid)
one control message per key (so every active partition would be getting at least one control message)
Because this is cumbersome for the sender, we are thinking about creating a new topic for control messages that the streams application consumes, in addition to the data topic. But how can we make it so that every partition receives all messages from the control message topic?
According to https://stackoverflow.com/a/55236780/709537, the group id cannot be set for Kafka Streams.
One way to do this would be to create and use a KafkaConsumer in addition to using Kafka Streams, which would allow us to set the group id as we like. However this sounds complex and dirty enough to wonder if there isn't a more straightforward way that we are missing.
Any ideas?
You can use a global store which sources data from all the partitions.
From the documentation,
Adds a global StateStore to the topology. The StateStore sources its
data from all partitions of the provided input topic. There will be
exactly one instance of this StateStore per Kafka Streams instance.
The syntax is as follows:
public StreamsBuilder addGlobalStore(StoreBuilder storeBuilder,
String topic,
Consumed consumed,
ProcessorSupplier stateUpdateSupplier)
The last argument is the ProcessorSupplier which has a get() that returns a Processor that will be executed for every new message. The Processor contains the process() method that will be executed every time there is a new message to the topic.
The global store is per stream instance, so you get all the topic data in every stream instance.
In the process(K key, V value), you can write your processing logic.
A global store can be in-memory or persistent and can be backed by a changelog topic, so that even if the streams instance local data (state) is deleted, the store can be built using the changelog topic.

kafka produce to topic and write to state store in a single transaction

Is it possible to produce to a Kafka topic and write to state store in a single transaction? But not start the transaction as part of a topic consumption.
EDIT: The reason I what to do this is to be able to filter out duplicate requests. E.g. a service exposes a REST interface and just writes a message to a topic. If it is possible to produce to topic and write to state store in a single transaction, then I can easily first query the state store to filter out a duplicate. This also assumes that the transaction timeout, will be less than the REST timeout, but not that related to the question.
I am also aware of the solution provided here by Confluent. But this will work as long as the synchronisation time "from the topic to the store" is less than the blocking time.
https://kafka.apache.org/10/javadoc/org/apache/kafka/streams/processor/StateStore.html
State store is part of Streams API. So, State store is linked with Kafka-streams. I would recommend using headers within a message to maintain state information.
Or
Create another topic to store intermediate information.
If I understand you use case properly, you can do like that:
Write REST call result to some topic - raw-data(using the producer)
Use Kafka Streams to process data from raw-data topic. Using Kafka Streams you can implement whole logic of checking/filtering duplicates, etc and writing result into golden topic.

Kafka Consumer API vs Streams API for event filtering

Should I use the Kafka Consumer API or the Kafka Streams API for this use case? I have a topic with a number of consumer groups consuming off it. This topic contains one type of event which is a JSON message with a type field buried internally. Some messages will be consumed by some consumer groups and not by others, one consumer group will probably not be consuming many messages at all.
My question is:
Should I use the consumer API, then on each event read the type field and drop or process the event based on the type field.
OR, should I filter using the Streams API, filter method and predicate?
After I consume an event, the plan is to process that event (DB delete, update, or other depending on the service) then if there is a failure I will produce to a separate queue which I will re-process later.
Thanks you.
This seems more a matter of opinion. I personally would go with Streams/KSQL, likely smaller code that you would have to maintain. You can have another intermediary topic that contains the cleaned up data that you can then attach a Connect sink, other consumers, or other Stream and KSQL processes. Using streams you can scale a single application on different machines, you can store state, have standby replicas and more, which would be a PITA to do it all yourself.