Dynamically create and change Kafka topics with Flink - apache-kafka

I'm using Flink to read and write data from different Kafka topics.
Specifically, I'm using the FlinkKafkaConsumer and FlinkKafkaProducer.
I'd like to know if it is possible to change the Kafka topics I'm reading from and writing to 'on the fly' based on either logic within my program, or the contents of the records themselves.
For example, if a record with a new field is read, I'd like to create a new topic and start diverting records with that field to the new topic.
Thanks.

If you have your topics following a generic naming pattern, for example, "topic-n*", your Flink Kafka consumer can automatically reads from "topic-n1", "topic-n2", ... and so on as they are added to Kafka.
Flink 1.5 (FlinkKafkaConsumer09) added support for dynamic partition discovery & topic discovery based on regex. This means that the Flink-Kafka consumer can pick up new Kafka partitions without needing to restart the job and while maintaining exactly-once guarantees.
Consumer constructor that accepts subscriptionPattern: link.

Thinking more about the requirement,
1st step is - You will start from one topic (for simplicity) and will spawn more topic during runtime based on the data provided and direct respective messages to these topics. It's entirely possible and will not be a complicated code. Use ZkClient API to check if topic-name exists, if does not exist create a model topic with new name and start pushing messages into it through a new producer tied to this new topic. You don't need to restart job to produce messages to a specific topic.
Your initial consumer become producer(for new topics) + consumer(old topic)
2nd step is - You want to consume messages for new topic. One way could be to spawn a new job entirely. You can do this be creating a thread pool initially and supplying arguments to them.
Again be more careful with this, more automation can lead to overload of cluster in case of a looping bug. Think about the possibility of too many topics created after some time if input data is not controlled or is simply dirty. There could be better architectural approaches as mentioned above in comments.

Related

Consume all messages of a topic in all instances of a Streams app

In a Kafka Streams app, an instance only gets messages of an input topic for the partitions that have been assigned to that instance. And as the group.id, which is based on the (for all instances identical) application.id, that means that every instance sees only parts of a topic.
This all makes perfect sense of course, and we make use of that with the high-throughput data topic, but we would also like to control the streams application by adding topic-wide "control messages" to the input topic. But as all instances need to get those messages, we would either have to send
one control message per partition (making it necessary for the sender to know about the partitioning scheme, something we would like to avoid)
one control message per key (so every active partition would be getting at least one control message)
Because this is cumbersome for the sender, we are thinking about creating a new topic for control messages that the streams application consumes, in addition to the data topic. But how can we make it so that every partition receives all messages from the control message topic?
According to https://stackoverflow.com/a/55236780/709537, the group id cannot be set for Kafka Streams.
One way to do this would be to create and use a KafkaConsumer in addition to using Kafka Streams, which would allow us to set the group id as we like. However this sounds complex and dirty enough to wonder if there isn't a more straightforward way that we are missing.
Any ideas?
You can use a global store which sources data from all the partitions.
From the documentation,
Adds a global StateStore to the topology. The StateStore sources its
data from all partitions of the provided input topic. There will be
exactly one instance of this StateStore per Kafka Streams instance.
The syntax is as follows:
public StreamsBuilder addGlobalStore(StoreBuilder storeBuilder,
String topic,
Consumed consumed,
ProcessorSupplier stateUpdateSupplier)
The last argument is the ProcessorSupplier which has a get() that returns a Processor that will be executed for every new message. The Processor contains the process() method that will be executed every time there is a new message to the topic.
The global store is per stream instance, so you get all the topic data in every stream instance.
In the process(K key, V value), you can write your processing logic.
A global store can be in-memory or persistent and can be backed by a changelog topic, so that even if the streams instance local data (state) is deleted, the store can be built using the changelog topic.

Kafka Message Filtering Based on ENVs

I have a consumer application deployed on several ENVs (dev, test, stage & preprod). They all are consuming the same Kafka Topic (means works like multiple consumer of same topic).
I have separate producer applications for all ENVs (dev, test, stage & preprod). While producing message inside the payload it has a field to mention the producer's ENV.
Our requirement is that - Dev ENV's consumer should only consume Dev ENV's producer application's messages. Same goes to other ENVs.
My question is - should I go with Consumer side filtering? Is this will ensure our requirement? How it will ensure our requirement?
Thanks in advance.
You have multiple options on how to deal with this requirement. However, I don't think it is in general a good idea to have one topic for different environments. Looking into data protection and access permissions this doesn't sound like a good design.
Anyway, I see the following options.
Option 1:
Use the environment (dev, test, ...) as the key of the topic and tell the consumer to filter by key.
Option 2:
Write producers that send data from each environment to individual partitions and tell the consumers for each environment to only read from a particular partition.
But before implementing Option 2, I would rather do
Option 3:
Have a topic for each environment and let the Producer/Consumer write/read from the differen topics.
I agree with mike that using a single topic across environments is not a good idea.
However if you are going to do this, then I would suggest you use a stream processor to create separate topics for your consumers. You can do this in Kafka Streams, ksqlDB, etc.
ksqlDB would look like this:
-- Declare stream over existing topic
CREATE STREAM FOO_ALL_ENVS WITH (KAFKA_TOPIC='my_source_topic', VALUE_FORMAT='AVRO');
-- Create derived stream & new topic populated with message just for DEV
-- You can explicitly provide the target Kafka topic name.
CREATE STREAM FOO_DEV WITH (KAFKA_TOPIC='foo_dev') AS SELECT * FROM FOO_ALL_ENVS WHERE ENV='DEV';
-- Create derived stream & new topic populated with message just for PROD
-- If you don't specify a Kafka topic name it will inherit from the
-- stream name (i.e. `FOO_PROD`)
CREATE STREAM FOO_PROD AS SELECT * FROM FOO_ALL_ENVS WHERE ENV='PROD';
-- etc
Now you have your producer writing to a single topic (if you must), but your consumers can consume from a topic that is specific to their environment. The ksqlDB statements are continuous queries so will process all existing messages in the source topic and every new message that arrives.

Is there an option to configure num stream thread for specific topic(s) instead of all?

In our application we have multiple topics where some topics will be created with 16 partition and some topics will be created with 1 partition. Is there any spring.cloud.stream.kafka.bindings property/option available to achieve this?
Maybe this helps: num.stream.threads creating idle threads
If there is one KafkaStreams instance it is not possible because Kafka Streams does only have a global config. Hence, you would need to have multiple applications, i.e., multiple KafkaStreams instances that process different input topic to configure each with a different number of threads. Following the answer from above, it seems that spring-cloud-streams can create multiple KafkaStreams clients to support what you want.
However, I am not sure why you would want/need this (but I am also not exaclty sure how spring-cloud-stream translates your program)? In the end, parallelization is done based on tasks and thus for single input topic partitions only one of your thread will get the corresponding task assigned. Thus, there is no overhead you need to worry about.
For more details check out: https://docs.confluent.io/current/streams/architecture.html#parallelism-model
There are several partition properties available. For example,
spring.cloud.stream.bindings.func-out-0.producer.partitionKeyExpression=payload.id
spring.cloud.stream.bindings.func-out-0.producer.partition
You can get more information on both producer and consumer configuration properties here

Kafka Producer design - multiple topics

I'm trying to implement a Kafka producer/consumer model, and am deliberating whether creating a separate publisher thread per topic would be preferred over having a single publisher handle multiple topics. Any help would be appreciated
PS: I'm new to Kafka
By separate publisher thread, I think you mean separate producer objects. If so..
Since messages are stored as key-value pairs in Kafka, different topics can have different key-value types.
So if your Kafka topics have different key-value types like for example..
Topic1 - key:String, value:Student
Topic2 - key:Long, value:Teacher
and so on, then you should be using multiple producers. This is because the KafkaProducer class at the time of constructing the object asks you for the key and value serializers.
Properties props=new Properties();
props.put("key.serializer",StringSerializer.class);
props.put("value.serializer",LongSerializer.class);
KafkaProducer<String,Long> producer=new KafkaProducer<>(props);
Though, you may also write a generic serializer for all the types! But, it is better to know before hand what we are doing with the producer.
I prefer Keep It Stupid Simple (KISS) approach for the sake of obvious reasons - one producer / multiple producers - one topic.
From Wikipedia,
The KISS principle states that most systems work best if they are kept simple rather than made complicated; therefore, simplicity should be a key goal in design, and unnecessary complexity should be avoided.
Talking about the possibility of one producer supporting multiple topics, it is also far from the truth.
Starting with version 2.5, you can use a RoutingKafkaTemplate to select the producer at runtime, based on the destination topic name.
https://docs.spring.io/spring-kafka/reference/html/#routing-template
Single Publisher can handle multiple Topics and you can customize the Producer Config as per Topic needs
I think a separate thread for each topic would be preferred because due to some reasons if the particular producer is down then the respected topic will be impacted and remaining all the topics will work smoothly without any problem.
If we create one publisher for all topics then, if the publisher is down for some reasons then all the topics would impact.

Get latest values from a topic on consumer start, then continue normally

We have a Kafka producer that produces keyed messages in a very high frequency to topics whose retention time = 10 hours. These messages are real-time updates and the used key is the ID of the element whose value has changed. So the topic is acting as a changelog and will have many duplicate keys.
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal, keeping the minimum load on Kafka server and letting the consumer do most of the job. We tried many ways and none of them seems the best.
What we tried:
1 changelog topic + 1 compact topic:
The producer sends the same message to both topics wrapped in a transaction to assure successful send.
Consumer launches and requests the latest offset of the changelog topic.
Consumes the compacted topic from beginning to construct the table.
Continues consuming the changelog since the requested offset.
Cons:
Having duplicates in compacted topic is a very high possibility even with setting the log compaction frequency the highest possible.
x2 number of topics on Kakfa server.
KSQL:
With KSQL we either have to rewrite a KTable as a topic so that consumer can see it (Extra topics), or we will need consumers to execute KSQL SELECT using to KSQL Rest Server and query the table (Not as fast and performant as Kafka APIs).
Kafka Consumer API:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Kafka Streams:
By using KTables as following:
KTable<Integer, MarketData> tableFromTopic = streamsBuilder.table("topic_name", Consumed.with(Serdes.Integer(), customSerde));
KTable<Integer, MarketData> filteredTable = tableFromTopic.filter((key, value) -> keys.contains(value.getRiskFactorId()));
Kafka Streams will create 1 topic on Kafka server per KTable (named {consumer_app_id}-{topic_name}-STATE-STORE-0000000000-changelog), which will result in a huge number of topics since we a big number of consumers.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Thanks in advance.
By using KTables, Kafka Streams will create 1 topic on Kafka server per KTable, which will result in a huge number of topics since we a big number of consumers.
If you are just reading an existing topic into a KTable (via StreamsBuilder#table()), then no extra topics are being created by Kafka Streams. Same for KSQL.
It would help if you could clarify what exactly you want to do with the KTable(s). Apparently you are doing something that does result in additional topics being created?
1 changelog topic + 1 compact topic:
Why were you thinking about having two separate topics? Normally, changelog topics should always be compacted. And given your use case description, I don't see a reason why it should not be:
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal [...]
Hence compaction would be very useful for your use case. It would also prevent this problem you described:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Note that, to reconstruct the latest table values, all three of Kafka Streams, KSQL, and the Kafka Consumer must read the table's underlying topic completely (from beginning to end). If that topic is NOT compacted, this might indeed take a long time depending on the data volume, topic retention settings, etc.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Without knowing more about your use case, particularly what you want to do with the KTable(s) once they are populated, my answer would be:
Make sure the "changelog topic" is also compacted.
Try KSQL first. If this doesn't satisfy your needs, try Kafka Streams. If this doesn't satisfy your needs, try the Kafka Consumer.
For example, I wouldn't use the Kafka Consumer if it is supposed to do any stateful processing with the "table" data, because the Kafka Consumer lacks built-in functionality for fault-tolerant stateful processing.
Consumer starts and consumes the topic from beginning. This worked
perfectly, but the consumer has to consume the 10 hours change log to
construct the last values table.
During the first time your application starts up, what you said is correct.
To avoid this during every restart, store the key-value data in a file.
For example, you might want to use a persistent map (like MapDB).
Since you give the consumer group.id and you commit the offset either periodically or after each record is stored in the map, the next time your application restarts it will read it from the last comitted offset for that group.id.
So the problem of taking a lot of time occurs only initially (during first time). So long as you have the file, you don't need to consume from beginning.
In case, if the file is not there or is deleted, just seekToBeginning in the KafkaConsumer and build it again.
Somewhere, you need to store this key-values for retrieval and why cannot it be a persistent store?
In case if you want to use Kafka streams for whatever reason, then an alternative (not as simple as the above) is to use a persistent backed store.
For example, a persistent global store.
streamsBuilder.addGlobalStore(Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore(topic), keySerde, valueSerde), topic, Consumed.with(keySerde, valueSerde), this::updateValue);
P.S: There will be a file called .checkpoint in the directory which stores the offsets. In case if the topic is deleted in the middle you get OffsetOutOfRangeException. You may want to avoid this, perhaps by using UncaughtExceptionHandler
Refer to https://stackoverflow.com/a/57301986/2534090 for more.
Finally,
It is better to use Consumer with persistent file rather than Streams for this, because of simplicity it offers.