Context:
We are using Kafka Stream to write an application. Generally, we did transformations on messages from one topic to another topic (not using join).
In order to sneaking into output results a little bit before flowing it to destination topics, we want to tweak out a debug mode which allows us only consume certain number of messages (~1000) from source topic. And, messages after Kafka stream should be diverged into Stdout or files instead of destination topic. Thus, no message will produce to Kafka and we can get a sense of what result data looks like.
Questions:
Is that possible to let Kafka stream consume only n messages from source topic and close stream?
I think KafkaConsumerInterceptor is an option to count on messages. But, I don't know if there is a way to close Kafka stream when we reach a certain number.
Another potential idea is do some changes on topology Kafka Stream created. Like, creating a new source node in topology which can only read messages in Topic X from offset A to offset B so that we can manually set.
I was wondering which approach is feasible or if there are other better solutions. Thanks!
Related
Here is my case,
We have a Postgres connector on the source side and have lambda connector on the sink side. We have implemented this pipeline and it is working.
So we can get database data(update, delete, insert) on the lambda function.But we don't want that lambda function to be called on every time when there are any changes in the database. If the topic has minimum N changes, then we want to send that data to the lambda function.
Lambda connector has batch options advanced setting, but it is used to maximize batch size. I want to decide the minimum batch size and after that messages that are in the topic will be consumed by the consumer connector.
consume messages from topic only if topic have minimum n messages?
Out of the box, no, and not with Connect framework.
You can use GetOffsetShell or other programmatic function to check the "end offset" of the topic, but this is not an exact "number of messages". Then, if that meets some condition, you can start a consumer.
Plus, you will continue to send data, and may have processed data before, which is not removed from the topic, therefore the idea of "topic having N messages" doesn't really make sense. You will therefore also need to query for any committed offsets and calculate the difference (i.e. consumer lag).
Your other option would be to always read the topic, fetch data into an in-memory data-structure (a batch), then only process that (and commit the ending offset) once it reaches a certain size. Kafka Streams window functions could help with this ; window/aggregate the data and send out events to a new topic which will invoke the lambda.
I'm new at using Kafka and I have one question. Can I delete only ONE message from a topic if I know the topic, the offset and the partition? And if not is there any alternative?
It is not possible to remove a single message from a Kafka topic, even though you know its partition and offset.
Keep in mind, that Kafka is not a key/value store but a topic is rather an append-only(!) log that represents a stream of data.
If you are looking for alternatives to remove a single message you may
Have your consumer clients ignore that message
Enable log compaction and send a tompstone message
Write a simple job (KafkaStreams) to consume the data, filter out that one message and produce all messages to a new topic.
We have a use case where based on work item arriving on a worker queue, we would need to use the message metatdata to decide which Kafka topic to stream our data from. We would have maybe less than 100 worker nodes deployed and each worker node can have a configurable number of threads to receive messages from the queue. So if a worker has "n" threads , we would land up opening maybe kafka streams to "n" different topics. (n is usually less than 10).
Once the worker is done processing the message, we would need to close the stream also.
The worker can receive the next messsage once its acked the first message and at which point , I need to open a kafka stream for another topic.
Also every kafka stream needs to scan all the partitions(around 5-10) for the topic to filter by a certain attribute.
Can a flow like this work for Kafka streams or is this not an optimal approach?
I am not sure if I fully understand the use case, but it seem to be a "simple" copy data from topic A to topic B use case, ie, no data processing/modification. The logic to copy data from input to output topic seems complex though, and thus using Kafka Streams (ie, Kafka's stream processing library) might not be the best fit, as you need more flexibility.
However, using plain KafkaConsumers and KafkaProducers should allow you to implement what you want.
Avro encoded messages on a single Kafka topic, single partitioned. Each of these messages were to be consumed by a specific consumer only. For ex, message a1, a2, b1 and c1 on this topic, there are 3 consumers named A, B and C. Each consumer would get all the messages but ultimate A would consume a1 and a2, B on b1 and C on c1.
I want to know how typically this is solved when using avro on Kafka:
leave it for the consumers to deserialize the message then some application logic to decide to consume the message or drop the message
use partition logic to make each of the messages to go to a particular partition, then setup each consumer to listen to only a single partition
setup another 3 topics and a tiny kafka-stream application that would do the filtering + routing from main topic to these 3 specific topics
make use of kafka header to inject identifier for downstream consumers to filter
Looks like each of the options have their pros and cons. I want to know if there is a convention that people follow or there is some other ways of solving this.
It depends...
If you only have a single partitioned topic, the only option is to let each consumer read all data and filter client side which data the consumer is interested in. For this case, each consumer would need to use a different group.id to isolate the consumers from each other.
Option 2 is certainly possible, if you can control the input topic you are reading from. You might still have different group.ids for each consumer as it seems that the consumer represent different applications that should be isolated from each other. The question is still if this is a good model, because the idea of partitions is to provide horizontal scale out, and data-parallel processing; however, if each application reads only from one partition it seems not to align with this model. You also need to know which data goes into which partition producer side and consumer side to get the mapping right. Hence, it implies a "coordination" between producer and consumer what seems not desirable.
Option 3 seems to indicate that you cannot control the input topic and thus want to branch the data into multiple topics? This seems to be a good approach in general, as topics are a logical categorization of data. However, it would even be better to have 3 topic for the different data to begin with! If you cannot have 3 input topic from the beginning on, Option 3 seems not to provide a good conceptual setup, however, it won't provide much performance benefits, because the Kafka Streams application required to read and write each record once. The saving you gain is that each application would only consume from one topic and thus redundant data read is avoided here -- if you would have, lets say 100 application (and each is only interested in 1/100 of the data) you would be able to cut down the load significantly from an 99x read overhead to a 1x read and 1x write overhead. For your case you don't really cut down much as you go from 2x read overhead to 1x read + 1x write overhead. Additionally, you need to manage the Kafka Streams application itself.
Option 4 seems to be orthogonal, because is seems to answer the question on how the filtering works, and headers can be use for Option 1 and Option 3 to do the actually filtering/branching.
The data in the topic is just bytes, Avro shouldn't matter.
Since you only have one partition, only one consumer of a group can be actively reading the data.
If you only want to process certain offsets, you must either seek to them manually or skip over messages in your poll loop and commit those offsets
We have a Kafka producer that produces keyed messages in a very high frequency to topics whose retention time = 10 hours. These messages are real-time updates and the used key is the ID of the element whose value has changed. So the topic is acting as a changelog and will have many duplicate keys.
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal, keeping the minimum load on Kafka server and letting the consumer do most of the job. We tried many ways and none of them seems the best.
What we tried:
1 changelog topic + 1 compact topic:
The producer sends the same message to both topics wrapped in a transaction to assure successful send.
Consumer launches and requests the latest offset of the changelog topic.
Consumes the compacted topic from beginning to construct the table.
Continues consuming the changelog since the requested offset.
Cons:
Having duplicates in compacted topic is a very high possibility even with setting the log compaction frequency the highest possible.
x2 number of topics on Kakfa server.
KSQL:
With KSQL we either have to rewrite a KTable as a topic so that consumer can see it (Extra topics), or we will need consumers to execute KSQL SELECT using to KSQL Rest Server and query the table (Not as fast and performant as Kafka APIs).
Kafka Consumer API:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Kafka Streams:
By using KTables as following:
KTable<Integer, MarketData> tableFromTopic = streamsBuilder.table("topic_name", Consumed.with(Serdes.Integer(), customSerde));
KTable<Integer, MarketData> filteredTable = tableFromTopic.filter((key, value) -> keys.contains(value.getRiskFactorId()));
Kafka Streams will create 1 topic on Kafka server per KTable (named {consumer_app_id}-{topic_name}-STATE-STORE-0000000000-changelog), which will result in a huge number of topics since we a big number of consumers.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Thanks in advance.
By using KTables, Kafka Streams will create 1 topic on Kafka server per KTable, which will result in a huge number of topics since we a big number of consumers.
If you are just reading an existing topic into a KTable (via StreamsBuilder#table()), then no extra topics are being created by Kafka Streams. Same for KSQL.
It would help if you could clarify what exactly you want to do with the KTable(s). Apparently you are doing something that does result in additional topics being created?
1 changelog topic + 1 compact topic:
Why were you thinking about having two separate topics? Normally, changelog topics should always be compacted. And given your use case description, I don't see a reason why it should not be:
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal [...]
Hence compaction would be very useful for your use case. It would also prevent this problem you described:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Note that, to reconstruct the latest table values, all three of Kafka Streams, KSQL, and the Kafka Consumer must read the table's underlying topic completely (from beginning to end). If that topic is NOT compacted, this might indeed take a long time depending on the data volume, topic retention settings, etc.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Without knowing more about your use case, particularly what you want to do with the KTable(s) once they are populated, my answer would be:
Make sure the "changelog topic" is also compacted.
Try KSQL first. If this doesn't satisfy your needs, try Kafka Streams. If this doesn't satisfy your needs, try the Kafka Consumer.
For example, I wouldn't use the Kafka Consumer if it is supposed to do any stateful processing with the "table" data, because the Kafka Consumer lacks built-in functionality for fault-tolerant stateful processing.
Consumer starts and consumes the topic from beginning. This worked
perfectly, but the consumer has to consume the 10 hours change log to
construct the last values table.
During the first time your application starts up, what you said is correct.
To avoid this during every restart, store the key-value data in a file.
For example, you might want to use a persistent map (like MapDB).
Since you give the consumer group.id and you commit the offset either periodically or after each record is stored in the map, the next time your application restarts it will read it from the last comitted offset for that group.id.
So the problem of taking a lot of time occurs only initially (during first time). So long as you have the file, you don't need to consume from beginning.
In case, if the file is not there or is deleted, just seekToBeginning in the KafkaConsumer and build it again.
Somewhere, you need to store this key-values for retrieval and why cannot it be a persistent store?
In case if you want to use Kafka streams for whatever reason, then an alternative (not as simple as the above) is to use a persistent backed store.
For example, a persistent global store.
streamsBuilder.addGlobalStore(Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore(topic), keySerde, valueSerde), topic, Consumed.with(keySerde, valueSerde), this::updateValue);
P.S: There will be a file called .checkpoint in the directory which stores the offsets. In case if the topic is deleted in the middle you get OffsetOutOfRangeException. You may want to avoid this, perhaps by using UncaughtExceptionHandler
Refer to https://stackoverflow.com/a/57301986/2534090 for more.
Finally,
It is better to use Consumer with persistent file rather than Streams for this, because of simplicity it offers.