Can I have/build a "null topology" with Kafka Streams? - apache-kafka

We have a Spring Boot Kafka Streams processor. For various reasons, we may have a situation where we need the process to start and run, but there are no topics we wish to subscribe to. In such cases, we just want the process to "sleep", because other liveness/environment checkers depend on it running. Also, it's part of a RedHat OCP cluster, and we don't want the pod to be constantly doing a crash backoff loop. I fully understand that it'll never really do anything until it's restarted with a valid topic(s), but that's OK.
If we start it with no topics, we get this message:Failed to start bean 'kStreamBuilder'; nested exception is org.springframework.kafka.KafkaException: Could not start stream: ; nested exception is org.apache.kafka.streams.errors.TopologyException: Invalid topology: Topology has no stream threads and no global threads, must subscribe to at least one source topic or global table.
In a test environment, we could just create a topic that's never written to, but in production, we don't have that flexibility, so a programmatic solution would be best. Ideally, I think, if there's a "null topic" abstraction of some sort (a Kafka "/dev/null"), that would look the cleanest in the code.
Best practices, please?

You can set the autoStartup property on the StreamsBuilderFactoryBean to false and only start() it if you have at least one stream.
If using Spring Boot, it's available as a property:
https://docs.spring.io/spring-boot/docs/current/reference/html/application-properties.html#application-properties.integration.spring.kafka.streams.auto-startup

Related

How do you correctly write a Kafka Streams healthcheck?

Kafka will internally kill any nodes that are not correctly polling, but we don't want bad nodes to stay up and we want visibility of which nodes are working, so naturally we want to build out a classical healthcheck.
The obvious way to build a healthcheck is to use KafkaStream#state#isRunning but this is completely deceptive. If the underlying network connection to Kafka is down, or any of a dozen things that don't cause an internal thread to die, KafkaStreams still reports itself as alive even though the logs are throwing up errors all over the place. Is there any correct way to get visibility of what is happening inside Kafka streams? Is expecting a sane healthcheck the wrong approach to take when dealing with Kafka?
To be clear, I am not talking about healthchecks of the Kafka cluster, specifically the Streams processor node.

Kafka Streams stateStores fault tolerance exactly once?

We're trying to achieve a deduplication service using Kafka Streams.
The big picture is that it will use its rocksDB state store in order to check existing keys during process.
Please correct me if I'm wrong, but to make those stateStores fault tolerant too, Kafka streams API will transparently copy the values in the stateStore inside a Kafka topic ( called the change Log).
That way, if our service falls, another service will be able to rebuild its stateStore according to the changeLog found in Kafka.
But it raises a question to my mind, do this " StateStore --> changelog" itself is exactly once ?
I mean, When the service will update its stateStore, it will update the changelog in an exactly once fashion too.. ?
If the service crash, another one will take the load, but can we sure it won't miss a stateStore update from the crashing service ?
Regards,
Yannick
Short answer is yes.
Using transaction - Atomic multi-partition write - Kafka Streams insure, that when offset commit was performed, state store was also flashed to changelog topic on the brokers. Above operations are Atomic, so if one of them will failed, application will reprocess messages from previous offset position.
You can read in following blog more about exactly once semantic https://www.confluent.io/blog/enabling-exactly-kafka-streams/. There is section: How Kafka Streams Guarantees Exactly-Once Processing.
But it raises a question to my mind, do this " StateStore --> changelog" itself is exactly once ?
Yes -- as others have already said here. You must of course configure your application to use exactly-once semantics via the configuration parameter processing.guarantee, see https://kafka.apache.org/21/documentation/streams/developer-guide/config-streams.html#processing-guarantee (this link is for Apache Kafka 2.1).
We're trying to achieve a deduplication service using Kafka Streams. The big picture is that it will use its rocksDB state store in order to check existing keys during process.
There's also an event de-duplication example application available at https://github.com/confluentinc/kafka-streams-examples/blob/5.1.0-post/src/test/java/io/confluent/examples/streams/EventDeduplicationLambdaIntegrationTest.java. This links points to the repo branch for Confluent Platform 5.1.0, which uses Apache Kafka 2.1.0 = the latest version of Kafka available right now.

How do I stop attempting to consume messages off of Kafka when at the end of the log?

I have a Kafka consumer that I create on a schedule. It attempts to consume all of the new messages that have been added since the last commit was made.
I would like to shut the consumer down once it consumes all of the new messages in the log instead of waiting indefinitely for new messages to come in.
I'm having trouble finding a solution via Kafka's documentation.
I see a number of timeout related properties available in the Confluent.Kafka.ConsumerConfig and ClientConfig classes, including FetchWaitMaxMs, but am unable to decipher which to use. I'm using the .NET client.
Any advice would be appreciated.
I have found a solution. Version 1.0.0-beta2 of Confluent's .NET Kafka library provides a method called .Consume(TimeSpan timeSpan). This will return null if there are no new messages to consume or if we're at the partition EOF. I was previously using the .Consume(CancellationToken cancellationToken) overload which was blocking and preventing me from shutting down the consumer. More here: https://github.com/confluentinc/confluent-kafka-dotnet/issues/614#issuecomment-433848857
Another option was to upgrade to version 1.0.0-beta3 which provides a boolean flag on the ConsumeResult object called IsPartitionEOF. This is what I was initially looking for - a way to know when I've reached the end of the partition.
I have never used the .NET client, but assuming it cannot be all that different from the Java client, the poll() method should accept a timeout value in milliseconds, so setting that to 5000 should work in most cases. No need to fiddle with config classes.
Another approach is to find the maximum offset at the time that your consumer is created, and only read up until that offset. This would theoretically prevent your consumer from running indefinitely if, by any chance, it is not consuming as fast as producers produce. But I have never tried that approach.

Kafka stream application not consume data after restart

After I did restart our Kafka cluster my application of Kafka streams didn't receive messages from input topic and I got an exception of "can׳t create internal topic". After some research, I did reset with the Kafka tool (to the input topic and the application) the tool is Kafka-streams-application-reset.sh.
Unfortunately, it didn't resolve the problem and I also got the exception again
From the error message, you can infer that the topic already exists and thus, cannot be created. The reason for the failure is, that the existing topic does not have the expected number of partitions (it has 1 instead of 150) -- if the number of partitions would match, Kafka Streams would just use the existing topic.
This can happen, if you have topic auto-create enabled at the brokers (and the topic was created with a wrong number of partitions), or if the number of partitions of your input topic changed. Kafka Streams does not automatically change the number of partitions for the repartition topic, because this might result in data corruption and thus lead to incorrect results.
One way to fix this, it to either manually delete this topic: note, that this might result in data loss and you should only do this, if you know that it is what you want.
Another (better way) would be, to reset the application cleanly using bin/kafka-streams-application-reste.sh in combination with KafkaStreams#cleanup().
Because you need to clean up the application and users should be aware of the implication, Kafka Streams fails to make user aware of the issue instead of "auto magically" take some actions that might be undesired from a user point of view.
Check out the docs for more details. There is also a blog post that explains application reset in details:
https://kafka.apache.org/11/documentation/streams/developer-guide/app-reset-tool.html
https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/

Best way to perform Kafka consumer healthcheck.

Let's say I have an application which consumes logs from kafka cluster. I want the application to periodically check for the availability of the cluster and based on that perform certain actions. I thought of a few approaches but was not sure which one is better or what is the best way to do this:
Create a MessageProducer and MessageConsumer. The producer publishes heartbeatTopic to the cluster and the consumer looks for it. The issue that I think for this is, where the application is concerned with only consuming, healthcheck has both producing and consuming part.
Create a MessageConsumer with a new groupId which continuously pools for new messages. This way the monitoring/healthcheck is doing the same thing which the application is supposed to do, which I think is good.
Create a MessageConsumer which does something different from actually consuming the messages. Something like listTopics (https://stackoverflow.com/a/47477448/2094963) .
Which of these methods is more preferable and why?
Going down a slightly different route here, you could potentially poll zookeeper (znode path - /brokers/ids) for this information by using the Apache Curator library.
Here's an idea that I tried and worked - I used the Curator's Leader Latch recipe for a similar requirement.
You could create an instance of LeaderLatch and invoke the getLeader() method. If at every invocation, you get a leader then it is safe to assume that the cluster is up and running otherwise there is something wrong with it.
I hope this helps.
EDIT: Adding the zookeeper node path where the leader information is stored.