Let's say I have an application which consumes logs from kafka cluster. I want the application to periodically check for the availability of the cluster and based on that perform certain actions. I thought of a few approaches but was not sure which one is better or what is the best way to do this:
Create a MessageProducer and MessageConsumer. The producer publishes heartbeatTopic to the cluster and the consumer looks for it. The issue that I think for this is, where the application is concerned with only consuming, healthcheck has both producing and consuming part.
Create a MessageConsumer with a new groupId which continuously pools for new messages. This way the monitoring/healthcheck is doing the same thing which the application is supposed to do, which I think is good.
Create a MessageConsumer which does something different from actually consuming the messages. Something like listTopics (https://stackoverflow.com/a/47477448/2094963) .
Which of these methods is more preferable and why?
Going down a slightly different route here, you could potentially poll zookeeper (znode path - /brokers/ids) for this information by using the Apache Curator library.
Here's an idea that I tried and worked - I used the Curator's Leader Latch recipe for a similar requirement.
You could create an instance of LeaderLatch and invoke the getLeader() method. If at every invocation, you get a leader then it is safe to assume that the cluster is up and running otherwise there is something wrong with it.
I hope this helps.
EDIT: Adding the zookeeper node path where the leader information is stored.
Related
We have a Spring Boot Kafka Streams processor. For various reasons, we may have a situation where we need the process to start and run, but there are no topics we wish to subscribe to. In such cases, we just want the process to "sleep", because other liveness/environment checkers depend on it running. Also, it's part of a RedHat OCP cluster, and we don't want the pod to be constantly doing a crash backoff loop. I fully understand that it'll never really do anything until it's restarted with a valid topic(s), but that's OK.
If we start it with no topics, we get this message:Failed to start bean 'kStreamBuilder'; nested exception is org.springframework.kafka.KafkaException: Could not start stream: ; nested exception is org.apache.kafka.streams.errors.TopologyException: Invalid topology: Topology has no stream threads and no global threads, must subscribe to at least one source topic or global table.
In a test environment, we could just create a topic that's never written to, but in production, we don't have that flexibility, so a programmatic solution would be best. Ideally, I think, if there's a "null topic" abstraction of some sort (a Kafka "/dev/null"), that would look the cleanest in the code.
Best practices, please?
You can set the autoStartup property on the StreamsBuilderFactoryBean to false and only start() it if you have at least one stream.
If using Spring Boot, it's available as a property:
https://docs.spring.io/spring-boot/docs/current/reference/html/application-properties.html#application-properties.integration.spring.kafka.streams.auto-startup
I want to have a topic deleted after some predefined time of inactivity.
To give you some context, there's a microservice that has many replicas, and each replica has its own topic to communicate, identified by its replica Id (e.g. topic_microservice-name_<random_id>).
If for any reason, a replica crashes, K8s will start another Pod, with a completely different replica Id, therefore the previous topic will not be used anymore. For this reason, after some time there could be many useless topics.
Does kafka have a built-in Time To Live for the whole topic?
Another idea I have is to have a Quartz Job iterating all topics somehow getting the last modified/written date and checking if the TTL expired.
There currently isn't a way to give a topic a TTL, where once the TTL expires Kafka automatically deletes the topic.
One can configure retention on the topic level (retention.ms - how long messages should be retained for this topic or retention.bytes - the amount of messages to retain in bytes). With this, you could have a separate service leveraging the AdminClient to execute scheduled operations on your topics. The logic could simply be iterating over the topics, filtering out the active topics, and deleting each topic that has been inactive long enough for the retention strategy to take effect.
The original question as to whether kafka topic actually has a TTL has already been answered (which is NO as of writing this answer).
This answer deals with several ways to handle deletion of topics w.r.t your scenario.
Write a container preStop hook
Where you can execute the topic's deletion code upon a pod termination. This could be simple approach.
The hook implementations include exec command (or) a HTTP call.
You can for example, include a small wrapper script on top of kafka-topics.sh (or) a simple python script that could connect to the broker and delete the topic.
You might also want to make a note of terminationGracePeriodSeconds and increase it accordingly if your topic deletion script takes longer than this value.
Get notified using Kubernetes Watch APIs
You may need to write a client that listens to the events and use the AdminClient to delete the topics corresponding to the terminated pod. This typically needs to be separated from the terminated pod.
Find out what topics needs to be deleted by getting list of active pods.
Retrieve the pod replicas available in the Kubernetes cluster using Kubernetes API.
Iterate over all the topics and delete those which do not conform to the above retrieved list.
P.S:
Note that the deletion of topics is an administrative task and it is typically done manually after some verification checks.
Creation of a lot of topics isn't recommended as maintenance would be difficult. If your applications are creating a lot of topics, for eg, as many as the number of workload instances running, then it might be the time to rethink your application design.
We are working on a appliation in which we are using kafka.
The components of the Application are as follows,
We have a microservice which gets the request's and pushes the messages to a kafka topic. (Lets say ServiceA)
Another microservice which consumes the messages from topics and push the data to a datastore. (Lets say ServiceB)
I am clear with ServiceA part of the application but have some design confusions in the ServiceB part.
As ServiceB we are planning for REST API,
Is it good to bundle Consumer and controllers in a single application ?
For consumer i am planning to go with ConsumerGroup with multiple Consumer's to acheive more throughput. Is there any better and efficent approach ?
Should i take out the Consumer part of ServiceB and make it as a separate service which is independent ?
If we are bundling it inside the ServiceB should i configure Consumer as a Listener ? (We are going with spring boot for microservice)
Thanks in Advance !
Is it good to bundle Consumer and controllers in a single application
?
Its good to bundle together by context, having a listener, wich forwards to another service to controll makes no sense in my opionion. But consider splitting up controller by different context if necessary. Like Martin Fowler says: start with a monolith first and than split up (https://martinfowler.com/bliki/MonolithFirst.html)
For consumer i am planning to go with ConsumerGroup with multiple
Consumer's to acheive more throughput. Is there any better and
efficent approach ?
A consumer group makes sense if you think about scale your service B out. If you want to have this possibility in the future, start with one instance of ServiceB inside the consumer group. If you use something like kubernetes, its simple do later on deploy more instances of your service if required. But do not invest to much on in an eventual future. Start simple and do some monitoring, and if you figure out some bottle necks, than act. One more thing to keep in mind, is that kafka by default keeps message for a long time (i guess 7 days by default) so if you thing in a classical message broker style you could get a lot of duplicates of your messages. Think about a update message, if somethings change, which is raised when your ServiceA starts. Maybe reducing the retention.ms would be an option, but take care not to loose messages.
Should i take out the Consumer part of ServiceB and make it as a
separate service which is independent ?
No i think not.
If we are bundling it inside the ServiceB should i configure Consumer
as a Listener ? (We are going with spring boot for microservice)
Yes :-)
I have a Kafka consumer that I create on a schedule. It attempts to consume all of the new messages that have been added since the last commit was made.
I would like to shut the consumer down once it consumes all of the new messages in the log instead of waiting indefinitely for new messages to come in.
I'm having trouble finding a solution via Kafka's documentation.
I see a number of timeout related properties available in the Confluent.Kafka.ConsumerConfig and ClientConfig classes, including FetchWaitMaxMs, but am unable to decipher which to use. I'm using the .NET client.
Any advice would be appreciated.
I have found a solution. Version 1.0.0-beta2 of Confluent's .NET Kafka library provides a method called .Consume(TimeSpan timeSpan). This will return null if there are no new messages to consume or if we're at the partition EOF. I was previously using the .Consume(CancellationToken cancellationToken) overload which was blocking and preventing me from shutting down the consumer. More here: https://github.com/confluentinc/confluent-kafka-dotnet/issues/614#issuecomment-433848857
Another option was to upgrade to version 1.0.0-beta3 which provides a boolean flag on the ConsumeResult object called IsPartitionEOF. This is what I was initially looking for - a way to know when I've reached the end of the partition.
I have never used the .NET client, but assuming it cannot be all that different from the Java client, the poll() method should accept a timeout value in milliseconds, so setting that to 5000 should work in most cases. No need to fiddle with config classes.
Another approach is to find the maximum offset at the time that your consumer is created, and only read up until that offset. This would theoretically prevent your consumer from running indefinitely if, by any chance, it is not consuming as fast as producers produce. But I have never tried that approach.
I am building a correlated system using Kafka. Suppose, there's a service A that performs data processing and there're its thousands of clients B that submit jobs to it. Bs are short-lived, they appear on the network, push the data to A and then two important things happen:
B will immediately receive a status from A;
B then will either
drop out completely, stay online to receive further updates on
status, or will sporadically pop back on to check the status.
(this is not dissimilar to grid computing or mpi).
Both points should be achieved using a well-known concept of correlationId: B possesses a unique id (UUID in my case), which it sends to A in headers, which, in turn, uses it as Reply-To topic to send status updates to. Which means it has to create topics on the fly, they can't be predetermined.
I have auto.create.topics.enable switched on, and it indeed creates topics dynamically, but existing consumers are not aware of them and require to be restarted [to fetch topic metadata i suppose, if i understood the docs right]. I also checked consumer's metadata.max.age.ms setting, but it doesn't help it seems, even if i set it to a very low value.
As far as i've read, this is yet unanswered, i.e.: kafka filtering/Dynamic topic creation, kafka consumer to dynamically detect topics added, Can a Kafka producer create topics and partitions? or answered unsatisfactory.
As there're hundreds of As and thousands of Bs, i can't possibly use shared topics or anything like it, lest i overload my network. I can use Kafka's AdminTools, or whatever it's called, to pre-create topics, but i find it somehow silly (even though i saw real-life examples of people using it to talk to Zookeeper and Kafka infrastructure itself).
So the question is, is there a way to dynamically create Kafka topics in a way that makes both consumer and producer aware of it without being restarted or anything? And, in the worst case, will AdminTools really help it and on which side must i use it - A or B?
Kafka 0.11, Java 8
UPDATE
Creating topics with AdminClient doesn't help for whatever reason, consumers still throw LEADER_NOT_AVAILABLE when i try to subscribe.
Ok, so i’d answer my own question.
Creating topics with AdminClient works only if performed before corresponding consumers are created.
Changed the topology i have, taking into account 1) and introducing exchange of correlation ids in message headers (same as in JMS). I also had to implement certain topology management methodologies, grouping Bs into containers.
It should be noted that, as many people have said, this only works when Bs are in single-consumer groups and listen to topics with 1 partition.
To get some idea of the work i'm into, you might have a look at the middleware framework i've been working on https://github.com/ikonkere/magic.
Creating an unbounded number of topics is not recommended. Id advise to redesign your topology/system.
Ive thought of making dynamic topics myself but then realized that eventually zookeeper will fail as it will run out of memory due to stale topics (imagine a year from now on how many topics could be created). Maybe this could work if you make sure you have some upper bound on topics ever created. Overall an administrative headache.
If you look up using Kafka with request response you will find others also say it is awkward to do so (Does Kafka support request response messaging).