Kafka will internally kill any nodes that are not correctly polling, but we don't want bad nodes to stay up and we want visibility of which nodes are working, so naturally we want to build out a classical healthcheck.
The obvious way to build a healthcheck is to use KafkaStream#state#isRunning but this is completely deceptive. If the underlying network connection to Kafka is down, or any of a dozen things that don't cause an internal thread to die, KafkaStreams still reports itself as alive even though the logs are throwing up errors all over the place. Is there any correct way to get visibility of what is happening inside Kafka streams? Is expecting a sane healthcheck the wrong approach to take when dealing with Kafka?
To be clear, I am not talking about healthchecks of the Kafka cluster, specifically the Streams processor node.
Related
We are thinking about using the Strimzi Kafka-Bridge(https://strimzi.io/docs/bridge/latest/#proc-creating-kafka-bridge-consumer-bridge) as HTTP(s) Gateway to an existing Kafka Cluster.
The documentation mentions the creation of consumers using arbitrary names for taking part in a consumer-group. These names can subsequently be used to consume messages, seek or sync offsets,...
The question is: Am I right in assuming the following?
The bridge-consumers seem to be created and maintained just in one Kafka-Bridge instance.
If I want to use more than one bridge because of fault-tolerance-requirements, the name-information about a specific consumer will not be available on the other nodes, since there is no synchronization or common storage between the bridge-nodes.
So if the clients of the kafka-bridge are not sticky, as soon as a it communicates (e.g. because of round-robin handling by a load-balancer) with another node, the consumer-information will not be available and the http(s)-clients must be prepared to reconfigure the consumers on the new communicating node.
The offsets will be lost. Worst case the fetching of messages and syncing their offsets will always happen on different nodes.
Or did I overlook anything?
You are right. The state and the Kafka connections are currently not shared in any way between the bridge instances. The general recommendation is that when using consumers, you should run the bridge only with single replica (and if needed deploy different bridge instances for different consumer groups).
We have a Spring Boot Kafka Streams processor. For various reasons, we may have a situation where we need the process to start and run, but there are no topics we wish to subscribe to. In such cases, we just want the process to "sleep", because other liveness/environment checkers depend on it running. Also, it's part of a RedHat OCP cluster, and we don't want the pod to be constantly doing a crash backoff loop. I fully understand that it'll never really do anything until it's restarted with a valid topic(s), but that's OK.
If we start it with no topics, we get this message:Failed to start bean 'kStreamBuilder'; nested exception is org.springframework.kafka.KafkaException: Could not start stream: ; nested exception is org.apache.kafka.streams.errors.TopologyException: Invalid topology: Topology has no stream threads and no global threads, must subscribe to at least one source topic or global table.
In a test environment, we could just create a topic that's never written to, but in production, we don't have that flexibility, so a programmatic solution would be best. Ideally, I think, if there's a "null topic" abstraction of some sort (a Kafka "/dev/null"), that would look the cleanest in the code.
Best practices, please?
You can set the autoStartup property on the StreamsBuilderFactoryBean to false and only start() it if you have at least one stream.
If using Spring Boot, it's available as a property:
https://docs.spring.io/spring-boot/docs/current/reference/html/application-properties.html#application-properties.integration.spring.kafka.streams.auto-startup
I am messing aroung with Kafka Streams handled by the K8s.
It goes more or less fine so far, yet weird behaviour is observed on the test environment:
[Consumer clientId=dbe-livestream-kafka-streams-77185a88-71a7-40cd-8774-aeecc04054e1-StreamThread-1-consumer, groupId=dbe-livestream-kafka-streams] We received an assignment [_livestream.dbe.tradingcore.sporteventmappings-table-0, _livestream.dbe.tradingcore.sporteventmappings-table-2, _livestream.dbe.tradingcore.sporteventmappings-table-4, _livestream.dbe.tradingcore.sporteventmappings-table-6, livestream.dbe.tennis.results-table-0, livestream.dbe.tennis.results-table-2, livestream.dbe.tennis.results-table-4, livestream.dbe.tennis.results-table-6, _livestream.dbe.betmanager.sporteventmappings-table-0, _livestream.dbe.betmanager.sporteventmappings-table-2, _livestream.dbe.betmanager.sporteventmappings-table-4, _livestream.dbe.betmanager.sporteventmappings-table-6] that doesn't match our current subscription Subscribe(_livestream.dbe.betmanager.sporteventmappings-table|_livestream.dbe.trading_states|_livestream.dbe.tradingcore.sporteventmappings-table|livestream.dbe.tennis.markets|livestream.dbe.tennis.markets-table); it is likely that the subscription has changed since we joined the group. Will try re-join the group with current subscription
As far as I understand, internal state somehow got broken, and Stream's source of truth conflicts with the broker/zookeeper's one. This behaviour never terminates: I just let it hang for few days beingh busy with another stuff, and still it's the, reported at the WARN level. More than that: no ERRORs were reported for this time.
I did not change nothing; did not deploy new instances; did not manipulate Kafka brokers in any way that might affect abovementioned Kafka Streams app. Any ideas what's wrong?
The error message itself indicates, that something is wrong with your subscription. This may happen if you have two Kafka Streams instances using the same application.id, but both don't subscribe to the exact some topics.
In your case, the subscription does not contain livestream.dbe.tennis.results-table but corresponding partitions are assigned.
Note, that Kafka Streams requires that all instances with the same application.id are required to execute the exact some Topology and thus subscribe to the exact some topics.
The performance tuning documentation provided by Storm states for the absolute best performance scaling multiple parallel topologies can yield better performance than simply scaling workers.
I am try to benchmark this theory against scaling worker.
However, using version 1.2.1 the storm Kafka spout is not behaving as I would have expected across multiple different topologies.
Setting a common client.id and group.id for the kafka spout consumer across all topologies for a single topic, each topology still subscribes to all available partitions and duplicate tuples, with errors being thrown as already committed tuples are recommitted.
I am surprised by this behaviour as I assumed that the consumer API would support this fairly simple use case.
I would be really grateful if somebody would explain
what's the implementation logic of this behaviour with the kafka spout?
any way around this problem?
The default behavior for the spout is to assign all partitions for a topic to workers in the topology, using the KafkaConsumer.assign API. This is the behavior you are seeing. With this behavior, you shouldn't be sharing group ids between topologies.
If you want finer control over which partitions are assigned to which workers or topologies, you can implement the TopicFilter interface, and pass it to your KafkaSpoutConfig. This should let you do what you want.
Regarding running multiple topologies being faster, I'm assuming you're referring to this section from the docs: In multiworker mode, messages often cross worker process boundaries. For performance sensitive cases, if it is possible to configure a topology to run as many single-worker instances [...] it may yield significantly better throughput and latency. The objective here is to avoid sending messages between workers, and instead keep each partition's processing internal in one worker. If you want to avoid running many topologies, you could look at customizing the Storm scheduler to make it allocate e.g. one full copy of your pipeline in each worker. That way, if you use localOrShuffleGrouping, there will always be a local bolt to send to, so you don't have to go over the network to another worker.
Let's say I have an application which consumes logs from kafka cluster. I want the application to periodically check for the availability of the cluster and based on that perform certain actions. I thought of a few approaches but was not sure which one is better or what is the best way to do this:
Create a MessageProducer and MessageConsumer. The producer publishes heartbeatTopic to the cluster and the consumer looks for it. The issue that I think for this is, where the application is concerned with only consuming, healthcheck has both producing and consuming part.
Create a MessageConsumer with a new groupId which continuously pools for new messages. This way the monitoring/healthcheck is doing the same thing which the application is supposed to do, which I think is good.
Create a MessageConsumer which does something different from actually consuming the messages. Something like listTopics (https://stackoverflow.com/a/47477448/2094963) .
Which of these methods is more preferable and why?
Going down a slightly different route here, you could potentially poll zookeeper (znode path - /brokers/ids) for this information by using the Apache Curator library.
Here's an idea that I tried and worked - I used the Curator's Leader Latch recipe for a similar requirement.
You could create an instance of LeaderLatch and invoke the getLeader() method. If at every invocation, you get a leader then it is safe to assume that the cluster is up and running otherwise there is something wrong with it.
I hope this helps.
EDIT: Adding the zookeeper node path where the leader information is stored.