kafka-manager process getting automatically killed? - apache-kafka

I am using kafka-manager version 1.1 to monitor kafka, and kafka version is 2.11-1.1.0. And zk version is 3.4.10.
zookeeper is working fine. But sometimes kafka brokers gets automatically killed, and it thorows only a single message on terminal "killed" and gets stopped the same is happening for kafka manager too.
I have 64GB RAM and when I checked RAM usage then I found that only 1GB RAM is being used by all the processes and still I have 63GB RAM. So I think memory is not an issue here. But I did not find the reason why it is stopping processes.
Please suggest me what I need to change?
Thanks in advance.

Related

Kafka is not restarting - cleared logs, disk space. Restarted but it turns off again and again

We are a talent system and have installed Zookeeper and Kafka on our AWS instance to send our requests to the core engine to get matches.
On our UI we are getting the error:
NoBrokersAvailable
and when we check Kafka is down. We restart it and it's still down. We checked for the logs and cleared it and also we checked cleared disk space.
Still the same problem of Kafka not starting. What should we do?

Kafka broker occassionally takes much longer than usual to load logs on startup

We are observing that Kafka brokers occasionally take much more time to load logs on startup than usual. Much longer in this case means 40 minutes instead of at most 1 minute. This happens during a rolling restart following the procedure described by Confluent. This happens after the broker reported that controlled shutdown was succesful.
Kafka Setup
Confluent Platform 5.5.0
Kafka Version 2.5.0
3 Replicas (minimum 2 in sync)
Controlled broker shutdown enabled
1TB of AWS EBS for Kafka log storage
Other potentially useful information
We make extensive use of Kafka Streams
We use exactly-once processing and transactional producers/consumers
Observations
It is not always the same broker that takes a long time.
It does not only occur when the broker is the active controller.
A log partition that loads quickly (15ms) can take a long time (9549 ms) for the same broker a day later.
We experienced this issue before on Kafka 2.4.0 but after upgrading to 2.5.0 it did not occur for a few weeks.
Does anyone have an idea what could be causing this? Or what additional information would be useful to track down the issue?

When one Broker has a problem, what is the best way to resolve the situation?

If I have three Brokers running in Kafka cluster, and one of them failed due to an error. So I only have two running brokers left.
1) Usually, when this happens, restarting a failed broker will solve the problem?
2) If restarting the broker wouldn't solve the problem, can I erase all the data that the failed Broker had and restart it? (Because all the data will be restored by two other Brokers). Is this method okay in production? If not, why?
When I was testing Kafka with my desktop on Windows 10 long time ago, if a Broker has an error and the restarting the server wouldn't work, I erased all the data. Then, it began to run okay. (I am aware of Kafka and Windows issues.) So, I am curious if this would work on multi-clustered Kafka (Linux) environments.
Ultimately, it depends what the error is. If it is a networking error, then there is nothing necessarily wrong with the logs, so you should leave them alone (unless they are not being replicated properly).
The main downside of deleting all data from a broker is that some topics may only have one replica, and it is on that node. Or if you lose other brokers while the replication is catching up, then all data is potentially gone. Also, if you have many TB of data that is replicating back to one node, then you have to be aware of any disk/network contention that may occur, and consider throttling the replication (which would take hours for the node to be healthy again)
But yes, Windows and Linux ultimately work the same in this regard, and it is one way to address a clustered environment

Kafka partitions and offsets disappeared

My Kafka clients are running in GCP App Engine Flex environment with auto scale enabled (GCP keeps the instance count to at least two and it has been mostly 2 due to low CPU usages). The consumer groups running in that 2 VMs have been consuming messages from various topics in 20 partitions for several months and recently I noticed that partitions in older topics shrank to just 1 (!) and offsets for that consumer group was reset to 0. topic-[partition] directories were also gone from the kafka-logs directory. Strangely, recently created topic partitions are intact. I have 3 different environments (all in GCP) and this happened to all three. We didn't see any lost messages or data problem but want to understand what had happened to avoid this happening again.
The kafka broker and zookeeper are running in the same and single GCP compute engine instance (I know it's not the best practice and have plan to improve) and I suspect it has something to do with machine restart and that wipes out some information. However, I verified that data files are written under /opt/bitnami/(kafka|bitnami) directory and not /tmp which can be removed by machine restarts.
Spring Kafka 1.1.3
kafka client 0.10.1.1
single node kafka broker 0.10.1.0
single node zookeeper 3.4.9
Any insights on this will be appreciated!
Bitnami developer here. I could reproduce the issue and track it down to an init script that was clearing the content of the tmp/kafka-logs/ folder.
We released a new revision of the kafka installers, virtual machines and cloud images fixing the issue. The revision that includes the fix is 1.0.0-2.

Kafka sink connector: No tasks assigned, even after restart

I am using Confluent 3.2 in a set of Docker containers, one of which is running a kafka-connect worker.
For reasons yet unclear to me, two of my four connectors - to be specific, hpgraphsl's MongoDB sink connector - stopped working. I was able to identify the main problem: The connectors did not have any tasks assigned, as could be seen by calling GET /connectors/{my_connector}/status. The other two connectors (of the same type) were not affected and were happily producing output.
I tried three different methods to get my connectors running again via the REST API:
Pausing and resuming the connectors
Restarting the connectors
Deleting and the creating the connector under the same name, using the same config
None of the methods worked. I finally got my connectors working again by:
Deleting and creating the connector under a different name, say my_connector_v2 instead of my_connector
What is going on here? Why am I not able to restart my existing connector and get it to start an actual task? Is there any stale data on the kafka-connect worker or in some kafka-connect-related topic on the Kafka brokers that needs to be cleaned?
I have filed an issue on the specific connector's github repo, but I feel like this might actually be general bug related to the intrinsics of kafka-connect. Any ideas?
I have faced this issue. If the resources are less for a SinkTask or SourceTask to start, this can happen.
Memory allocated to the worker may be less some time. By default workers are allocated 250MB. Please increase this. Below is an example to allocate 2GB memory for the worker running in distributed mode.
KAFKA_HEAP_OPTS="-Xmx2G" sh $KAFKA_SERVICE_HOME/connect-distributed $KAFKA_CONFIG_HOME/connect-avro-distributed.properties