Kafka partitions and offsets disappeared - apache-kafka

My Kafka clients are running in GCP App Engine Flex environment with auto scale enabled (GCP keeps the instance count to at least two and it has been mostly 2 due to low CPU usages). The consumer groups running in that 2 VMs have been consuming messages from various topics in 20 partitions for several months and recently I noticed that partitions in older topics shrank to just 1 (!) and offsets for that consumer group was reset to 0. topic-[partition] directories were also gone from the kafka-logs directory. Strangely, recently created topic partitions are intact. I have 3 different environments (all in GCP) and this happened to all three. We didn't see any lost messages or data problem but want to understand what had happened to avoid this happening again.
The kafka broker and zookeeper are running in the same and single GCP compute engine instance (I know it's not the best practice and have plan to improve) and I suspect it has something to do with machine restart and that wipes out some information. However, I verified that data files are written under /opt/bitnami/(kafka|bitnami) directory and not /tmp which can be removed by machine restarts.
Spring Kafka 1.1.3
kafka client 0.10.1.1
single node kafka broker 0.10.1.0
single node zookeeper 3.4.9
Any insights on this will be appreciated!

Bitnami developer here. I could reproduce the issue and track it down to an init script that was clearing the content of the tmp/kafka-logs/ folder.
We released a new revision of the kafka installers, virtual machines and cloud images fixing the issue. The revision that includes the fix is 1.0.0-2.

Related

Messages are stuck in ActiveMQ Artemis cluster queues

We have a problem with Apache ActiveMQ Artemis cluster queues. Sometimes messages are beginning to pile up in the particular cluster queues. It usually happens 1-4 times per day and mostly on production (it was only one time for last 90 days when it has happened on one of the test environments).
These messages are not delivered to consumers on other cluster brokers until we restart cluster connector (or entire broker).
The problem looks related to ARTEMIS-3809.
Our setup is: 6 servers in one environment (3 pairs of master/backup servers). Operating system is Linux (Red Hat).
We have tried to:
upgrade from 2.22.0 to 2.23.1
increase minLargeMessageSize on the cluster connectors to 1024000
The messages are still being stuck in the cluster queues.
Another problem that I tried to configure min-large-message-size as it written in documentation (in cluster-connection), but it caused errors at start (broker.xml did not pass validation with xsd), so it was only option to specify minLargeMessageSize in the URL parameters of connector for each cluster broker. I don't know if this setting has effect.
So we had to make a script which checks if messages are stuck in the cluster queues and restarts cluster connector.
How can we debug this situation?
When the messages are stuck, nothing wrong is written to the log (no errors, no stacktraces etc.).
Which logging level (for what classes) should we enable to debug or trace level to find out what happens with the cluster connectors?
I believe you can remedy the situation by setting this on your cluster-connection:
<producer-window-size>-1</producer-window-size>
See ARTEMIS-3805 for more details.
Generally speaking, moving message around the cluster via the cluster-connection, while convenient, isn't terribly efficient (much less so for "large" messages). Ideally you would have a sufficient number of clients on each node to consume the messages that were originally produced there. If you don't have that many clients then you may want to re-evaluate the size of your cluster as it may actually decrease overall message throughput rather than increase it.
If you're just using 3 HA pairs in order to establish a quorum for replication then you should investigate the recently added pluggable quorum voting which allows integration with a 3rd party component (e.g. ZooKeeper) for leader election eliminating the need for a quorum of brokers.

Kafka broker occassionally takes much longer than usual to load logs on startup

We are observing that Kafka brokers occasionally take much more time to load logs on startup than usual. Much longer in this case means 40 minutes instead of at most 1 minute. This happens during a rolling restart following the procedure described by Confluent. This happens after the broker reported that controlled shutdown was succesful.
Kafka Setup
Confluent Platform 5.5.0
Kafka Version 2.5.0
3 Replicas (minimum 2 in sync)
Controlled broker shutdown enabled
1TB of AWS EBS for Kafka log storage
Other potentially useful information
We make extensive use of Kafka Streams
We use exactly-once processing and transactional producers/consumers
Observations
It is not always the same broker that takes a long time.
It does not only occur when the broker is the active controller.
A log partition that loads quickly (15ms) can take a long time (9549 ms) for the same broker a day later.
We experienced this issue before on Kafka 2.4.0 but after upgrading to 2.5.0 it did not occur for a few weeks.
Does anyone have an idea what could be causing this? Or what additional information would be useful to track down the issue?

How to add two more kafka brokers in the local machine if my current running kafka broker already has the data

I have one broker running in my local machine with Windows OS which has 2-3 topics with messages stored. I want to scale up my machine by adding two more broker instances. I have followed all the steps to configure 3 brokers on the same machine by creating different properties file.
My broker=0 getting shutdown when I am starting broker=1 server with below error.
[2019-07-11 13:56:33,580] INFO Stopping serving logs in dir C:\kafka_2.12-2.2.1\data\kafka (kafka.log.LogManager)
[2019-07-11 13:56:33,585] ERROR Shutdown broker because all log dirs in C:\kafka_2.12-2.2.1\data\kafka have failed (kafka.log.LogManager)
Is it possible to add more brokers if my existing broker instance has the data.
Or do I need to delete the data directory and freshly start the broker 0. Is there any possibility to preserve the data without deleting it from the kafka server.
Yes you can add brokers to your cluster and migrate/spread data across all your brokers.
The Expanding your cluster section in the documentation details the steps to achieve this.
After starting the new brokers, you basically need to use the bin/kafka-reassign-partitions.sh tool (other 3rd party tools also exists) to move data onto them.
Please note however that adding brokers on the same machine does not provide a lot of resiliency as if the machine was to go down, all brokers would be affected. But if you want to just play around and learn about Kafka that may be fine.
To run multiple brokers on the same physical machine, it is necessary for each broker in the config to specify a unique broker.id, different log.dirs and ports in listeners.
For example,
config/server{1,2,3}.properties
in every config set difference
broker.id=<id>
log.dirs=/data/kafka<id>
listeners=PLAINTEXT://localhost:909<id>
When all three brokers start, new topics will be created evenly throughout the cluster, but old ones need to be rebalanced.

how to start Kafka and zookeeper when the HDD is full

I have setup Kafka cluster with 3 nodes. now their hard disk is full, and Kafka and Zookeeper are down.
What is the best solution to re-start Kafka and Zookeeper?
Can I delete the directory for Kafka logs (log.dirs directory) and start Kafka again?
I using confluent version 4.0.0
If you delete log.dirs you will loose all your data, which I guess should be your last option. If you are using amazon EC2 you better try to add more disk, and after starting the brokers play with the retention policy or with replication factor or remove topics.
Take a look on: https://kafka.apache.org/documentation/#basic_ops_add_topic
I solved this problem with this method and I hope useful for you:
step 1 :
I moved some of the log files from my servers to another server to free up a little bit of disk space. You can temporarily change the port of Kafka So nobody can produce it.
step 2 :
when the disk Was emptied, I was able to run Zookeeper and Kafka on the server.
step 3 :
now I've deleted messages (by kafka-delete-records) from some topics and released half of the hard drive
step 4 :
I stopped Kafka and Zookeeper in all servers (3 nodes)
step 5 :
I moved the Log files to their directory (I moved them in the step 1) and changed the port Kafka to 9092 again
step 6 :
And finally, I started Kafka and Zookeeper on all servers successfully

Kafka Mirror Maker (Multiple Sources --> One Target) possible?

short question to the Kafka Pros out there. I have multiple datacenters DC_REMOTE_1, DC_REMOTE_2 and DC_LOCAL, where the remote datacenters are actually sending messages to one topic.
In the local datacenter (DC_LOCAL) we are running a mirror-maker which currently transfers the remote topic (events#dc_remote_1) to the local topic (events#dc_local). Is is possible to configure mirror maker in that way, that (events#dc_remote_2) are as well copied to the events#local?
It's kind of merging different remote topics into one local topic, or do we run into problems due to the offset management?
thanks for your help.
I have worked on similar requirement where you could have one independent MirrorMaker run in each DC (DC_REMOTE_1 and DC_REMOTE_2) which will act as a consumer for the local kafka clusters and send the data to DC_LOCAL(this way there are no issues on offset management).
Indeed the capacity of the kafka cluster DC_LOCAL should be approximately 2 times the DC_REMOTE_1/2 kafka clusters.
Let me know if this helps your requirement.