Why kafka cluster did error "Number of alive brokers '0' does not meet the required replication factor"?

Why kafka cluster did error "Number of alive brokers '0' does not meet the required replication factor"? - apache-kafka

I have 2 kafka brokers and 1 zookeeper. Brokers config: server.properties file:
1 broker:
auto.create.topics.enable=true
broker.id=1
delete.topic.enable=true
group.initial.rebalance.delay.ms=0
listeners=PLAINTEXT://5.1.2.3:9092
log.dirs=/opt/kafka_2.12-2.1.0/logs
log.retention.check.interval.ms=300000
log.retention.hours=168
log.segment.bytes=1073741824
max.message.bytes=105906176
message.max.bytes=105906176
num.io.threads=8
num.network.threads=3
num.partitions=10
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
replica.fetch.max.bytes=105906176
socket.receive.buffer.bytes=102400
socket.request.max.bytes=105906176
socket.send.buffer.bytes=102400
transaction.state.log.min.isr=1
transaction.state.log.replication.factor=1
zookeeper.connect=5.1.3.6:2181
zookeeper.connection.timeout.ms=6000
2 broker:
auto.create.topics.enable=true
broker.id=2
delete.topic.enable=true
group.initial.rebalance.delay.ms=0
listeners=PLAINTEXT://18.4.6.6:9092
log.dirs=/opt/kafka_2.12-2.1.0/logs
log.retention.check.interval.ms=300000
log.retention.hours=168
log.segment.bytes=1073741824
max.message.bytes=105906176
message.max.bytes=105906176
num.io.threads=8
num.network.threads=3
num.partitions=10
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
replica.fetch.max.bytes=105906176
socket.receive.buffer.bytes=102400
socket.request.max.bytes=105906176
socket.send.buffer.bytes=102400
transaction.state.log.min.isr=1
transaction.state.log.replication.factor=1
zookeeper.connect=5.1.3.6:2181
zookeeper.connection.timeout.ms=6000
if i ask zookeeper like this:
echo dump | nc zook_IP 2181
i got:
SessionTracker dump:
Session Sets (3):
0 expire at Sun Jan 04 03:40:27 MSK 1970:
1 expire at Sun Jan 04 03:40:30 MSK 1970:
0x1000bef9152000b
1 expire at Sun Jan 04 03:40:33 MSK 1970:
0x1000147d4b40003
ephemeral nodes dump:
Sessions with Ephemerals (2):
0x1000147d4b40003:
/controller
/brokers/ids/2
0x1000bef9152000b:
/brokers/ids/1
looke fine, but not works :(. Zookeeper see 2 brokers, but in first kafka broker we have error:
ERROR [KafkaApi-1] Number of alive brokers '0' does not meet the required replication factor '1' for the offsets topic (configured via 'offsets.topic.replication.factor'). This error can be ignored if the cluster is starting up and not all brokers are up yet. (kafka.server.KafkaApis)
also we use kafka_exporter for prometheus, and he log this error:
Cannot get oldest offset of topic Some.TOPIC partition 9: kafka server: Request was for a topic or partition that does not exist on this broker." source="kafka_exporter.go:296
pls help ! were i mistake in config ?

Are your clocks working? Zookeeper thinks it's 1970
Sun Jan 04 03:40:27 MSK 1970
You may want to look at the rest of the logs or see if Kafka and Zookeeper are actively running and ports are open.
In your first message, after starting a fresh cluster you see this, so it's not a true error
This error can be ignored if the cluster is starting up and not all brokers are up yet. (kafka.server.KafkaApis)
The properties you show, though, have listeners on entirely different subnets and you're not using advertised.listeners

Kafka broker.id changes maybe cause this problem. Clean up the kafka metadata under zk, note: kafka data will be lost

I got this error message in this situation :
Cluster talking in SSL
Every broker is a container
Updated the certificate with new password inside ALL brokers
Rolling update
After the first broker reboot, it spammed this error message and the broker controller talked about "a new broker connected but password verification failed".
Solutions :
Set the new certificate password with the old password
Down then Up your entire cluster at once
(not tested yet) Change the certificate on one broker, reboot it then move to the next broker until you reboot all of them (ending with the controller)

Related

Kafka multi-datacenter solution

I have 2 services ( 1 producer who writes 15 000 messages to kafka topic, and 1 consumer who reads this messages from that topic ) and i have streched 3 dc kafka cluster ( this 3 dc locates within same city, so latency is low )
to immitate 2 dc failure i'm simultaneously shutdown 2 kafkas ( systemctl kill through ansible ) so i have only 1 kafka up & running, i have acks=all and isr=3 and min isr=3, so in theory if even 1 kafka will be down all writes to kafka will stop
but in my case my service write to kafka with only 1 node alive!
why this happens?
here's my /etc/kafka/server.properties
zookeeper.connect=192.168.1.11:2181,192.168.1.12:2181,192.168.1.13:2181
log.dirs=/var/lib/kafka/data
broker.id=0
group.initial.rebalance.delay.ms=0
log.retention.check.interval.ms=30000
log.retention.hours=3
log.roll.hours=1
log.segment.bytes=1073741824
num.io.threads=16
num.network.threads=8
num.partitions=1
num.recovery.threads.per.data.dir=2
offsets.topic.replication.factor=3
socket.receive.buffer.bytes=1024000
socket.request.max.bytes=104857600
socket.send.buffer.bytes=1024000
transaction.state.log.min.isr=3
transaction.state.log.replication.factor=3
zookeeper.connection.timeout.ms=10000
delete.topic.enable=True
replica.fetch.max.bytes=5242880
max.message.bytes=5242880
message.max.bytes=5242880
default.replication.factor=3
min.insync.replicas=3
replica.fetch.wait.max.ms=200
replica.lag.time.max.ms=1000
advertised.listeners=PLAINTEXT://192.168.1.11:9092
unclean.leader.election=false
acks=all

The Cluster ID XXXXXXXXXXXX doesn't match stored clusterId Some

On two nodes Kafka cluster after restart one node I have get error:
ERROR Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
kafka.common.InconsistentClusterIdException: The Cluster ID hJGLXQwERc2rTXI_dWLwcg doesn't match stored clusterId Some(9VcFGDwSfke5JWuzA6f9mw) in meta.properties. The broker is trying to join the wrong cluster. Configured zookeeper.connect may be wrong.
at kafka.server.KafkaServer.startup(KafkaServer.scala:223)
at kafka.Kafka$.main(Kafka.scala:109)
at kafka.Kafka.main(Kafka.scala)
Where hJGLXQwERc2rTXI_dWLwcg - ID of worked cluster from meta.properties
9VcFGDwSfke5JWuzA6f9mw - new generated ID.
Kafka version 3.0.0
On Node1 :
server.properties:
broker.id=1
listeners=SASL_PLAINTEXT://:9092
advertised.listeners=SASL_PLAINTEXT://loc-kafka1:9092
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/opt/kafka/logs
num.partitions=1
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=loc-kafka1:2181, loc-kafka2:2181
zookeeper.connection.timeout.ms=18000
group.initial.rebalance.delay.ms=0
compression.type=gzip
security.inter.broker.protocol=SASL_PLAINTEXT
sasl.mechanism.inter.broker.protocol=PLAIN
sasl.enabled.mechanisms=PLAIN
authorizer.class.name=kafka.security.authorizer.AclAuthorizer
ssl.endpoint.identification.algorithm=
super.users=User:admin
sasl.jaas.config=/opt/kafka/config/kafka_server_jaas.conf
delete.topic.enable=true
zookeeper.properties
dataDir=/opt/kafka/zookeeper/data
clientPort=2181
maxClientCnxns=10
admin.enableServer=false
server1=loc-kafka1:2888:3888
server2=loc-kafka2:2888:3888
On Node2 the same config except broker.id=2 and myid file in dataDir.
All information what I can find sad : you should place log.dirs and dataDir not in /tmp folder. And all will done. But it's not like that.
One solution is clear log.dirs and dataDir folders allow this node connect to cluster. But this doesn't seem good solution.
If I just delete meta.properties, new cluster.id is plased in it and node don't connect to cluster. At the same time, in server.logs, the following lines show that the node is connecting to another cluster node:
INFO Opening socket connection to server loc-kafka1/10.1.1.3:2181. (org.apache.zookeeper.ClientCnxn)
INFO SASL config status: Will attempt to SASL-authenticate using Login Context section 'Client' (org.apache.zookeeper.ClientCnxn)
INFO Socket connection established, initiating session, client: loc-kafka2/10.1.1.4:38672, server: loc-kafka1/10.1.1.3:2181 (org.apache.zookeeper.ClientCnxn)
INFO Session establishment complete on server loc-kafka1/10.0.0.3:3001, session id = 0x100000043bb0006, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn)
What I can do to fix it?

Can't Start Kafka Server on Windows 10 - Kafka's log directories (and children) should only contain Kafka topic data

Following instructions in lesson 28 of Learn Apache Kafka for Beginners Udemy course to start zookeeper and then start a kafka server broker on Windows 10. Zookeeper runs fine on port 2181:
C:\kafka_2.12-2.3.1> zookeeper-server-start.bat config/zookeeper.properties
...
INFO binding to port 0.0.0.0/0.0.0.0:2181
But after adding the bat files to path, running kafka server does not work:
C:\kafka_2.12-2.3.1> kafka-server-start.bat config/server.properties
...
ERROR There was an error in one of the threads during logs loading: org.apache.kafka.common.KafkaException: Found directory C:\kafka_2.12-2.3.1\data\kafka, 'kafka' is not in the form of topic-partition or topic-partition.uniqueId-delete (if marked for deletion).
Kafka's log directories (and children) should only contain Kafka topic data. (kafka.log.LogManager)
Some of the stdout logging in zookeeper looks informative:
Accepted socket connection from /127.0.0.1:49439 (org.apache.zookeeper.server.NIOServerCnxnFactory)
[2019-11-03 17:22:42,278] INFO Client attempting to establish new session at /127.0.0.1:49439 (org.apache.zookeeper.server.ZooKeeperServer)
[2019-11-03 17:22:42,286] INFO Creating new log file: log.1 (org.apache.zookeeper.server.persistence.FileTxnLog)
...
INFO Processed session termination for sessionid: 0x1007b0044a40000 (org.apache.zookeeper.server.PrepRequestProcessor)
[2019-11-03 17:22:42,987] INFO Closed socket connection for client /127.0.0.1:49439 which had sessionid 0x1007b0044a40000 (org.apache.zookeeper.server.NIOServerCnxn)
Under the data folder I created, there are two folders I created, the second of which was filled after I tried running the kafka broker:
kafka/
|-- empty
zookeeper/
|-- version-2/
|--log.1
Why does this error happen and how can I start a Kafka server on Windows 10?
EDIT:
Contents of config/server.properties:
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=C:/kafka_2.12-2.3.1/data/
num.partitions=1
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=localhost:2181
zookeeper.connection.timeout.ms=6000
group.initial.rebalance.delay.ms=0

The data/ folder contains empty zookeeper/ and kafka/ directories
Did you create those folders? Kafka's log.dirs can only contain folders in the format of topic-partition or topic-partition.uniqueId-delete (if marked for deletion), as mentioned by the error. That folder should be empty (no subdriectories) if starting fresh installation.
Plus, Kafka data directory should not hold Zookeeper data as Zookeeper should be independent of Kafka

Kafka long coordinator load time and small ISRs

I'm using Kafka 0.8.2.1, running a topic with 200 partitions and RF=3, with log retention set to about 1GB.
An unknown event caused the cluster to enter the "coordinator load" or "group load" state. A few signals made this apparent: the pykafka-based consumers began to fail during FetchOffsetRequests with error code 14 COORDINATOR_LOAD_IN_PROGRESS for some subset of partitions. These errors were triggered when consuming with a consumer group that had existed since before the coordinator load. In broker logs, messages like this appeared:
[2018-05...] ERROR Controller 17 epoch 20 initiated state change for partition [my.cool.topic,144] from OnlinePartition to OnlinePartition failed (state.change.logger)
kafka.common.StateChangeFailedException: encountered error while electing leader for partition [my.cool.topic,144] due to: Preferred replica 11 for partition [my.cool.topic,144] is either not alive or not in the isr. Current leader and ISR: [{"leader":12,"leader_epoch":7,"isr":[12,13]}].
For some reason, Kafka decided that replica 11 was the "preferred" replica despite the fact that it was not in the ISR. To my knowledge, consumption could continue uninterrupted from either replica 12 or 13 while 11 resynchronized - it's not clear why Kafka chose a non-synced replica as the preferred leader.
The above-described behavior lasted for about 6 hours, during which the pykafka fetch_offsets error made message consumption impossible. While the coordinator load was still in progress, other consumer groups were able to consume the topic without error. In fact, the eventual fix was to restart the broken consumers with a new consumer_group name.
Questions
Is it normal or expected for the coordinator load state to last for 6 hours? Is this load time affected by log retention settings, message production rate, or other parameters?
Do non-pykafka clients handle COORDINATOR_LOAD_IN_PROGRESS by consuming only from the non-erroring partitions? Pykafka's insistence that all partitions return successful OffsetFetchResponses can be a source of consumption downtime.
Why does Kafka sometimes select a non-synced replica as the preferred replica during coordinator loads? How can I reassign partition leaders to replicas in the ISR?
Are all of these questions moot because I should just be using a newer version of Kafka?
Broker config options:
broker.id=10
port=9092
zookeeper.connect=****/kafka5
log.dirs=*****
delete.topic.enable=true
replica.fetch.max.bytes=1048576
replica.fetch.wait.max.ms=500
replica.high.watermark.checkpoint.interval.ms=5000
replica.socket.timeout.ms=30000
replica.socket.receive.buffer.bytes=65536
replica.lag.time.max.ms=10000
replica.lag.max.messages=4000
controller.socket.timeout.ms=30000
message.max.bytes=1000000
auto.create.topics.enable=false
log.index.interval.bytes=4096
log.index.size.max.bytes=10485760
log.retention.hours=96
log.roll.hours=168
log.retention.check.interval.ms=300000
log.segment.bytes=1073741824
zookeeper.connection.timeout.ms=6000
zookeeper.sync.time.ms=2000
num.io.threads=8
socket.request.max.bytes=104857600
num.replica.fetchers=4
controller.message.queue.size=10
num.partitions=8
log.flush.interval.ms=60000
log.flush.interval.messages=60000
log.flush.scheduler.interval.ms=2000
num.network.threads=8
socket.receive.buffer.bytes=1048576
socket.send.buffer.bytes=1048576
queued.max.requests=500
fetch.purgatory.purge.interval.requests=100
producer.purgatory.purge.interval.requests=100
controlled.shutdown.enable=true

Multi node kafka cluster: Producer and Consumer not working

I have a kafka cluster consisting of two machines. This is my server.properties:
broker.id=2
listeners=PLAINTEXT://:9092
advertised.listeners=PLAINTEXT://a.b.c.d:9092
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/tmp/kafka-logs
num.partitions=2
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=a.b.c.d:2181,a.b.c.e:2181
zookeeper.connection.timeout.ms=6000
group.initial.rebalance.delay.ms=0
And this is my zookeeper.properties:
dataDir=/tmp/zookeeper
clientPort=2181
maxClientCnxns=0
tickTime=2000
server.1=a.b.c.d:2888:3888
server.2=a.b.c.e:2888:3888
initLimit=20
syncLimit=10
a.b.c.d = The IPs these machines have, e.g. 192.168.....
I start the zookeeper server on both machines using:
bin/zookeeper-server-start config/zookeeper.properties
I then start kafka servers on both the nodes. After this, I am able to create a new topic and get its details using --describe. However I am unable to read from consumer or write to producer. I run these by:
bin/kafka-console-consumer --bootstrap-server a.b.c.d:9092,a.b.c.e:9092 --topic randomTopic --from beginning
bin/kafka-console-producer --broker-list a.b.c.d:9092,a.b.c.e:9092 --topic randomTopic
When I run the producer, the prompt(>) appears and I can write into it. However, kafka cannot read anything from the consumer and the screen remains black.
How do I make the consumer read the data in the topics or make producer able to write the data in the topics?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Why kafka cluster did error "Number of alive brokers '0' does not meet the required replication factor"? - apache-kafka

Kafka broker.id changes maybe cause this problem. Clean up the kafka metadata under zk, note: kafka data will be lost

Related

Kafka multi-datacenter solution

The Cluster ID XXXXXXXXXXXX doesn't match stored clusterId Some

Can't Start Kafka Server on Windows 10 - Kafka's log directories (and children) should only contain Kafka topic data

Kafka long coordinator load time and small ISRs

Multi node kafka cluster: Producer and Consumer not working

Categories

Resources