Kafka multi-datacenter solution - apache-kafka

I have 2 services ( 1 producer who writes 15 000 messages to kafka topic, and 1 consumer who reads this messages from that topic ) and i have streched 3 dc kafka cluster ( this 3 dc locates within same city, so latency is low )
to immitate 2 dc failure i'm simultaneously shutdown 2 kafkas ( systemctl kill through ansible ) so i have only 1 kafka up & running, i have acks=all and isr=3 and min isr=3, so in theory if even 1 kafka will be down all writes to kafka will stop
but in my case my service write to kafka with only 1 node alive!
why this happens?
here's my /etc/kafka/server.properties
zookeeper.connect=192.168.1.11:2181,192.168.1.12:2181,192.168.1.13:2181
log.dirs=/var/lib/kafka/data
broker.id=0
group.initial.rebalance.delay.ms=0
log.retention.check.interval.ms=30000
log.retention.hours=3
log.roll.hours=1
log.segment.bytes=1073741824
num.io.threads=16
num.network.threads=8
num.partitions=1
num.recovery.threads.per.data.dir=2
offsets.topic.replication.factor=3
socket.receive.buffer.bytes=1024000
socket.request.max.bytes=104857600
socket.send.buffer.bytes=1024000
transaction.state.log.min.isr=3
transaction.state.log.replication.factor=3
zookeeper.connection.timeout.ms=10000
delete.topic.enable=True
replica.fetch.max.bytes=5242880
max.message.bytes=5242880
message.max.bytes=5242880
default.replication.factor=3
min.insync.replicas=3
replica.fetch.wait.max.ms=200
replica.lag.time.max.ms=1000
advertised.listeners=PLAINTEXT://192.168.1.11:9092
unclean.leader.election=false
acks=all

Related

Kafka Broker leader change without effect

I have 3 kafka brokers, with 3 partitions :
broker.id 1001: 10.18.0.73:9092 LEADER
broker.id 1002: 10.18.0.73:9093
broker.id 1005: 10.18.0.73:9094
Zookeeper set with 127.0.0.1:2181
Launch with:
1001 -> .\kafka-server-start.bat ..\..\config\server.properties
1002 -> .\kafka-server-start.bat ..\..\config\server1.properties
1005 -> .\kafka-server-start.bat ..\..\config\server2.properties
This is server.properties
broker.id=-1
listeners=PLAINTEXT://10.18.0.73:9092
advertised.listeners=PLAINTEXT://10.18.0.73:9092
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/tmp/kafka-logs
num.partitions=1
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=10.18.0.73:2181
zookeeper.connection.timeout.ms=6000
group.initial.rebalance.delay.ms=0
advertised.port=9092
advertised.host.name=10.18.0.73
port=9092
This is server1.properties
broker.id=-1
listeners=PLAINTEXT://10.18.0.73:9093
advertised.listeners=PLAINTEXT://10.18.0.73:9093
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/tmp/kafka-logs4
num.partitions=1
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=localhost:2181
zookeeper.connection.timeout.ms=6000
group.initial.rebalance.delay.ms=0
advertised.port=9093
advertised.host.name=10.18.0.73
port=9093
This is server2.properties
broker.id=-1
listeners=PLAINTEXT://10.18.0.73:9094
advertised.listeners=PLAINTEXT://10.18.0.73:9094
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/tmp/kafka-logs2
num.partitions=1
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=localhost:2181
zookeeper.connection.timeout.ms=6000
group.initial.rebalance.delay.ms=0
advertised.port=9094
advertised.host.name=10.18.0.73
port=9094
in folder C:\kafka_2.12-2.4.0\config
Run All
Run Producer
.\kafka-console-producer.bat --broker-list 10.18.0.73:9092,10.18.0.73:9093,10.18.0.73:9094 --topic clinicaleventmanager
Run Consumer
.\kafka-console-consumer.bat --bootstrap-server 10.18.0.73:9092,10.18.0.73:9093,10.18.0.73:9094 --topic clinicaleventmanager
I launch a test message
Receive ok!
Now, i shutdown broker 1001 (the leader)
The new leader is 1002
In the consumer this message appeared for 1 second, I imagine for the time necessary for the election of the new leader
[2020-01-16 15:33:35,802] WARN [Consumer clientId=consumer-console-consumer-56669-1, groupId=console-consumer-56669] Connection to node 2147482646 (/10.18.0.73:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
if I try to send another message, this is not read by the consume
The new leader 1002 does not appear to be sending messages.
Why?
If i run 1001 broker.id, works all.
Thanks
First, Kafka never "sends (pushes) messages", the consumer asks for them.
Second, it would seem you've changed nothing but the listeners, port, and log dir.
You don't explicitly create any topic, so you would end up with the defaults of one partition and one replica. For your topic and the internal consumer offsets topic
If any replica is offline from the broker you stopped, then no other process can read (or write) to that replica, regardless of which broker is the controller.
So, change the offsets (and transactions) replication factor to 3 and try again

Why kafka cluster did error "Number of alive brokers '0' does not meet the required replication factor"?

I have 2 kafka brokers and 1 zookeeper. Brokers config: server.properties file:
1 broker:
auto.create.topics.enable=true
broker.id=1
delete.topic.enable=true
group.initial.rebalance.delay.ms=0
listeners=PLAINTEXT://5.1.2.3:9092
log.dirs=/opt/kafka_2.12-2.1.0/logs
log.retention.check.interval.ms=300000
log.retention.hours=168
log.segment.bytes=1073741824
max.message.bytes=105906176
message.max.bytes=105906176
num.io.threads=8
num.network.threads=3
num.partitions=10
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
replica.fetch.max.bytes=105906176
socket.receive.buffer.bytes=102400
socket.request.max.bytes=105906176
socket.send.buffer.bytes=102400
transaction.state.log.min.isr=1
transaction.state.log.replication.factor=1
zookeeper.connect=5.1.3.6:2181
zookeeper.connection.timeout.ms=6000
2 broker:
auto.create.topics.enable=true
broker.id=2
delete.topic.enable=true
group.initial.rebalance.delay.ms=0
listeners=PLAINTEXT://18.4.6.6:9092
log.dirs=/opt/kafka_2.12-2.1.0/logs
log.retention.check.interval.ms=300000
log.retention.hours=168
log.segment.bytes=1073741824
max.message.bytes=105906176
message.max.bytes=105906176
num.io.threads=8
num.network.threads=3
num.partitions=10
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
replica.fetch.max.bytes=105906176
socket.receive.buffer.bytes=102400
socket.request.max.bytes=105906176
socket.send.buffer.bytes=102400
transaction.state.log.min.isr=1
transaction.state.log.replication.factor=1
zookeeper.connect=5.1.3.6:2181
zookeeper.connection.timeout.ms=6000
if i ask zookeeper like this:
echo dump | nc zook_IP 2181
i got:
SessionTracker dump:
Session Sets (3):
0 expire at Sun Jan 04 03:40:27 MSK 1970:
1 expire at Sun Jan 04 03:40:30 MSK 1970:
0x1000bef9152000b
1 expire at Sun Jan 04 03:40:33 MSK 1970:
0x1000147d4b40003
ephemeral nodes dump:
Sessions with Ephemerals (2):
0x1000147d4b40003:
/controller
/brokers/ids/2
0x1000bef9152000b:
/brokers/ids/1
looke fine, but not works :(. Zookeeper see 2 brokers, but in first kafka broker we have error:
ERROR [KafkaApi-1] Number of alive brokers '0' does not meet the required replication factor '1' for the offsets topic (configured via 'offsets.topic.replication.factor'). This error can be ignored if the cluster is starting up and not all brokers are up yet. (kafka.server.KafkaApis)
also we use kafka_exporter for prometheus, and he log this error:
Cannot get oldest offset of topic Some.TOPIC partition 9: kafka server: Request was for a topic or partition that does not exist on this broker." source="kafka_exporter.go:296
pls help ! were i mistake in config ?
Are your clocks working? Zookeeper thinks it's 1970
Sun Jan 04 03:40:27 MSK 1970
You may want to look at the rest of the logs or see if Kafka and Zookeeper are actively running and ports are open.
In your first message, after starting a fresh cluster you see this, so it's not a true error
This error can be ignored if the cluster is starting up and not all brokers are up yet. (kafka.server.KafkaApis)
The properties you show, though, have listeners on entirely different subnets and you're not using advertised.listeners
Kafka broker.id changes maybe cause this problem. Clean up the kafka metadata under zk, note: kafka data will be lost
I got this error message in this situation :
Cluster talking in SSL
Every broker is a container
Updated the certificate with new password inside ALL brokers
Rolling update
After the first broker reboot, it spammed this error message and the broker controller talked about "a new broker connected but password verification failed".
Solutions :
Set the new certificate password with the old password
Down then Up your entire cluster at once
(not tested yet) Change the certificate on one broker, reboot it then move to the next broker until you reboot all of them (ending with the controller)

Offset for consumer group resetted for one partition

During the last maintenance of Kafka, which required a rolling restart of kafka brokers, we witnessed a reset for consumer group offsets for certain partitions.
At 11:14 am, everything is fine for the consumer group and we don't see a consumer lag:
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0 105130857 105130857 0 st-...
...
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 6 78591770 78591770 0 st-...
However 5 minutes later, during the rolling restart of brokers, we have a reset for one partition and a consumer lag of millions of events.
$ bin/kafka-consumer-groups --bootstrap-server XXX:9093,XXX... --command-config secrets.config --group st-xx --describe
Note: This will not show information about old Zookeeper-based consumers.
[2019-08-26 12:44:13,539] WARN Connection to node -5 could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2019-08-26 12:44:13,707] WARN [Consumer clientId=consumer-1, groupId=st-xx] Connection to node -5 could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
Consumer group 'st-xx' has no active members.
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0 105132096 105132275 179
...
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 6 15239401 78593165 63353764 ...
In the last two hours, the offset for the partition hasn't recovered and we need to patch it now manually. We had similar issues during the last rolling restart of brokers.
Has anyone seen something like this before? The only clue we could find is this ticket, however we run Kafka version: 1.0.1-kafka3.1.0,

Kafka long coordinator load time and small ISRs

I'm using Kafka 0.8.2.1, running a topic with 200 partitions and RF=3, with log retention set to about 1GB.
An unknown event caused the cluster to enter the "coordinator load" or "group load" state. A few signals made this apparent: the pykafka-based consumers began to fail during FetchOffsetRequests with error code 14 COORDINATOR_LOAD_IN_PROGRESS for some subset of partitions. These errors were triggered when consuming with a consumer group that had existed since before the coordinator load. In broker logs, messages like this appeared:
[2018-05...] ERROR Controller 17 epoch 20 initiated state change for partition [my.cool.topic,144] from OnlinePartition to OnlinePartition failed (state.change.logger)
kafka.common.StateChangeFailedException: encountered error while electing leader for partition [my.cool.topic,144] due to: Preferred replica 11 for partition [my.cool.topic,144] is either not alive or not in the isr. Current leader and ISR: [{"leader":12,"leader_epoch":7,"isr":[12,13]}].
For some reason, Kafka decided that replica 11 was the "preferred" replica despite the fact that it was not in the ISR. To my knowledge, consumption could continue uninterrupted from either replica 12 or 13 while 11 resynchronized - it's not clear why Kafka chose a non-synced replica as the preferred leader.
The above-described behavior lasted for about 6 hours, during which the pykafka fetch_offsets error made message consumption impossible. While the coordinator load was still in progress, other consumer groups were able to consume the topic without error. In fact, the eventual fix was to restart the broken consumers with a new consumer_group name.
Questions
Is it normal or expected for the coordinator load state to last for 6 hours? Is this load time affected by log retention settings, message production rate, or other parameters?
Do non-pykafka clients handle COORDINATOR_LOAD_IN_PROGRESS by consuming only from the non-erroring partitions? Pykafka's insistence that all partitions return successful OffsetFetchResponses can be a source of consumption downtime.
Why does Kafka sometimes select a non-synced replica as the preferred replica during coordinator loads? How can I reassign partition leaders to replicas in the ISR?
Are all of these questions moot because I should just be using a newer version of Kafka?
Broker config options:
broker.id=10
port=9092
zookeeper.connect=****/kafka5
log.dirs=*****
delete.topic.enable=true
replica.fetch.max.bytes=1048576
replica.fetch.wait.max.ms=500
replica.high.watermark.checkpoint.interval.ms=5000
replica.socket.timeout.ms=30000
replica.socket.receive.buffer.bytes=65536
replica.lag.time.max.ms=10000
replica.lag.max.messages=4000
controller.socket.timeout.ms=30000
message.max.bytes=1000000
auto.create.topics.enable=false
log.index.interval.bytes=4096
log.index.size.max.bytes=10485760
log.retention.hours=96
log.roll.hours=168
log.retention.check.interval.ms=300000
log.segment.bytes=1073741824
zookeeper.connection.timeout.ms=6000
zookeeper.sync.time.ms=2000
num.io.threads=8
socket.request.max.bytes=104857600
num.replica.fetchers=4
controller.message.queue.size=10
num.partitions=8
log.flush.interval.ms=60000
log.flush.interval.messages=60000
log.flush.scheduler.interval.ms=2000
num.network.threads=8
socket.receive.buffer.bytes=1048576
socket.send.buffer.bytes=1048576
queued.max.requests=500
fetch.purgatory.purge.interval.requests=100
producer.purgatory.purge.interval.requests=100
controlled.shutdown.enable=true

Multi node kafka cluster: Producer and Consumer not working

I have a kafka cluster consisting of two machines. This is my server.properties:
broker.id=2
listeners=PLAINTEXT://:9092
advertised.listeners=PLAINTEXT://a.b.c.d:9092
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/tmp/kafka-logs
num.partitions=2
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=a.b.c.d:2181,a.b.c.e:2181
zookeeper.connection.timeout.ms=6000
group.initial.rebalance.delay.ms=0
And this is my zookeeper.properties:
dataDir=/tmp/zookeeper
clientPort=2181
maxClientCnxns=0
tickTime=2000
server.1=a.b.c.d:2888:3888
server.2=a.b.c.e:2888:3888
initLimit=20
syncLimit=10
a.b.c.d = The IPs these machines have, e.g. 192.168.....
I start the zookeeper server on both machines using:
bin/zookeeper-server-start config/zookeeper.properties
I then start kafka servers on both the nodes. After this, I am able to create a new topic and get its details using --describe. However I am unable to read from consumer or write to producer. I run these by:
bin/kafka-console-consumer --bootstrap-server a.b.c.d:9092,a.b.c.e:9092 --topic randomTopic --from beginning
bin/kafka-console-producer --broker-list a.b.c.d:9092,a.b.c.e:9092 --topic randomTopic
When I run the producer, the prompt(>) appears and I can write into it. However, kafka cannot read anything from the consumer and the screen remains black.
How do I make the consumer read the data in the topics or make producer able to write the data in the topics?