I have an application which consumers messages from 4 kafka topics. For simplicity sake, let's call the topics: a, b, c, d. Each new version of the application uses a new consumer group id (basically a docker image ID).
Today, I had a problem where a new version of the application launched with a new consumer group which connected to a,b,d, but not c topic. Looking in Kafka manager, the new consumer group had no entry for topic c.
I can see an error in the client error logs
Consumer clientId=indexer, groupId=650-c6ac848] Node 331 sent an invalid full fetch response with extra=(a-28, response=(c-28","logger_name":"org.apache.kafka.clients.FetchSessionHandler","thread_name":"kafka-coordinator-heartbeat-thread
I suspect it may be an infrastructure / configuration issue, but I can't be certain. I'm a developer - and I'm not very familiar with Kafka, so I don't where to look. The application code changes were minimal and shouldn't have impacted consumer group setup.
The log message to me suggests something related to heartbeat, and topics a and c have had their wires crossed somehow.
server.properties..
advertised.listeners=PLAINTEXT://kafka1.dub1.cloud:9092
auto.create.topics.enable=false
broker.id=16
broker.rack=dub1-zone4
default.replication.factor=3
delete.topic.enable=true
group.initial.rebalance.delay.ms=3
log.dirs=/var/lib/kafka
log.retention.check.interval.ms=300000
log.retention.hours=168
log.segment.bytes=1073741824
min.insync.replicas=2
num.io.threads=8
num.network.threads=3
num.partitions=30
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=3
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
socket.send.buffer.bytes=102400
transaction.state.log.min.isr=2
transaction.state.log.replication.factor=3
unclean.leader.election.enable=false
zookeeper.connect=zookeeper1:2181,zookeeper2:2181,zookeeper3:2181
zookeeper.connection.timeout.ms=6000
Looking at the source code isn't helpful to the uninitiated
https://github.com/apache/kafka/blob/b8a99be7847c61d7792689b71fda5b283f8340a8/clients/src/main/java/org/apache/kafka/clients/FetchSessionHandler.java#L394
Any suggestions on how to further diagnose this problem would be great appreciated.
Turns out the topic c had No messages and that seems to be the reason for errors I saw.
Related
We have this issue that when Kafka brokers must be taken offline, no consumer service has any idea about that and keeps running.
We tried listing consumers in the new Kafka instance, and saw no existing consumer listed there. All consumers listed are those newly created.
We had to manually terminate all existing consumer services which is not convenient every time we hit this issue.
Question - How does a consumer know it is no longer listed in the Kafka cluster so it should terminate itself?
P.S. We use Spring Kafka.
1 -- To Check Clusters & Replica status ?
Check Kafka cluster all broker status
$ zookeeper-shell.sh localhost:9001 ls /brokers/ids
Check Kafka cluster Specific broker status
$ zookeeper-shell.sh localhost:9001 get /brokers/ids/<id>
specific to replica_unavailability check
$ kafka-check --cluster-type=sample_type replica_unavailability
For first broker check
$ kafka-check --cluster-type=sample_type --broker-id 3 replica_unavailability --first-broker-only
Any partitions replicas not available
$ kafka-check --cluster-type=sample_type replica_unavailability
Checking offline partitions
$ kafka-check --cluster-type=sample_type offline
2 -- Code sample to send/auto-shutdown
2 custom options to do handle the shutdown using a kill-message,
do it gracefully by sending a kill-message before taking down
brokers or topics.
Option 1: Consider an in-band message/signal - i.e. send a “kill” message pertaining to topics/brokers consumer is listening to as it follows the offset order on the topic-partition
Option 2: make the consumer listen to 2 topics for e.g. “topic” and “topic_kill”
The difference between the 2 options above, is that the first version is comes in the the order it was sent, consider that there maybe blocking messages maybe waiting, depending on your implementation, to be consumed before that “kill message”.
While, the second version allows kill-signal to arrive independently without being blocked out of band, this is a nicer & reusable architectural pattern, with a clear separation between data topic and signaling.
Code Sample a) producer sending the kill-message & b) consumer to recieve and handle the shutdown
// Producer -- modify and adapt as needed
import json
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers=['0.0.0.0:<my port number>'],
key_serializer=lambda m: m.encode('utf8'),
value_serializer=lambda m: json.dumps(m).encode('utf8'))
def send_kill(topic: str, partitions: [int]):
for p in partitions:
producer.send(topic, key='kill', partition=p)
producer.flush()
// Consumer to accept a kill-message -- please modify and adapt as needed
import json
from kafka import KafkaConsumer
from kafka.structs import OffsetAndMetadata, TopicPartition
consumer = KafkaConsumer(bootstrap_servers=['0.0.0.0:<my port number>'],
key_deserializer=lambda m: m.decode('utf8'),
value_deserializer=lambda m: json.loads(m.decode('utf8')),
auto_offset_reset="earliest",
group_id='1')
consumer.subscribe(['topic'])
for msg in consumer:
tp = TopicPartition(msg.topic, msg.partition)
offsets = {tp: OffsetAndMetadata(msg.offset, None)}
if msg.key == "kill":
consumer.commit(offsets=offsets)
consumer.unsuscribe()
exit(0)
# do your work...
consumer.commit(offsets=offsets)
We have 8 nodes kafka cluster and kafka manager installed.
We are monitoring via new relic.
new Relic and kafka manager both are reporting kafka is rejecting bytes. I am not able to find the cause.
In broker logs there are no error lines.
JMS BEAN - JMX/kafka.server/BrokerTopicMetrics/BytesRejectedPerSec/OneMinuteRate
Kafka Config -
auto.create.topics.enable=false
auto.leader.rebalance.enable=true
broker.id=180
controlled.shutdown.enable=true
controlled.shutdown.max.retries=3
default.replication.factor=1
delete.topic.enable=true
kafka.http.metrics.host=0.0.0.0
kafka.http.metrics.port=24042
kafka.log4j.dir=/logs/kafka
kerberos.auth.enable=false
leader.imbalance.check.interval.seconds=300
leader.imbalance.per.broker.percentage=10
log.cleaner.dedupe.buffer.size=134217728
log.cleaner.delete.retention.ms=604800000
log.cleaner.enable=true
log.cleaner.min.cleanable.ratio=0.5
log.cleaner.threads=1
log.dirs=/kafka/data
log.retention.bytes=5368709120
log.retention.check.interval.ms=300000
log.retention.hours=72
log.retention.ms=259200000
log.roll.hours=168
log.segment.bytes=1073741824
message.max.bytes=3145728
min.insync.replicas=1
num.io.threads=8
num.partitions=1
num.replica.fetchers=6
offsets.topic.num.partitions=50
offsets.topic.replication.factor=3
port=9092
quota.consumer.default=52428800
quota.consumer.default=52428800
quota.producer.default=26214400
quota.producer.default=26214400
replica.fetch.max.bytes=4194304
replica.lag.max.messages=6000
replica.lag.time.max.ms=60000
unclean.leader.election.enable=false
zookeeper.session.timeout.ms=6000
zookeeper.connect=zookeeper01.prod.***.com:2181,zookeeper02.prod.***.com:2181,zookeeper03.prod.***.com:2181
security.inter.broker.protocol=PLAINTEXT
listeners=PLAINTEXT://kafka01.prod.***.com:9092,
broker.id.generation.enable=false
sasl.kerberos.service.name=kafka
listeners=PLAINTEXT://:9092
num.network.threads=8
By examining Kafka sources (ref1, ref2), it seems that the only reason counted into BytesRejectedPerSec (bytesRejectedRate) is the message size exceeding config.maxMessageSize.
Note: recompression and message format conversion may also affect the message size beyond what's being sent by the producer.
I have a kafka cluster of 3 kafka brokers on 3 different servers.
Lets assume the three servers are .
99.99.99.1
99.99.99.2
99.99.99.3
All 3 servers have a shared path on which kafka is residing.
I have created 3 server.properties with name
server1.properties
server2.properties
server3.properties
The server1.properties look like below:
broker.id=1
port=9094
listeners=SSL://99.99.99.1:9094
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=3
zookeeper.connect=99.99.99.1:2181,99.99.99.2:2182,99.99.99.3:2183
ssl.keystore.location=xyz.jks
ssl.keystore.password=password
ssl.key.password=password
ssl.truststore.location=xyz.jks
ssl.truststore.password=password
ssl.client.auth=required
security.inter.broker.protocol=SSL
Similarly, the other two server properties look.
Issues/Query:
I need the consumer and producer should connect using SSL and even all the brokers should connect to each other using SSL. Is my configuration right for this?
I keep on getting below error is this usual?
WARN Failed to send SSL Close message
(org.apache.kafka.common.network.SslTransportLayer)
java.io.IOException: Broken pipe
As part of our current Kafka cluster, high-availability testing (HA) is being done. The objective is, while a producer job is pushing data to a particular partition of a topic, all the brokers in Kafka cluster are restarted sequentially (Stop-first broker- restart it and after first broker comes up, do same steps for second broker and so-on). The producer job is pushing around 7 million records for about 30 minutes while this test is going on. At the end of job, it was noticed that around 1000 records are missing.
Below are specifics of our Kafka cluster: (kafka_2.10-0.8.2.0)
-3 Kafka brokers each with 2 100GB mounts
Topic was created with:
-Replication factor of 3
-min.insync.replica=2
server.properties:
broker.id=1
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.dirs=/drive1,/drive2
num.partitions=1
num.recovery.threads.per.data.dir=1
log.flush.interval.messages=10000
log.retention.hours=1
log.segment.bytes=1073741824
log.retention.check.interval.ms=1800000
log.cleaner.enable=false
zookeeper.connect=ZK1:2181,ZK2:2181,ZK3:2181
zookeeper.connection.timeout.ms=10000
advertised.host.name=XXXX
auto.leader.rebalance.enable=true
auto.create.topics.enable=false
queued.max.requests=500
delete.topic.enable=true
controlled.shutdown.enable=true
unclean.leader.election=false
num.replica.fetchers=4
controller.message.queue.size=10
Producer.properties (aync producer with new producer API)
bootstrap.servers=broker1:9092,broker2:9092,broker3:9092
acks=all
buffer.memory=33554432
compression.type=snappy
batch.size=32768
linger.ms=5
max.request.size=1048576
block.on.buffer.full=true
reconnect.backoff.ms=10
retry.backoff.ms=100
key.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
Can someone share any info about Kafka-cluster and HA to ensure that data would not be lost while rolling restarting Kafka brokers?
Also, here is my producer code. This is a fire and forget kind of producer. we are not handling failures explicitly as of now. Working fine for almost millions of records. I am seeing problem, only when Kafka brokers are restarted as explained above.
public void sendMessage(List<byte[]> messages, String destination, Integer parition, String kafkaDBKey) {
for(byte[] message : messages) {
producer.send(new ProducerRecord<byte[], byte[]>(destination, parition, kafkaDBKey.getBytes(), message));
}
}
By increasing default retries value from 0 to 4000 on producer side, we are able to send data successfully without loosing.
retries=4000
Due to this setting, there is a possibility of sending same message twice and messages are out of sequence by the time consumer receives it (second msg might reach before first msg). But for our current problem that is not an issue and is handled on consumer side to ensure everything is in order.
We have a kafka cluster with three brokers (node ids 0,1,2) and a zookeeper setup with three nodes.
We created a topic "test" on this cluster with 20 partitions and replication factor 2. We are using Java producer API to send messages to this topic. One of the kafka broker intermittently goes down after which it is unrecoverable. To simulate the case, we killed one of the broker manually. As per the kafka arch, it is supposed to self recover, but which is not happening. When I describe the topic on the console, I see the number of ISR's reduced to one for few of the partitions as one of the broker killed. Now, whenever we are trying to push messages via the producer API (either Java client or console producer), we are encountering SocketTimeoutException.. One quick look into the logs says, "Unable to fetch the metadata"
WARN [2015-07-01 22:55:07,590] [ReplicaFetcherThread-0-3][] kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-0-3],
Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 23711; ClientId: ReplicaFetcherThread-0-3;
ReplicaId: 0; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [zuluDelta,2] -> PartitionFetchInfo(11409,1048576),[zuluDelta,14] -> PartitionFetchInfo(11483,1048576).
Possible cause: java.nio.channels.ClosedChannelException
[2015-07-01 23:37:40,426] WARN Fetching topic metadata with correlation id 0 for topics [Set(test)] from broker [id:1,host:abc-0042.yy.xxx.com,port:9092] failed (kafka.client.ClientUtils$)
java.net.SocketTimeoutException
at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221)
at kafka.utils.Utils$.read(Utils.scala:380)
at kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
at kafka.network.Receive$class.readCompletely(Transmission.scala:56)
at kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29)
at kafka.network.BlockingChannel.receive(BlockingChannel.scala:111)
at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:75)
at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72)
at kafka.producer.SyncProducer.send(SyncProducer.scala:113)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93)
at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
Any leads will be appreciated...
From your error Unable to fetch metadata it could mostly be because you could have set the bootstrap.servers in the producer to the broker that has died.
Ideally, you must have more than one broker in the bootstrap.servers list because if one of the broker fails (or is unreachable) then the other could give you the metadata.
FYI: Metadata is the information about a particular topic that tells how many number of partitions it has, their leader brokers, follower brokers etc.
So, when a key is produced to a partition, its corresponding leader broker will be the one to whom the messages will be sent to.
From your question, your ISR set has only one broker. You could try setting the bootstrap.server to this broker.