Apache zookeeper client times out - apache-kafka

We continuously get EndOfStreamException in zookeeper logs,
[2017-04-06 19:15:24,350] WARN EndOfStreamException: Unable to read additional data from client sessionid 0x15b43c712fc03a5, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)
And in the client's (consumer) logs, we get session time out,
main-SendThread(localhost:2181) INFO 2017-04-06 21:30:27,823: org.apache.zookeeper.ClientCnxn Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x15b43c712fc03a5, negotiated timeout = 6000
Is it normal behavior ?
We are actually amid investigation for the issue and the consumers are unable to read messages from queue. And producers are unable to put in. Thus, the whole process in jammed.
What do you suggest?

In our case, we were running into zookeeper disconnects which were right above the 6000ms default timeout, due to flaky distributed network. Since at that point the node takes itself out of the cluster, it was causing fairly high impact on the production cluster. So, we simply increased the timeout to 15 seconds, and did not see the problem again.

Related

When do Kafka consumer retries happen?

There is a retry feature on Kafka clients. I am struggling to find out when a retry happens. Would a retry happen if the connection to the broker in interrupted briefly? How about if the brokers were not reachable for 5 mins? Will the messages get delivered once the brokers are back up? Or does retry only happen on known scenarios to the kafka clients?
Kafka Producer consists of a pool of buffer space that holds records that haven't yet been transmitted to the server as well as a background I/O thread that is responsible for turning these batch records into requests and transmitting them to the cluster.
For example if records are sent faster than they can be delivered to the server the producer will block for max.block.ms after which it will throw an exception. Then client assumes batch is failed and will retry to send the batch based on retries config
org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for my-test-topic-4 due to 30024 ms has passed since batch creation plus linger time
Suppose if the retries config is set to 3 and if all retries fails then the batch is lost
error: Failed to send message after 3 tries
How about if the brokers were not reachable for 5 mins?
If the broker is down and in mean time if retry is exhausted then you lost the data

ActiveMQ Artemis Error - AMQ224088: Timeout (10 seconds) while handshaking has occurred

In ActiveMQ Artemis, I occasionally receive the connection error below. I can't see any obvious impact to the brokers or message queues. Anyone able to advise exactly what it means or what impact it could be having?
Current action performed is to either restart the brokers or check to see they're still connected to the cluster. Is either of this action necessary?
Current ActiveMQ Artemis version deployed is v2.7.0.
//error log line received at least once a month
2019-05-02 07:28:14,238 ERROR [org.apache.activemq.artemis.core.server] AMQ224088: *Timeout (10 seconds) while handshaking* has occurred.
This error indicates that something on the network is connecting to the ActiveMQ Artemis broker, but it's not completing any protocol handshake. This is commonly seen with, for example, load balancers that do a health check by creating a socket connection without sending any real data just to see if the port is open on the target machine.
The timeout is configurable so that the ERROR messages aren't logged, but that will also disable the clean-up which may or may not be a problem in your use-case. You should just be able to set handshake-timeout=0 on the relevant acceptor URL in broker.xml.
When you see this message there should be no need to restart the broker.
In the next ActiveMQ Artemis release the IP address of the remote client where the connection originated will be included as part of the message.

kafka0.10.2 disk_read_io too high, isr always shrink and expand, client timeout

cluster: 0.10.2 20brokers
phenomenon: add a new consumer and triggered by the peak traffic.
the cluster will always adjust the ISR
disk_read_io rose high
the client will appear timeout
broker fetchreplica link closed. broker logs:
Attempting to send response via channel for which there there no open connection, connection id
Questions:
    
1. after check the source code i found this // if the selector closed the connection because it was idle for too long. What is the reason for this idle? disk IO waiting time is too long, or to say that there are other cases of the initiative to shut down
Frequent switching ISR consumption of network traffic is much more than normal consumer traffic, why is that

org.apache.kafka.clients.NetworkClient Bootstrap broker bootstrap-servers-ip:9092 disconnected

I am running apache kafka on my local system and it is running absolutely fine. But during smoke testing my application is not able to connect to the kafka cluster. It keeps throwing the following error endlessly:
[2016-11-22T23:04:35,017][WARN ][org.apache.kafka.clients.NetworkClient] Bootstrap broker <host1>:9092 disconnected
[2016-11-22T23:04:35,474][WARN ][org.apache.kafka.clients.NetworkClient] Bootstrap broker <host2>:9092 disconnected
[2016-11-22T23:04:35,951][WARN ][org.apache.kafka.clients.NetworkClient] Bootstrap broker <host1>:9092 disconnected
[2016-11-22T23:04:36,430][WARN ][org.apache.kafka.clients.NetworkClient] Bootstrap broker <host2>:9092 disconnected
I am using the below consumer config to connect:
propsMap.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "<host1>:9092,<host2>:9092);
propsMap.put("zookeeper.connect", "<host1>:2181,<host2>:2181");
propsMap.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
propsMap.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, "100");
propsMap.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "15000");
propsMap.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
propsMap.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
propsMap.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
propsMap.put(ConsumerConfig.GROUP_ID_CONFIG, "test");
Could it be network issue on smoke servers due to which my deployment server is not able to connect to the kafka servers? Because it is working fine on my local and 2 other testing environments.
Could it have something to do with the kafka version?
Or do I need to add some other config such as SSL in this case to connect?
I am new to Kafka, it would really help if someone could point me in the right direction!
If you are using the Kafka 0.9.x.x client or later (which you are if you are using spring-kafka), you don't need the zookeeper.connect property (but that shouldn't cause your problem).
If the broker is down, you should get something like
WARN o.apache.kafka.clients.NetworkClient - Connection to node -1 could not be established. Broker may not be available.
I suggest you look at the server logs to see if there's anything useful there. You need to talk to your admins to figure out if you need SSL/SASL/Kerberos, etc to connect.
This may be due to server moved to different address or not available at moment.
If you still want to go ahead with this assuming the server will come up later, but do not want logs to keep printing "server disconnected" in an infinite loop, use this property.
reconnect.backoff.ms
The base amount of time to wait before attempting to reconnect to a given host. This avoids repeatedly connecting to a host in a tight loop. This backoff applies to all connection attempts by the client to a broker.
Type: long
Default: 50
Valid Values: [0,...]
By default, it retries every 50 milliseconds to reconnect a failed host, this can be increased to, lets say, 5 minutes (300,000ms). By doing so, your logs wouldn't print the infinite disconnection message.
[OPTIONAL] Also, if you are using Apache Camel for routing purpose, use the similar sounding property in camel-kafka component bean definition.
reconnectBackoffMs (producer)

Connection not cleaned up if client failure occurs abruptly without closing the resources

I am using hornetq-2.2.14 Final and configured connection-ttl in hornetq-jms.xml is 60000ms . I have a publisher program which sends messages to a topic and a Consumer program which consumes messages from the topic. My consumer program exited abruptly without closing the resources. I waited 1 minute since the ttl is 60000ms,but server not clearing up the resources even after one minute. Any one can help me out to resolve this issue, if this is a configuration issue?
Sometimes it can take as 2X TTL depending on how it happened. we recently had a fix on master to make sure it is always close to the TTL configured.