Why kafka crash? - apache-kafka

I have kafka 2.5.0
My service kafka crash sometimes.
kafka/logs/server.log
[09:25:23,316] WARN Unable to read additional data from client sessionid 0x1000001a8fd0012, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)
/var/log/messages
09:25:23 kafka1 systemd: kafka.service: main process exited, code=exited, status=1/FAILURE
09:25:23 kafka1 systemd: Unit kafka.service entered failed state.
09:25:23 kafka1 systemd: kafka.service failed.
How to find out why this happens?

Check Zookeeper first if it is running.
If it is running, try to change these settings in zoo.cfg:
autopurge.snapRetainCount=15 (at least)
autopurge.purgeInterval=1 - 2 hours
Some hints might be here:
zookeeper + Unable to read additional data from client session id
ZooKeeper keeps getting EndOfStreamException, causing a crash

Related

Why does the message "The Critical Analyzer detected slow paths on the broker" mean in Artemis broker?

Setup: I have an artemis broker HA cluster with 3 brokers. The replication policy is replication. Each broker is running in its own VM.
Problem: When I leave my brokers running for long time, usually after 5-6 hours, I get the below error.
2022-11-21 21:32:37,902 WARN
[org.apache.activemq.artemis.utils.critical.CriticalMeasure] Component
org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager
is expired on path 0 2022-11-21 21:32:37,902 INFO
[org.apache.activemq.artemis.core.server] AMQ224107: The Critical
Analyzer detected slow paths on the broker. It is recommended that
you enable trace logs on org.apache.activemq.artemis.utils.critical
while you troubleshoot this issue. You should disable the trace logs
when you have finished troubleshooting. 2022-11-21 21:32:37,902 ERROR
[org.apache.activemq.artemis.core.server] AMQ224079: The process for
the virtual machine will be killed, as component
org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager#46d59067
is not responsive 2022-11-21 21:32:37,969 WARN
[org.apache.activemq.artemis.core.server] AMQ222199: Thread dump:
******************************************************************************* Complete Thread dump "Thread-517
(ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$7#437da279)"
Id=602 TIMED_WAITING on
java.util.concurrent.SynchronousQueue$TransferStack#75f49105
at sun.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.SynchronousQueue$TransferStack#75f49105
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
What does this really mean? I understand that the critical analyzer sees an error and it halts the broker but what is causing this error?
You may take a look at the documentation. Basically you are experiencing some issue tat the broker detects and it shuts down before it becomes too irresponsive. Setting the policy to LOG you might get more clues on the issue.

Trying to run Apache/NIFI on Zookeeper from Confluent

I am trying to run Apache/NIFI on confluent-zookeeper. NIFI ver 1.11.3 installed in /opt/nifi by unpacking tar container, confluent is community edition, ver 5.3. installed using confluent repo https://packages.confluent.io/rpm/5.3.
So NIFI works using integrated zookeper, NIFI works if I download zookeeper separatly from Apache/zookeeper site. Confluent Kafka also works with separate zookeeper and NIFI-integrated. BUT I cannot make it works using zookeeper from confluent.
In logs I see only one warning which is:
WARN Received packet at server of unknown type 15 (org.apache.zookeeper.server.ZooKeeperServer)
My config file for all three zookeepers are the same:
tickTime=2000
dataDir=/var/lib/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=myhost1:2888:3888
server.2=myhost2:2888:3888
server.3=myhost3:2888:3888
autopurge.snapRetainCount=3
autopurge.purgeInterval=24
I do not think that Confluent really changed smth in their zookeeper. What could be the reason of this error?
As #BryanBende said:
NiFi 1.11.x (In our particular case, 1.11.4) requires ZK 3.5, please
confirm the version of ZK used in Confluent platform, IF ITS 3.4 THEN
ITS NOT GOING TO WORK – Bryan Bende Mar 27 at 14:35
The typical error you will see in zookeeper log:
Oct 08 17:22:23 some-pro-zk03 zookeeper-server-start[14136]:
[2020-10-08 17:22:23,275] INFO Accepted socket connection from
/10.10.10.1:53794 (org.apache.zookeeper.server.NIOServerCnxnFactory)
Oct 08 17:22:23 some-pro-zk03 zookeeper-server-start[14136]:
[2020-10-08 17:22:23,275] INFO Refusing session request for client
/10.10.10.1:53794 as it hasseen zxid 0x400000000 our last zxid is
0x300000004 client must try another server
(org.apache.zookeeper.server.ZooKeeperServer)
Oct 08 17:22:23
some-pro-zk03 zookeeper-server-start[14136]: [2020-10-08 17:22:23,275]
INFO Closed socket connection for client /10.159.164.93:53794 (no
session established for client)
(org.apache.zookeeper.server.NIOServerCnxn)
The typical error you will see in client log:
2020-10-07 16:00:09,112 ERROR [Curator-Framework-0]
o.a.c.f.imps.CuratorFrameworkImpl Background operation retry gave
uporg.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLossat
org.apache.zookeeper.KeeperException.create(KeeperException.java:102)at
org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862)at
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:990)at
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943)at
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66)at
org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:346)at
java.util.concurrent.FutureTask.run(FutureTask.java:266)at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)at
java.lang.Thread.run(Thread.java:748)
This takes a full day of work (reading logs!!) to find the error.
So, check your zookeeper version:
FOR Zookeeper 3.5+
echo srvr | nc localhost 2181
FOR Zookeeper 3.5<
echo stats | nc localost 2181
Also you can use telnet
FOR Zookeeper 3.5+
telnet localhost 2181
srvr
FOR Zookeeper 3.5<
telnet localhost 2181
stats

Cannot start Zookeeper due to: Exception causing close of session 0x0 due to java.io.EOFException

I am trying to start up Zookeeper via the CLI with the command:
bin/zookeeper-server-start.sh ../config/zookeeper.properties
And it hums along for a second with what all seems to be correct until it says this:
INFO binding to port 0.0.0.0/0.0.0.0:2181 (org.apache.zookeeper.server.NIOServerCnxnFactory)
and then the below loops indefinitely until I exit:
[2018-08-10 15:07:48,223] INFO Accepted socket connection from /172.31.39.32:46374 (org.apache.zookeeper.server.NIOServerCnxnFactory)
[2018-08-10 15:07:48,228] WARN Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running (org.apache.zookeeper.server.NIOServerCnxn)
[2018-08-10 15:07:48,228] INFO Closed socket connection for client /172.31.39.32:46374 (no session established for client) (org.apache.zookeeper.server.NIOServerCnxn)
This is a single server and I believe a single node test server, so there isn't a quorum or other pieces running. My zookeeper config is basic, it only contains this:
dataDir=/tmp/zookeeper
clientPort=2181
maxClientCnxns=0
The weird thing is, my zookeeper had been running fine, and I had made NO changes to the config. Pulled it down to try to fix something else to do a quick restart on the zookeeper, and it won't budge. I've checked, and nothing else is running on port 2181.
I see this question has been asked several times with no answers, any ideas?
This might be happening because of some corruption in zookeeper data. You should not set dataDir to /tmp/*. If your computer purges some data of /tmp, it will be difficult for zookeeper to restore the state upon restart. If you check the zookeeper logs, you should see some kind of exception there.
Since you mentioned this zookeeper instance is for test purpose only. You should set
dataDir to anything but /tmp and try restart.

Can't start zookeeper

I'm using confluent platform, the zookeeper is active with status lookup. but when I try to start kafka with confluent it shows zookeeper is down.
$ sudo service zookeeper status
Redirecting to /bin/systemctl status zookeeper.service
● zookeeper.service - Zookeeper
Loaded: loaded (/etc/systemd/system/zookeeper.service; disabled; vendor preset: disabled)
Active: active (running) since Tue 2017-08-08 17:25:34 PDT; 16h ago
Docs: http://kafka.apache.org/documentation.html
Process: 3774 ExecStop=/var/www/confluent/bin/zookeeper-server-stop (code=exited, status=1/FAILURE)
Main PID: 3785 (java)
CGroup: /system.slice/zookeeper.service
└─3785 java -Xmx512M -Xms512M -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC -Djava.awt.headless=true -Xloggc:/var/log...
zookeeper[3785]: [2017-08-08 17:26:09,005] INFO Processed session termination for sessionid: 0x15dc460fd0c0000 (org.apache.zooke...Processor)
zookeeper[3785]: [2017-08-08 17:26:39,000] INFO Expiring session 0x15dc4364baf0004, timeout of 60000ms exceeded (org.apache.zook...perServer)
zookeeper[3785]: [2017-08-08 17:26:39,000] INFO Expiring session 0x15dc4364baf0002, timeout of 60000ms exceeded (org.apache.zook...perServer)
zookeeper[3785]: [2017-08-08 17:26:39,000] INFO Expiring session 0x15dc4364baf0003, timeout of 60000ms exceeded (org.apache.zook...perServer)
zookeeper[3785]: [2017-08-08 17:26:39,001] INFO Processed session termination for sessionid: 0x15dc4364baf0004 (org.apache.zooke...Processor)
zookeeper[3785]: [2017-08-08 17:26:39,002] INFO Processed session termination for sessionid: 0x15dc4364baf0002 (org.apache.zooke...Processor)
zookeeper[3785]: [2017-08-08 17:26:39,002] INFO Processed session termination for sessionid: 0x15dc4364baf0003 (org.apache.zooke...Processor)
zookeeper[3785]: [2017-08-09 09:56:26,711] INFO Accepted socket connection from /127.0.0.1:46446 (org.apache.zookeeper.server.NI...xnFactory)
zookeeper[3785]: [2017-08-09 09:59:14,796] WARN Exception causing close of session 0x0 due to java.io.IOException: Len error -72...erverCnxn)
zookeeper[3785]: [2017-08-09 09:59:14,796] INFO Closed socket connection for client /127.0.0.1:46446 (no session established for...erverCnxn)
Hint: Some lines were ellipsized, use -l to show in full.
$ confluent start kafka
Starting zookeeper
|Zookeeper failed to start
zookeeper is [DOWN]
Cannot start Kafka, Zookeeper is not running. Check your deployment
This is because zookeeper is already running, you can check the process with
ps aux|grep zookeeper
and kill the process manually and it is gonna work.
The most common cause for the message you are seeing when running:
confluent start kafka
and informs you that zookeeper is down, is that there's another zookeeper instance that is currently running, and the new zookeeper instance can not bind to its required port (by default this port is 2181).
A few options at your disposal to figure out what's the other zookeeper instance that is currently running when you try to issue confluent start kafka are:
run jps to see the running java processes. Zookeeper is the process named QuorumPeerMain next to its process ID. (equivalent to running ps xuaww | grep -i zookeeper or equivalent).
run lsof -i :2181 to figure out what the process that is running and has reserved the default zookeeper port (in this example 2181, but might be different in your system).
Try running confluent start kafka again after stopping the above process.
I received the same message. In my case I didn't set $JAVA_HOME variable properly.
You are mixing two installations.
confluent start kafka would depend on you running confluent start zookeeper.
Rather, it seems you already have systemctl running Zookeeper, so you should ideally just configure your server.properties and use the regular kafka-server-start script. And/or create a systemctl file for Kafka.
run $ confluent log zookeeper you will be able to see the log for any errors
there is high chance zookeeper is already running and using the port 2181,
use $ sudo lsof -i :1-2181 to see which process is using that port and try to kill and try again or
run $ sudo netstat -plten | grep java to see processes and ports they are on.
run kill -9 <pid> to kill the process.

Mesos-marthon cluster issue Could not determine the current leader

I am new in Mesos and Marathon services. I have setup 3 master and 3 slave server as per www.digitalocean.com. Configured as it is in master servers as well as slaves. Finally I done setup of Mesos, Marathon, Zookeeper and Chronos. Mesos is able to listing with 5050, Marathon is 8080 and Chronos 4400. After few hours my Marthon instances are showing like Error 503
HTTP ERROR: 503
Problem accessing /. Reason:
Could not determine the current leader
Powered by Jetty:// 9.3.z-SNAPSHOT.
But mesos is working fine. Every time i am facing this problem and if i restart the marathon service and zookeeper service its working fine.
Marathon
Jun 15 06:19:20 master3 marathon[1054]: INFO Waiting for consistent leadership state. Are we leader?: false, leader: Some(192.168.4.78:8080 (mesosphere.marathon.api.LeaderProxyFilter$:qtp522188921-35)
Jun 15 06:19:20 master3 marathon[1054]: INFO Waiting for consistent leadership state. Are we leader?: false, leader: Some(192.168.4.78:8080 (mesosphere.marathon.api.LeaderProxyFilter$:qtp522188921-35)
Zookeeper
2016-06-15 03:41:13,797 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory#197] - Accepted socket connection from /192.168.4.78:38339
2016-06-15 03:41:13,798 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running