How to troubleshoot Hypnotoad worker failures - mojolicious

I've got an app that is served up by Hypnotoad, with no reverse proxy.It has 15 workers, with 2 clients allowed apiece. The app is launched via hypnotoad in foreground mode.
I am seeing the following in the log/production.log:
[Wed Apr 1 16:28:12 2015] [error] Worker 119914 has no heartbeat, restarting.
[Wed Apr 1 16:28:21 2015] [error] Worker 119910 has no heartbeat, restarting.
[Wed Apr 1 16:28:21 2015] [error] Worker 119913 has no heartbeat, restarting.
[Wed Apr 1 16:28:22 2015] [error] Worker 119917 has no heartbeat, restarting.
[Wed Apr 1 16:28:22 2015] [error] Worker 119909 has no heartbeat, restarting.
[Wed Apr 1 16:28:27 2015] [error] Worker 119907 has no heartbeat, restarting.
[Wed Apr 1 16:28:34 2015] [error] Worker 119905 has no heartbeat, restarting.
[Wed Apr 1 16:28:42 2015] [error] Worker 119904 has no heartbeat, restarting.
[Wed Apr 1 16:30:12 2015] [error] Worker 119912 has no heartbeat, restarting.
[Wed Apr 1 16:31:23 2015] [error] Worker 119918 has no heartbeat, restarting.
[Wed Apr 1 16:32:18 2015] [error] Worker 119911 has no heartbeat, restarting.
[Wed Apr 1 16:32:22 2015] [error] Worker 119916 has no heartbeat, restarting.
However, the workers are never restarted.
When I run an strace, the manager process appears to be valiantly trying to kill the (now expired) workers:
Process 119878 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 0
kill(119906, SIGKILL) = 0
kill(119917, SIGKILL) = 0
kill(119905, SIGKILL) = 0
kill(119910, SIGKILL) = 0
kill(119904, SIGKILL) = 0
kill(119914, SIGKILL) = 0
kill(119916, SIGKILL) = 0
kill(119908, SIGKILL) = 0
kill(119913, SIGKILL) = 0
kill(119915, SIGKILL) = 0
kill(119918, SIGKILL) = 0
kill(119912, SIGKILL) = 0
kill(119909, SIGKILL) = 0
kill(119911, SIGKILL) = 0
kill(119907, SIGKILL) = 0
stat("/xxx/xxx/xxx/hypnotoad.pid", {st_mode=S_IFREG|0644, st_size=6, ...}) = 0
poll([{fd=4, events=POLLIN|POLLPRI}], 1, 1000) = 0 (Timeout)
kill(119906, SIGKILL) = 0
kill(119917, SIGKILL) = 0
kill(119905, SIGKILL) = 0
kill(119910, SIGKILL) = 0
kill(119904, SIGKILL) = 0
kill(119914, SIGKILL) = 0
kill(119916, SIGKILL) = 0
kill(119908, SIGKILL) = 0
kill(119913, SIGKILL) = 0
kill(119915, SIGKILL) = 0
kill(119918, SIGKILL) = 0
kill(119912, SIGKILL) = 0
kill(119909, SIGKILL) = 0
kill(119911, SIGKILL) = 0
kill(119907, SIGKILL) = 0
stat("/xxx/xxx/xxx/hypnotoad.pid", {st_mode=S_IFREG|0644, st_size=6, ...}) = 0
poll([{fd=4, events=POLLIN|POLLPRI}], 1, 1000^C <unfinished ...>
Process 119878 detached
How can I troubleshoot this further to determine:
Why does Hypnotoad think it still needs to kill non-existent
processes?
Why isn't it starting new ones?

What does "Worker 31842 has no heartbeat, restarting" mean?
As long as they are accepting new connections, worker processes of all built-in preforking web servers send heartbeat messages to the manager process at regular intervals, to signal that they are still responsive. A blocking operation such as an infinite loop in your application can prevent this, and will force the affected worker to be restarted after a timeout. This timeout defaults to 20 seconds and can be extended with the attribute "heartbeat_timeout" in Mojo::Server::Prefork if your application requires it.
http://mojolicio.us/perldoc/Mojolicious/Guides/FAQ#What-does-Worker-31842-has-no-heartbeat-restarting-mean

Related

Apache (apache2) HTTP server stops accepting connections after a while and requires restart

The apache server uses up all of the servers (up to ServerLimit) and then does not accept any more connections.
Slot PID Stopping Connections Threads Async connections
total accepting busy idle writing keep-alive closing
0 23257 yes 1 no 0 0 0 0 0
1 27271 no 0 yes 1 24 0 0 0
2 24876 yes 2 no 0 0 0 0 0
3 23117 yes 2 no 0 0 0 0 0
4 22671 yes 1 no 0 0 0 0 0
5 23994 yes 1 no 0 0 0 0 0
6 25159 yes 1 no 0 0 0 0 0
7 24604 yes 1 no 0 0 0 0 0
Sum 8 7 9 1 24 0 0 0
The one pid that was accepting was killed and restarted to get the status report above. Over time this PID would also end up like the rest. How do I find out why Apache stops accepting connections after a while? The timeout is set at 90 seconds.
Additional information:
Server Version: Apache/2.4.33 (Unix) OpenSSL/1.0.2o
Server Built: Apr 18 2018 10:56:21
Server loaded APR Version: 1.6.3
Compiled with APR Version: 1.6.3
Server loaded APU Version: 1.6.1
Compiled with APU Version: 1.6.1
Module Magic Number: 20120211:76
Hostname/port: localhost:8006
Timeouts: connection: 90 keep-alive: 5
MPM Name: event
MPM Information: Max Daemons: 8 Threaded: yes Forked: yes
Server Architecture: 64-bit

Why root directory node does not exist while nodes under root directory exist in zookeeper instance?

I am running a zookeeper instance aka IA in standalone mode, trying to upgrade to quorum mode, then I prepared another 2 zookeeper instances(IB and IC) with empty snapshot directory, first modified zoo.cfg properly in 3 instances, and created myid respectively, restart the standalone instance IA first, then start the other 2.
What happened to IB and IC is, they have the data, but the root directory is not there:
Both IB and IC:
[zk: localhost:2181(CONNECTED) 14] ls /
Node does not exist: /
[zk: localhost:2181(CONNECTED) 15] ls /zookeeper
[quota]
[zk: localhost:2181(CONNECTED) 16]
besides, there is data loss in IB:
[zk: localhost:2181(CONNECTED) 16] get /demo/version
cZxid = 0x30000006c
ctime = Thu Dec 22 17:49:13 CST 2016
mZxid = 0x30000006c
mtime = Thu Dec 22 17:49:13 CST 2016
pZxid = 0x6003792a0
cversion = 12764622
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 135794
[zk: localhost:2181(CONNECTED) 17]
IA looks like:
[zk: localhost:2181(CONNECTED) 10] get /demo/version
cZxid = 0x30000006c
ctime = Thu Dec 22 17:49:13 CST 2016
mZxid = 0x30000006c
mtime = Thu Dec 22 17:49:13 CST 2016
pZxid = 0x6003792a0
cversion = 12312921
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 587495
[zk: localhost:2181(CONNECTED) 11]
IC looks like:
[zk: localhost:2181(CONNECTED) 10] get /demo/version
cZxid = 0x30000006c
ctime = Thu Dec 22 17:49:13 CST 2016
mZxid = 0x30000006c
mtime = Thu Dec 22 17:49:13 CST 2016
pZxid = 0x6003792a0
cversion = 12312921
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 587495
[zk: localhost:2181(CONNECTED) 11]
btw, the status are just fine:
IA:
[shell#kernel /data/zookeeper/zookeeper-3.4.8/bin]# ./zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /data/zookeeper/zookeeper-3.4.8/bin/../conf/zoo.cfg
Mode: follower
IB:
[shell#kernel /data/zookeeper/zookeeper-3.4.8/bin]# ./zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /data/zookeeper/zookeeper-3.4.8/bin/../conf/zoo.cfg
Mode: follower
IC:
[shell#kernel /data/zookeeper/zookeeper-3.4.8/bin]# ./zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /data/zookeeper/zookeeper-3.4.8/bin/../conf/zoo.cfg
Mode: leader
As shown above, the version is
3.4.8
Thank you in advance
I managed to fix this issue by changing the initLimit and syncLimit to 100 and 50 respectively, tickTime remains to 2000, then migrate from standalone mode to quorum mode, wait a moment, everything went fine then.

Zookeeper watches :: No notification in case of child node modification

We are seeing an issue with zookeeper watches. We creating a node “/newtest” and intent is to add/modify nodes inside it. We are putting a watch on "/newtest”. Our observation is that if a child is added or deleted we get the notification but if a child is modified we do not get the notification.
Below is the output from zkCli.sh commands
========
[zk: localhost:2181(CONNECTED) 21] ls /newtest watch <=== to get the child nodes plus the watch
[1, 5, 4] <=== 1,5, 4 are the child nodes
[zk: localhost:2181(CONNECTED) 24] set /newtest/5 hello6 <=== updating the data for node “5”, no watch notification
cZxid = 0xc16
ctime = Fri Mar 11 01:03:29 UTC 2016
mZxid = 0xc78
mtime = Fri Mar 11 01:19:48 UTC 2016
pZxid = 0xc16
cversion = 0
dataVersion = 2
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 6
numChildren = 0
[zk: localhost:2181(CONNECTED) 25] create /newtest/6 hello6 <=== creating a new node
WATCHER::
Created /newtest/6
WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/newtest <== watcher notification
[zk: localhost:2181(CONNECTED) 26] ls /newtest watch <=== Again watch
[1, 6, 5, 4]
[zk: localhost:2181(CONNECTED) 27] set /newtest/6 hello6 <== updating node “6”, no notification
cZxid = 0xc79
ctime = Fri Mar 11 01:19:59 UTC 2016
mZxid = 0xc86
mtime = Fri Mar 11 01:23:18 UTC 2016
pZxid = 0xc79
cversion = 0
dataVersion = 1
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 6
numChildren = 0
========
Please suggest a solution. Zookeeper version is zookeeper.version=3.4.6--1
I would suggest not using zkCli.sh for anything other than testing and small/quick operations. If you want to get notifications when a child node is modified I would suggest writing your own watchers in Java using Apache Curator, and more specifically using a Tree Cache.

Kafka Consumer startup error: Failed to add leader for partitions [calls,0] - NotLeaderForPartitionException

Note: this is for an older version : We are running kafka_2.9.2-0.8.1
When attempting to run the kafka-console-consumer.bat against an existing topic on windows7 we receive "failed to add leader" and "NotLeaderForPartition" exceptions
Here is the command line
set GROUP=group1234
kafka-console-consumer.bat --group %GROUP% --zookeeper localhost:2181 --topic calls --from-beginning
Here are the errors:
[2014-05-26 15:02:12,997] WARN [group1234_S80035683-SC01-1401141732400-98745e28-leader-finder-thread], Failed to add leader for partitions [calls,0];
will retry (kafka.consumer.ConsumerFetcherManager$LeaderFinderThread:89)
kafka.common.NotLeaderForPartitionException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at java.lang.Class.newInstance(Class.java:374)
at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:73)
at kafka.consumer.SimpleConsumer.earliestOrLatestOffset(SimpleConsumer.scala:160)
at kafka.consumer.ConsumerFetcherThread.handleOffsetOutOfRange(ConsumerFetcherThread.scala:60)
at kafka.server.AbstractFetcherThread$$anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:179)
at kafka.server.AbstractFetcherThread$$anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:174)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:119)
at kafka.server.AbstractFetcherThread.addPartitions(AbstractFetcherThread.scala:174)
at kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:86)
at kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:76)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:119)
at kafka.server.AbstractFetcherManager.addFetcherForPartitions(AbstractFetcherManager.scala:76)
And we are unable to consume any messages in the end.
We are running kafka_2.9.2-0.8.1
Note: Zookeeper is seeing the Consumer attempt to connect: I go into zkCli and can see the new group1234
[zk: localhost:2181(CONNECTED) 2] ls2 /consumers/group1234
[owners, ids]
cZxid = 0x123
ctime = Mon May 26 15:02:12 PDT 2014
mZxid = 0x123
mtime = Mon May 26 15:02:12 PDT 2014
pZxid = 0x128
cversion = 2
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 2
And here is info on the requested calls topic in ZK:
[zk: localhost:2181(CONNECTED) 7] ls2 /brokers/topics/calls
[partitions]
cZxid = 0x18
ctime = Sat May 24 23:15:16 PDT 2014
mZxid = 0x18
mtime = Sat May 24 23:15:16 PDT 2014
pZxid = 0x1c
cversion = 1
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 36
numChildren = 1
In case there were corruption on the topic, I just dropped it in ZK and then recreated it via kafka-topics.bat. Here is the new ZK output
[zk: localhost:2181(CONNECTED) 15] ls2 /brokers/topics/calls
[]
cZxid = 0x136
ctime = Mon May 26 16:02:51 PDT 2014
mZxid = 0x136
mtime = Mon May 26 16:02:51 PDT 2014
pZxid = 0x136
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 36
numChildren = 0
A search shows that now, 7 years later, this is no longer a known problem for current versions.
There have also been multiple patches which resolve errors that may or may not be the same, and it is almost certain that one of these fixed the issue.
As such the only practical solution for anyone seems to be upgrading to a newer version. (For Kafka, Zookeeper as well as Windows.)
If the problem persists in currently relevant versions please ask a new question as the root cause is unlikely to be the same.

My app does not launch in my device?

I am having this problem, for the first time. I am running my app to device with distribution + Ad-Hoc provision profile but I can't able to launch app the first time in device, as I am getting this error continuously:
Mar 1 18:07:58 My-iPhon kernel[0] : launchd[276] Builtin profile: container (sandbox)
Mar 1 18:07:58 My-iPhon kernel[0] : launchd[276] Container: /private/var/mobile/Applications/E142C3CE-F6E0-4C77-ABE8-1B764DA216FE (sandbox)
Mar 1 18:07:58 My-iPhon com.apple.debugserver-189[261] : 1 +0.000000 sec [0105/0303]: error: ::task_for_pid ( target_tport = 0x0103, pid = 276, &task ) => err = 0x00000005 ((os/kern) failure) err = ::task_for_pid ( target_tport = 0x0103, pid = 276, &task ) => err = 0x00000005 ((os/kern) failure) (0x00000005)
Mar 1 18:07:58 My-iPhon mobile_house_arrest[280] : Max open files: 125
Mar 1 18:07:59 My-iPhon com.apple.debugserver-189[261] : 2 +0.417620 sec [0105/0303]: error: ::task_for_pid ( target_tport = 0x0103, pid = 276, &task ) => err = 0x00000005 ((os/kern) failure) err = ::task_for_pid ( target_tport = 0x0103, pid = 276, &task ) => err = 0x00000005 ((os/kern) failure) (0x00000005)
Mar 1 18:07:59 My-iPhon mobile_house_arrest[281] : Max open files: 125
Mar 1 18:07:59 My-iPhon mobile_house_arrest[282] : Max open files: 125
After I launch, the app crashed and in Device Console I got this error:
Mar 1 18:11:44 My-iPhon backboardd[52] : BKSendGSEvent ERROR sending event type 50: (ipc/send) invalid destination port (0x10000003)
Mar 1 18:11:44 My-iPhon com.apple.launchd[1] (UIKitApplication:com.xxx.myApp[0x3077][276]) : (UIKitApplication:com.xxxx.myapp[0x3077]) Exited: Killed: 9
Mar 1 18:11:44 My-iPhon com.apple.debugserver-189[261] : 21 +216.166834 sec [0105/0303]: RNBRunLoopLaunchInferior DNBProcessLaunch() returned error: 'failed to get the task for process 276'
Mar 1 18:11:44 My-iPhon com.apple.debugserver-189[261] : error: failed to launch process (null): failed to get the task for process 276
Mar 1 18:11:44 My-iPhon backboardd[52] : Application 'UIKitApplication:com.xxxxx.myApp[0x3077]' quit with signal 9: Killed: 9
However, the third time its running normally!
I have tried it many ways like
Recreated my provision profile and also given entitlement.plist for Ad-Hoc distribution
I set the scheme's build configuration to debug, so how can i solve this error while running my app first time on my device
I restated my device
No matter what I try, I get this error! Can any of you explain this?
You can try using development certificate. It will work fine if you install IPA file in your device.
Use Ad-hoc and distribution provisioning profiles when you are going to upload your app to the app store.