Why does Puma mess up incoming requests? (timed out worker) - server

Problem
I have a Rails 7 app deployed on render.com, and it doesn't get a lot of traffic (maybe once per day). However, when a few requests do come in, everything seems to running fine for a moment until Puma seems to barf. The incoming requests are from Twilio for a voice call, and the call eventually errors with "We're sorry, an application error has occurred. Goodbye". It seems like something about a "timed out" worker happens, then the worker boots, and whammo! a flood of "Completed 2XX OK" and "Kredis Connected to shared" lines come crashing through like they've been pent up the entire time. THEN, nearly a day later without any outside requests coming in, several log lines about Out-of-sync worker list, no 78 worker come through. My Puma config file is unchanged from what ships with Rails.
Questions
Where might I go look for the offending code? What tools could help me decipher why a Puma worker is timing out? Could it have something to do with how I'm using Redis via Kredis in my app?
Workaround
To get around this issue, I've started to occasionally redeploy my latest commit and that seems to help. I'm not certain, but it seems like inactivity causes Puma to become discombobulated.
Log output
Here's what the offending lines in my log file look like:
... a few requests that complete 200 OK ...
Sep 13 05:53:15 PM [70] ! Terminating timed out worker (worker failed to check in within 60 seconds): 90
... a couple more normal log lines and then ...
Sep 13 05:53:16 PM [70] - Worker 3 (PID: 134) booted in 0.04s, phase: 0
... some more normal log lines and then ...
Sep 13 05:53:16 PM I, [2022-09-13T22:53:16.593713 #74] INFO -- : [595ad8e5-fa3a-45a3-8c5b-a506e6c94b69] Completed 204 No Content in 110ms (Allocations: 13681)
Sep 13 05:53:16 PM I, [2022-09-13T22:53:16.425579 #86] INFO -- : [f1a64c71-8048-4032-8bf6-2e68aa1fa7ba] Completed 204 No Content in 2ms (Allocations: 541)
Sep 13 05:53:16 PM I, [2022-09-13T22:53:16.595408 #86] INFO -- : [68d19bd9-2286-4f75-a982-5fa3e864d6ac] Completed 200 OK in 105ms (Views: 0.2ms | Allocations: 1592)
Sep 13 05:53:16 PM I, [2022-09-13T22:53:16.614951 #76] INFO -- : [e883350f-9a26-4d3d-8f1c-4853285aa71a] Kredis (10.6ms) Connected to shared
Sep 13 05:53:16 PM I, [2022-09-13T22:53:16.615787 #76] INFO -- : [fbcd8730-1514-4af5-9332-0bdf0c89fc2d] Kredis (17.2ms) Connected to shared
Sep 13 05:53:16 PM I, [2022-09-13T22:53:16.705926 #86] INFO -- : [1f67a177-38f2-4bf5-bd03-1c59a3edb3a4] Kredis (224.1ms) Connected to shared
Sep 13 05:53:16 PM I, [2022-09-13T22:53:16.958386 #76] INFO -- : [e883350f-9a26-4d3d-8f1c-4853285aa71a] Completed 200 OK in 472ms (ActiveRecord: 213.1ms | Allocations: 32402)
Sep 13 05:53:17 PM I, [2022-09-13T22:53:17.034211 #86] INFO -- : [1f67a177-38f2-4bf5-bd03-1c59a3edb3a4] Completed 200 OK in 606ms (ActiveRecord: 256.6ms | Allocations: 17832)
Sep 13 05:53:17 PM I, [2022-09-13T22:53:17.136231 #76] INFO -- : [fbcd8730-1514-4af5-9332-0bdf0c89fc2d] Completed 200 OK in 654ms (ActiveRecord: 88.0ms | Allocations: 37385)
... literally a day later without any other activity ...
Sep 14 05:02:29 AM [69] ! Terminating timed out worker (worker failed to check in within 60 seconds): 78
Sep 14 05:02:31 AM [69] ! Out-of-sync worker list, no 78 worker
Sep 14 05:02:31 AM [69] ! Out-of-sync worker list, no 78 worker
Sep 14 05:02:31 AM [69] ! Out-of-sync worker list, no 78 worker
Sep 14 05:02:31 AM [69] ! Out-of-sync worker list, no 78 worker
Sep 14 05:02:31 AM [69] ! Out-of-sync worker list, no 78 worker
Sep 14 05:02:31 AM [69] ! Out-of-sync worker list, no 78 worker
Sep 14 05:02:31 AM [69] - Worker 1 (PID: 132) booted in 0.03s, phase: 0

Related

Ceph Monitor out of quorum

we're experiencing a problem with one of our ceph monitors. Cluster uses 3 monitors and they are all up&running. They can communicate with each other and gives a relevant ceph -s output. However quorum shows second monitor is down. The ceph -s output from supposedly down monitor is below:
cluster:
id: bb1ab46a-d282-4530-bf5c-021e9c940958
health: HEALTH_WARN
insufficient standby MDS daemons available
noout flag(s) set
9 large omap objects
47 pgs not deep-scrubbed in time
application not enabled on 2 pool(s)
1/3 mons down, quorum mon1,mon3
services:
mon: 3 daemons, quorum mon1,mon3 (age 3d), out of quorum: mon2
mgr: mon1(active, since 3d)
mds: filesystem:1 {0=mon1=up:active}
osd: 77 osds: 77 up (since 3d), 77 in (since 2w)
flags noout
rbd-mirror: 1 daemon active (12512649)
rgw: 1 daemon active (mon1)
data:
pools: 13 pools, 1500 pgs
objects: 65.36M objects, 23 TiB
usage: 85 TiB used, 701 TiB / 785 TiB avail
pgs: 1500 active+clean
io:
client: 806 KiB/s wr, 0 op/s rd, 52 op/s wr
systemctl status ceph-mon#2.service shows:
ceph-mon#2.service - Ceph cluster monitor daemon
Loaded: loaded (/usr/lib/systemd/system/ceph-mon#.service; enabled; vendor preset: disabled)
Active: failed (Result: start-limit) since Tue 2020-12-08 12:12:58 +03; 28s ago
Process: 2681 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
Main PID: 2681 (code=exited, status=1/FAILURE)
Dec 08 12:12:48 mon2 systemd[1]: Unit ceph-mon#2.service entered failed state.
Dec 08 12:12:48 mon2 systemd[1]: ceph-mon#2.service failed.
Dec 08 12:12:58 mon2 systemd[1]: ceph-mon#2.service holdoff time over, scheduling restart.
Dec 08 12:12:58 mon2 systemd[1]: Stopped Ceph cluster monitor daemon.
Dec 08 12:12:58 mon2 systemd[1]: start request repeated too quickly for ceph-mon#2.service
Dec 08 12:12:58 mon2 systemd[1]: Failed to start Ceph cluster monitor daemon.
Dec 08 12:12:58 mon2 systemd[1]: Unit ceph-mon#2.service entered failed state.
Dec 08 12:12:58 mon2 systemd[1]: ceph-mon#2.service failed.
Restarting, Stop/Starting, Enable/Disabling the monitor daemon did not work. Docs mention the monitor asok file in var/run/ceph and i don't have it in the supposed directory yet the other monitors have their asok files right in place. And now im in a state that i can't even stop the monitor daemon on second monitor it only stays at failed state. There is no logs shown in /var/log/ceph monitor logs. What am i supposed to do? I don't have much experience in ceph so i don't want to change things without being absolutely sure in order to avoid messing up the cluster.
try to start the service manually on MON2 with:
/usr/bin/ceph-mon -f --cluster Ceph --id 2 --setuser ceph --setgroup ceph

LSF job states for a given job

Let's say my job was running for some time and it went to suspend state due to machine overloading and became running after sometime and got completed.
Now the status acquired by this job were RUNNING -> SUSPEND -> RUNNING
How to get all states acquired by a given job ?
bjobs -l If the job hasn't been cleaned from the system yet.
bhist -l Otherwise. You might need -n, depending on how old the job is.
Here's an example of bhist -l output when a job was suspended and later resumed because the system load temporarily exceeded the configured threshold.
$ bhist -l 1168
Job <1168>, User <mclosson>, Project <default>, Command <sleep 10000>
Fri Jan 20 15:08:40: Submitted from host <hostA>, to
Queue <normal>, CWD <$HOME>, Specified Hosts <hostA>;
Fri Jan 20 15:08:41: Dispatched 1 Task(s) on Host(s) <hostA>, Allocated 1 Slot(
s) on Host(s) <hostA>, Effective RES_REQ <select[type == any] or
der[r15s:pg] >;
Fri Jan 20 15:08:41: Starting (Pid 30234);
Fri Jan 20 15:08:41: Running with execution home </home/mclosson>, Execution CW
D </home/mclosson>, Execution Pid <30234>;
Fri Jan 20 16:19:22: Suspended: Host load exceeded threshold: 1-minute CPU ru
n queue length (r1m)
Fri Jan 20 16:21:43: Running;
Summary of time in seconds spent in various states by Fri Jan 20 16:22:09
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
1 0 4267 0 141 0 4409
At 16:19:22 the jobs was suspended because r1m exceeded the threshold. Later at 16:21:43 the job resumes.

Kubernetes scheduler: watch of *api.Pod ended with error: unexpected end of JSON input

Yesterday service worked fine. But today when i checked service's state i saw:
Mar 11 14:03:16 coreos-1 systemd[1]: scheduler.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Mar 11 14:03:16 coreos-1 systemd[1]: Unit scheduler.service entered failed state.
Mar 11 14:03:16 coreos-1 systemd[1]: scheduler.service failed.
Mar 11 14:03:16 coreos-1 systemd[1]: Starting Kubernetes Scheduler...
Mar 11 14:03:16 coreos-1 systemd[1]: Started Kubernetes Scheduler.
Mar 11 14:08:16 coreos-1 kube-scheduler[4659]: E0311 14:08:16.808349 4659 reflector.go:118] watch of *api.Service ended with error: very short watch
Mar 11 14:08:16 coreos-1 kube-scheduler[4659]: E0311 14:08:16.811434 4659 reflector.go:118] watch of *api.Pod ended with error: unexpected end of JSON input
Mar 11 14:08:16 coreos-1 kube-scheduler[4659]: E0311 14:08:16.847595 4659 reflector.go:118] watch of *api.Pod ended with error: unexpected end of JSON input
It's really confused 'cause etcd, flannel and apiserver work fine.
Only some strange logs are for etcd:
Mar 11 20:22:21 coreos-1 etcd[472]: [etcd] Mar 11 20:22:21.572 INFO | aba44aa0670b4b2e8437c03a0286d779: warning: heartbeat time out peer="6f4934635b6b4291bf29763add9bf4c7" missed=1 backoff="2s"
Mar 11 20:22:48 coreos-1 etcd[472]: [etcd] Mar 11 20:22:48.269 INFO | aba44aa0670b4b2e8437c03a0286d779: warning: heartbeat time out peer="6f4934635b6b4291bf29763add9bf4c7" missed=1 backoff="2s"
Mar 11 20:48:12 coreos-1 etcd[472]: [etcd] Mar 11 20:48:12.070 INFO | aba44aa0670b4b2e8437c03a0286d779: warning: heartbeat time out peer="6f4934635b6b4291bf29763add9bf4c7" missed=1 backoff="2s"
So, I'm really stuck and don't know what's wrong. How can i resolve this problem? Or, how can i check details log for scheduler.
journalctl give me same logs like systemd status
Please see: https://github.com/GoogleCloudPlatform/kubernetes/issues/5311
It means apiserver accepted the watch request but then immediately terminated the connection.
If you see it occasionally, it implies a transient error and is not alarming. If you see it repeatedly, it implies that apiserver (or etcd) is sick.
Is something actually not working for you?

Couchbase: 20k items stuck in Tap Queue

We are currently evaluating couchbase as a memcached replacement in the first place. Our setup looks like this:
php -> localhost moxi -> couchbase bucket (Total bucket size = 10240 MB (2048 MB x 5 nodes with replica count 1))
The Servers have 16GB RAM and are SSD backed.
We were inserting at about 400 ops/s and had no problem for a few days. When we reached about 13 million items. We found out that we forgot to implement the delete function in our testsetup and a lot of keys had no expiration set.
To start over again we flushed the bucket through the webinterface. This where our problems began.
We started to see that we had temp ooms, back-offs, and tap queue was filled with 20k items. the drain and fill rate was nearly the same. See attached screenshot
What also catched our eye was that node 4 had only 220k items, where everyone else had around 1.39M
Somehow it looks like the replication messed up something, but im relatively new to couchbase. Any hints, suggestions? - See more at: http://www.couchbase.com/communities/q-and-a/20k-items-stuck-tap-queue#sthash.v9MxNnTk.dpuf
The problem was solved for a short time, after removing the failing node from the cluster.
So now with this four nodes left in the cluster, after some hours the same happend again with another node. We tried setting the now failing node into FailOver state. That fixed the problem again, but after Re-Adding the node, the same phenomenon happened again on that node.
Other things we realized are:
* Three out of four nodes have thousands of items in their TAP replication queue, but one
("the failing one") has 0.
* Also three out of four nodes have a back-off rate of around 400, but one ("the failing one") has 0.
* Only the failing one has a massive amount of "Temp OOMs per second", but the other three have 0.
The phenomenon seems to disappear, if we lower the load to the servers by disabling the couchbase-writes for one out of two software project writing to couchbase.
But if we enable the writes again, after around 10 minutes we can see this in the memcached.log on the failing node:
Tue Dec 17 12:29:05.010547 CET 3: (CENSORED) Received error[86] from mccouch for unknown
Tue Dec 17 12:29:05.010576 CET 3: (CENSORED) Retry notify CouchDB of update, vbucket=277 rev=522
Tue Dec 17 12:29:08.748103 CET 3: (CENSORED) Received error[86] from mccouch for unknown
Tue Dec 17 12:29:08.748257 CET 3: (CENSORED) Retry notify CouchDB of update, vbucket=321 rev=948
Tue Dec 17 12:40:17.354448 CET 3: (CENSORED) Received error[86] from mccouch for unknown
Tue Dec 17 12:40:17.354476 CET 3: (CENSORED) Retry notify CouchDB of update, vbucket=303 rev=491
This error then happens around 5 times within four hours:
Tue Dec 17 14:19:32.145071 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
And after these four hours it starts spamming this instantly (Maybe, because the load increased heavily, because in the evening our page generates much more load than in the morning/noon) together with this "error from mccouch":
Tue Dec 17 16:42:30.875343 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:42:36.493317 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:43:25.239876 CET 3: (CENSORED) Received error[86] from mccouch for unknown
Tue Dec 17 16:43:25.240052 CET 3: (CENSORED) Retry notify CouchDB of update, vbucket=296 rev=483
Tue Dec 17 16:43:25.903997 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:43:31.906178 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:43:36.913045 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:43:42.919114 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:43:48.920354 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:43:54.924017 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:44:00.928572 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
We have no clue what is happening here, why this failing node seems to reject every replication and throwing this error.
Do you have any idea?
Thanks for all your help and greetings from Cologne,
Andy!
Seeing as you just want to delete all items in the Bucket have you tried just deleting and re-creating the bucket?
This will be much faster than flush, as flush actually needs to send a delete request for every document in the bucket.
I can't find it in the docs at the moment, but I think Flush is not really recommended with the latest versions.
you are not writing what is your operating system. If it's Linux try to check maximum amount of open sockets for user running the Couchbase. Check the file /etc/security/limits.conf.
the command for check on Linux is: ulimit -Hn.
Hope that helps.
Daniel
I think you should try these settings:
http://docs.couchbase.com/couchbase-manual-2.1/#specifying-backoff-for-replication

How to run puppetmaster using Apache/Passenger

Running Puppet v2.7.14 on CEntOs 6 and also using Apache/Passenger instead of WEBrick. I was told that puppetmaster service is not required to be running (hence: chkconfig off puppetmaster) running when using httpd and passenger but in my case, if I don't start puppetmasterd manually, none of the agents can connect to the master. I can start httpd just fine and 'passenger' seems to start okay as well. This is my apache configuration file:
# /etc/httpd/conf.d/passenger.conf
LoadModule passenger_module modules/mod_passenger.so
<IfModule mod_passenger.c>
PassengerRoot /usr/lib/ruby/gems/1.8/gems/passenger-3.0.12
PassengerRuby /usr/bin/ruby
#PassengerTempDir /var/run/rubygem-passenger
PassengerHighPerformance on
PassengerUseGlobalQueue on
PassengerMaxPoolSize 15
PassengerPoolIdleTime 150
PassengerMaxRequests 10000
PassengerStatThrottleRate 120
RackAutoDetect Off
RailsAutoDetect Off
</IfModule>
Upon restart, I see these in the httpd_error log:
[Sat Jun 09 04:06:47 2012] [notice] caught SIGTERM, shutting down
[Sat Jun 09 09:06:51 2012] [notice] suEXEC mechanism enabled (wrapper: /usr/sbin/suexec)
[Sat Jun 09 09:06:51 2012] [notice] Digest: generating secret for digest authentication ...
[Sat Jun 09 09:06:51 2012] [notice] Digest: done
[Sat Jun 09 09:06:51 2012] [notice] Apache/2.2.15 (Unix) DAV/2 Phusion_Passenger/3.0.12 mod_ssl/2.2.15 OpenSSL/1.0.0-fips configured -- resuming normal operations
And passenger-status prints these info on the screen:
----------- General information -----------
max = 15
count = 0
active = 0
inactive = 0
Waiting on global queue: 0
----------- Application groups -----------
But still, as I said, none of my agents can actually talk to the master until I start puppetmasterd manually. Does anyone know what am I still missing? Or, is this the way it supposed too be? Cheers!!
It sounds like you may be missing an /etc/httpd/conf.d/puppetmaster.conf file that's based on https://github.com/puppetlabs/puppet/blob/master/ext/rack/files/apache2.conf
Without something like this, you're missing the glue that'll map port 8140 to rack-based pupeptmastd.
See http://docs.puppetlabs.com/guides/passenger.html
https://github.com/puppetlabs/puppet/tree/master/ext/rack
http://www.modrails.com/documentation/Users%20guide%20Apache.html#_deploying_a_rack_based_ruby_application_including_rails_gt_3
After a few days of banging head, now it's running. The main problem was with port number - the puppetmaster was running on different port than puppet agent trying to communicate on.
Another thing is: RackAutoDetect On must be executed before the dashboard vhost file. My So, I renamed passenger vhost file as: 00_passenger.conf to make sure it runs first. After that I get puppetmaster running using Apache/Passenger. Cheers!!