I had a resource that kept triggering every few minutes and it created a queue of hundreds of pending jobs. I want to clear off all the old resource versions so that it stops trying to start new jobs. How can I do that without deleting and recreating the pipeline or impacting any other active pipelines?
If I understand correctly you have something like:
> fly -t vm builds
id pipeline/job build status
25 queue-up/queue-up 25 started
24 queue-up/queue-up 24 started
23 queue-up/queue-up 23 started
22 queue-up/queue-up 22 started
21 queue-up/queue-up 21 started
20 queue-up/queue-up 20 started
19 queue-up/queue-up 19 started
18 queue-up/queue-up 18 started
17 queue-up/queue-up 17 succeeded
where maybe some of the builds are pending instead than started.
There is no way to clear off the old resource versions without deleting the pipeline. On the other hand, you can always abort all or some of the builds:
> for i in (seq 24 18); fly -t vm abort-build --build $i; end
build successfully aborted
build successfully aborted
build successfully aborted
build successfully aborted
build successfully aborted
build successfully aborted
build successfully aborted
> fly -t vm builds
id pipeline/job build status
25 queue-up/queue-up 25 started
24 queue-up/queue-up 24 aborted
23 queue-up/queue-up 23 aborted
22 queue-up/queue-up 22 aborted
21 queue-up/queue-up 21 aborted
20 queue-up/queue-up 20 aborted
19 queue-up/queue-up 19 aborted
18 queue-up/queue-up 18 aborted
17 queue-up/queue-up 17 succeeded
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 12 hours ago.
Improve this question
errors in terminal
When i try to deploy project to linux i getting issues in terminal after using commands
Restart the journal service by running the following command: systemctl restart systemd-journald
Start the kestrel service by running the following command: sudo systemctl start skinet-web.service
Check it is started by running: netstat -ntpl
Check the journal by running: journalctl -u skinet-web.service --since "5 min ago"
how to fix errors?
I using macbook pro 14 with Mac Os Ventura
here my project on github
https://github.com/SashaMaksyutenko/skinet
here errors from terminal
root#skinet-demo:~# journalctl -u skinet-web.service --since "5 min ago"
Feb 19 22:47:03 skinet-demo systemd1: skinet-web.service: Scheduled restart job, restart counter is at 30581.
Feb 19 22:47:03 skinet-demo systemd1: Stopped Kestrel service running on Ubuntu 20.04.
Feb 19 22:47:03 skinet-demo systemd1: Started Kestrel service running on Ubuntu 20.04.
Feb 19 22:47:04 skinet-demo skinet[475272]: Unhandled exception. System.IO.DirectoryNotFoundException: /var/skinet/Content/
Feb 19 22:47:04 skinet-demo skinet[475272]: at Microsoft.Extensions.FileProviders.PhysicalFileProvider..ctor(String root, ExclusionFilters filters)
Feb 19 22:47:04 skinet-demo skinet[475272]: at Program.$(String[] args) in /Users/sashamaksyutenkoadmin/Documents/CSharp_work/Angular_eCommerce/skinet/API/Program.cs:line 21
Feb 19 22:47:04 skinet-demo skinet[475272]: at Program.(String[] args)
Feb 19 22:47:04 skinet-demo systemd1: skinet-web.service: Main process exited, code=dumped, status=6/ABRT
Feb 19 22:47:04 skinet-demo systemd1: skinet-web.service: Failed with result 'core-dump'.
Problem
I have a Rails 7 app deployed on render.com, and it doesn't get a lot of traffic (maybe once per day). However, when a few requests do come in, everything seems to running fine for a moment until Puma seems to barf. The incoming requests are from Twilio for a voice call, and the call eventually errors with "We're sorry, an application error has occurred. Goodbye". It seems like something about a "timed out" worker happens, then the worker boots, and whammo! a flood of "Completed 2XX OK" and "Kredis Connected to shared" lines come crashing through like they've been pent up the entire time. THEN, nearly a day later without any outside requests coming in, several log lines about Out-of-sync worker list, no 78 worker come through. My Puma config file is unchanged from what ships with Rails.
Questions
Where might I go look for the offending code? What tools could help me decipher why a Puma worker is timing out? Could it have something to do with how I'm using Redis via Kredis in my app?
Workaround
To get around this issue, I've started to occasionally redeploy my latest commit and that seems to help. I'm not certain, but it seems like inactivity causes Puma to become discombobulated.
Log output
Here's what the offending lines in my log file look like:
... a few requests that complete 200 OK ...
Sep 13 05:53:15 PM [70] ! Terminating timed out worker (worker failed to check in within 60 seconds): 90
... a couple more normal log lines and then ...
Sep 13 05:53:16 PM [70] - Worker 3 (PID: 134) booted in 0.04s, phase: 0
... some more normal log lines and then ...
Sep 13 05:53:16 PM I, [2022-09-13T22:53:16.593713 #74] INFO -- : [595ad8e5-fa3a-45a3-8c5b-a506e6c94b69] Completed 204 No Content in 110ms (Allocations: 13681)
Sep 13 05:53:16 PM I, [2022-09-13T22:53:16.425579 #86] INFO -- : [f1a64c71-8048-4032-8bf6-2e68aa1fa7ba] Completed 204 No Content in 2ms (Allocations: 541)
Sep 13 05:53:16 PM I, [2022-09-13T22:53:16.595408 #86] INFO -- : [68d19bd9-2286-4f75-a982-5fa3e864d6ac] Completed 200 OK in 105ms (Views: 0.2ms | Allocations: 1592)
Sep 13 05:53:16 PM I, [2022-09-13T22:53:16.614951 #76] INFO -- : [e883350f-9a26-4d3d-8f1c-4853285aa71a] Kredis (10.6ms) Connected to shared
Sep 13 05:53:16 PM I, [2022-09-13T22:53:16.615787 #76] INFO -- : [fbcd8730-1514-4af5-9332-0bdf0c89fc2d] Kredis (17.2ms) Connected to shared
Sep 13 05:53:16 PM I, [2022-09-13T22:53:16.705926 #86] INFO -- : [1f67a177-38f2-4bf5-bd03-1c59a3edb3a4] Kredis (224.1ms) Connected to shared
Sep 13 05:53:16 PM I, [2022-09-13T22:53:16.958386 #76] INFO -- : [e883350f-9a26-4d3d-8f1c-4853285aa71a] Completed 200 OK in 472ms (ActiveRecord: 213.1ms | Allocations: 32402)
Sep 13 05:53:17 PM I, [2022-09-13T22:53:17.034211 #86] INFO -- : [1f67a177-38f2-4bf5-bd03-1c59a3edb3a4] Completed 200 OK in 606ms (ActiveRecord: 256.6ms | Allocations: 17832)
Sep 13 05:53:17 PM I, [2022-09-13T22:53:17.136231 #76] INFO -- : [fbcd8730-1514-4af5-9332-0bdf0c89fc2d] Completed 200 OK in 654ms (ActiveRecord: 88.0ms | Allocations: 37385)
... literally a day later without any other activity ...
Sep 14 05:02:29 AM [69] ! Terminating timed out worker (worker failed to check in within 60 seconds): 78
Sep 14 05:02:31 AM [69] ! Out-of-sync worker list, no 78 worker
Sep 14 05:02:31 AM [69] ! Out-of-sync worker list, no 78 worker
Sep 14 05:02:31 AM [69] ! Out-of-sync worker list, no 78 worker
Sep 14 05:02:31 AM [69] ! Out-of-sync worker list, no 78 worker
Sep 14 05:02:31 AM [69] ! Out-of-sync worker list, no 78 worker
Sep 14 05:02:31 AM [69] ! Out-of-sync worker list, no 78 worker
Sep 14 05:02:31 AM [69] - Worker 1 (PID: 132) booted in 0.03s, phase: 0
A bit new to this topic and would be great if one can provide some pointers in this case:
i have been trying to use NFS on an in-house built cluster. everything worked from the link i followed except common file sharing topic.
installed NFS-kernel-server on master node
installed NFS-common on slave / worker nodes
configured /etc/exports on master node as required
configured /etc/fstab properly on worker nodes with the mount location
But unable to start nfs-server on master node and the errors show issue with the dependencies
if needed i can provide output from journalctl -xe but main error shows "Failed to mount NFSD configuration filesystem."
Any pointer / solution would be greatly helpful.
Output from journalctl -xe:
*The unit proc-fs-nfsd.mount has entered the 'failed' state with result 'exit-code'. Dec 13 10:06:39 ccc-001 systemd1: Failed to
mount NFSD configuration filesystem.
-- Subject: A start job for unit proc-fs-nfsd.mount has failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- A start job for unit proc-fs-nfsd.mount has finished with a failure.
-- The job identifier is 1032 and the job result is failed. Dec 13 10:06:39 ccc-001 systemd1: Dependency failed for NFS Mount Daemon.
-- Subject: A start job for unit nfs-mountd.service has failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- A start job for unit nfs-mountd.service has finished with a failure.
-- The job identifier is 1034 and the job result is dependency. Dec 13 10:06:39 ccc-001 systemd1: Dependency failed for NFS server and
services.
-- Subject: A start job for unit nfs-server.service has failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- A start job for unit nfs-server.service has finished with a failure.
-- The job identifier is 1027 and the job result is dependency. Dec 13 10:06:39 ccc-001 systemd1: Dependency failed for NFSv4 ID-name
mapping service.
-- Subject: A start job for unit nfs-idmapd.service has failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- A start job for unit nfs-idmapd.service has finished with a failure.
-- The job identifier is 1037 and the job result is dependency. Dec 13 10:06:39 ccc-001 systemd1: nfs-idmapd.service: Job
nfs-idmapd.service/start failed with result 'dependency'. Dec 13
10:06:39 ccc-001 systemd1: nfs-server.service: Job
nfs-server.service/start failed with result 'dependency'. Dec 13
10:06:39 ccc-001 systemd1: nfs-mountd.service: Job
nfs-mountd.service/start failed with result 'dependency'. Dec 13
10:06:39 ccc-001 systemd1: Condition check resulted in RPC security
service for NFS server being skipped.
-- Subject: A start job for unit rpc-svcgssd.service has finished successfully
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- A start job for unit rpc-svcgssd.service has finished successfully.
-- The job identifier is 1042. Dec 13 10:06:39 ccc-001 systemd1: Condition check resulted in RPC security service for NFS client and
server being skipped.
-- Subject: A start job for unit rpc-gssd.service has finished successfully
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- A start job for unit rpc-gssd.service has finished successfully.
-- The job identifier is 1041. Dec 13 10:06:39 ccc-001 sudo[4469]: pam_unix(sudo:session): session closed for user root*
Let's say my job was running for some time and it went to suspend state due to machine overloading and became running after sometime and got completed.
Now the status acquired by this job were RUNNING -> SUSPEND -> RUNNING
How to get all states acquired by a given job ?
bjobs -l If the job hasn't been cleaned from the system yet.
bhist -l Otherwise. You might need -n, depending on how old the job is.
Here's an example of bhist -l output when a job was suspended and later resumed because the system load temporarily exceeded the configured threshold.
$ bhist -l 1168
Job <1168>, User <mclosson>, Project <default>, Command <sleep 10000>
Fri Jan 20 15:08:40: Submitted from host <hostA>, to
Queue <normal>, CWD <$HOME>, Specified Hosts <hostA>;
Fri Jan 20 15:08:41: Dispatched 1 Task(s) on Host(s) <hostA>, Allocated 1 Slot(
s) on Host(s) <hostA>, Effective RES_REQ <select[type == any] or
der[r15s:pg] >;
Fri Jan 20 15:08:41: Starting (Pid 30234);
Fri Jan 20 15:08:41: Running with execution home </home/mclosson>, Execution CW
D </home/mclosson>, Execution Pid <30234>;
Fri Jan 20 16:19:22: Suspended: Host load exceeded threshold: 1-minute CPU ru
n queue length (r1m)
Fri Jan 20 16:21:43: Running;
Summary of time in seconds spent in various states by Fri Jan 20 16:22:09
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
1 0 4267 0 141 0 4409
At 16:19:22 the jobs was suspended because r1m exceeded the threshold. Later at 16:21:43 the job resumes.
Yesterday service worked fine. But today when i checked service's state i saw:
Mar 11 14:03:16 coreos-1 systemd[1]: scheduler.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Mar 11 14:03:16 coreos-1 systemd[1]: Unit scheduler.service entered failed state.
Mar 11 14:03:16 coreos-1 systemd[1]: scheduler.service failed.
Mar 11 14:03:16 coreos-1 systemd[1]: Starting Kubernetes Scheduler...
Mar 11 14:03:16 coreos-1 systemd[1]: Started Kubernetes Scheduler.
Mar 11 14:08:16 coreos-1 kube-scheduler[4659]: E0311 14:08:16.808349 4659 reflector.go:118] watch of *api.Service ended with error: very short watch
Mar 11 14:08:16 coreos-1 kube-scheduler[4659]: E0311 14:08:16.811434 4659 reflector.go:118] watch of *api.Pod ended with error: unexpected end of JSON input
Mar 11 14:08:16 coreos-1 kube-scheduler[4659]: E0311 14:08:16.847595 4659 reflector.go:118] watch of *api.Pod ended with error: unexpected end of JSON input
It's really confused 'cause etcd, flannel and apiserver work fine.
Only some strange logs are for etcd:
Mar 11 20:22:21 coreos-1 etcd[472]: [etcd] Mar 11 20:22:21.572 INFO | aba44aa0670b4b2e8437c03a0286d779: warning: heartbeat time out peer="6f4934635b6b4291bf29763add9bf4c7" missed=1 backoff="2s"
Mar 11 20:22:48 coreos-1 etcd[472]: [etcd] Mar 11 20:22:48.269 INFO | aba44aa0670b4b2e8437c03a0286d779: warning: heartbeat time out peer="6f4934635b6b4291bf29763add9bf4c7" missed=1 backoff="2s"
Mar 11 20:48:12 coreos-1 etcd[472]: [etcd] Mar 11 20:48:12.070 INFO | aba44aa0670b4b2e8437c03a0286d779: warning: heartbeat time out peer="6f4934635b6b4291bf29763add9bf4c7" missed=1 backoff="2s"
So, I'm really stuck and don't know what's wrong. How can i resolve this problem? Or, how can i check details log for scheduler.
journalctl give me same logs like systemd status
Please see: https://github.com/GoogleCloudPlatform/kubernetes/issues/5311
It means apiserver accepted the watch request but then immediately terminated the connection.
If you see it occasionally, it implies a transient error and is not alarming. If you see it repeatedly, it implies that apiserver (or etcd) is sick.
Is something actually not working for you?