CentOS 8 freeze on gcloud. How to debug? - centos

I run a Google Compute Engine (GCE) instance with CentOS 8 on Google Cloud Platform. The problem is that this instance freezes at random times and the only way to make it work is to stop the instance and start it again. I tried to recreate an instance, this doesn't help.
Here are the screenshots from the GCE monitoring page (freeze from 21:04 Feb 9 to 09:29 Feb 10):
CPU utilization + Network bytes
RAM + Disk space utilization
Network packets
It's a 2vCPU instance with 2Gb RAM. It runs 2 docker containers (1 for Apache, 1 for FastAPI), it also runs 4 cron jobs every minute. When the machine freezes I can't ssh to it, I can't access any webpage, none of the 4 cron jobs sends any data to a logging server.
Any ideas how I can debug this issue?
Here is the /var/log/messages dump in case it's useful:
Feb 9 21:03:07 instance-prod-2 systemd[285938]: Stopped target Basic System.
Feb 9 21:03:07 instance-prod-2 systemd[285938]: Stopped target Paths.
Feb 9 21:03:07 instance-prod-2 systemd[285938]: Stopped target Timers.
Feb 9 21:03:07 instance-prod-2 systemd[285938]: grub-boot-success.timer: Succeeded.
Feb 9 21:03:07 instance-prod-2 systemd[285938]: Stopped Mark boot as successful after the user session has run 2 minutes.
Feb 9 21:03:07 instance-prod-2 systemd[285938]: Stopped target Sockets.
Feb 9 21:03:07 instance-prod-2 systemd[285938]: dbus.socket: Succeeded.
Feb 9 21:03:07 instance-prod-2 systemd[285938]: Closed D-Bus User Message Bus Socket.
Feb 9 21:03:07 instance-prod-2 systemd[285938]: Reached target Shutdown.
Feb 9 21:03:07 instance-prod-2 systemd[285938]: Starting Exit the Session...
Feb 9 21:03:07 instance-prod-2 systemd[1]: user#1000.service: Succeeded.
Feb 9 21:03:07 instance-prod-2 systemd[1]: Stopped User Manager for UID 1000.
Feb 9 21:03:07 instance-prod-2 systemd[1]: Stopping /run/user/1000 mount wrapper...
Feb 9 21:03:07 instance-prod-2 systemd[1]: Removed slice User Slice of UID 1000.
Feb 9 21:03:07 instance-prod-2 systemd[1]: run-user-1000.mount: Succeeded.
Feb 9 21:03:07 instance-prod-2 systemd[1]: user-runtime-dir#1000.service: Succeeded.
Feb 9 21:03:07 instance-prod-2 systemd[1]: Stopped /run/user/1000 mount wrapper.
Feb 9 21:03:39 instance-prod-2 collectd[1307]: write_gcm: can not take infinite value
Feb 9 21:03:39 instance-prod-2 collectd[1307]: write_gcm: wg_typed_value_create_from_value_t_inline failed for swap/percent/value! Continuing.
Feb 9 21:03:39 instance-prod-2 collectd[1307]: write_gcm: can not take infinite value
Feb 9 21:03:39 instance-prod-2 collectd[1307]: write_gcm: wg_typed_value_create_from_value_t_inline failed for swap/percent/value! Continuing.
Feb 9 21:03:39 instance-prod-2 collectd[1307]: write_gcm: can not take infinite value
Feb 9 21:03:39 instance-prod-2 collectd[1307]: write_gcm: wg_typed_value_create_from_value_t_inline failed for swap/percent/value! Continuing.
Feb 9 21:04:01 instance-prod-2 systemd[1]: Started /run/user/1000 mount wrapper.
Feb 9 21:04:01 instance-prod-2 systemd[1]: Created slice User Slice of UID 1000.
Feb 9 21:04:01 instance-prod-2 systemd[1]: Starting User Manager for UID 1000...
Feb 9 21:04:01 instance-prod-2 systemd[1]: Started Session 19719 of user user_12345.
Feb 9 21:04:01 instance-prod-2 systemd[1]: Started Session 19720 of user user_12345.
Feb 9 21:04:01 instance-prod-2 systemd[1]: Started Session 19721 of user user_12345.
Feb 9 21:04:01 instance-prod-2 systemd[1]: Started Session 19722 of user user_12345.
Feb 9 21:04:01 instance-prod-2 systemd[285996]: Started Mark boot as successful after the user session has run 2 minutes.
Feb 9 21:04:01 instance-prod-2 systemd[285996]: Reached target Paths.
Feb 9 21:04:01 instance-prod-2 systemd[285996]: Starting D-Bus User Message Bus Socket.
Feb 9 21:04:01 instance-prod-2 systemd[285996]: Reached target Timers.
Feb 9 21:04:01 instance-prod-2 systemd[285996]: Listening on D-Bus User Message Bus Socket.
Feb 9 21:04:01 instance-prod-2 systemd[285996]: Reached target Sockets.
Feb 9 21:04:01 instance-prod-2 systemd[285996]: Reached target Basic System.
Feb 9 21:04:01 instance-prod-2 systemd[285996]: Reached target Default.
Feb 9 21:04:01 instance-prod-2 systemd[285996]: Startup finished in 40ms.
Feb 9 21:04:01 instance-prod-2 systemd[1]: Started User Manager for UID 1000.
Feb 9 21:04:06 instance-prod-2 systemd[1]: session-19722.scope: Succeeded.
Feb 9 21:04:07 instance-prod-2 systemd[1]: session-19720.scope: Succeeded.
Feb 9 21:04:09 instance-prod-2 systemd[1]: session-19719.scope: Succeeded.
Feb 9 21:04:10 instance-prod-2 systemd[1]: session-19721.scope: Succeeded.
Feb 9 21:04:10 instance-prod-2 systemd[1]: Stopping User Manager for UID 1000...
Feb 9 21:04:10 instance-prod-2 systemd[285996]: Stopped target Default.
Feb 9 21:04:10 instance-prod-2 systemd[285996]: Stopped target Basic System.
Feb 9 21:04:10 instance-prod-2 systemd[285996]: Stopped target Sockets.
Feb 9 21:04:10 instance-prod-2 systemd[285996]: Stopped target Paths.
Feb 9 21:04:10 instance-prod-2 systemd[285996]: dbus.socket: Succeeded.
Feb 9 21:04:10 instance-prod-2 systemd[285996]: Closed D-Bus User Message Bus Socket.
Feb 9 21:04:10 instance-prod-2 systemd[285996]: Stopped target Timers.
Feb 9 21:04:10 instance-prod-2 systemd[285996]: grub-boot-success.timer: Succeeded.
Feb 9 21:04:10 instance-prod-2 systemd[285996]: Stopped Mark boot as successful after the user session has run 2 minutes.
Feb 9 21:04:10 instance-prod-2 systemd[285996]: Reached target Shutdown.
Feb 9 21:04:10 instance-prod-2 systemd[285996]: Starting Exit the Session...
Feb 9 21:04:11 instance-prod-2 systemd[1]: user#1000.service: Succeeded.
Feb 9 21:04:11 instance-prod-2 systemd[1]: Stopped User Manager for UID 1000.
Feb 9 21:04:11 instance-prod-2 systemd[1]: Stopping /run/user/1000 mount wrapper...
Feb 9 21:04:11 instance-prod-2 systemd[1]: Removed slice User Slice of UID 1000.
Feb 9 21:04:11 instance-prod-2 systemd[1]: run-user-1000.mount: Succeeded.
Feb 9 21:04:11 instance-prod-2 systemd[1]: user-runtime-dir#1000.service: Succeeded.
Feb 9 21:04:11 instance-prod-2 systemd[1]: Stopped /run/user/1000 mount wrapper.
####################### FROZEN #####################
Feb 10 09:29:18 instance-prod-2 kernel: Command line: BOOT_IMAGE=(hd0,gpt2)/boot/vmlinuz-4.18.0-240.10.1.el8_3.x86_64 root=UUID=0d7450f2-b70b-4208-bfe4-8>
Feb 10 09:29:18 instance-prod-2 kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Feb 10 09:29:18 instance-prod-2 kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Feb 10 09:29:18 instance-prod-2 kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Feb 10 09:29:18 instance-prod-2 kernel: x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
Feb 10 09:29:18 instance-prod-2 kernel: x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
Feb 10 09:29:18 instance-prod-2 kernel: BIOS-provided physical RAM map:

Related

Streamsets Service does not start any more

I just upgraded my MapR Cluster and I am trying to start Streamsets. However I get the following Error:
Exception in thread "main" java.lang.ExceptionInInitializerError: Expected exactly 1 stage lib jar but found 0 with name streamsets-datacollector-mapr_6_0-mep4-lib Jun 12 09:48:56 BGDTEST5 streamsets[24543]: at com.streamsets.pipeline.SDCClassLoader.bringStageAndProtoLibsToFront(SDCClassLoader.java:186) Jun 12 09:48:56 BGDTEST5 streamsets[24543]: at com.streamsets.pipeline.SDCClassLoader.getOrderedURLsForClassLoader(SDCClassLoader.java:404) Jun 12 09:48:56 BGDTEST5 streamsets[24543]: at com.streamsets.pipeline.SDCClassLoader.(SDCClassLoader.java:119) Jun 12 09:48:56 BGDTEST5 streamsets[24543]: at com.streamsets.pipeline.SDCClassLoader.getStageClassLoader(SDCClassLoader.java:389) Jun 12 09:48:56 BGDTEST5 streamsets[24543]: at com.streamsets.pipeline.SDCClassLoader.getStageClassLoader(SDCClassLoader.java:383) Jun 12 09:48:56 BGDTEST5 streamsets[24543]: at com.streamsets.pipeline.BootstrapMain.main(BootstrapMain.java:291) Jun 12 09:48:56 BGDTEST5 systemd[1]: sdc.service: main process exited, code=exited, status=1/FAILURE Jun 12 09:48:56 BGDTEST5 systemd[1]: Unit sdc.service entered failed state. Jun 12 09:48:56 BGDTEST5 systemd[1]: sdc.service failed.
I can see the stage-lib in question in the streamsets-libs directory. This used to work seamlessly before. What am I doing wrong?

Zookeeper / Exhibitor recurring JMX Error

I don't know why this is occurring, but occasionally I will get this series of repeating errors and the zookeeper instances will go into a bad state.
Tue Feb 16 07:05:04 EST 2016 ERROR ZooKeeper Server: Using config: /opt/zookeeper/zookeeper-3.4.6/bin/../conf/zoo.cfg
Tue Feb 16 07:05:04 EST 2016 ERROR ZooKeeper Server: JMX enabled by default
Tue Feb 16 07:05:04 EST 2016 INFO Process started via: /opt/zookeeper/zookeeper-3.4.6/bin/zkServer.sh
Tue Feb 16 07:05:03 EST 2016 INFO Kill attempted result: 0
Tue Feb 16 07:05:03 EST 2016 INFO Attempting to start/restart ZooKeeper
Tue Feb 16 07:05:03 EST 2016 INFO Attempting to stop instance
Tue Feb 16 07:05:03 EST 2016 INFO Restarting down/not-serving ZooKeeper after 60037 ms pause
Tue Feb 16 07:04:33 EST 2016 INFO ZooKeeper down/not-serving waiting 30026 of 40000 ms before restarting
Tue Feb 16 07:04:05 EST 2016 INFO ZooKeeper Server: Starting zookeeper ... STARTED
Tue Feb 16 07:04:04 EST 2016 ERROR ZooKeeper Server: Using config: /opt/zookeeper/zookeeper-3.4.6/bin/../conf/zoo.cfg
Tue Feb 16 07:04:04 EST 2016 ERROR ZooKeeper Server: JMX enabled by default
The exhibitor stuff uses shared storage on a NAS. The servers are centOs 6.6. It is a three node ensemble, and the one noticible problem I have seen is that the "ensemble" connection string inside of Exhibitor GUI all of a sudden becomes different between the three nodes (one node may "forget" about some of the other nodes in the ensemble).
I don't even know where to look to dig into these causes. Any help or direction will be greatly appreciated. Its trully odd...
update versions
zk: 3.4.6
Exhibitor: 1.5.5

Kubernetes scheduler: watch of *api.Pod ended with error: unexpected end of JSON input

Yesterday service worked fine. But today when i checked service's state i saw:
Mar 11 14:03:16 coreos-1 systemd[1]: scheduler.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Mar 11 14:03:16 coreos-1 systemd[1]: Unit scheduler.service entered failed state.
Mar 11 14:03:16 coreos-1 systemd[1]: scheduler.service failed.
Mar 11 14:03:16 coreos-1 systemd[1]: Starting Kubernetes Scheduler...
Mar 11 14:03:16 coreos-1 systemd[1]: Started Kubernetes Scheduler.
Mar 11 14:08:16 coreos-1 kube-scheduler[4659]: E0311 14:08:16.808349 4659 reflector.go:118] watch of *api.Service ended with error: very short watch
Mar 11 14:08:16 coreos-1 kube-scheduler[4659]: E0311 14:08:16.811434 4659 reflector.go:118] watch of *api.Pod ended with error: unexpected end of JSON input
Mar 11 14:08:16 coreos-1 kube-scheduler[4659]: E0311 14:08:16.847595 4659 reflector.go:118] watch of *api.Pod ended with error: unexpected end of JSON input
It's really confused 'cause etcd, flannel and apiserver work fine.
Only some strange logs are for etcd:
Mar 11 20:22:21 coreos-1 etcd[472]: [etcd] Mar 11 20:22:21.572 INFO | aba44aa0670b4b2e8437c03a0286d779: warning: heartbeat time out peer="6f4934635b6b4291bf29763add9bf4c7" missed=1 backoff="2s"
Mar 11 20:22:48 coreos-1 etcd[472]: [etcd] Mar 11 20:22:48.269 INFO | aba44aa0670b4b2e8437c03a0286d779: warning: heartbeat time out peer="6f4934635b6b4291bf29763add9bf4c7" missed=1 backoff="2s"
Mar 11 20:48:12 coreos-1 etcd[472]: [etcd] Mar 11 20:48:12.070 INFO | aba44aa0670b4b2e8437c03a0286d779: warning: heartbeat time out peer="6f4934635b6b4291bf29763add9bf4c7" missed=1 backoff="2s"
So, I'm really stuck and don't know what's wrong. How can i resolve this problem? Or, how can i check details log for scheduler.
journalctl give me same logs like systemd status
Please see: https://github.com/GoogleCloudPlatform/kubernetes/issues/5311
It means apiserver accepted the watch request but then immediately terminated the connection.
If you see it occasionally, it implies a transient error and is not alarming. If you see it repeatedly, it implies that apiserver (or etcd) is sick.
Is something actually not working for you?

Dovecot isn't delivering e-mails from localhost

I'm facing an issue with delivering e-mails.
I've successfully setup dovecot + postfix + mysql. The issue is that e-mails is not in the INBOX when I send e-mails from localhost (for example from php script). It works perfectly when I send e-mails from any other server. I have no idea what could cause this issue. Maillog seems to be OK. Where should I take a look?
Log for undelivered e-mail looks like this
Nov 9 22:31:31 user postfix/pickup[15929]: 474A5300E47: uid=5005 from=<webmaster#domain.com>
Nov 9 22:31:31 user postfix/cleanup[18511]: 474A5300E47: message-id=<20141109223131.474A5300E47#domain.com>
Nov 9 22:31:31 user postfix/qmgr[2582]: 474A5300E47: from=<webmaster#domain.com>, size=1198, nrcpt=1 (queue active)
Nov 9 22:31:35 user postfix/smtpd[18515]: connect from localhost[127.0.0.1]
Nov 9 22:31:35 user postfix/smtpd[18515]: 9A538300E48: client=localhost[127.0.0.1]
Nov 9 22:31:35 user postfix/cleanup[18511]: 9A538300E48: message-id=<20141109223131.474A5300E47#domain.com>
Nov 9 22:31:35 user postfix/smtpd[18515]: disconnect from localhost[127.0.0.1]
Nov 9 22:31:35 user postfix/qmgr[2582]: 9A538300E48: from=<webmaster#domain.com>, size=1595, nrcpt=1 (queue active)
Nov 9 22:31:35 user amavis[3458]: (03458-10) Passed CLEAN {RelayedInbound}, <webmaster#domain.com> -> <info#domain.com>, Message-ID: <20141109223131.474A5300E47#domain.com>, mail_id: 1S0boeHaaI2L, Hits: 1.115, size: 1196, queued_as: 9A538300E48, 4313 ms
Nov 9 22:31:35 user postfix/smtp[18512]: 474A5300E47: to=<info#domain.com>, relay=127.0.0.1[127.0.0.1]:10024, delay=4.4, delays=0.05/0.02/0.02/4.3, dsn=2.0.0, status=sent (250 2.0.0 from MTA(smtp:[127.0.0.1]:10025): 250 2.0.0 Ok: queued as 9A538300E48)
Nov 9 22:31:35 user postfix/qmgr[2582]: 474A5300E47: removed
Nov 9 22:31:35 user dovecot: lda(info#domain.com): sieve: msgid=<20141109223131.474A5300E47#domain.com>: stored mail into mailbox 'INBOX'
Nov 9 22:31:35 user postfix/pipe[18516]: 9A538300E48: to=<info#domain.com>, relay=dovecot, delay=0.1, delays=0.02/0.03/0/0.05, dsn=2.0.0, status=sent (delivered via dovecot service)
Nov 9 22:31:35 user postfix/qmgr[2582]: 9A538300E48: removed
Log for delivered e-mail
Nov 9 22:32:13 user postfix/smtpd[18542]: connect from mail-wi0-x236.google.com[2a00:1450:400c:c05::236]
Nov 9 22:32:13 user postfix/smtpd[18542]: 985EB300E47: client=mail-wi0-x236.google.com[2a00:1450:400c:c05::236]
Nov 9 22:32:13 user postfix/cleanup[18511]: 985EB300E47: message-id=<B840B0EE-45E6-4609-BD14-59EBF39449D0#gmail.com>
Nov 9 22:32:13 user postfix/qmgr[2582]: 985EB300E47: from=<example#gmail.com>, size=7916, nrcpt=1 (queue active)
Nov 9 22:32:13 user postfix/smtpd[18542]: disconnect from mail-wi0-x236.google.com[2a00:1450:400c:c05::236]
Nov 9 22:32:18 user postfix/smtpd[18515]: connect from localhost[127.0.0.1]
Nov 9 22:32:18 user postfix/smtpd[18515]: 3F751300E48: client=localhost[127.0.0.1]
Nov 9 22:32:18 user postfix/cleanup[18511]: 3F751300E48: message-id=<B840B0EE-45E6-4609-BD14-59EBF39449D0#gmail.com>
Nov 9 22:32:18 user postfix/qmgr[2582]: 3F751300E48: from=<example#gmail.com>, size=8407, nrcpt=1 (queue active)
Nov 9 22:32:18 user postfix/smtpd[18515]: disconnect from localhost[127.0.0.1]
Nov 9 22:32:18 user amavis[2072]: (02072-11) Passed CLEAN {RelayedInbound}, [2a00:1450:400c:c05::236]:65181 [86.149.90.56] <example#gmail.com> -> <info#domain.com>, Queue-ID: 985EB300E47, Message-ID: <B840B0EE-45E6-4609-BD14-59EBF39449D0#gmail.com>, mail_id: IXFd3XMT3ftY, Hits: -0.799, size: 7915, queued_as: 3F751300E48, dkim_sd=20120113:gmail.com, 4569 ms
Nov 9 22:32:18 user postfix/smtp[18512]: 985EB300E47: to=<info#domain.com>, relay=127.0.0.1[127.0.0.1]:10024, delay=4.7, delays=0.1/0/0.02/4.6, dsn=2.0.0, status=sent (250 2.0.0 from MTA(smtp:[127.0.0.1]:10025): 250 2.0.0 Ok: queued as 3F751300E48)
Nov 9 22:32:18 user postfix/qmgr[2582]: 985EB300E47: removed
Nov 9 22:32:18 user dovecot: lda(info#domain.com): sieve: msgid=<B840B0EE-45E6-4609-BD14-59EBF39449D0#gmail.com>: stored mail into mailbox 'INBOX'
Nov 9 22:32:18 user postfix/pipe[18516]: 3F751300E48: to=<info#domain.com>, relay=dovecot, delay=0.05, delays=0.01/0/0/0.04, dsn=2.0.0, status=sent (delivered via dovecot service)
Nov 9 22:32:18 user postfix/qmgr[2582]: 3F751300E48: removed
FIXED IT
Just in case somebody else would face this issue. Check your /etc/hosts file
if you have
39.29.192.294 domain.com
Of course with your public ip address & your domain.
It wouldn't deliver any e-mail sent from localhost to #domain.com. Remove this line & it should works ;)

Meteor: Couldn't deploy

I deployed a site successfully a few days ago and was able to configure logins and do other stuff but today it is down. The strange thing is I deploy to a different location without any changes and now I couldn't even access the page to configure stuff. I don't really know how to debug these things. Here is the logs
[Wed Sep 11 2013 15:19:16 GMT+0000 (UTC)] INFO STATUS waiting -> starting
[Wed Sep 11 2013 15:19:16 GMT+0000 (UTC)] INFO STATUS starting -> running
[Wed Sep 11 2013 15:19:17 GMT+0000 (UTC)] WARNING timers.js:103
[Wed Sep 11 2013 15:19:17 GMT+0000 (UTC)] WARNING ^
[Wed Sep 11 2013 15:19:17 GMT+0000 (UTC)] WARNING if (!process.listeners('uncaughtException').length) throw e;
[Wed Sep 11 2013 15:19:17 GMT+0000 (UTC)] WARNING Error: Could not locate any valid servers in initial seed list
at MongoClient.connect.connectFunction (/meteor/containers/6c32717b-367c-bd8f-b229-ad69088fe830/bundle/programs/server/npm/mongo-livedata/main/node_modules/mongodb/lib/mongodb/mongo_client.js:281:52)
at Db.open (/meteor/containers/6c32717b-367c-bd8f-b229-ad69088fe830/bundle/programs/server/npm/mongo-livedata/main/node_modules/mongodb/lib/mongodb/db.js:267:16)
at Server.connect.connectionPool.on.server._serverState (/meteor/containers/6c32717b-367c-bd8f-b229-ad69088fe830/bundle/programs/server/npm/mongo-livedata/main/node_modules/mongodb/lib/mongodb/connection/server.js:499:7)
at EventEmitter.emit (events.js:126:20)
at connection.on._self._poolState (/meteor/containers/6c32717b-367c-bd8f-b229-ad69088fe830/bundle/programs/server/npm/mongo-livedata/main/node_modules/mongodb/lib/mongodb/connection/connection_pool.js:168:15)
at EventEmitter.emit (events.js:99:17)
at Socket.timeoutHandler (/meteor/containers/6c32717b-367c-bd8f-b229-ad69088fe830/bundle/programs/server/npm/mongo-livedata/main/node_modules/mongodb/lib/mongodb/connection/connection.js:463:10)
at Socket.EventEmitter.emit (events.js:93:17)
at Socket._onTimeout (net.js:188:8)
at Timer.list.ontimeout (timers.js:101:19)
[Wed Sep 11 2013 15:19:17 GMT+0000 (UTC)] ERROR Application crashed with code: 1
[Wed Sep 11 2013 15:19:17 GMT+0000 (UTC)] INFO STATUS running -> waiting
Given the timing, looks like the very brief outage from this morning. Could you check it again and let me know if everything is back to normal?