Problems with memcached and ntpd on CentOS - memcached

We are having a problem with a virtual machine that's running our frontend website. Once it's running everything is fine, but after a reboot memcached is going bonkers. What happens is that we put items in there set to expire in 15 to 30 seconds, but they don't expire for about an hour! So after a while all data we're serving is highly outdated.
We've been investigating the issue for a bit and found that during startup ntp is changing the clock a lot, putting it almost an hour forward.
We found that memcached doesn't use the system clock but has it's own clock, so once the system clock changes and sets the expiry in it's time, memcache is an hour behind and will keep the item for an hour.
We've already swapped the boot order of ntpd (now S58) and memcached (now S59), but that hasn't resolved the issue.
Restarting memcached manually after a reboot is not really option because our host reboots the server regularly after patches and we're not always there after that's happened.
Does anyone have any idea on how to resolve this? We've googled high and low, but can't find anyone with the same problem. Surely we're not the first to have this problem?
virt-what is reporting the VPS is running in VMWare.

Related

Losing SSH Connectivity repeatedly on Google Compute Engine

I ve been coding on vscode remotely connected to an instance on the google Compute Engine. I have an internet connection speed of around 30-40 mbps. What I have observed is I keep losing connection to the remote machine very frequently. What I have also observed is there are times when this issue occurs especially when certain memory intensive operations are run. So,
Question 1: Is there a relationship between RAM and ssh connectivity.
Question 2: Is my internet connectivity speed a problem? If yes what is the minimum amount of speed necessary a seamless coding experience.
The only relationship between the RAM and the SSH service is that the SSH is also using RAM to be able to operate. In your case, you already got a clue that the SSH Service crashes from time to time and mostly when performing memory intensive operations. Your machine is falling short on resources and hence in order to keep the OS up, the process manager shuts down the processes. SSH is one of those processes. Once you reset the machine, all comes back to normal.
With your current speed, connection is not an issue.
One of the best ways to tackle this is:
increase the resources of your VM (RAM)
then go back to code and check the requirements and limitation of your app
You can also check this official SSH Troubleshooting guide from google. Troubleshooting SSH

Postgres utilizing 100% CPU on EC2 Instance? Why? How to fix?

I am facing same issue regularly which happens 1-3 times in a month and mostly on weekends.
To explain, CPU utilization is exceeding past 100% from last 32 hours.
EC2 Instance is t3.medium
Postgres version is 10.6
OS : Amazon Linux 2
I have tried gather all the information I could get using command provided in reference https://severalnines.com/blog/why-postgresql-running-slow-tips-tricks-get-source
But didn't found any inconsistency or leak in my database, although while checking for process consuming all CPU resources I found following command is the culprit running for more than 32 hours.
/var/lib/postgresql/10/main/postgresql -u pg_linux_copy -B
This command is running from 3 separate processes at the moment and running from last 32 hours, 16 hours & 16 hours respectively.
Searching about about this didn't even returned a single result on google which is heartbreaking.
If I kill the process, everything turns back to normal.
What is the issue and what can I do to prevent this from happening again in future?
I was recently contacted by AWS EC2 Abuse team regarding my instance involved in some intrusion attack to some other server.
To my surprise, I found out that as I had used very week password root for default postgres account for my database and also had the postgres port public, the attacker silently gained access to instance and used my instance to try gaining access to another instance.
I am still not sure, how was he able to try ssh command by gaining access to master database account.
To summarise, One reason for unusual database spikes on server could be someone attacking your system.

What to do when a Google Cloud SQL postgres server upgrade (to a bigger machine) takes considerably longer than "a few minutes"?

We upgraded our Google Cloud SQL postgres server to a bigger machine and the upgrade is not terminating. In our experience, this usually takes less than 5 minutes, but we'ven been waiting for about 1.5 hours now and nothing is happening. There are no logs after the server shut down(except for failed connection attempts). We cannot switch to the failover, because there is already an operation in progress (namely the upgrade that's causing the problem in the first place). Restarting is disabled because the operation is in progress. It seems like there's nothing we can do right now, except maybe apply the last backup, though we're not sure if that's even possible while an operation is in progress.
Is there anything we can do to restart the DB or fix the problem?
When you upgrade a CloudSQL server, the instance is rebooted. It can happen occasionally that rebooting takes more than expected, which seems to be what happened to your server, but this is not unexpected behaviour.
This being said, be sure to check the status of the CloudSQL service. And if upgrades get stuck too often or never finish, contact support.
To reduce the chances of having this issue again:
Configure High Availability for your instance, so it has failover capability.
Make sure that the maintenance window of failover replicas is different from that of the master instance. To change the maintenance schedule, on the GCP console, go to SQL, click on an instance, and "Edit maintenance schedule"->"Set maintenance schedule". Then choose a window.

Slow replication recovery due to communication problems

We had lately several times the same problems on Google compute engine environment with PostgreSQL streaming replication and I would like to understand reasons and if I can repair it in some smoother way.
From time to time we see some communication problems in Google's internal network in GCE datacenter and they always trigger replication lags between our PG master and its replicas. All machines are Debian-8 and PostgreSQL 9.5.
When situation happens everything seems to be OK - no errors in PG logs on master or replicas just communication between master and replicas seems to be incredibly slow or repeatedly failing so new WAL logs are transfered to replicas with big delays and therefore replication lag is still growing.
Restart of replication from within PostgreSQl or restart of PostgreSQL on replica does not really help - after several WAL logs copied using scp in recovery command communication is back in previous incredibly slow status. Only restart of the whole instance help. When whole VM is restarted communication is back to normal and recovery even from lag many hours long is done in a few minutes. So main reason for this behavior seems to be on OS level. I tried to check net traffic but without finding anything significant. I also do not see anything relevant in any OS log.
Could restart of some OS service help? So I do not need to restart the whole VM? Thank you very much for any ideas.

why zookeeper get high cpu when leap seconds comes?

I have a zookeeper cluster, machines get a huge spike in CPU after leap seconds. The solution is to restart the machine. Anyone knows why? Seems mozilla meets this also. http://blog.mozilla.org/it/2012/06/30/mysql-and-the-leap-second-high-cpu-and-the-fix/
They're looking into it, but it appears to be a Linux bug, and not a ZooKeeper bug specifically. For now, this thread from the ZooKeeper User mailing list should provide you with the most up-to-date information.