I am using ubuntu 18.04 on AWS EC2 instace free tier, running websites on apache server, NodeJS with PostgreSQL database. All deployments are done perfectly and webapps works fine without any exception or error details.
However I am facing an annoying issue: this instance is stopping frequently without any exception or error logs. After rebooting instance everything starts working fine but after some time it automatically stops either in few hrs. on same day when rebooted instance or in 1-2 days after that.
I created another free tier instance with seperate account and that is also giving same issue. I am not finding any logs or troubleshoot option to get rid of this problem.
I would like to know how it can be troubleshooted or where can i find logs of any errors or exception for this isntance?
Suggestion given by AWS in "Instance Status Checl" as attached below are not practicle solution to apply evertime.
Something with your VM itself is causing its health checks to fail.
Have a look at syslogs, and your application logs. Also take a look at CloudWatch metrics to see if any metrics have dramatic change close to time.
You can also add a CloudWatch alarm with a recovery action to automatically reboot if there’s an issue with your VM.
Related
We upgraded our Google Cloud SQL postgres server to a bigger machine and the upgrade is not terminating. In our experience, this usually takes less than 5 minutes, but we'ven been waiting for about 1.5 hours now and nothing is happening. There are no logs after the server shut down(except for failed connection attempts). We cannot switch to the failover, because there is already an operation in progress (namely the upgrade that's causing the problem in the first place). Restarting is disabled because the operation is in progress. It seems like there's nothing we can do right now, except maybe apply the last backup, though we're not sure if that's even possible while an operation is in progress.
Is there anything we can do to restart the DB or fix the problem?
When you upgrade a CloudSQL server, the instance is rebooted. It can happen occasionally that rebooting takes more than expected, which seems to be what happened to your server, but this is not unexpected behaviour.
This being said, be sure to check the status of the CloudSQL service. And if upgrades get stuck too often or never finish, contact support.
To reduce the chances of having this issue again:
Configure High Availability for your instance, so it has failover capability.
Make sure that the maintenance window of failover replicas is different from that of the master instance. To change the maintenance schedule, on the GCP console, go to SQL, click on an instance, and "Edit maintenance schedule"->"Set maintenance schedule". Then choose a window.
This morning my application could not connect my MySQL master instance in Google Cloud SQL. The master instance does not have more logs, but the replication instance log show that the replication could not connect to the master too.
I tried to restart MySQL, but an hour later, it didn't start.
What should I do?
There are several possible reasons for this issue. For instance, your master instance may have failed due to an error while a dump was being created, or the instance may have been under maintenance and now it cannot restart correctly, etc. If that were the case, you would need to get in touch with Google Cloud Platform Support to have your Cloud SQL instance manually restarted.
Alternatively, you can also check the documentation for instance connection issues and how to diagnose issues in connections.
If nothing of this applies to your case, you should consider adding more information to your question, since there could be a problem with the expiration of your SSL server certificate, with the Proxy, etc.
I'm trying to run a container in a custom VM on Google Compute Engine. This is to perform a heavy ETL process so I need a large machine but only for a couple of hours a month. I have two versions of my container with small startup changes. Both versions were built and pushed to the same google container registry by the same computer using the same Google login. The older one works fine but the newer one fails by getting stuck in an endless list of the following error:
E0927 09:10:13 7f5be3fff700 api_server.cc:184 Metadata request unsuccessful: Server responded with 'Forbidden' (403): Transport endpoint is not connected
Can anyone tell me exactly what's going on here? Can anyone please explain why one of my images doesn't have this problem (well it gives a few of these messages but gets past them) and the other does have this problem (thousands of this message and taking over 24 hours before I killed it).
If I ssh in to a GCE instance then both versions of the container pull and run just fine. I'm suspecting the INTEGRITY_RULE checking from the logs but I know nothing about how that works.
MORE INFO: this is down to "restart policy: never". Even a simple Centos:7 container that says "hello world" deployed from the console triggers this if the restart policy is never. At least in the short term I can fix this in the entrypoint script as the instance will be destroyed when the monitor realises that the process has finished
I suggest you try creating a 3rd container that's focused on the metadata service functionality to isolate the issue. It may be that there's a timing difference between the 2 containers that's not being overcome.
Make sure you can ‘curl’ the metadata service from the VM and that the request to the metadata service is using the VM's service account.
I have a 5 cluster MariaDB/Galera cluster running in production environment.
I also have a monitor which checks every 20 seconds for cluster size changes. One of our other engineers has been running queries using MySQL Workbench, and when that application is running, I start seeing alerts coming from my monitor where cluster size is 1. It does recover in a few seconds back to the correct size of 5, however it's disconcerting that this client app is causing issues on the cluster. I've requested everyone on our team to not use this app... however I wonder if anyone else has seen this, or knows what it is doing to the cluster.
I Came by a problem where i have an Ops Manager that suppose to run a MongoDB cluster as an automated cluster.
Suddenly the servers started going down, unexpectedly - while there are no errors in any of the log files indicating on when is the problem.
The Ops Manager gets stuck on the blue label
We are deploying your changes. This might take a few minutes
And it just never goes away.
Because this environment is based on the automation feature, the mms is managing the user on the servers and runs all of the processes from "mongod" which i can't access even as a Root (administrator).
As far as the Ops Manager goes it shows that a shard in a replica set is down although it's live, and thinks that a mongos that is dead is alive.
Has someone got into this situation before and may be able to help ?
Thanks,
Eliran.
Problem found: there was an ntp mismatch between the servers in the cluster somehow, so what happened was that the servers were not synced and everytime the ops manager did something it got responses with wrong times and could not use it's time limits.
After re-configuring all the ntp's back to the same one - everything got back to how it should have been :)