MongoDb Ops Manager can't start mongos and shard - mongodb

I Came by a problem where i have an Ops Manager that suppose to run a MongoDB cluster as an automated cluster.
Suddenly the servers started going down, unexpectedly - while there are no errors in any of the log files indicating on when is the problem.
The Ops Manager gets stuck on the blue label
We are deploying your changes. This might take a few minutes
And it just never goes away.
Because this environment is based on the automation feature, the mms is managing the user on the servers and runs all of the processes from "mongod" which i can't access even as a Root (administrator).
As far as the Ops Manager goes it shows that a shard in a replica set is down although it's live, and thinks that a mongos that is dead is alive.
Has someone got into this situation before and may be able to help ?
Thanks,
Eliran.

Problem found: there was an ntp mismatch between the servers in the cluster somehow, so what happened was that the servers were not synced and everytime the ops manager did something it got responses with wrong times and could not use it's time limits.
After re-configuring all the ntp's back to the same one - everything got back to how it should have been :)

Related

MongoDB Atlas - Replica Set Has No Primary

I'm fairly new to MongoDB (Atlas - free tier), where I have created a project using it for storing my data. I had it set up and working fine for a couple of weeks, when suddenly I received an email with: An alert is open for your Atlas project: Replica set has no primary. I have no idea what this means and I don't believe I have done anything in the last couple of days/weeks that could warrant this alert. However, after checking my project, it seems that I can no longer connect to my cluster and access my data.
After checking on MongoDB Cloud, it seems that my cluster has stopped working and only the secondary shard (don't know if this is the right terminology) is running, while the other two seem to be down. Can anyone explain what this means, why it is happening or how to fix it? Thanks.
To troubleshoot issues like this, read the server logs and act based on the information therein.
For free and shared tiers in Atlas the logs are apparently not available. Therefore:
For a free tier cluster (M0), delete this cluster and create a new one. If you don't have a backup you should be able to dump via a direct connection to any of the operational secondary nodes or using the secondary read preference.
For a shared tier cluster (M2/M5), use the official MongoDB support channels for assistance.

AWS EC2 free tier instance is automatically stopping frequently

I am using ubuntu 18.04 on AWS EC2 instace free tier, running websites on apache server, NodeJS with PostgreSQL database. All deployments are done perfectly and webapps works fine without any exception or error details.
However I am facing an annoying issue: this instance is stopping frequently without any exception or error logs. After rebooting instance everything starts working fine but after some time it automatically stops either in few hrs. on same day when rebooted instance or in 1-2 days after that.
I created another free tier instance with seperate account and that is also giving same issue. I am not finding any logs or troubleshoot option to get rid of this problem.
I would like to know how it can be troubleshooted or where can i find logs of any errors or exception for this isntance?
Suggestion given by AWS in "Instance Status Checl" as attached below are not practicle solution to apply evertime.
Something with your VM itself is causing its health checks to fail.
Have a look at syslogs, and your application logs. Also take a look at CloudWatch metrics to see if any metrics have dramatic change close to time.
You can also add a CloudWatch alarm with a recovery action to automatically reboot if there’s an issue with your VM.

What to do when a Google Cloud SQL postgres server upgrade (to a bigger machine) takes considerably longer than "a few minutes"?

We upgraded our Google Cloud SQL postgres server to a bigger machine and the upgrade is not terminating. In our experience, this usually takes less than 5 minutes, but we'ven been waiting for about 1.5 hours now and nothing is happening. There are no logs after the server shut down(except for failed connection attempts). We cannot switch to the failover, because there is already an operation in progress (namely the upgrade that's causing the problem in the first place). Restarting is disabled because the operation is in progress. It seems like there's nothing we can do right now, except maybe apply the last backup, though we're not sure if that's even possible while an operation is in progress.
Is there anything we can do to restart the DB or fix the problem?
When you upgrade a CloudSQL server, the instance is rebooted. It can happen occasionally that rebooting takes more than expected, which seems to be what happened to your server, but this is not unexpected behaviour.
This being said, be sure to check the status of the CloudSQL service. And if upgrades get stuck too often or never finish, contact support.
To reduce the chances of having this issue again:
Configure High Availability for your instance, so it has failover capability.
Make sure that the maintenance window of failover replicas is different from that of the master instance. To change the maintenance schedule, on the GCP console, go to SQL, click on an instance, and "Edit maintenance schedule"->"Set maintenance schedule". Then choose a window.

MongoDb preparing for Sharded Clusters

We are currently setting up our mongodb environment for production. At the moment we only have one dedicated mongodb database server. We will expand this in the near future with a 2nd server and I already indicated to the management that for the ideal situation we should get a 3rd server as well.
Since I already know we're going to use sharding and replication in the near future I want to be prepared for it.
The idea I have now is to start now with the Development Configuration (as mongo's documentation names it).
Whenever our second server comes available I would like to expand this setup to a configuration with 2 configuration servers en 2 shards (replica sets).
And of course when our third server comes available have the fully functional sharded cluster configuration.
While reading mongo's documentation I was getting triggered by the note that de Development setup should not be used in production.
MongoDb Development Configuration
Keeping in mind that we will add more servers soon, would it be a bad idea to already configure the Development Configuration already so we can easily add the 2nd server to the cluster when it comes available?
After setting up the 'development sharded setup' I've found my anwser. Of course i'm happy to share in case anybody runs into the same questions as I do when starting with this.
In my case, it was ok to start with the development setup untill my new servers arrived. It was a temporary situation and when my new servers arived I was able to easily expand my replicasets. There are a number of reasons why this isn't adviced for production:
To state the obvious, there is no replication yet. Since I was running shards on one machine there is a single point of failure. If the machine, or one node goes down, the cluster won't work anymore.
Now this part is interesting. After I added a second server, I did have primary and secondary nodes. Primary nodes were used for writing and secondary for reading. I've eliminated the issue that there was no replication AND my data had a higher availability. However, I noticed with the 2-member replica sets, if one member of the replicaset went down (even is this was a secondary), the primary stepped down to a secondary node as well. This had to do with the voting mechanism that MongoDb uses. See Markus' more detailed answer on this.. Since there are no more primaries in the replicaset, my cluster won't function anymore. Now, if i were to use an arbiter I could eliminate this problem as well.
When you have a 3-member replicataset, automatic failover kicks in. Whenever a node goes down, another primary is assigned automatically and the cluster will continue performing as before.
During my tests I also got to a point where one of my MongoD.exe instances stopped working due to a "Out of memory exception". I was running a cluster with 3 replicasets, meaning every machine had at least 4 mongod.exe processes running (3 for the replicaset shards and one for the configuration server replicaset). Besides having a query which wasn't optimized yet I also noticed that the WiredTiger storage engine by default can use up to 50% of ram minus one gigabyte. Perhaps it wasn't the best choise to have multiple replicaset-shards on one machine but I was able to eliminate the problem by capping the wiredtiger memory usage.
I hope this answer helps anybody who's starting to set up replication and sharding for MongoDb.

Replica set never finishes cloning primary node

We're working with an average sized (50GB) data set in MongoDB and are attempting to add a third node to our replica set (making it primary-secondary-secondary). Unfortunately, when we bring the nodes up (with the appropriate command line arguments associating them with our replica set), the nodes never exit the RECOVERING stage.
Looking at the logs, it seems as though the nodes ditch all of their data as soon as the recovery completes and start syncing again.
We're using version 2.0.3 on all of the nodes and have tried adding the third node from both a "clean" (empty db) state as well as a bootstrapped state (using mongodump to take a snapshot of the primary database and mongorestore'ing that snapshot into the new node), each failing.
We've observed this recurring phenomenon over the past 24 hours and any input/guidance would be appreciated!
It's hard to be certain without looking at the logs, but it sounds like you're hitting a known issue in MongoDB 2.0.3. Check out http://jira.mongodb.org/browse/SERVER-5177 . The problem is fixed in 2.0.4, which has an available release candidate.
I don't know if it helps, but when I got that problem, I erased the replica DB and initiated it. It started from scratch and replicated OK. worth a try, I guess.