Problem
I am trying to complete the MongoDB on AWS quickstart to create a simple MongoDB cluster. Unfortunately it never completes the rollout, cancelling after one last installation part (PrimaryReplicaNodeXYWaitForNodeInstallGP2) has not been completed within an hour.
Background
My Settings were the following:
AvailabilityZone0 eu-central-1a
AvailabilityZone1 eu-central-1b
AvailabilityZone2 eu-central-1b
BuildBucket quickstart-reference/mongodb/latest
ClusterReplicaSetCount 0
ClusterShardCount 1
ConfigServerInstanceType t2.micro
Iops 100
KeyName my_definitely_working_keypair
MongoDBVersion 3.2
NATInstanceType t2.small
NodeInstanceType m3.medium
PrimaryReplicaSubnet 10.0.2.0/24
PublicSubnet 10.0.1.0/24
RemoteAccessCIDR XXX.XXX.0.0/16
SecondaryReplicaSubnet0 10.0.3.0/24
SecondaryReplicaSubnet1 10.0.4.0/24
ShardsPerNode 0
VolumeSize 40
VolumeType gp2
VPCCIDR 10.0.0.0/16
Which caused a rollback in the same behaviour, as named in the AWS forum:
In "Ressources", all but one subtask never gets completed and stays on
forever as "PrimaryReplicaNode0WaitForNodeInstallGP2 -
PrimaryReplicaNode0WaitForNodeInstallWaitHandle - Created in Progress
- Ressource creation initiated"
So, I was further researching on the issue. The post referred to another forum thread, where users with the problem should try to delete their DynamoDB entries and set ClusterReplicaSetCount to 3.
Problem here: In DynamoDB there are no entries and changing ClusterReplicaSetCount to 3 also causes a rollback with a similar error:
ConfigServer2WaitForNodeInstall WaitCondition timed out. Received 0
conditions when expecting 1
and later
MONGODBSTACK1 The following resource(s) failed to create:
[ConfigServer1WaitForNodeInstall,
PrimaryReplicaNode00WaitForNodeInstallGP2,
ConfigServer0WaitForNodeInstall,
SecondaryReplicaNode00WaitForNodeInstallGP2,
SecondaryReplicaNode01WaitForNodeInstallGP2,
ConfigServer2WaitForNodeInstall].
Summary
In both cases there is a fail on PrimaryReplicaNodeXYWaitForNodeInstallGP2 (where XY is the number of the node) while all other parts of the installation completed successfully. I am totally in the dark.
Anyone got around this? The quick start is from 2016 and I think there must be people, who have successfully created this mongo stack!?
After days and days of hard struggle and no solution there was an update (since over a year, feels as if my prayers were heard) on the manual and the template:
https://docs.aws.amazon.com/quickstart/latest/mongodb/welcome.html
So this also comes with a completely revised infrastructure and a more sophisticated setup form, changes are described as:
Upgraded MongoDB to version 3.4; removed sharding configuration;
updated security groups and added database security; updated
parameters
Following the tutorial is quite similar to the former versions, so no struggle here.
Everything went out fine and I got my stack completed now consisting of a
mongoDB
mongoDB Replica
bastion stack
vpc stack
So this part is basically done. If something else comes up, I will open a new question for that.
I noticed this after tearing down a dev cluster and attempting to stand up a new one with the same name.
The torn down cluster orphaned a dynamodb table with the name the new stack was trying to publish the worker nodes status onto. I deleted this dynamo table manually and reattempted to spin up the stack with the same name a third time and had success.
Related
Firstly, yes I have read this https://www.liquibase.com/blog/using-liquibase-in-kubernetes
and I also read many SO threads where people are answering "I solved the issue by using init-container"
I understand that for most people this might have fixed the issue because the reason their pods were going down was because the migration was taking too long and k8s probes killed the pods.
But what about when a new deployment is applied and the previous deployment was stuck a failed state (k8s trying again and again to launches the pods without success) ?
When this new deployment is applied it will simply whip / replace all the failing pods and if this happens while Liquibase aquired the lock the pods (and its init containers) are killed and the DB will be left in a locked state requiring manual intervention.
Unless I missed something with k8s's init-container, using them doesn't really solve the issue described above right?
Is that the only solution currently available? What other solution could be used to avoid manual intervention ?
My first thought was to add some kind of custom code (either directly in the app before the Liquibase migration happens) or in init-container that would run before liquibase init-container runs to automatically unlock the DB if for example the lock is, let's say, 5 minutes old.
Would that be acceptable or will it cause other issues i'm not thinking about ?
I'm fairly new to MongoDB (Atlas - free tier), where I have created a project using it for storing my data. I had it set up and working fine for a couple of weeks, when suddenly I received an email with: An alert is open for your Atlas project: Replica set has no primary. I have no idea what this means and I don't believe I have done anything in the last couple of days/weeks that could warrant this alert. However, after checking my project, it seems that I can no longer connect to my cluster and access my data.
After checking on MongoDB Cloud, it seems that my cluster has stopped working and only the secondary shard (don't know if this is the right terminology) is running, while the other two seem to be down. Can anyone explain what this means, why it is happening or how to fix it? Thanks.
To troubleshoot issues like this, read the server logs and act based on the information therein.
For free and shared tiers in Atlas the logs are apparently not available. Therefore:
For a free tier cluster (M0), delete this cluster and create a new one. If you don't have a backup you should be able to dump via a direct connection to any of the operational secondary nodes or using the secondary read preference.
For a shared tier cluster (M2/M5), use the official MongoDB support channels for assistance.
I Came by a problem where i have an Ops Manager that suppose to run a MongoDB cluster as an automated cluster.
Suddenly the servers started going down, unexpectedly - while there are no errors in any of the log files indicating on when is the problem.
The Ops Manager gets stuck on the blue label
We are deploying your changes. This might take a few minutes
And it just never goes away.
Because this environment is based on the automation feature, the mms is managing the user on the servers and runs all of the processes from "mongod" which i can't access even as a Root (administrator).
As far as the Ops Manager goes it shows that a shard in a replica set is down although it's live, and thinks that a mongos that is dead is alive.
Has someone got into this situation before and may be able to help ?
Thanks,
Eliran.
Problem found: there was an ntp mismatch between the servers in the cluster somehow, so what happened was that the servers were not synced and everytime the ops manager did something it got responses with wrong times and could not use it's time limits.
After re-configuring all the ntp's back to the same one - everything got back to how it should have been :)
FURTHER UPDATE: this error has not occurred since the November update.
EDIT: you may want to read this if your stateful service stops working for no apparent reason. Typical sign is using WordCount-like app (for example), the service deployment reports that one partition is remaining and after 5 tries gives up. The stateless service starts ok. The diagnostics reports multiple "Constructed instance of type WordCountService". If You have this, then you may have the same problem I have. No amount of uninstalling VS/SF/Azure SDKs helps. I now use a VM template with VS/Azure/SF installed and just delete and recreate it each time this error occurs (it is rare but has happened several times). Assume MSFT is aware and fixing for beta.
ORIGINAL:
Summary question: Is there a way to reset Service Fabric completely?
Background: I have a stateful/stateless app service based on Wordcount example. All of a sudden, after deployment the app no longer replicates the stateful service (1 instance, 2 replicas). The stateless service is deployed ok (one instance, no replicas).
The partition status of the primary partition is reporting "Partition is below target replica or instance count". The replica status is "InBuild" for replicas, Primary is OK.
On the primary node, there is a warning "Replica had multiple failures during open. Error = -2147024894.
I have tried cleaning the cluster, uninstalling/reinstalling service fabric, deleting the SfDevCluster directory entirely etc.
If I copy the exact code to another computer with service fabric installed, it works (and I mean copy/paste the whole solution directory).
I had a similar problem last week but it caused the host service not to start. Tried uninstall/reinstall/clean/remove SDKs, remove Visual Studio, etc. The only thing that fixed it was a reinstall of windows.
We're working with an average sized (50GB) data set in MongoDB and are attempting to add a third node to our replica set (making it primary-secondary-secondary). Unfortunately, when we bring the nodes up (with the appropriate command line arguments associating them with our replica set), the nodes never exit the RECOVERING stage.
Looking at the logs, it seems as though the nodes ditch all of their data as soon as the recovery completes and start syncing again.
We're using version 2.0.3 on all of the nodes and have tried adding the third node from both a "clean" (empty db) state as well as a bootstrapped state (using mongodump to take a snapshot of the primary database and mongorestore'ing that snapshot into the new node), each failing.
We've observed this recurring phenomenon over the past 24 hours and any input/guidance would be appreciated!
It's hard to be certain without looking at the logs, but it sounds like you're hitting a known issue in MongoDB 2.0.3. Check out http://jira.mongodb.org/browse/SERVER-5177 . The problem is fixed in 2.0.4, which has an available release candidate.
I don't know if it helps, but when I got that problem, I erased the replica DB and initiated it. It started from scratch and replicated OK. worth a try, I guess.