MongoDB. Keep information about sharded collections when restoring - mongodb

I am using mongodump and mongorestore in a replicated shard cluster in MongoDB 2.2. to get a backup and restore it.
First, I use mongodump for creating the dump of all the system, then I drop a concrete collection and restore it using mongorestore with the output of mongodump. After that, the collection is correct (the data it contains is correct and also the indexes), but the information about if this collection is sharded is lost. Before dropping it, the collection was sharded. After the restore, however, the collection was not sharded anymore.
I was wondering then if a way of keeping this information in backups exist. I was thinking that maybe sharded information for collection is kept in the admin database, but in the dump, admin folder is empty, and using show collections for this database I get nothing. Then I thought it could be kept in the metadata, but this would be strange, because I know that, in the metadata, the information about indexes is stored and indexes are correctly restored.
Then, I would like to know if it could be possible to keep this information using instead of mongodump + mongorestore, filesystem snapshots; or maybe still using mongodump and mongorestore but stopping the system or locking writing. I don't thing this last option could be the reason, because I am not performing writing operations while restoring even not being locking it, but just to give ideas.
I also would like to know if anyone is completely sure about if it is the case that this feature is still not available in the current version.
Any ideas?

If you are using mongodump to back up your sharded collection, are you sure it really needs to be sharded? Usually sharded collections are very large and mongodump would take too long to back it up.
What you can do to back up a large sharded collection is described here.
The key piece is to back up your config server as well as each shard - and do it as close to "simultaneously" as possible after having stopped balancing. Config DB is small so you should probably back it up very frequently anyway. Best way to back up large shards is via file snapshots.

Related

Mongodb atlas backup/export data auto

I'm using Atlas mongodb
When I try to set a back it says Turn on Backup (M10 and up) which means that I cannot backup (Logical Size928.8 KB).
Is there a different method to backup Atlas mongodb?
I know that I can use Compas to export each collection but it's tedious, I would like to have simple method to backup my data daily, preferable automatic.
Is there a similar service I can use that offer backup for smaller DB as well?

MongoDB Collections Unexpected Deletion/Drop

Our collections in MongoDB were automatically deleted/drop and we are not sure why and how. Our MongoDB is working fine for almost 10 months now, so we are really not sure what happened here.
Is there a collection expiration for MongoDB where it automatically delete the collections and its data? Also, would it be possible to retrieve the data?
Thank you in advance!
Collections do not 'drop' themselves.
Someone has run db.collection.drop() somewhere, intentionally, or accidentally.
You can set a TTL on the data inside a collection - see here however I don't think that's what has happened here.
The only way of retrieving the data would be from a backup.
Restoring a backup to a secondary database and taking a copy of the collection in question, then importing that back into your main database may be the best approach here.
... You do have backups, right?

Intercept or filter out oplog transactions from MongoDB

There is a MongoDB which has interesting data I want to examine. Unfortunately, due to size concerns, once every 48 hours, the database is purged of "old" records.
I created a replica set with a secondary database system that has priority 0 and vote 0 so as not to interfere with the main database performance. This works great as I can query the secondary and get my data. However, there are many occasions that my system cannot process all the records in time and will lose some old records if I did not get to them within 48 hours.
Is there a way where I can cache the oplog on another system which I can then process at my leisure, possibly filtering out the deletes until I am ready?
I considered the slavedelay parameters, but that will affect all transactions. I also looked into Tungsten Replicate as a solution so I can essentially cache the the oplogs, however, they do not support MongoDB as a source of the data.
Is the oplog stored in plain text on the secondary such that I can read it and extract what I want from it.
Any pointers to this would be helpful, unfortunately I could not find much documentation on Oplog in the MongoDB website.
MongoDB oplog is stored as a capped collection called 'oplog.rs' in your local DB:
use local
db.oplog.rs.find()
If you want to store more old data in oplog for later use, you can try to increase the size of that collection. See http://docs.mongodb.org/manual/tutorial/change-oplog-size/
Alternatively, you can recreate oplog.rs as an uncapped collection (though this is not recommended since you will have to maually clean up oplog). Follow the same steps to increase the size above, but when recreating the oplog, use this command
db.runCommand( { create: "oplog.rs", capped: false})
Another solution is to create a cron job with the following command dump oplog into the folder YYYYMMDD:
mongodump --db local --collection oplog.rs -o $(date +%Y%m%d)
Hope that helps.
I wonder why you would do that manually. The "canonical" way to do it is to either identify the lifetime or expiration date of a record. If it is a lifetime, you'd do sth like
db.collection.insert({'foo':'bar' [...], created: ISODate("2014-10-06T09:00:05Z")})
and
db.collection.ensureIndex({'created':1},{expireAfterSeconds:172800})
By doing so, a thread called TTLMonitor will awake every minute and remove all documents which have a created field which is older than two days.
If you have a fixed expiration date for each document, you'd basically do the same:
db.collection.insert({'foo':'bar' [...], expirationDate: ISODate("2100-01-01T00:00:00Z"})
and
db.collection.ensureIndex({expirationDate:1},{expireAfterSeconds:0})
This will purge the documents in the first run of TTLMonitor after the expirationDate.
You could adjust expireAfterSeconds to a value which safely allows you to process the records before they are purged, keeping the overall size at acceptable needs and making sure that even when your application goes down during the purging work, the records are removed. (Not to mention that you don't need to maintain the purging logic yourself).
That being said and in the hope it might be useful to you, I think your problem is a conceptual one.
You have a scaling problem. You system is unable to deal with peaks, hence it occasionally is unable to process all data in time. Instead of fiddling with the internals of MongoDB (which might be quite dangerous, as #chianh correctly pointed out), you should rather scale accordingly by identifying your bottleneck and scale it according to your peaks.

session table in mongo db

I'm working on schema design of a scalable session table (of a customized authentication) in mongo db. I know the scalability of Mongo DB is inherited from design and also have requirements. My user case is simple,
when user login, a random token is generated and granted to user, then insert record to session table using the token as primary key, which is shard-able. old token record would be deleted if exists.
user access service using the token
my question is, if system keep delete the expired session key, the size of the session collection (considering shard'ed situation that I need partition on the token field) possibly will grow to very big and include alot 'gap' of expired session, how to gracefully handle this problem (or any better design)?
Thanks in advance.
Edit: My question is about storage level. how mongodb manage disk space if records are frequently removed and inserted? it should be kind of an (auto-)shrink mechanism there. Hopefully won't block reads to the collection.
TTL is good and all however repair is not. --repair is not designed to be run regularly on a database, infact maybe once every 3 months or something. It does a lot of internal stuff that, if run often, will seriously damage your servers performance.
Now about reuse of disk space in such an envirionemt; when you delete a record it will free that "block". If another document fits into that "block" it will reuse that space otherwise it will actually create a new extent, meaning a new "block" a.k.a more space.
So if you want save disk space here you will need to make sure that documents do not exceed each other, fortunately you have a relatively static schema here of maybe:
{
_id: {},
token: {},
user_id: {},
device: {},
user_agent: ""
}
which should mean that documents, hopefully, will reuse their space.
Now you come to a tricky part if they do not. MongoDB will not automatically give back free space per collection (but does per database since that is the same as deleting the files) so you have to run --repair on the database or compact() on the collection to actually get your space back.
That being said, I believe your documents will be of relative size to each other so I am unsure if you will see a problem here but you could also try: http://www.mongodb.org/display/DOCS/Padding+Factor#PaddingFactor-usePowerOf2Sizes for a collection that will frequently have inserts and deletes, it should help the performance on that front.
I agree with #Steven Farley, While creating index you can set ttl, in python by pymongo driver we can do like this
http://api.mongodb.org/python/1.3/api/pymongo/collection.html#pymongo.collection.Collection.create_index
I would have to suggest you use TTL. You can read more about it at http://docs.mongodb.org/manual/tutorial/expire-data/ it would be a perfect fit for what your doing. This is only available since version 2.2
How mongo stores data: http://www.mongodb.org/display/DOCS/Excessive+Disk+Space
Way to clean up removed records:
Command Line: mongod --repair
See: http://docs.mongodb.org/manual/reference/mongod/#cmdoption-mongod--repair
Mongo Shell: db.repairDatabase()
See: http://docs.mongodb.org/manual/reference/method/db.repairDatabase/
So you could have an automated clean up script that executes the repair, keep in mind this will block mongo for a while.
There are a few ways to achieve sessions:
Capped collections as showed in this use case.
Expire data with a TTL to the index by adding expireAfterSeconds to ensureIndex.
Cleaning sessions program side using a TTL and remove.
Faced to the same problematic, I used solution 3 for the flexibility it provides.
You can find a good overview of remove and disk optimization in this answer.

creating a different database for each collection in MongoDB 2.2

MongoDB 2.2 has a write lock per database as opposed to a global write lock on the server in previous versions. So would it be ok if i store each collection in a separate database to effectively have a write lock per collection.(This will make it look like MyISAM's table level locking). Is this approach faulty?
There's a key limitation to the locking and that is the local database. That database includes a the oplog collection which is used for replication.
If you're running in production, you should be running with Replica Sets. If you're running with Replica Sets, you need to be aware of the write lock effect on that database.
Breaking out your 10 collections into 10 DBs is useless if they all block waiting for the oplog.
Before taking a large step to re-write, please ensure that the oplog will not cause issues.
Also, be aware that MongoDB implements DB-level security. If you're using any security features, you are now creating more DBs to secure.
Yes that will work, 10gen actually offers this as an option in their talks on locking.
I probably isolate every collection, though. Most databases seem to have 2-5 high activity collections. For the sake of simplicity it's probably better to keep the low activity collections grouped in one DB and put high activity collections in their own databases.