When/Where is the best-practices time/place to configure a MongoDB "schema"? - mongodb

In an app that uses MongoDB, when/where is the best place to make database changes that would be migrations in a relational database?
For example, how should creating indexes or setting shard keys be managed? Where should this code go?

it's probably best to do this in the shell, conciously!, because you could cause havoc if you accidentally start such a command at the wrong moment and on the wrong instance.
Most importantly: do this offline on an extra slave instance if you add an index on an existing DB! For large data sets, building an index can take hours, even days!
see also:
http://www.mongodb.org/display/DOCS/Indexes
http://www.javabeat.net/articles/print.php?article_id=353
http://www.mongodb.org/display/DOCS/Indexing+as+a+Background+Operation
http://nosql.mypopescu.com/post/1312926692/mongodb-indexes-and-indexing
If you have a large data set, make sure to read up on the 4square outage last year..!!
http://www.infoq.com/news/2010/10/4square_mongodb_outage
http://blog.foursquare.com/2010/10/05/so-that-was-a-bummer/
http://highscalability.com/blog/2010/10/15/troubles-with-sharding-what-can-we-learn-from-the-foursquare.html
one of the main reasons for not wanting to put indexing in a script or config file of some sort is that in MongoDB the index operation is blocking(!) -- that means MongoDB will stop other operations on the database from proceeding until the indexing is completed. Just imagine an innocent change in the code, requiring a new index to improve performance -- and this change is carelessly checked-in and deployed to production ... and suddenly your production MongoDB is feezing up for your app-server, because MongoDB is internally adding the new index first before doing anything else... outch! Apparently that has happened to a couple of folks, that's why they keep reminding people at the MongoDB conferences to be careful to not 'programmatically' require indexes.
New versions of MongoDB allow background indexing -- you should always do that e.g. db.yourcollection.ensureIndex(..., {background: true})
otherwise, not-so-fun stuff happens:
https://jira.mongodb.org/browse/SERVER-1341
https://jira.mongodb.org/browse/SERVER-3067

Related

mongotools collection restore Oplog warnings

I regularly reinstate prod data into testing environments on MongoDB Atlas. I delete it and perform a mongotools restore --collection myCollection. I have good reasons.not to replace the whole dB.
Is there a way to avoid hammering the oplog during such copy and hence generate warnings on the oplog window size?
In the first instance I thought just to disable them, since it doesn't really matter from a backup point of view. However this may cause replica sync issues (which matters little too, due to the nightly nature of the job, but still doesn't feel like I'm doing the right thing).
Thank you.

Drop a 5TB collection in Mongo without bringing down the db

In our Mongo configuration we have a replica set with a primary and 2 secondaries. We currently have a collection that is about 5TB in size that we want to drop completely. From reading docs it sounds like just dropping the collection would lock the database. Seems like it might take a bit to delete 5TB and anything more than a few minutes downtime really isn't an option.
I tried deleting records a little bit at a time via query and remove commands, but this still slowed the db down to a crawl.
I've thought about taking the primary out of the set, dropping the collection and then putting it back in the set as primary, but what will the impact be of having those changes replicate to the secondaries? Is it still just going to use a ton of cpu and lock things up?
The end goal is to move all of our mongo instances to smaller disks, so it would be nice if there was an option that allowed us to tackle both the migration and the deletion of the data at the same time.
Any advice is appreciated.

Processing large mongo collection offsite

I have a system writing logs into mongodb (about 1kk logs per day). On a weekly basis I need to calculate some statistics on those logs. Since the calculations are very processor and memory consuming I want to copy collection I'm working to powerful offsite machine. How do I keep offsite collections up to date without copying everything? I modify offsite collection, by storing statistic within its elements i.e. adding fields {"alogirthm_1": "passed"} or {"stat1": 3.1415}. Is replication right for my use case or I should investigate other alternatives?
As to your question, yes, replication does partially resolve your issue, with limitations.
So there are several ways I know to resolve your issue:
The half-database, half-application way.
Replication keeps your data up to date. It doesn't allow you to modify the secondary nodes (which you call "offsite collection") however. So you have to do the calculation on the secondary and write data to the primary. You need to have an application running aggregation on the secondary, and write the result back to it's primary.
This requires that you run an application, PHP, .NET, Python, whatever.
full-server way
Since you are going to have multi-servers any way, you can consider using Sharding for the faster storage and directly do the calculation online. This way you don't even need to run an application. The Map/Reduce do the calculation and write output into an new collection. I DON'T recommend this solution though because of the Map/Reduce performance issue of current versions.
The full-application way
Basically you still use an replication for reading, but the server doesn't do any calculations except querying data. You can use capped collection or TTL index for removing expired data, and you just enumerate data one by one in your application and calculation by yourself.

MongoDB: Switch database/collection referenced by a given name on the fly

My application needs only read access to all of its databases. One of those databases (db_1) hosts a collection coll_1 whose entire contents* need to be replaced periodically**.
My goal is to have no or very little effect on read performance for servers currently connected to the database.
Approaches I could think of with so far:
1. renameCollection
Build a temporary collection coll_tmp, then use renameCollection with dropTarget: true to move its contents over to coll_1. The downside of this approach is that as far as I can tell, renameCollection does not copy indexes, so once the collection is renamed, coll_1 would need reindexing. While I don't have a good estimate of how long this would take, I would think that query-performance will be significantly affected until reindexing is complete.
2. TTL Index
Instead of straight up replacing, use a time-to-live index to expire documents after the chosen replacement period. Insert new data every time period. This seems like a decent solution to me, except that for our specific application, old data is better than no data. In this scenario, if the cron job to repopulate the database fails for whatever reason, we could potentially be left with an empty coll_1 which is undesirable. I think this might have a negligible effect, but this solution also requires on-the-fly indexing as every document is inserted.
3. Communicate current database to read-clients
Simply use two different databases (or collections?) and inform connected clients which one is more recent. This solution would allow for finishing indexing the new coll_1_alt (and then coll_1 again) before making it available. I personally dislike the solution since it couples the read clients very closely to the database itself, and of course communication channels are always imperfect.
4. copyDatabase
Use copyDatabase to rename (designate) an alternate database db_tmp to db_1.db_tmp would also have a collection coll_1. Once reindexing is complete on db_tmp.coll_1, copyDatabase could be used to simply rename db_tmp to db_1. It seems that this would require droppping db_1 before renaming, leaving a window in which data won't be accessible.
Ideally (and naively), I'd just set db_1 to be something akin to a symlink, switching to the most current database as needed.
Anyone has good suggestions on how to achieve the desired effect?
*There are about 10 million documents in coll_1.
** The current plan is to replace the collection once every 24 hours. The replacement interval might get as low as once every 30 minutes, but not lower.
The problem that you point out in option 4 you will also have with option 1. dropTarget will also mean that the collection is not available.
Another alternative could be to just have both the old and the new data in the same collection, and use a "version ID" that you then still have to communicate to your clients to do a query on. That at least stops you from having to do reindexing like you pointed out for option 1.
I think your best bet is actually option 3, and it's the most equivalent to changing a symlink, except it is on the client side.

What is the overhead of ensureIndex({field:1}) when an index already exists?

I'd like to always ensure that my collections are indexed, and I'm adding and dropping them on a semi-regular basis.
Assuming that I make a new connection to the DB with every web request, would it be okay to execute a few db.collection.ensureIndex({field:true}) statements every time I connect?
As I understand it MongoDB will simply just query the system Collection to look to see if the index exists before it creates it ...
http://www.mongodb.org/display/DOCS/Indexes#Indexes-AdditionalNotesonIndexes
> db.system.indexes.find();
You can run getIndexes() to see a Collection's indexes
> db.things.getIndexes();
So really, you'd just be adding one query; it would not rebuild it or do anything else non-obvious.
That said, I don't think this would be a particularly good idea. It would add unneeded overhead, and worse might lock your database as the index is created ... since by default creating an index blocks your database (unless you run it in the background) like so:
> db.things.ensureIndex({x:1}, {background:true});
However note ...
... background mode building uses an
incremental approach to building the
index which is slower than the default
foreground mode: time to build the
index will be greater.
I think it would be much better to do this in code when you add the collections instead of everytime you connect to the database. Why are you adding and dropping them anyhow?