MongoDB initial deployment and indexes

MongoDB initial deployment and indexes - mongodb

I'm doing my first project with MongoDB and from what I've seen the implicit creation of collections is the recommended practice (ie. db.myCollection.insert() will create the collection the first time an insert is made)
I'm using PHP and using different collections this way, but the problem is that I don't know where I should create the indexes I'll need for that collection. As I wouldn't know when a collection is created, the naive approach would be calling ensureIndex() just before every operation on that collection (which doesn't sound very good). Or whenever a connection to the database is made, make sure the indexes exist (what happens if I create an index on a collection that wasn't created? Is that defined?)
Any best practice advice for this?

Not sure if it's the best practice, but I tend to not put the ensureIndex in the app. I usually put the ones I am sure I will need using the db shell. Then I keep an eye during load testing(or when things start to slow down in production) and add any I missed again in the shell. You can build indexes in the background by doing ensureIndex({a : 1}, {background : true}), so building them later isn't as terrible as some other dbs.
MongoDB has a good profiler to find what is going slow: http://www.mongodb.org/display/DOCS/Database+Profiler.
10gen(MongoDB's commercial counterpart) has a free monitoring service that is talked about a lot although I haven't used it yet: http://www.10gen.com/mongodb-monitoring-service.
But as far as what happens when you call db.collection.ensureIndex() before collection is created, it will create the collection and put the index on it.
If you definitely want it in the app, I would go with the second option you put forth(ensure indexes right after db connect) instead of before each operation. I would probably save something in the db when I did so they don't run each time if there is more than a couple. Don't know php but here is pseudo code:
var test = db.systemChecks.findOne({indexes : true})
if (test == null) //item doesn't exist
{
//do all the ensureIndex() commands
db.systemChecks.insert({indexes : true})
}
Just remember to delete the systemCheck item if you find you need more indexes later to run through the indexes

Actually, ensureIndex() is exactly what you will need to do. I would do it in each Model's constructor that uses a specific connection - if you have one of those models). ensureIndex() will make sure that an index is only created when it doesn't already exists. You can alternatively do it when you create your database connection. If you run ensureIndex() on a non-existing collection, it would just create that collection and makes empty indexes (as there are no documents yet).
In the future, the PHP driver will also cache whether ensureIndex() was already run in the same request, basically making it a no-op: https://jira.mongodb.org/browse/PHP-581

Related

When and why to autoIndex with MongoDB? [duplicate]

Per the Mongoose documentation for MongooseJS and MongoDB/Node.js :
When your application starts up, Mongoose automatically calls ensureIndex for each defined index in your schema. While nice for development, it is recommended this behavior be disabled in production since index creation can cause a significant performance impact. Disable the behavior by setting the autoIndex option of your schema to false.
This appears to instruct removal of auto-indexing from mongoose prior to deploying to optimize Mongoose from instructing Mongo to go and churn through all indexes on application startup, which seems to make sense.
What is the proper way to handle indexing in production code? Maybe an external script should generate indexes? Or maybe ensureIndex is unnecessary if a single application is the sole reader/writer to a collection because it will continue an index every time a DB write occurs?
Edit: To supplement, MongoDB provides good documentation for the how to do indexing, but not why or when explicit indexing directives should be done. It seems to me that indexes should be kept up to date by writer applications automatically on collections with existing indexes and that ensureIndex is really more of a one-time thing (done when a new index is being applied), in which case Mongoose's autoIndex should be a no-op under a normal server restart.

I've never understood why the Mongoose documentation so broadly recommends disabling autoIndex in production. Once the index has been added, subsequent ensureIndex calls will simply see that the index already exists and then return. So it only has an effect on performance when you're first creating the index, and at that time the collections are often empty so creating an index would be quick anyway.
My suggestion is to leave autoIndex enabled unless you have a specific situation where it's giving you trouble; like if you want to add a new index to an existing collection that has millions of docs and you want more control over when it's created.

Although I agree with the accepted answer, its worth noting that according to the MongoDB manual, this isn't the recommended way of adding indexes on a production server:
If your application includes ensureIndex() operations, and an index doesn’t exist for other operational concerns, building the index can have a severe impact on the performance of the database.
To avoid performance issues, make sure that your application checks for the indexes at start up using the getIndexes() method or the equivalent method for your driver and terminates if the proper indexes do not exist. Always build indexes in production instances using separate application code, during designated maintenance windows.
Of course, it really depends on how your application is structured and deployed. If you are deploying to Heroku, for example, and you aren't using Heroku's preboot feature, then it is likely your application is not serving requests at all during startup, and so it's probably safe to create an index at that time.
In addition to this, from the accepted answer:
So it only has an effect on performance when you're first creating the index, and at that time the collections are often empty so creating an index would be quick anyway.
If you've managed to get your data model and queries nailed on first time around, this is fine, and often the case. However, if you are adding new functionality to your app, with a new DB query on a property without an index, you'll often find yourself adding an index to a collection containing many existing documents.
This is the time when you need to be careful about adding indexes, and carefully consider the performance implications of doing so. For example, you could create the index in the background:
db.ensureIndex({ name: 1 }, { background: true });

use this block code to handle production mode:
const autoIndex = process.env.NODE_ENV !== 'production';
mongoose.connect('mongodb://localhost/collection', { autoIndex });

Copy data field from one mongo collection to another, on db server

I have two mongo collections. One we can call a template and second is instance. Every time new instance is created, rather large data field is copied from template to instance. Currently the field is retrieved from mongo db template collection in application and then sent back to db as a part of instance collection insert.
Would it be possible to somehow perform this copy on insert directly in mongo db, to avoid sending several megabytes over the network back and forth?
Kadira is reporting 3 seconds lag due to this. And documents are only going to get bigger.
I am using Meteor, but I gather that that should not influence the answer much.

I have done some searching and I can't really find an elegant solution for you. The two ways I can think of doing it are:
1.) Fork a process to run a mongo command to copy your template as your new instance via db.collection.copyTo().
http://eureka.ykyuen.info/2015/02/26/meteor-run-shell-command-at-server-side/
https://docs.mongodb.org/manual/reference/method/db.collection.copyTo/
Or
2.) Attempt to access the raw mongo collection rather than the minimongo collection meteor provides you with so you can use the db.collection.copyTo() functionality supplied by Mongo.
var rawCollection = Collection.rawCollection();
rawCollection.copyTo(newCollection);
Can meteor mongo driver handle $each and $position operators?
I haven't tried accessing the rawCollection to see if copyTo is available, and I also don't know if it will bring it into meteor before writing out the new collection. I'm just throwing this out here as an idea for you; hopefully someone else has a better one.

It's not possible to lock a mongodb document. What if I need to?

I know that I can't lock a single mongodb document, in fact there is no way to lock a collection either.
However, I've got this scenario, where I think I need some way to prevent more than one thread (or process, it's not important) from modifying a document. Here's my scenario.
I have a collection that contains object of type A. I have some code that retrieve a document of type A, add an element in an array that is a property of the document (a.arr.add(new Thing()) and then save back the document to mongodb. This code is parallel, multiple threads in my applications can do theses operations and for now there is no way to prevent to threads from doing theses operations in parallel on the same document. This is bad because one of the threads could overwrite the works of the other.
I do use the repository pattern to abstract the access to the mongodb collection, so I only have CRUDs operations at my disposition.
Now that I think about it, maybe it's a limitation of the repository pattern and not a limitation of mongodb that is causing me troubles. Anyway, how can I make this code "thread safe"? I guess there's a well known solution to this problem, but being new to mongodb and the repository pattern, I don't immediately sees it.
Thanks

Hey the only way of which I think now is to add an status parameter and use the operation findAndModify(), which enables you to atomically modify a document. It's a bit slower, but should do the trick.
So let's say you add an status attribut and when you retrieve the document change the status from "IDLE" to "PROCESSING". Then you update the document and save it back to the collection updating the status to "IDLE" again.
Code example:
var doc = db.runCommand({
"findAndModify" : "COLLECTION_NAME",
"query" : {"_id": "ID_DOCUMENT", "status" : "IDLE"},
"update" : {"$set" : {"status" : "RUNNING"} }
}).value
Change the COLLECTION_NAME and ID_DOCUMENT to a proper value. By default findAndModify() returns the old value, which means the status value will be still IDLE on the client side. So when you are done with updating just save/update everything again.
The only think you need be be aware is that you can only modify one document at a time.
Hope it helps.

Stumbled into this question while working on mongodb upgrades. Unlike at the time this question was asked, now mongodb supports document level locking out of the box.
From: http://docs.mongodb.org/manual/faq/concurrency/
"How granular are locks in MongoDB?
Changed in version 3.0.
Beginning with version 3.0, MongoDB ships with the WiredTiger storage engine, which uses optimistic concurrency control for most read and write operations. WiredTiger uses only intent locks at the global, database and collection levels. When the storage engine detects conflicts between two operations, one will incur a write conflict causing MongoDB to transparently retry that operation."

Classic solution when you want to make something thread-safe is to use locks (mutexes).
This is also called pessimistic locking as opposed to optimistic locking described here.
There are scenarios when pessimistic locking is more efficient (more details here). It is also far easier to implement (major difficulty of optimistic locking is recovery from collision).
MongoDB does not provide mechanism for a lock. But this can be easily implemented at application level (i.e. in your code):
Acquire lock
Read document
Modify document
Write document
Release lock
The granularity of the lock can be different: global, collection-specific, record/document-specific. The more specific the lock the less its performance penalty.

"Doctor, it hurts when I do this"
"Then don't do that!"
Basically, what you're describing sounds like you've got a serial dependency there -- MongoDB or whatever, your algorithm has a point at which the operation has to be serialized. That will be an inherent bottleneck, and if you absolutely must do it, you'll have to arrange some kind of semaphore to protect it.
So, the place to look is at your algorithm. Can you eliminate that? Could you, for example, handle it with some kind of conflict resolution, like "get record into local' update; store record" so that after the store the new record would be the one gotten on that key?

Answering my own question because I found a solution while doing research on the Internet.
I think what I need to do is use an Optimistic Concurency Control.
It consist in adding a timestamp, a hash or another unique identifier (I'll used UUIDs) to every documents. The unique identifier must be modified each time the document is modified. before updating the document I'll do something like this (in pseudo-code) :
var oldUUID = doc.uuid;
doc.uuid = new UUID();
BeginTransaction();
if (GetDocUUIDFromDatabase(doc.id) == oldUUID)
{
SaveToDatabase(doc);
Commit();
}
else
{
// Document was modified in the DB since we read it. We can't save our changes.
RollBack();
throw new ConcurencyException();
}

Update:
With MongoDB 3.2.2 using WiredTiger Storage implementation as default engine, MongoDB use default locking at document level.It was introduced in version 3.0 but made default in version 3.2.2. Therefore MongoDB now has document level locking.

As of 4.0, MongoDB supports Transactions for replica sets. Support for sharded clusters will come in MongoDB 4.2. Using transactions, DB updates will be aborted if a conflicting write occurs, solving your issue.
Transactions are much more costly in terms of performance so don't use Transactions as an excuse for poor NoSQL schema design!

An alternative is to do in place update
for ex:
http://www.mongodb.org/display/DOCS/Updating#comment-41821928
db.users.update( { level: "Sourcerer" }, { '$push' : { 'inventory' : 'magic wand'} }, false, true );
which will push 'magic wand' into all "Sourcerer" user's inventory array. Update to each document/user is atomic.

If you have a system with > 1 servers then you'll need a distributive lock.
I prefer to use Hazelcast.
While saving you can get Hazelcast lock by entity id, fetch and update data, then release a lock.
As an example:
https://github.com/azee/template-api/blob/master/template-rest/src/main/java/com/mycompany/template/scheduler/SchedulerJob.java
Just use lock.lock() instead of lock.tryLock()
Here you can see how to configure Hazelcast in your spring context:
https://github.com/azee/template-api/blob/master/template-rest/src/main/resources/webContext.xml

Instead of writing the question in another question, I try to answer this one: I wonder if this WiredTiger Storage will handle the problem I pointed out here:
Limit inserts in mongodb

If the order of the elements in the array is not important for you then the $push operator should be safe enough to prevent threads from overwriting each others changes.

I had a similar problem where I had multiple instances of the same application which would pull data from the database (the order did not matter; all documents had to be updated - efficiently), work on it and write back the results. However, without any locking in place, all instances obviously pulled the same document(s) instead of intelligently distributing their workforce.
I tried to solve it by implementing a lock on application level, which would add an locked-field in the corresponding document when it was currently being edited, so that no other instance of my application would pick the same document and waste time on it by performing the same operation as the other instance(s).
However, when running dozens or more instances of my application, the timespan between reading the document (using find()) and setting the locked-field to true (using update()) where to long and the instances still pulled the same documents from the database, making my idea of speeding up the work using multiple instances pointless.
Here are 3 suggestions that might solve your problem depending on your situation:
Use findAndModify() since the read and write operations are atomic using that function. Theoretically, a document requested by one instance of your application should then appear as locked for the other instances. And when the document is unlocked and visible for other instances again, it is also modified.
If however, you need to do other stuff in between the read find() and write update() operations, you could you use transactions.
Alternatively, if that does not solve your problem, a bit of a cheese solution (which might suffice) is making the application pull documents in large batches and making each instance pick a random document from that batch and work on it. Obvisously this shady solution is based on the fact that coincidence will not punish your application's efficieny.

Sounds like you want to use MongoDB's atomic operators: http://www.mongodb.org/display/DOCS/Atomic+Operations

Recommended way/place to create index on MongoDB collection for a web application

I'm using MongoDB for our web application. Assume there will be a 'find()' on MongoDB for incoming requests. What is the recommended way/place to add index on a MongoDB collection ?
Couple of options I can think of:-
1) 'ensureIndex' on the collection while initializing the application. [But how will I 'ensureindex' at the very first time application initialize ? since there won't be any data in place]
2 'ensureIndex' before every 'find' operation (on web request) ? but isn't this an overhead even if 'ensureIndex' wouldn't create index if it is already created ?
Any other options ?
Thanks in advance.

I would put it when you initialize the application. If the collection does not exist when you call ensureIndex, the index (and collection) will be created at that time.
I am assuming that you know a priori what kinds of queries you will be running on the data, and what kind of data you will be putting into the index, of course.

Clearly run the ensureIndex when you start the application. It doesn't matter if you call the ensureIndex on an empty collection (as it will add the data afterwards to the index).
Furthermore it depends on which property your queries are based. If you for example query based on the logged in user, then you should add the index on the user id.

You could create a initialisation script for the collection to be run from the command line of your server (the Mongo docs discuss how to write Mongo scripts and run them from the CLI). Then have it run whenever mongod starts? The script could just call ensureIndex() on the collection object.

Is it OK to call ensureIndex on non-existent collections?

I read somewhere that calling ensureIndex() actually creates a collection if it does not exist. But the index is always on some fields, not all of them, so if I ensure an index on say { name:1 } and then add documents to that collection that have many more fields, the index will work? I know we don't have a schema, coming from RDBMS world I just want to make sure. :) I'd like to create indexes when my website starts, but initially the database is empty. I do not need to have any data prior to ensuring indexes, is that correct?

ensureIndex will create the collection if it does not yet exist. It does not matter if you add documents that don't have the property that the index covers, you just can't use that index to find those documents. The way I understand it is that in versions before 1.7.4 a document that is missing a property for which there is an index will be indexed as though it had that property, but will a null value. In versions after 1.7.4 you can create sparse indexes that don't include these objects at all. The difference is slight but may be significant in some situations.
Depending on the circumstances it may not be a good idea to create indexes when the app starts. Consider the situation where you deploy a new version which adds new indexes when it starts up, in development you will not notice this as you only have a small database, but in production you may have a huge database and adding the index will take a lot of time. During the index creation your app will hang and you can't serve requests. You can create indexes with the background flag set to true (the syntax depends on which driver you're using), but in most cases it's better to add indexes manually, or as part of a setup script. That way you will have to think before you update indexes.

Deprecated since version 3.0: db.collection.ensureIndex() has been
replaced by db.collection.createIndex().
Ref: https://docs.mongodb.com/manual/reference/method/db.collection.ensureIndex/

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse