I read somewhere that calling ensureIndex() actually creates a collection if it does not exist. But the index is always on some fields, not all of them, so if I ensure an index on say { name:1 } and then add documents to that collection that have many more fields, the index will work? I know we don't have a schema, coming from RDBMS world I just want to make sure. :) I'd like to create indexes when my website starts, but initially the database is empty. I do not need to have any data prior to ensuring indexes, is that correct?
ensureIndex will create the collection if it does not yet exist. It does not matter if you add documents that don't have the property that the index covers, you just can't use that index to find those documents. The way I understand it is that in versions before 1.7.4 a document that is missing a property for which there is an index will be indexed as though it had that property, but will a null value. In versions after 1.7.4 you can create sparse indexes that don't include these objects at all. The difference is slight but may be significant in some situations.
Depending on the circumstances it may not be a good idea to create indexes when the app starts. Consider the situation where you deploy a new version which adds new indexes when it starts up, in development you will not notice this as you only have a small database, but in production you may have a huge database and adding the index will take a lot of time. During the index creation your app will hang and you can't serve requests. You can create indexes with the background flag set to true (the syntax depends on which driver you're using), but in most cases it's better to add indexes manually, or as part of a setup script. That way you will have to think before you update indexes.
Deprecated since version 3.0: db.collection.ensureIndex() has been
replaced by db.collection.createIndex().
Ref: https://docs.mongodb.com/manual/reference/method/db.collection.ensureIndex/
Related
I have a database for development from which I only need to dump a subset of the fields, but all documents. So I created a view on the collection I need and monogodumped the view. Unfortunately, the underlying collection had indexes defined which were not rebuilt when I mongorestored the collections from the dump, because the index definitions were not dumped along with the data, apparently because they are defined for the collection, not for the view.
Is there a way to have the index definitions of the underlying collection dumped along with the data from the view?
Of course I can manually tell MongoDB to rebuild the indexes on the restored target collections, but that seems error-prone.
The fact that some indexes are on fields that are not part of the view may be a problem or even a blocker.
I believe the direct answer to your question is: No, mongodump will not pull index definitions from the source collection(s) associated with the view. Some degree of manual intervention or a change of approach is going to be needed here.
The specific approach you take depends on your specific constraints and goals. A few general things come to mind for consideration:
If the data isn't actually moving clusters, then perhaps $merge by itself would be sufficient in moving the subset of fields to a different collection. The rest of this answer assumes that this is not the case and that you do intend to actually move the data to a different cluster.
$merge may still be of interest even if you are moving the data since you could use that on the source cluster (combined with a script to copy indexes) and then run mongodump on that new collection instead. It's an extra data copy, but allows a script to programmatically recreate the indexes directly which should help prevent human error.
If you did continue with the current approach mentioned in the question, you could use a similar script to grab the index definitions (and then have them recreated).
Another thing you could do is run a second mongodump against the source collection with a --query that didn't match any documents (eg { _id: 'missing' }). The outcome would be a dump that doesn't contain any data, only index definitions. Those index definitions are just JSON text, so you could update the namespace and then combine it with the data dumped from the view to be restored together.
The specifics of the script to copy indexes mentioned in a couple of the alternatives depend a little bit on the specifics. But it would basically leverage the db.collection.getIndexes() helper to gather a list of existing indexes and then iterate over them to generate the appropriate command(s) to create the new ones.
I also want to address these statements:
The fact that some indexes are on fields that are not part of the view may be a problem or even a blocker.
it might be a problem that some index definitions are for fields that are not included in the view.
From MongoDB's perspective, there is no issue with creating indexes on fields that do not exist. Since it has a flexible schema, new fields could be added at any point. The fact that indexes aren't dumped for views is really more related to the fact that the views are not materialized. Now if some of those indexes are not appropriate for the transformed data (which doesn't have all of the fields from the original data), then of course you should consider dropping (or not creating) those indexes.
Per the Mongoose documentation for MongooseJS and MongoDB/Node.js :
When your application starts up, Mongoose automatically calls ensureIndex for each defined index in your schema. While nice for development, it is recommended this behavior be disabled in production since index creation can cause a significant performance impact. Disable the behavior by setting the autoIndex option of your schema to false.
This appears to instruct removal of auto-indexing from mongoose prior to deploying to optimize Mongoose from instructing Mongo to go and churn through all indexes on application startup, which seems to make sense.
What is the proper way to handle indexing in production code? Maybe an external script should generate indexes? Or maybe ensureIndex is unnecessary if a single application is the sole reader/writer to a collection because it will continue an index every time a DB write occurs?
Edit: To supplement, MongoDB provides good documentation for the how to do indexing, but not why or when explicit indexing directives should be done. It seems to me that indexes should be kept up to date by writer applications automatically on collections with existing indexes and that ensureIndex is really more of a one-time thing (done when a new index is being applied), in which case Mongoose's autoIndex should be a no-op under a normal server restart.
I've never understood why the Mongoose documentation so broadly recommends disabling autoIndex in production. Once the index has been added, subsequent ensureIndex calls will simply see that the index already exists and then return. So it only has an effect on performance when you're first creating the index, and at that time the collections are often empty so creating an index would be quick anyway.
My suggestion is to leave autoIndex enabled unless you have a specific situation where it's giving you trouble; like if you want to add a new index to an existing collection that has millions of docs and you want more control over when it's created.
Although I agree with the accepted answer, its worth noting that according to the MongoDB manual, this isn't the recommended way of adding indexes on a production server:
If your application includes ensureIndex() operations, and an index doesn’t exist for other operational concerns, building the index can have a severe impact on the performance of the database.
To avoid performance issues, make sure that your application checks for the indexes at start up using the getIndexes() method or the equivalent method for your driver and terminates if the proper indexes do not exist. Always build indexes in production instances using separate application code, during designated maintenance windows.
Of course, it really depends on how your application is structured and deployed. If you are deploying to Heroku, for example, and you aren't using Heroku's preboot feature, then it is likely your application is not serving requests at all during startup, and so it's probably safe to create an index at that time.
In addition to this, from the accepted answer:
So it only has an effect on performance when you're first creating the index, and at that time the collections are often empty so creating an index would be quick anyway.
If you've managed to get your data model and queries nailed on first time around, this is fine, and often the case. However, if you are adding new functionality to your app, with a new DB query on a property without an index, you'll often find yourself adding an index to a collection containing many existing documents.
This is the time when you need to be careful about adding indexes, and carefully consider the performance implications of doing so. For example, you could create the index in the background:
db.ensureIndex({ name: 1 }, { background: true });
use this block code to handle production mode:
const autoIndex = process.env.NODE_ENV !== 'production';
mongoose.connect('mongodb://localhost/collection', { autoIndex });
Speaking in general, I want to know what are the best practices for querying (and therefore indexing) of schemaless data structures? (i.e. documents)
Lets say I use MongoDB to store and query deterministic data structures in a collection. At this point all documents have the same structure therefore I can easily create indexes for any queries in my app since I know each document has required field(s) for the index.
What happens after I change the structure and try to save new documents to the db? Lets say I joined two fields FirstName and Lastname to FullName. As a result the collection contains nondeterministic data. I see two problems here:
Old indexes cannot cover new data, therefore new indexes needed that handle both fields old and new
App should take care of dealing with two representations of the documents
This may result in a big problem when there are many changes in the db resulting in many versions of document structures.
I see two main approaches:
Lazy migration. This means that each document is migrated on demand (i.e. only after loading from collection) to final structure and then stored back to colection. This approach actually does not solve the problems because it concedes nondeterminism at any point of time.
Forced migration. This is the same approach as for RDBMS migrations. The migration is performed for all documents at one point of time while the app does not run. The main con is downtime of the app.
So the question: Is there any good way of solving the problem, especially without app downtime?
If you can't have downtime then the only choice is to do the migrations "on the fly":
Change the application so that when new documents are saved the new field is created, but read from the old ones.
Update your collection with a script/queries to add the new field in the collection.
Create new indexes on that field.
Change the application so that it reads from the new fields.
Drop the unnecessary indexes and remove the old fields from the documents.
Changing the schema on a live database is never an easy process, no matter what database you use. It always requires some forward thinking and careful planning.
is indexing a pain?
Indexing is not a pain, but premature optimization is. You should always test and check that you actually need indexes before adding them and when you have them, check that they are being properly used.
If you're worried about performance issues on a live system when creating indexes, then you should consider having replica sets and doing rolling maintenance (in short: taking secondaries down from replication, creating indexes on them, bringing them back into replication and then repeating the process for all the subsequent replica set members).
Edit
What I was describing is basically a process of migrating your schema to a new one while temporary supporting both versions of the documents.
In step 1, you're basically adding support for multiple versions of documents. You're updating existing documents i.e. creating new fields, while you're reading data from the previous version fields. Step 2 is optional, because you can gradually update your documents as they are being saved.
In step 4 you're removing the support for the previous versions from your application code and migrating to a new version. Finally, in step 5 you're removing the previous version fields from your actual MongoDB documents.
Consider a collection student contains the following documents.
{name:”Nithin”,age:23}
{name:”Nithin”,age:25}
{name:”Nithin”,age:28}
{name:”Nithin”,age:12}
I want to update all the documents whose name is “Nithin” as age=60.
If we execute the following query it will only update the first document.
db.student.update({name:”Nithin”},{age:60})
For update all the documents I have to use the query
db.student.update({name:”Nithin”},{age:60},false,true)
or
db.student.update({name:”Nithin”},{age:60},multi:true)
What is the reason by default mongodb not updating all the documents by executing db.student.update({name:”Nithin”},{age:60}) ? What is the motivation for creating separate queries for updating all the documents? Is it improving the performance?
Originally, in the early early days of MongoDB (pre 1.1) it was not possible to update multiple documents. This was a feature added around 1.1.3.
You can see it in the release notes, New Feature 268.
I'm guessing this was not enabled by default for backwards compatibility with previous versions.
This may not really be the reason but I find the additional multi parameter as a safeguard to prevent accidental update of multiple records when one intends to update a single document only, something like accidentally performing UPDATE...SET on SQL without specifying additional constraints.
Again this is just an assumption but may not really be the case.
I suppose part of the reason might be to avoid people coming from the SQL world to think about multi-document updates as isolated transactions.
In fact, during a long update MongoDB will periodically yield control to other queries which can potentially modify the same dataset.
So, by explicitly setting multi=true you are somewhat acknowledging this fact (well, not really, but I guess that's the spirit...)
I'm doing my first project with MongoDB and from what I've seen the implicit creation of collections is the recommended practice (ie. db.myCollection.insert() will create the collection the first time an insert is made)
I'm using PHP and using different collections this way, but the problem is that I don't know where I should create the indexes I'll need for that collection. As I wouldn't know when a collection is created, the naive approach would be calling ensureIndex() just before every operation on that collection (which doesn't sound very good). Or whenever a connection to the database is made, make sure the indexes exist (what happens if I create an index on a collection that wasn't created? Is that defined?)
Any best practice advice for this?
Not sure if it's the best practice, but I tend to not put the ensureIndex in the app. I usually put the ones I am sure I will need using the db shell. Then I keep an eye during load testing(or when things start to slow down in production) and add any I missed again in the shell. You can build indexes in the background by doing ensureIndex({a : 1}, {background : true}), so building them later isn't as terrible as some other dbs.
MongoDB has a good profiler to find what is going slow: http://www.mongodb.org/display/DOCS/Database+Profiler.
10gen(MongoDB's commercial counterpart) has a free monitoring service that is talked about a lot although I haven't used it yet: http://www.10gen.com/mongodb-monitoring-service.
But as far as what happens when you call db.collection.ensureIndex() before collection is created, it will create the collection and put the index on it.
If you definitely want it in the app, I would go with the second option you put forth(ensure indexes right after db connect) instead of before each operation. I would probably save something in the db when I did so they don't run each time if there is more than a couple. Don't know php but here is pseudo code:
var test = db.systemChecks.findOne({indexes : true})
if (test == null) //item doesn't exist
{
//do all the ensureIndex() commands
db.systemChecks.insert({indexes : true})
}
Just remember to delete the systemCheck item if you find you need more indexes later to run through the indexes
Actually, ensureIndex() is exactly what you will need to do. I would do it in each Model's constructor that uses a specific connection - if you have one of those models). ensureIndex() will make sure that an index is only created when it doesn't already exists. You can alternatively do it when you create your database connection. If you run ensureIndex() on a non-existing collection, it would just create that collection and makes empty indexes (as there are no documents yet).
In the future, the PHP driver will also cache whether ensureIndex() was already run in the same request, basically making it a no-op: https://jira.mongodb.org/browse/PHP-581