DB Compound indexing best practices Mongo DB - mongodb

How costly is it to index some fields in MongoDB,
I have a table where i want uniqueness combining two fields, Every where i search they suggested compound index with unique set to true. But what i was doing is " Appending both field1_field2 and making it a key, so that field2 will be always unique for field1.(and add Application logic) As i thought indexing is costly.
And also as MongoDB documentation advices us not to use Custom Object ID like auto incrementing number, I end up giving big numbers to Models like Classes, Students etc, (where i could have used easily used 1,2,3 in sql lite), I didn't think to add a new field for numbering and index that field for querying.
What are the best practices advice for production

The advantage of using compound indexes vs your own indexed field system is that compound indexes allows sorting quicker than regular indexed fields. It also lowers the size of every documents.
In your case, if you want to get the documents sorted with values in field1 ascending and in field2 descending, it is better to use a compound index. If you only want to get the documents that have some specific value contained in field1_field2, it does not really matter if you use compound indexes or a regular indexed field.
However, if you already have field1 and field2 in seperate fields in the documents, and you also have a field containing field1_field2, it could be better to use a compound index on field1 and field2, and simply delete the field containing field1_field2. This could lower the size of every document and ultimately reduce the size of your database.
Regarding the cost of the indexing, you almost have to index field1_field2 if you want to go down that route anyways. Queries based on unindexed fields in MongoDB are really slow. And it does not take much more time adding a document to a database when the document has an indexed field (we're talking 1 millisecond or so). Note that adding an index on many existing documents can take a few minutes. This is why you usually plan the indexing strategy before adding any documents.
TL;DR:
If you have limited disk space or need to sort the results, go with a compound index and delete field1_field2. Otherwise, use field1_field2, but it has to be indexed!

Related

Is it possible to pre-sort a text index in MongoDB?

My understanding is that, in MongoDB, regular (not text) indexes are pre-sorted based on the parameters passed to createIndex(). For example, db.collection.createIndex({ name: 1 }) will create an index with documents sorted by name, in ascending order.
Is it possible to do this with text indexes? I have a large MongoDB collection (millions of documents) with a text index. When I perform a text search on the collection, I'd like to sort the results by created date... but the sort operation always runs out of memory. Can I set up the text index so that it's pre-sorted by created date (ie. no need to perform a sort operation after the results are retrieved)?
According to the text index docs it's not possible:
Sort operations cannot obtain sort order from a text index, even from
a compound text index; i.e. sort operations cannot use the ordering in
the text index.
Unfortunately it looks like sort on a text index is a real problem in MongoDB. There are multiple related issues on their tracker SERVER-36087, SERVER-24375,
SERVER-36794

MongoDB Indexing a field which may not exist

I have a collection which has an optional field xy_id. About 10% of the documents (out of 500k) does not have this xy_id field.
I have quite a lot of queries to this collection like find({xy_id: <id>}).
I tried indexing it normally (.createIndex({xy_id: 1}, {"background": true})) and it does improve the query speed.
Is this the correct way to index the field in this case? or should I be using a sparse index or another way?
Yes, this is the correct way. The default behaviour of MongoDB is serving well in this case. You can see in the docs that index creation supports an unique flag, which is false by default. All your documents missing the index key will be indexed under a single index entry. Queries can use this index in all cases because all the documents are indexed.
On the other hand, if you use sparse index the documents missing the index key will not be indexed at all. Some operations such as count, sort and other queries will not be able to use the sparse index unless explicitly hinted to do so. If explicitly hinted, you should be okay with incorrect results - the entries not in the index will be omitted in the result. You can read about it here.

MongoDB Indexing: Multiple single-field vs single compound?

I have a collection of geospatial+temporal data with a few additional properties, which I'll be displaying on a map. The collection has a few million documents at this point, and will grow over time.
Each document has the following fields:
Location: [geojson object]
Date: [Date object]
ZoomLevel: [int32]
EntryType: [ObjectID]
I need to be able to rapidly query this collection by any combination of location (generally a geowithin query), Date (generally $gte/$lt), ZoomLevel and EntryType.
What I'm wondering is: Should I make a compound index containing all four fields, or a single index for each field, or some combination thereof? I read in the MongoDB docs the following:
For a compound index that includes a 2dsphere index key along with
keys of other types, only the 2dsphere index field determines whether
the index references a document.
...Which sounds like it means having the 2dsphere index for Location be part of a compound index might be pointless?
Any clarity on this would be much appreciated.
For your use case you will need to use multiple indexes.
If you create one index covering all fields of your documents your queries will only be able to use it when they include the first field in the index.
Since you need to query by any combination of these four fields I suggest you to analyze your data access patterns and see exactly what filters are you actually using and create specific index for each one or group of them.
EDIT: For your question about 2dsphere, it does make sense to make them compound.
This note refers to the 'sparse' option. Sparse index references only documents that contains the index fields, for 2dspheres the only documents that will be left out is the ones that do not contain the geojson/point array.

Mongodb id on bulk insert performance

I have a class/object that have a guid and i want to use that field as the _id object when it is saved to Mongodb. Is it possible to use other value instead of the ObjectId?
Is there any performance consideration when doing bulk insert when there is an _id field? Is _id an index? If i set the _id to different field, would it slow down the bulk insert? I'm inserting about 10 million records.
1) Yes you can use that field as the id. There is no mention of what API (if any) you are using for inserting the documents. So if you would do the insertion at the command line, the command would be:
db.collection.insert({_id : <BSONString_version_of_your_guid_value>, field1 : value1, ...});
It doesn't have to be BsonString. Change it to whatever Bson value is closest matching to your guid's original type (except the array type. Arrays aren't allowed as the value of _id field).
2) As far as i know, there IS effect on performance when db.collection.insert when you provide your own ids, especially in bulk, BUT if the id's are sorted etc., there shouldn't be a performance loss. The reason, i am quoting:
The structure of index is a B-tree. ObjectIds have an excellent
insertion order as far as the index tree is concerned: they are always
increasing, meaning they are always inserted at the right edge of
B-tree. This, in turn, means that MongoDB only has to keep the right
edge of the B-Tree in memory.
Conversely, a random value in the _id field means that _ids will be
inserted all over the tree. Then the machine must move a page of the
index into memory, update a tiny piece of it, then probably ignore it
until it slides out of memory again. This is less efficient.
:from the book `50 Tips and Tricks for MongoDB Developers`
The tip's title says - "Override _id when you have your own simple, unique id." Clearly it is better to use your id if you have one and you don't need the properties of an ObjectId. And it is best if your ids are increasing for the reason stated above.
3) There is a default index on _id field by MongoDB.
So...
Yes. It is possible to use other types than ObjectId, including GUID that will be saved as BinData.
Yes, there are considerations. It's better if your _id is always increasing (like a growing number, or ObjectId) otherwise the index needs to rebuild itself more often. If you plan on using sharding, the _id should also be hashed evenly.
_id indeed has an index automatically.
It depends on the type you choose. See section 2.
Conclusion: It's better to keep using ObjectId unless you have a good reason not to.

Using Mongo: should we create an index tailored to each type of high-volume query?

We have two types of high-volume queries. One looks for docs involving 5 attributes: a date (lte), a value stored in an array, a value stored in a second array, one integer (gte), and one float (gte).
The second includes these five attributes plus two more.
Should we create two compound indices, one for each query? Assume each attribute has a high cardinality.
If we do, because each query involves multiple arrays, it doesn't seem like we can create an index because of Mongo's restriction. How do people structure their Mongo databases in this case?
We're using MongoMapper.
Thanks!
Indexes for queries after the first ranges in the query the value of the additional index fields drops significantly.
Conceptually, I find it best to think of the addition fields in the index pruning ever smaller sub-trees from the query. The first range chops off a large branch, the second a smaller, the third smaller, etc. My general rule of thumb is only the first range from the query in the index is of value.
The caveat to that rule is that additional fields in the index can be useful to aid sorting returned results.
For the first query I would create a index on the two array values and then which ever of the ranges will exclude the most documents. The date field is unlikely to provide high exclusion unless you can close the range (lte and gte). The integer and float is hard to tell without knowing the domain.
If the second query's two additional attributes also use ranges in the query and do not have a significantly higher exclusion value then I would just work with the one index.
Rob.