To be true, After typing the Question title only, i had a look about DB indexing in Wiki.
Now i know something about Indexing in general. But, still i have some questions on MongoDB indexing.
What is indexing in MongoDB? What it will exactly do, If i index a collection?
What i can do with indexing in MongoDB?
Will i able to use it for searching specific data?
Can anyone explain it with the below set of documents in a Collection in some MongoDB?
{ "_id":"das23j..", "x": "1", "y":[ {"RAM":"2 GB"}, {"Processor":"Intel i7"}, {"Graphics Card": "NVIDIA.."}]}
Thanks!!!
An index speeds up searching, at the expense of storage space. Think of the index as an additional copy of an attribute's (or column's) data, but in order. If you have an ordered collection you can perform something like a binary search, which is much faster than a sequential search (which you'd need if the data wasn't ordered). Once you find the data you need using the index, you can refer to the corresponding record.
The tradeoff is that you need the additional space to store the "ordered" copy of that column's data, and there's a slight speed tradeoff because new records have to be inserted in the correct order, a requisite for the quick search algorithms to work.
For details on mongodb indexing see http://www.mongodb.org/display/DOCS/Indexes.
Related
A non-distributed database has many posts, posts have zero or more user-defined tags, most posts have the most_posts_have_this tag, few posts have the few_posts_have_this tag.
When querying {'tags': {'$all': ['most_posts_have_this', 'few_posts_have_this']}} the query is slow, it seems to be iterating through posts with the most_posts_have_this tag.
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Short answer is no, this is due to how Mongo builds an index on an array:
To index a field that holds an array value, MongoDB creates an index key for each element in the array
So when you when you query the tags field imagine mongo queries each tag separately then it does an intersection.
If you run "explain" you will be able to see that after the index scan phase Mongo executes a fetch document phase, this phase in theory should be redundant for an pure index scan which shows this is not the case. So basically Mongo fetches ALL documents that have either of the tags, only then it performs the "$all" logic in the filtering phase.
So what can you do?
if you have prior knowledge on which tag is sparser you could first query that and only then filter based on the larger tag, I'm assuming this is not really the case but worth considering if possible. If your tags are somewhat static maybe you can precalculate this even.
Otherwise you will have to reconsider a restructuring that will allow better index usage for this usecase, I will say for most access patterns your structure is better.
The new structure can be an object like so:
tags2: {
tagname1: 1,
tagname2: 2,
...
}
Now if you built an index on tags2 each key of the object will be indexed separately, this will make mongo skip the "fetch" phase as the index contains all the information needed to execute the following query:
{"tags2.most_posts_have_this" :{$exists: true}, "tags2.few_posts_have_this": {$exists: true}}
I understand both solutions are underwhelming to say the least, but sadly Mongo does not excel in this specific use case.. I can think of more "hacky" approaches but I would say these 2 are the more reasonable ones to actually consider implementing depending on performance requirments.
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Not really. When Mongo runs an $all it is going to get all records with both tags first. You could try using two $in queries in an aggregation instead, selecting the less frequent tag first. I'm not sure if this would actually be faster (depends on how Mongo optimizes things) but could be worth a try.
The best you can do:
Make sure you have an an index on the tags field. I see in the comments you have done this.
Mongo may be using the wrong index for this query. You can see which it is using with cursor.explain(). You can force it to use your tags index with hint(). First use db.collection.getIndexes() to make sure your tags index shows up as expected in the list of indexes.
Using projections to return only the fields you need might speed things up. For example, depending on your use case, you might return just post IDs and then query full text for a smaller subset of the returned posts. This could speed things up because Mongo doesn't have to manage as much intermediate data.
You could also consider periodically sorting the tags array field by frequency. If the least frequent tags are first, Mongo may be able to skip further scanning for that document. It will still fetch all the matching documents, but if your tag lists are very large it could save time by skipping the later tags. See The ESR (Equality, Sort, Range) Rule for more details on optimizing your indexed fields.
If all that's still not fast enough and the performance of these queries is critical, you'll need to do something more drastic:
Upgrade your machine (ensure it has enough RAM to store your whole dataset, or at least your indexes, in memory)
Try sharding
Revisit your data model. The fastest possible result will be if you can turn this query into a covered query. This may or may not be possible on an array field.
See Mongo's optimizing query performance for more detail, but again, it is unlikely to help with this use case.
In the MongoDB docs it is stated that
Indexes are special data structures [1] that store a small portion of
the collection’s data set in an easy to traverse form.
How can I see these data structures? Is it possible?
I was going through this question and I saw that in this answer they gave an example of a schema for an index. Is there such a thing in MongoDB that is what I am trying to see. I am trying to understand indexes in MongoDB better.
When you create an index in Mongo (using createIndex) you specify which fields the index will use, or what you call the index "schema".
As mentioned in the docs these indexes are built as b-trees (don't read too much into this as indexes are a "black box" for us users), viewing the exact tree structure is not possible, but you can use indexStats to get some more information on an index you created.
From the tutorials out there I know that I can sort a MongoDB collection in meteor on request like this:
// Sorted by createdAt descending
Users.find({}, {sort: {createdAt: -1}})
But I feel like this solution is not optimal in the view of performance.
Because if I understand it right, every time there is a request for Users, the raw collection is requested and then sorted over and over again.
So wouldn't it be better to sort the whole collection once and for all and then access the already sorted collection with Users.find()?
The question is: How do I sort the whole collection permanently not just the found results?
This is a known limitation of MiniMongo, Meteor's client-side implementation of (a subset of) the MongoDB functionality.
"Sorting" a MongoDB collection does not really have a coherent meaning. It does not translate into a concrete set of operations. What would you sort it by? Is there a "natural" way to sort a set of documents which structure may vary?
The mechanism that is used for making data retrieval more efficient is an index. On the server, indices are used to assist sorting, if possible:
In MongoDB, sort operations can obtain the sort order by retrieving documents based on the ordering in an index. If the query planner cannot obtain the sort order from an index, it will sort the results in memory. Sort operations that use an index often have better performance than those that do not use an index. In addition, sort operations that do not use an index will abort when they use 32 megabytes of memory.
(Source: MongoDB documentation)
As a collection does not have an inherent order to it, the entity that holds information about the order requirements in MongoDB is a Cursor. A cursor can be fetched multiple times, and in theory could be made into an efficient ordered data fetcher.
Unfortunately, this is not the case at the moment. The way it is currently implemented, MiniMongo does not have indices and does not cache the documents by order. They are re-sorted every time the cursor is fetched.
The sorting is reasonably efficient (as much as sorting can be efficient, O(n*logn) sort function invocations), but for a large data set, it could be fairly lengthy and degrade the user experience.
At the moment, if you have a special use case that requires repeated access to a large data-set that is ordered the same way, you could try to keep a cache of ordered documents if you need to, by observing the cursor and updating the cache when there are changes.
The question is a very simple one, can you have more than one index in a collection. I suppose you can, but every time I search for multiple indexes I get explanations on compound indexes and that is not what I'm looking for.
All I want to do is make sure that I can have two simple separate indexes.
(I'm using PHP, I'll use php code formatting, but I understand
db.posts.ensureIndex({ my_id1: 1 }, {unique: true, background: true});
db.posts.ensureIndex({ my_id2: 1 }, {background: true});
I'll only search for one index at a time.
Compound indexes are not what I'm looking for because:
one index is unique and the other is not.
I think it's not going to be the fastest option. (open the link to understand the reason I think its going to be slower. link)
I just want to make sure that the indexes will work.
You sure can have indexes defined the way you have it. From MongoDB documentation:
How many indexes? Indexes make retrieval by a key, including ordered sequential retrieval, very fast. Updates by key are faster too as MongoDB can find the document to update very quickly. However, keep in mind that each index created adds a certain amount of overhead for inserts and deletes. In addition to writing data to the base collection, keys must then be added to the B-Tree indexes. Thus, indexes are best for collections where the number of reads is much greater than the number of writes. For collections which are write-intensive, indexes, in some cases, may be counterproductive. Most collections are read-intensive, so indexes are a good thing in most situations.
I also recommend you look at how Mongo will decide what index to use when it comes to running a query that goes by both fields.
Also take a look at their Indexing Advice and FAQ page. It will explain things like only one index per query, selectivity, etc.
p.s. This slideshare deck from 10gen suggests there's a limit of 40 indexes per collection.
For example, I have documents with only three fields: user, date, status. Since I select by user and sort by date, I have those two fields as an index. That is the proper thing to do. However, since each date only has one status, I am essentially indexing everything. Is it okay to not index all fields in a query? Where do you draw the line?
What makes this question more difficult is the complete opposite approach to indexes between read-heavy and write-heavy collections. If yours is somewhere in between, how do you determine the proper approach when it comes to indexes?
Is it okay to not index all fields in a query?
Yes, but you'll want to avoid this for frequently used queries. Anything not indexed will imply a "table scan". This means accessing each possible document individually, which will be slow.
Where do you draw the line?
Also note, that if you sort by an un-indexed field, MongoDB will "yell at you" if you're trying to sort too much data. So you have to have some awareness of how much data is "outside of" the index.
If yours is somewhere in between, how do you determine the proper approach when it comes to indexes?
Monitoring, instrumenting, experimenting and experience.
There is no hard and fast rule here, it's all going to be about trade-offs. CPU vs. RAM vs. Disk IO vs. Responsiveness, etc.
The perfect situation is to store everything in a single index. By everything I mean all fields you query on, you sort by and you retrieve. This will ensure that you'll get maximum performance (if index fits in ram)
This situation is not always possible, so you'll have to make choices.
Here are 3 tips to reduce at maximum the index size:
Does each of your query have a lot of results or only a few ? => A few : you do not have to index all the fields you retrieve (only the query and sort fields because few results mean few disk access).
Does your query results are often the same (i.e your working set is small) ? => don't index the field you retrieve because results are cached by mongodb.
Do you have a query field more selective than another ? => index the more selective field only.