I want to sort a collection of documents on the backend with Meteor in a publication. I want to make sure I index the collection properly and I couldnt find any concrete ways to do this for the fields I want.
Basically I want to find all the documents in collection "S" that have field "a" equal to a specific string value. Then after I have all those documents, I want to sort them by three fields: b, c (descending), and d.
Would this be the right way to index it:
S._ensureIndex({ a: 1 });
S._ensureIndex({ b: 1, c: -1, d: 1 });
S.find({ a: "specificString"}, { sort: { b: 1, c: -1, d: 1 } });
As a bonus question, field "d" is the time added in milliseconds and it would be highly unlikely that there is a duplicate document with all 3 fields being the same. Should I also include the unique option as well in the index?
S._ensureIndex({ b: 1, c: -1, d: 1 }, { unique: true });
Will adding the unique option help any with sorting and indexing performance?
I was also wondering what the general performance was for an indexed sort by mongo? On their page it says indexed sorts dont need to be done in memory because it is just doing a read. Does this mean the sort is instant?
Thanks for all your help :)
I was able to find my own answer here:
http://docs.mongodb.org/manual/tutorial/sort-results-with-indexes/#sort-and-non-prefix-subset-of-an-index
Basically the correct way to index for my example here is:
S._ensureIndex({ a: 1, b: 1, c: -1, d: 1 });
S.find({ a: "specificString"}, { sort: { b: 1, c: -1, d: 1 } });
-- As for my other subquestions.
Because you are using the index here, the performance is instant. However do keep mind that having a 4 field index can be very expensive in other ways.
Without looking this up for sure....my hunch is that including the "unique: true" would not make a difference in performance since it is already instant.
Related
If I have documents that look like:
{
_id: 1,
val1: x,
val2: aa
},
{
_id: 2,
val1: y,
val2: bb
},
{
_id: 3,
val1: x,
val2: cc
},
{
_id: 4,
val1: z,
val2: bb
}
Is it possible to group them in MongoDB so that docs 1 and 3 are paired and docs 2 and 4 are paired?
In essence, I'm looking to group docs if their val1 OR val2 are the same. Also, there will NOT be the possibility of docs being able to be in two different groups. Meaning, I should be able to partition the set of docs. Is the possible in Mongo?
Ultimately, I want to partition my set of documents based on the aforementioned criteria and then count the size of each subset.
I've tried attacking this problem by grouping on the val1 field and using $addToSet to created an array of val2's. But then I'm stuck because I don't know of a way in Mongo to merge arrays that contain at least one common element. If I did, I could use that list of arrays and aggregate again using $in.
Please let me know if I can clarify my question in any way!
Lets say I have a million entries in the db with 10 fields/("columns") in the db. It seems to me that the more columns I search by, the faster the query goes - for example:
db.items.find( {
$and: [
{ field1: x },
{ field2: y },
{ field3: z}
]
} )
is faster than:
db.items.find( {
$and: [
{ field1: x },
{ field2: y }
]
} )
While I would love to say "Great, this makes total sense to me" - it doesn't. I just know it's happening in my particular case, and wondering if this is actually always true. If so, ideally, I would like to know why.
Furthermore, when creating multi-field indices, does it help to have them in any sort of order. For example, let's say I add a compound index:
db.collection.ensureIndex( { field1: 1, field2: 1, field3: 1 } )
Do these have any sort of order? If yes, would the order matter? Let's say 90% of items will match the field1 criteria, but 1% of items will match the field 3 criteria. Would ordering them make some sort of difference?
It may be just the case that more restrictive query returns less documents, since 90% of items will match the field1 criteria and only 1% of items will match the field 3 criteria. Check what explain says for both queries.
Mongo has quite good profiler. Give it a try. Play with different indexes and different queries. Not on production db of course.
Order of fields in the index matters. If you have an index { field1: 1, field2: 1, field3: 1 }
and a query db.items.find( { field2: x, field3: y }), the index wount be used at all,
and for query db.items.find( { field1: x, field3: y }) it can be used only partially for field1.
From the other hand, order of conditions in the query does not matter:
db.items.find( { field1: x, field2: y }) is as good as
db.items.find( { field2: y, field1: x }) and will use the index in both cases.
Choosing indexing strategy you should examine both data and typical queries. It may be the case, that index intersection works better for you, and instead of a single compound index you get better total performance with simple indexes like { field1: 1}, { field2: 1}, { field3: 1}, rather than multiple compound indexes for different kind of queries.
It is also important to check index size to fit it in memory. In most cases anyway.
It's complicated... MongoDB keeps recently accessed documents in RAM, and the query plan is calculated the first time a query is executed, so the second time you run a query may be much faster than the first time.
But, putting that aside, the order of a compound index does matter. In a compound index, you can use the index in the order it was created, a bit like opening a door, walking through and finding that you have more doors to open.
So, having two overlapping indexes setup, e.g.:
{ city: 1, building: 1, room: 1 }
AND
{ city: 1, building: 1 }
Would be a waste, because you can still search for all the rooms in a particular building using the first two levels (fields) of the "{ city: 1, building: 1, room: 1 }" index.
Your intuition does make sense. If you had to find a particular room in a building, going straight to the right city, straight to the right building and then knowing the approximate place in the building will make it faster to find the room than if you didn't know the approximate place (assuming that there are lots of rooms). Take a look at levels in a B-Tree, e.g. the search visualisation here: http://visualgo.net/bst.html
It's not universally the case though, not all data is neatly distributed in a sort order - for example, English names or words tend to clump together under common letters - there's not many words that start with the letter X.
The (free, online) MongoDB University developer courses cover indexes quite well, but the best way to find out about the performance of a query is to look at the results of the explain() method against your query to see if an index was used, or whether a collection was scanned (COLLSCAN).
db.items.find( {
$and: [
{ field1: x },
{ field2: y }
]
})
.explain()
db.numbers.find().sort( { a : 1, b : 1, c : 1 })
If I execute this command MongoDB will sort numbers collection by property 'a', if 'a' is the same on two docs it will sort them by 'b' property, if that is the same too it will go on to 'c'. I hope I got that right, correct me if not.
But how does it pick 'a' property as first when it is just a JS object? Does it iterate over sorting object properties using for(var propr in ...) and whichever is first is also first to be sorted by?
Internally, MongoDB doesn't use JSON, is uses BSON. While JSON is technically un-ordered, BSON, (per the specification) is ordered. This is how MongoDB knows that in {a:1, b:1, c:1} that the keys are ordered "a,b,c": the underlying implementation is ordered as well.
As #Sammaye posted above in the comments, the JavaScript dictionary must be created with key priority in mind.
Hence, if you do something like this:
db.numbers.find().sort({
a: 1,
b: 1,
c: 1
});
your results will be sorted first by a, then by b, then by c.
If you do this, however:
db.numbers.find().sort({
c: 1,
a: 1,
b: 1
});
your results will be sorted first by c, then by a, then by b.
By using those keys you mentioned:
db.numbers.find().sort({
a: 1,
b: 1,
c: 1
});
MongoDB sorts with property a, b, and then c.
Basically, MongoDB scans from the beginning of the datafile (CMIIW). Hence, if MongoDB finds two or more documents with the same keys as yours (a,b, and c), it prints the first document found in the datafile first.
If you have a double compound index { a : 1, b : 1}, it makes sense to me that the index won't be used if you query on b alone (i.e. you cannot "skip" a in your query). The index will however be used if you query on a alone.
However, given a triple compound index { a : 1, b: 1, c: 1} my explain command is showing that the index is used when you query on a and c (i.e. you can "skip" b in your query).
How can Mongo use an abc index on a query for ac, and how effective is the index in this case?
Background:
My use case is that sometimes I want to query on a,b,c and sometimes I want to query on a,c. Now should I create only 1 index on a,b,c or should I create one on a,c and one on a,b,c?
(It doesn't make sense to create an index on a,c,b because c is a multi-key index with good selectivity.)
bottom line / tl;dr: Index b can be 'skipped' if a and c are queried for equality or inequality, but not, for instance, for sorts on c.
This is a very good question. Unfortunately, I couldn't find anything that authoritatively answers this in greater detail. I believe the performance of such queries has improved over the last years, so I wouldn't trust old material on the topic.
The whole thing is quite complicated because it depends on the selectivity on your indexes and whether you query for equality, inequality and/or sort, so explain() is your only friend, but here are some things I found:
Caveat: What comes now is a mixture of experimental results, reasoning and guessing. I might be stretching Kyle's analogy too far, and I might even be completely wrong (and unlucky, because my test results loosely match my reasoning).
It is clear that the index of A can be used, which, depending on the selectivity of A, is certainly very helpful. 'Skipping' B can be tricky, or not. Let's keep this similar to Kyle's cookbook example:
French
Beef
...
Chicken
Coq au Vin
Roasted Chicken
Lamb
...
...
If you now ask me to find some French dish called "Chateaubriand", I can use index A and, because I don't know the ingredient, will have to scan all dishes in A. On the other hand, I do know that the list of dishes in each category is sorted through the index C, so I will only have to look for the strings starting with, say, "Cha" in each ingredient-list. If there are 50 ingredients, I will need 50 lookups instead of just one, but that is a lot better than having to scan every French dish!
In my experiments, the number was a lot smaller than the number of distinct values in b: it never seemd to exceed 2. However, I tested this only with a single collection, and it probably has to do with the selectivity of the b-index.
If you asked me to give you an alphabetically sorted list of all French dishes, though, I'd be in trouble. Now the index on C is worthless, I'd have to merge-sort all those index lists. I will have to scan every element to do so.
This reflects in my tests. Here are some simplified results. The original collection has datetimes, ints and strings, but I wanted to keep things simple, so it's now all ints.
Essentially, there are only two classes of queries: those where nscanned <= 2 * limit, and those that have to scan the entire collection (120k documents). The index is {a, b, c}:
// fast (range query on c while skipping b)
> db.Test.find({"a" : 43, "c" : { $lte : 45454 }});
// slow (sorting)
> db.Test.find({"a" : 43, "c" : { $lte : 45454 }}).sort({ "c" : -1});
> db.Test.find({"a" : 43, "c" : { $lte : 45454 }}).sort({ "b" : -1});
// fast (can sort on c if b included in the query)
> db.Test.find({"a" : 43, "b" : 7887, "c" : { $lte : 45454 }}).sort({ "c" : -1});
// fast (older tutorials claim this is slow)
> db.Test.find({"a" : {$gte : 43}, "c" : { $lte : 45454 }});
Your mileage will vary.
You can view querying on A and C as a special case of querying on A (in which case the index would be used). Using the index is more efficient than having to load the whole document.
Suppose you wanted to get all documents with A between 7 and 13, and C between 5 and 8.
If you had an index on A only: the database could use the index to select documents with A between 7 and 13 but, to make sure that C was between 5 and 8, it would have to retrieve the corresponding documents too.
If you had an index on A, B, and C: the database could use the index to select documents with A between 7 and 13. Since the values of C are already stored in the records of the index, it could determine whether the correponding documents also match the C criterion, without having to retrieve those documents. Therefore, you would avoid disk reads, with better performance.
So I have a very large set of metrics (15GB and growing) that has some of the data in nested hashes. Like so:
{
_id: 'abc0000',
type: 'foo',
data: { a: 20, b: 30, c: 3 }
},
... more data following this schema...
{
_id: 'abc5000',
type: 'bar'
data: { a: 1, b: 2, c: 4, d: 10 }
}
What is the performance implications when I run a query on the nested hashes? The data inside the hash can't be indexed...or rather, it would be pointless to index it.
I can always reform the data into a flat style data_a, data_b, etc...
You can create indexes on attributes in nested hashes. Take a look at Indexing with dot notation for more details. You can also create compound indexes if you need, but be careful about caveats with parallel arrays. Basically, if you create a compound index only one of the indexed values can be array, however, that shouldn't effect you(judging from posted schema).
So you can create indexes on data.a, data.b or data.a, data.c as per your needs.