Index vs Aggregation Pipeline for Sorting - mongodb

I'm developing an application using MongoDB as its database, and for sorting data, I encountered an interesting argument from a colleague that index can be used instead of aggregation pipeline for getting sorted data.
I tried this and it actually works; using an index with specified field and direction does return sorted data when queried. When using aggregation pipeline, I also obtained the same result.
I have created an index with the following specification:
index name: batch_deleted_a_desc
num: asc
marked: asc
value: desc
Using aggregation pipeline:
> db.test.aggregate([{$match: {num:"3",marked:false}}, {$sort:{"value":-1}}])
{ "_id" : ObjectId("5d70b40ba7bebd3d7c135615"), "value" : 4, "marked" : false, "num" : "3" }
{ "_id" : ObjectId("5d70b414a7bebd3d7c135616"), "value" : 2, "marked" : false, "num" : "3" }
{ "_id" : ObjectId("5d70b3fea7bebd3d7c135614"), "value" : 1, "marked" : false, "num" : "3" }
Using index:
> db.test.find({num:"3",marked:false})
{ "_id" : ObjectId("5d70b40ba7bebd3d7c135615"), "value" : 4, "marked" : false, "num" : "3" }
{ "_id" : ObjectId("5d70b414a7bebd3d7c135616"), "value" : 2, "marked" : false, "num" : "3" }
{ "_id" : ObjectId("5d70b3fea7bebd3d7c135614"), "value" : 1, "marked" : false, "num" : "3" }
As you can see, the results are the same. But I am unsure that using index for getting sorted data is a good practice, and yet using aggregation pipeline is (sometimes) taking more effort than just creating index.
So, which would be the best option?

In the context of the question, the better option would be the aggregation because it explicitly specifies the sort.
In the query example, results are being returned in order specified by the index because the query is using the index { num: 1, marked: 1, value: 1}. However, nothing specified in the query will guarantee that ordering, meaning results may change at some point in the future. For example, consider the case where the index { num: 1, marked: 1, updated_at: 1 } were to be created. The query planner may decide to use this index instead, which may result in results in a different order.
In the absence of a sort, a query would return results in the order of the index being used, but you should not rely upon that ordering without explicitly specifying it. Quoting the docs:
Unless you specify the sort() method or use the $near operator,
MongoDB does not guarantee the order of query results.

Related

MongoDB $or + sort + index. How to avoid sorting in memory?

I have an issue to generate proper index for my mongo query, which would avoid SORT stage. I am not even sure if that is possible in my case. So here is my query with execution stats:
db.getCollection('test').find(
{
"$or" : [
{
"a" : { "$elemMatch" : { "_id" : { "$in" : [4577] } } },
"b" : { "$in" : [290] },
"c" : { "$in" : [35, 49, 57, 101, 161, 440] },
"d" : { "$lte" : 399 }
},
{
"e" : { "$elemMatch" : { "numbers" : { "$in" : ["1K0407151AC", "0K20N51150A"] } } },
"d" : { "$lte" : 399 }
}]
})
.sort({ "X" : 1, "d" : 1, "Y" : 1, "Z" : 1 }).explain("executionStats")
The fields 'm', 'a' and 'e' are arrays, that is why 'm' is not included in any index.
If you check the execution stats screenshot, you will see that memory usage is pretty close to maximum and unfortunately I had cases where the query failed to execute because of the 32MB limit.
Index for the first part of the $or query:
{
"a._id" : 1,
"X" : 1,
"d" : 1,
"Y" : 1,
"Z" : 1,
"b" : 1,
"c" : 1
}
Index for the second part of the $or query:
{
"e.numbers" : 1,
"X" : 1,
"d" : 1,
"Y" : 1,
"Z" : 1
}
The indexes are used by the query, but not for sorting. Instead of SORT stage I would like too see SORT_MERGE stage, but no success for now. If I run the part queries inside $or separately, they are able to use the index to avoid sorting in a memory. As a workaround it is ok, but I would need to merge and resort the results by the application.
MongoDB version is 3.4.2. I checked that and that question. My query is the result. Probably I missed something?
Edit: mongo documents look like that:
{
"_id" : "290_440_K760A03",
"Z" : "K760A03",
"c" : 440,
"Y" : "NPS",
"b" : 290,
"X" : "Schlussleuchte",
"e" : [
{
"..." : 184,
"numbers" : [
"0K20N51150A"
]
}
],
"a" : [
{
"_id" : 4577,
"..." : [
{
"..." : [
{
"..." : "R",
}
]
}
]
},
{
"_id" : 4578
}
],
"d" : 101,
"m" : [
"AT",
"BR",
"CH"
],
"moreFields":"..."
}
Edit 2: removed the filed "m" from query to decrease complexity and attached test collection dump for someone, who wants to help :)
Here is the solution-
I just added one document in my test collection as shown in your question (edit part). Then I created below four indices-
1. {"m":1,"b":1,"c":1,"X":1,"d":1,"Y":1,"Z":1}
2. {"a._id":1,"b":1,"c":1,"X":1,"d":1,"Y":1,"Z":1}
3. {"m":1,"X":1,"d":1,"Y":1,"Z":1}
4. {"e.numbers":1,"X":1,"d":1,"Y":1,"Z":1}
And when I executed given query for execution stats then it shows me the SORT_MERGE state as expected.
Here is the explanation-
MongoDB has a thing called equality-sort-range which tells a lot how we should create our indices. I just followed this rule and kept the index in that order. So Here the index should be {Equality fields, "X":1,"d":1,"Y":1,"Z":1, Range fields}. You can see that the query has range on field "d" only ("d" : { "$lte" : 101 }) but "d" is already covered in SORT fields of index ("X":1,"d":1,"Y":1,"Z":1) so we can skip range part (i.e. field "d") from the end of index.
If "d" had NOT been in sort/equality predicate then I would have taken it in index for range index field and my index would have looked like {Equality fields, "X":1,"Y":1,"Z":1,"d":1}.
Now my index is {Equality fields, "X":1,"d":1,"Y":1,"Z":1} and I am just concerned about equality fields. So to figure out equality fields I just checked the query find predicates and I found there are two conditions combined by OR operator.
The first condition has equality on "a._id", "b", "c", "m" ("d" has range, not equality). So I need to create an index like "a._id":1,"m":1,"b":1,"c":1,"X":1,"d":1,"Y":1,"Z":1 but this will give error because it has two array fields "a_id" and "m". And as we know Mongo doesn't allow compound index on parallel arrays so it will fail. So I created two separate index just to allow Mongo to use whatever is chosen by query planner. And hence I created first and second index.
The second condition of OR operator has "e.numbers" and "m". Both are arrays fields so I had to create two indices as done for first condition and that's how I got my third and fourth index.
Now we know that at a time a single query can use only and only one index so I need to create these indices because I don't know which branch of OR operator will be executed.
Note: If you are concerned about size of index then you can keep only one index from first two and one from last two. Or you can also keep all four and hint mongo to use proper index if you know it well before query planner.

Which index would be used if there are multiple indexes containing the same fields?

Take, for example, a find() that involves a field a and b, in that order. For example,
db.collection.find({'a':{'$lt':10},'b':{'$lt':5}})
I have two keys in my array of indexes for the collection:
[
{
"v" : 1,
"key" : {
"a" : 1,
"b" : 1
},
"ns" : "x.test",
"name" : "a_1_b_1"
},
{
"v" : 1,
"key" : {
"a" : 1,
"b" : 1,
"c" : 1
},
"ns" : "x.test",
"name" : "a_1_b_1_c_1"
}
]
Is it guaranteed that mongo will use the first key since it more accurately matches the query, or does it randomly choose any of the two keys because they will both work?
MongoDB has a query optimizer which selects the indexes that are most efficient. From the docs:
The MongoDB query optimizer processes queries and chooses the most
efficient query plan for a query given the available indexes.
So it's not strictly guaranteed (but I expect that the smaller index will yield results faster than the bigger compound index).
You can also use hint operator to force the query optimizer to use the specified index.
db.collection.find({'a':{'$lt':10},'b':{'$lt':5}}).hint({a:1, b:1});
However, those two indexes in your example are redundant. That's because the compound index supports queries on any prefix of index fields.
The following index:
db.collection.ensureIndex({a: 1, b: 1, c: 1});
Can support queries that include a, a and b and a and b and c, but not only b or c, or only b and c.
You and use $exist,, When is true, $exists matches the documents that contain the field, including documents where the field value is null. If is false, the query returns only the documents that do not contain the field.
$exist
the query will be
db.inventory.find( { "key.a": { $exists: true, 1 },"key.b": { $exists: true, 1 } } )

Subdocument index in mongo

What exactly happens when I call ensureIndex(data) when typical data looks like data:{name: "A",age:"B", job : "C"} ? Will it create a compound index over these three fields or will it create only one index applicable when anything from data is requested or something altogether different ?
You can do either :
> db.collection.ensureIndex({"data.name": 1,"data.age":1, "data.job" : 1})
> db.collection.ensureIndex({"data": 1})
This is discussed in the documentation under indexes-on-embedded-fields and indexes on sub documents
The important section of the sub document section is 'When performing equality matches on subdocuments, field order matters and the subdocuments must match exactly.'
This means that the 2 indexes are the same for simple queries .
However, as the sub-document example shows, you can get some interesting results (that you might not expect) if you just index the whole sub-document as opposed to a specific field and then do a comparison operator (like $gte) - if you index a specific sub field you get a less flexible, but potentially more useful index.
It really all depends on your use case.
Anyway, once you have created the index you can check what's created with :
> db.collection.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "test.collection",
"name" : "_id_"
},
{
"v" : 1,
"key" : {
"data.name" : 1,
"data.age" : 1,
"data.job" : 1
},
"ns" : "test.collection",
"name" : "data.name_1_data.age_1_data.job_1"
}
]
As you can see from the output it created a new key called data.name_1_data.age_1_data.job_1 (the _id_ index is always created).
If you want to test your new index then you can do :
> db.collection.insert({data:{name: "A",age:"B", job : "C"}})
> db.collection.insert({data:{name: "A1",age:"B", job : "C"}})
> db.collection.find({"data.name" : "A"}).explain()
{
"cursor" : "BtreeCursor data.name_1_data.age_1_data.job_1",
.... more stuff
The main thing is that you can see that your new index was used (BtreeCursor data.name_1_data.age_1_data.job_1 in the cursor field is what indicates this is the case). If you see "cursor" : "BasicCursor", then your index was not used.
For more detailed information look here.
you can try this :
db.collection.ensureIndex({"data.name": 1,"data.age":1, "data.job" : 1})

Handling optional/empty data in MongoDB

I remember reading somewhere that the mongo engine was more confortable when the entire structure of a document was already in place in case of an update, so here is the question.
When dealing with "empty" data, for example when inserting an empty string, should I default it to null, "" or not insert it at all ?
{
_id: ObjectId("5192b6072fda974610000005"),
description: ""
}
or
{
_id: ObjectId("5192b6072fda974610000005"),
description: null
}
or
{
_id: ObjectId("5192b6072fda974610000005")
}
You have to remember that the description field may or may not be filled in every document (based on user input).
Introduction
If a document doesn't have a value, the DB considers its value to be null. Suppose a database with the following documents:
{ "_id" : ObjectId("5192d23b1698aa96f0690d96"), "a" : 1, "desc" : "" }
{ "_id" : ObjectId("5192d23f1698aa96f0690d97"), "a" : 1, "desc" : null }
{ "_id" : ObjectId("5192d2441698aa96f0690d98"), "a" : 1 }
If you create a query to find documents with the field desc different than null, you will get just one document:
db.test.find({desc: {$ne: null}})
// Output:
{ "_id" : ObjectId("5192d23b1698aa96f0690d96"), "a" : 1, "desc" : "" }
The database doesn't differ documents without a desc field and documents with a desc field with the value null. One more test:
db.test.find({desc: null})
// Output:
{ "_id" : ObjectId("5192d2441698aa96f0690d98"), "a" : 1 }
{ "_id" : ObjectId("5192d23f1698aa96f0690d97"), "a" : 1, "desc" : null }
But the differences are only ignored in the queries, because, as shown in the last example above, the fields are still saved on disk and you'll receive documents with the same structure of the documents that were sent to the MongoDB.
Question
When dealing with "empty" data, for example when inserting an empty string, should I default it to null, "" or not insert it at all ?
There isn't much difference from {desc: null} to {}, because most of the operators will have the same result. You should only pay special attention to these two operators:
$exists
$type
I'd save documents without the desc field, because the operators will continue to work as expected and I'd save some space.
Padding factor
If you know the documents in your database grow frequently, then MongoDB might need to move the documents during the update, because there isn't enough space in the previous document place. To prevent moving documents around, MongoDB allocates extra space for each document.
The ammount of extra space allocated by MongoDB per document is controlled by the padding factor. You cannot (and don't need to) choose the padding factor, because MongoDB will adaptively learn it, but you can help MongoDB preallocating internal space for each document by filling the possible future fields with null values. The difference is very small (depending on your application) and might be even smaller after MongoDB learn the best padding factor.
Sparse indexes
This section isn't too important to your specific problem right now, but may help you when you face similar problems.
If you create an unique index on field desc, then you wouldn't be able to save more than one document with the same value and in the previous database, we had more than one document with same value on field desc. Let's try to create an unique index in the previous presented database and see what error we get:
db.test.ensureIndex({desc: 1}, {unique: true})
// Output:
{
"err" : "E11000 duplicate key error index: test.test.$desc_1 dup key: { : null }",
"code" : 11000,
"n" : 0,
"connectionId" : 3,
"ok" : 1
}
If we want to be able to create an unique index on some field and let some documents have this field empty, we should create a sparse index. Let's try to create the unique index again:
// No errors this time:
db.test.ensureIndex({desc: 1}, {unique: true, sparse: true})
So far, so good, but why am I explaining all this? Because there is a obscure behaviour about sparse indexes. In the following query, we expect to have ALL documents sorted by desc.
db.test.find().sort({desc: 1})
// Output:
{ "_id" : ObjectId("5192d23f1698aa96f0690d97"), "a" : 1, "desc" : null }
{ "_id" : ObjectId("5192d23b1698aa96f0690d96"), "a" : 1, "desc" : "" }
The result seems weird. What happened to the missing document? Let's try the query without sorting it:
{ "_id" : ObjectId("5192d23b1698aa96f0690d96"), "a" : 1, "desc" : "" }
{ "_id" : ObjectId("5192d23f1698aa96f0690d97"), "a" : 1, "desc" : null }
{ "_id" : ObjectId("5192d2441698aa96f0690d98"), "a" : 1 }
All documents were returned this time. What's happening? It's simple, but not so obvious. When we sort the result by desc, we use the sparse index created previously and there is no entries for the documents that haven't the desc field. The following query show us the use of the index to sort the result:
db.test.find().sort({desc: 1}).explain().cursor
// Output:
"BtreeCursor desc_1"
We can skip the index using a hint:
db.test.find().sort({desc: 1}).hint({$natural: 1})
// Output:
{ "_id" : ObjectId("5192d23f1698aa96f0690d97"), "a" : 1, "desc" : null }
{ "_id" : ObjectId("5192d2441698aa96f0690d98"), "a" : 1 }
{ "_id" : ObjectId("5192d23b1698aa96f0690d96"), "a" : 1, "desc" : "" }
Summary
Sparse unique indexes don't work if you include {desc: null}
Sparse unique indexes don't work if you include {desc: ""}
Sparse indexes might change the result of a query
There is little difference between the null value field and a document without the field. The main difference is that the former consumes a little disk space, while the latter does not consume at all. They can be distinguished by using $exists operator.
The field with an empty string is quite different from them. Though it depends on purpose I don't recommend to use it as a replacement for null. To be precise, they should be used to mean different things. For instance, think about voting. A person who cast a blank ballot is different from a person who wasn't permitted to vote. The former vote is an empty String, while the latter vote is null.
There is already a similar question here.

Using Spring Data Mongodb, is it possible to get the max value of a field without pulling and iterating over an entire collection?

Using mongoTemplate.find(), I specify a Query with which I can call .limit() or .sort():
.limit() returns a Query object
.sort() returns a Sort object
Given this, I can say Query().limit(int).sort(), but this does not perform the desired operation, it merely sorts a limited result set.
I cannot call Query().sort().limit(int) either since .sort() returns a Sort()
So using Spring Data, how do I perform the following as shown in the mongoDB shell? Maybe there's a way to pass a raw query that I haven't found yet?
I would be ok with extending the Paging interface if need be...just doesn't seem to help any. Thanks!
> j = { order: 1 }
{ "order" : 1 }
> k = { order: 2 }
{ "order" : 2 }
> l = { order: 3 }
{ "order" : 3 }
> db.test.save(j)
> db.test.save(k)
> db.test.save(l)
> db.test.find()
{ "_id" : ObjectId("4f74d35b6f54e1f1c5850f19"), "order" : 1 }
{ "_id" : ObjectId("4f74d3606f54e1f1c5850f1a"), "order" : 2 }
{ "_id" : ObjectId("4f74d3666f54e1f1c5850f1b"), "order" : 3 }
> db.test.find().sort({ order : -1 }).limit(1)
{ "_id" : ObjectId("4f74d3666f54e1f1c5850f1b"), "order" : 3 }
You can do this in sping-data-mongodb. Mongo will optimize sort/limit combinations IF the sort field is indexed (or the #Id field). This produces very fast O(logN) or better results. Otherwise it is still O(N) as opposed to O(N*logN) because it will use a top-k algorithm and avoid the global sort (mongodb sort doc). This is from Mkyong's example but I do the sort first and set the limit to one second.
Query query = new Query();
query.with(new Sort(Sort.Direction.DESC, "idField"));
query.limit(1);
MyObject maxObject = mongoTemplate.findOne(query, MyObject.class);
Normally, things that are done with aggregate SQL queries, can be approached in (at least) three ways in NoSQL stores:
with Map/Reduce. This is effectively going through all the records, but more optimized (works with multiple threads, and in clusters). Here's the map/reduce tutorial for MongoDB.
pre-calculate the max value on each insert, and store it separately. So, whenever you insert a record, you compare it to the previous max value, and if it's greater - update the max value in the db.
fetch everything in memory and do the calculation in the code. That's the most trivial solution. It would probably work well for small data sets.
Choosing one over the other depends on your usage of this max value. If it is performed rarely, for example for some corner reporting, you can go with the map/reduce. If it is used often, then store the current max.
As far as I am aware Mongo totally supports sort then limit: see http://www.mongodb.org/display/DOCS/Sorting+and+Natural+Order
Get the max/min via map reduce is going to be very slow and should be avoided at all costs.
I don't know anything about Spring Data, but I can recommend Morphia to help with queries. Otherwise a basic way with the Java driver would be:
DBCollection coll = db.getCollection("...");
DBCursor curr = coll.find(new BasicDBObject()).sort(new BasicDBObject("order", -1))
.limit(1);
if (cur.hasNext())
System.out.println(cur.next());
Use aggregation $max .
As $max is an accumulator operator available only in the $group stage, you need to do a trick.
In the group operator use any constant as _id .
Lets take the example given in Mongodb site only --
Consider a sales collection with the following documents:
{ "_id" : 1, "item" : "abc", "price" : 10, "quantity" : 2, "date" : ISODate("2014-01-01T08:00:00Z") }
{ "_id" : 2, "item" : "jkl", "price" : 20, "quantity" : 1, "date" : ISODate("2014-02-03T09:00:00Z") }
{ "_id" : 3, "item" : "xyz", "price" : 5, "quantity" : 5, "date" : ISODate("2014-02-03T09:05:00Z") }
{ "_id" : 4, "item" : "abc", "price" : 10, "quantity" : 10, "date" : ISODate("2014-02-15T08:00:00Z") }
{ "_id" : 5, "item" : "xyz", "price" : 5, "quantity" : 10, "date" : ISODate("2014-02-15T09:05:00Z") }
If you want to find out the max price among all the items.
db.sales.aggregate(
[
{
$group:
{
_id: "1", //** This is the trick
maxPrice: { $max: "$price" }
}
}
]
)
Please note that the value of "_id" - it is "1". You can put any constant...
Since the first answer is correct but the code is obsolete, I'm replying with a similar solution that worked for me:
Query query = new Query();
query.with(Sort.by(Sort.Direction.DESC, "field"));
query.limit(1);
Entity maxEntity = mongoTemplate.findOne(query, Entity.class);