Chaining time-based sort and limit issue - mongodb

Lately I've encountered some strange behaviours (i.e. meaning that they are, IMHO, counter-intuitive) while playing with mongo and sort/limit.
Let's suppose I do have the following collection:
> db.fred.find()
{ "_id" : ObjectId("..."), "record" : 1, "time" : ISODate("2011-12-01T00:00:00Z") }
{ "_id" : ObjectId("..."), "record" : 2, "time" : ISODate("2011-12-02T00:00:00Z") }
{ "_id" : ObjectId("..."), "record" : 3, "time" : ISODate("2011-12-03T00:00:00Z") }
{ "_id" : ObjectId("..."), "record" : 4, "time" : ISODate("2011-12-04T00:00:00Z") }
{ "_id" : ObjectId("..."), "record" : 5, "time" : ISODate("2011-12-05T00:00:00Z") }
What I would like is retrieving, in time order, the 2 records previous to "record": 4 plus record 4 (i.e. record 2, record 3 and record 4)
Naively I was about running something along:
db.fred.find({time: {$lte: ISODate("2011-12-04T00:00:00Z")}}).sort({time: -1}).limit(2).sort({time: 1})
but it does not work the way I expected:
{ "_id" : ObjectId("..."), "record" : 1, "time" : ISODate("2011-12-01T00:00:00Z") }
{ "_id" : ObjectId("..."), "record" : 2, "time" : ISODate("2011-12-02T00:00:00Z") }
I was thinking that the result would have been record 2, record 3 and 4.
From what I recollected, it seems that the 2 sort does apply before limit:
sort({time: -1}) => record 4, record 3, record 2, record 1
sort({time: -1}).limit(2) => record 4, record 3
sort({time: -1}).limit(2).sort({time: 1}) => record 1, record 2
i.e it's like the second sort was applied to the cursor returned by find (i.e. the whole set) and then only, the limit is applied.
What is my mistake here and how can I achieve the expected behavior?
BTW: running mongo 2.0.1 on Ubuntu 11.01

The MongoDB shell lazily evaluates cursors, which is to say, the series of chained operations you've done results in one query being sent to the server, using the final state based on the chained operations. So when you say "sort({time: -1}).limit(2).sort({time: 1})" the second call to sort overrides the sort set by the first call.
To achieve your desired result, you're probably better off reversing the cursor output in your application code, especially if you're limiting to a small result set (here you're using 2). The exact code to do so depends on the language you're using, which you haven't specified.

Applying sort() to the same query multiple times makes no sense here. The effective sorting will be taken from the last sort() call. So
sort({time: -1}).limit(2).sort({time: 1})
is the same as
sort({time: 1}).limit(2)

Related

Index vs Aggregation Pipeline for Sorting

I'm developing an application using MongoDB as its database, and for sorting data, I encountered an interesting argument from a colleague that index can be used instead of aggregation pipeline for getting sorted data.
I tried this and it actually works; using an index with specified field and direction does return sorted data when queried. When using aggregation pipeline, I also obtained the same result.
I have created an index with the following specification:
index name: batch_deleted_a_desc
num: asc
marked: asc
value: desc
Using aggregation pipeline:
> db.test.aggregate([{$match: {num:"3",marked:false}}, {$sort:{"value":-1}}])
{ "_id" : ObjectId("5d70b40ba7bebd3d7c135615"), "value" : 4, "marked" : false, "num" : "3" }
{ "_id" : ObjectId("5d70b414a7bebd3d7c135616"), "value" : 2, "marked" : false, "num" : "3" }
{ "_id" : ObjectId("5d70b3fea7bebd3d7c135614"), "value" : 1, "marked" : false, "num" : "3" }
Using index:
> db.test.find({num:"3",marked:false})
{ "_id" : ObjectId("5d70b40ba7bebd3d7c135615"), "value" : 4, "marked" : false, "num" : "3" }
{ "_id" : ObjectId("5d70b414a7bebd3d7c135616"), "value" : 2, "marked" : false, "num" : "3" }
{ "_id" : ObjectId("5d70b3fea7bebd3d7c135614"), "value" : 1, "marked" : false, "num" : "3" }
As you can see, the results are the same. But I am unsure that using index for getting sorted data is a good practice, and yet using aggregation pipeline is (sometimes) taking more effort than just creating index.
So, which would be the best option?
In the context of the question, the better option would be the aggregation because it explicitly specifies the sort.
In the query example, results are being returned in order specified by the index because the query is using the index { num: 1, marked: 1, value: 1}. However, nothing specified in the query will guarantee that ordering, meaning results may change at some point in the future. For example, consider the case where the index { num: 1, marked: 1, updated_at: 1 } were to be created. The query planner may decide to use this index instead, which may result in results in a different order.
In the absence of a sort, a query would return results in the order of the index being used, but you should not rely upon that ordering without explicitly specifying it. Quoting the docs:
Unless you specify the sort() method or use the $near operator,
MongoDB does not guarantee the order of query results.

MongoDB $or + sort + index. How to avoid sorting in memory?

I have an issue to generate proper index for my mongo query, which would avoid SORT stage. I am not even sure if that is possible in my case. So here is my query with execution stats:
db.getCollection('test').find(
{
"$or" : [
{
"a" : { "$elemMatch" : { "_id" : { "$in" : [4577] } } },
"b" : { "$in" : [290] },
"c" : { "$in" : [35, 49, 57, 101, 161, 440] },
"d" : { "$lte" : 399 }
},
{
"e" : { "$elemMatch" : { "numbers" : { "$in" : ["1K0407151AC", "0K20N51150A"] } } },
"d" : { "$lte" : 399 }
}]
})
.sort({ "X" : 1, "d" : 1, "Y" : 1, "Z" : 1 }).explain("executionStats")
The fields 'm', 'a' and 'e' are arrays, that is why 'm' is not included in any index.
If you check the execution stats screenshot, you will see that memory usage is pretty close to maximum and unfortunately I had cases where the query failed to execute because of the 32MB limit.
Index for the first part of the $or query:
{
"a._id" : 1,
"X" : 1,
"d" : 1,
"Y" : 1,
"Z" : 1,
"b" : 1,
"c" : 1
}
Index for the second part of the $or query:
{
"e.numbers" : 1,
"X" : 1,
"d" : 1,
"Y" : 1,
"Z" : 1
}
The indexes are used by the query, but not for sorting. Instead of SORT stage I would like too see SORT_MERGE stage, but no success for now. If I run the part queries inside $or separately, they are able to use the index to avoid sorting in a memory. As a workaround it is ok, but I would need to merge and resort the results by the application.
MongoDB version is 3.4.2. I checked that and that question. My query is the result. Probably I missed something?
Edit: mongo documents look like that:
{
"_id" : "290_440_K760A03",
"Z" : "K760A03",
"c" : 440,
"Y" : "NPS",
"b" : 290,
"X" : "Schlussleuchte",
"e" : [
{
"..." : 184,
"numbers" : [
"0K20N51150A"
]
}
],
"a" : [
{
"_id" : 4577,
"..." : [
{
"..." : [
{
"..." : "R",
}
]
}
]
},
{
"_id" : 4578
}
],
"d" : 101,
"m" : [
"AT",
"BR",
"CH"
],
"moreFields":"..."
}
Edit 2: removed the filed "m" from query to decrease complexity and attached test collection dump for someone, who wants to help :)
Here is the solution-
I just added one document in my test collection as shown in your question (edit part). Then I created below four indices-
1. {"m":1,"b":1,"c":1,"X":1,"d":1,"Y":1,"Z":1}
2. {"a._id":1,"b":1,"c":1,"X":1,"d":1,"Y":1,"Z":1}
3. {"m":1,"X":1,"d":1,"Y":1,"Z":1}
4. {"e.numbers":1,"X":1,"d":1,"Y":1,"Z":1}
And when I executed given query for execution stats then it shows me the SORT_MERGE state as expected.
Here is the explanation-
MongoDB has a thing called equality-sort-range which tells a lot how we should create our indices. I just followed this rule and kept the index in that order. So Here the index should be {Equality fields, "X":1,"d":1,"Y":1,"Z":1, Range fields}. You can see that the query has range on field "d" only ("d" : { "$lte" : 101 }) but "d" is already covered in SORT fields of index ("X":1,"d":1,"Y":1,"Z":1) so we can skip range part (i.e. field "d") from the end of index.
If "d" had NOT been in sort/equality predicate then I would have taken it in index for range index field and my index would have looked like {Equality fields, "X":1,"Y":1,"Z":1,"d":1}.
Now my index is {Equality fields, "X":1,"d":1,"Y":1,"Z":1} and I am just concerned about equality fields. So to figure out equality fields I just checked the query find predicates and I found there are two conditions combined by OR operator.
The first condition has equality on "a._id", "b", "c", "m" ("d" has range, not equality). So I need to create an index like "a._id":1,"m":1,"b":1,"c":1,"X":1,"d":1,"Y":1,"Z":1 but this will give error because it has two array fields "a_id" and "m". And as we know Mongo doesn't allow compound index on parallel arrays so it will fail. So I created two separate index just to allow Mongo to use whatever is chosen by query planner. And hence I created first and second index.
The second condition of OR operator has "e.numbers" and "m". Both are arrays fields so I had to create two indices as done for first condition and that's how I got my third and fourth index.
Now we know that at a time a single query can use only and only one index so I need to create these indices because I don't know which branch of OR operator will be executed.
Note: If you are concerned about size of index then you can keep only one index from first two and one from last two. Or you can also keep all four and hint mongo to use proper index if you know it well before query planner.

How do I access a specific element after aggregation in mongodb?

After aggregation pipeline, I get a list of objects, but there is no way to retrieve the Nth object.
See:
http://docs.mongodb.org/manual/reference/operator/aggregation/group/#retrieve-distinct-values
The doc has an output like so:
{ "_id" : 1, "item" : "abc", "price" : 10, "quantity" : 2, "date" : ISODate("2014-03-01T08:00:00Z") }
{ "_id" : 2, "item" : "jkl", "price" : 20, "quantity" : 1, "date" : ISODate("2014-03-01T09:00:00Z") }
{ "_id" : 3, "item" : "xyz", "price" : 5, "quantity" : 10, "date" : ISODate("2014-03-15T09:00:00Z") }
{ "_id" : 4, "item" : "xyz", "price" : 5, "quantity" : 20, "date" : ISODate("2014-04-04T11:21:39.736Z") }
{ "_id" : 5, "item" : "abc", "price" : 10, "quantity" : 10, "date" : ISODate("2014-04-04T21:23:13.331Z") }
This is a group of objects, but it is not in a list so you can't do stuff like:
results[1] to get the second object. How are you supposed to interact with this group?
First of all, if you use the db.collection.distinct(<fieldname>) function, you can get distinct values of a field as an array:
> db.animals.distinct("call")
["moo", "baa", "woof", "meow", "quack"]
and then you can dereference the result since it's an array.
> calls = db.animals.distinct("call")
> calls[3]
"meow"
Aggregating for distinct values is free of the big limitation of db.collection.distinct() in that it returns a cursor over the distinct values instead of a big array, which means there's no 16MB BSON limit, and the distinct function can use indexes to cover the operation. So use the aggregation approach when you have a gazillion distinct values but otherwise use the distinct function. While you could call .toArray() on the cursor and get all the results in an array, if you have so many results that you couldn't use db.collection.distinct() then that is a bad idea. You should iterate through the cursor and pick out those values that you want and do stuff with them:
> var k = 0
> var calls = db.animals.aggregate(<pipeline for distinct animal calls>)
> while (calls.hasNext()) {
var call = calls.next()
k++
if (k == 96) doStuff(call)
}
You can insert a $skip stage in the pipeline to have the server skip right to the first result that you want, if you also include a $sort to fix an order that the results will be returned in. If you know you only want up to a certain amount, you can also use $limit and then the .toArray() approach may be viable again.
Before MongoDB 2.6 the results from the aggregate method were returned as a single document with the results placed within an array. This changed to accommodate larger result sets that would exceed the 16MB BSON limit that was restricted by this form. The cursor that is returned now is actually an option, but turned on by default in the shell.
Provided your results are not to large, there is actually a "helper" method for .toArray() which just does what you want by turning the cursor results into an array. It does cursor iteration under the hood, but basically just hides that from you:
var results = db.collection.aggregate(pipeline).toArray()
Then just access the element, it's an array so n-1:
var result = results[8];
Similar methods are available to most drivers, or otherwise in the driver do not include the "cursor" option.

MongoDB Why this error : can't append to array using string field name: comments

I have a DB structure like below:
{
"_id" : 1,
"comments" : [
{
"_id" : 2,
"content" : "xxx"
}
]
}
I update a new subdocument in the comments feild. It is OK.
db.test.update(
{"_id" : 1, "comments._id" : 2},
{$push : {"comments.$.comments" : {_id : 3, content:"xxx"}}}
)
after that the DB structure:
{
"_id" : 1,
"comments" : [
{
"_id" : 2,
"comments" : [
{
"id" : 3,
"content" : "xxx"
}
],
"content" : "xxx"
}
]
}
But when I update a new subdocument in the comment field that _id is 3, There is a error:
db.test.update(
{"_id" : 1, "comments.comments.id" : 3},
{$push : {"comments.comments.$.comments" : {id : 4, content:"xxx"}}}
)
error message:
can't append to array using string field name: comments
Well, it makes total sense if you think about it. MongoDb has the advantage and the disadvantage of solving magically certain things.
When you query the database for a specific regular field like this:
{ field : "value" }
The query {field:"value"} makes total sense, it wouldn't in case value is part of an array but Mongo solves it for you, so in case the structure is:
{ field : ["value", "anothervalue"] }
Mongo iterates through all of them and matches "value" into the field and you don't have to think about it. It works perfectly.. at only one level, because it's impossible to guess what you want to do if you have multiple levels
In your case the first query works because it's the case in this example:
db.test.update(
{"_id" : 1, "comments._id" : 2},
{$push : {"comments.$.comments" : {_id : 3, content:"xxx"}}}
)
Matches _id in the first level, and comments._id at the second level, it gets an array as a result but Mongo is able to solve it.
But in the second case, think what you need, let's isolate the where clause:
{"_id" : 1, "comments.comments.id" : 3},
"Give me from the main collection records with _id:1" (one doc)
"And comments which comments inside have and id=3" (array * array)
The first level is solved easily, comments.id, the second is not possible due comments returns an array, but one more level is an array of arrays and Mongo gets an array of arrays as a result and it's not possible to push a document into all the records of the array.
The solution is to narrow your where clause to obtain an unique document in comments (could be the first one) but it's not a good solution because you never know what is the position of the document you're looking for, using the shell I think the only option to be accurate is to do it in two steps. Check this query that works (not the solution anyway) but "solves" the multiple array part fixing it to the first record:
db.test.update(
{"_id" : 1, "comments.0.comments._id" : 3},
{$push : {"comments.0.comments.$.comments" : {id : 4, content:"xxx"}}}
)

Using Spring Data Mongodb, is it possible to get the max value of a field without pulling and iterating over an entire collection?

Using mongoTemplate.find(), I specify a Query with which I can call .limit() or .sort():
.limit() returns a Query object
.sort() returns a Sort object
Given this, I can say Query().limit(int).sort(), but this does not perform the desired operation, it merely sorts a limited result set.
I cannot call Query().sort().limit(int) either since .sort() returns a Sort()
So using Spring Data, how do I perform the following as shown in the mongoDB shell? Maybe there's a way to pass a raw query that I haven't found yet?
I would be ok with extending the Paging interface if need be...just doesn't seem to help any. Thanks!
> j = { order: 1 }
{ "order" : 1 }
> k = { order: 2 }
{ "order" : 2 }
> l = { order: 3 }
{ "order" : 3 }
> db.test.save(j)
> db.test.save(k)
> db.test.save(l)
> db.test.find()
{ "_id" : ObjectId("4f74d35b6f54e1f1c5850f19"), "order" : 1 }
{ "_id" : ObjectId("4f74d3606f54e1f1c5850f1a"), "order" : 2 }
{ "_id" : ObjectId("4f74d3666f54e1f1c5850f1b"), "order" : 3 }
> db.test.find().sort({ order : -1 }).limit(1)
{ "_id" : ObjectId("4f74d3666f54e1f1c5850f1b"), "order" : 3 }
You can do this in sping-data-mongodb. Mongo will optimize sort/limit combinations IF the sort field is indexed (or the #Id field). This produces very fast O(logN) or better results. Otherwise it is still O(N) as opposed to O(N*logN) because it will use a top-k algorithm and avoid the global sort (mongodb sort doc). This is from Mkyong's example but I do the sort first and set the limit to one second.
Query query = new Query();
query.with(new Sort(Sort.Direction.DESC, "idField"));
query.limit(1);
MyObject maxObject = mongoTemplate.findOne(query, MyObject.class);
Normally, things that are done with aggregate SQL queries, can be approached in (at least) three ways in NoSQL stores:
with Map/Reduce. This is effectively going through all the records, but more optimized (works with multiple threads, and in clusters). Here's the map/reduce tutorial for MongoDB.
pre-calculate the max value on each insert, and store it separately. So, whenever you insert a record, you compare it to the previous max value, and if it's greater - update the max value in the db.
fetch everything in memory and do the calculation in the code. That's the most trivial solution. It would probably work well for small data sets.
Choosing one over the other depends on your usage of this max value. If it is performed rarely, for example for some corner reporting, you can go with the map/reduce. If it is used often, then store the current max.
As far as I am aware Mongo totally supports sort then limit: see http://www.mongodb.org/display/DOCS/Sorting+and+Natural+Order
Get the max/min via map reduce is going to be very slow and should be avoided at all costs.
I don't know anything about Spring Data, but I can recommend Morphia to help with queries. Otherwise a basic way with the Java driver would be:
DBCollection coll = db.getCollection("...");
DBCursor curr = coll.find(new BasicDBObject()).sort(new BasicDBObject("order", -1))
.limit(1);
if (cur.hasNext())
System.out.println(cur.next());
Use aggregation $max .
As $max is an accumulator operator available only in the $group stage, you need to do a trick.
In the group operator use any constant as _id .
Lets take the example given in Mongodb site only --
Consider a sales collection with the following documents:
{ "_id" : 1, "item" : "abc", "price" : 10, "quantity" : 2, "date" : ISODate("2014-01-01T08:00:00Z") }
{ "_id" : 2, "item" : "jkl", "price" : 20, "quantity" : 1, "date" : ISODate("2014-02-03T09:00:00Z") }
{ "_id" : 3, "item" : "xyz", "price" : 5, "quantity" : 5, "date" : ISODate("2014-02-03T09:05:00Z") }
{ "_id" : 4, "item" : "abc", "price" : 10, "quantity" : 10, "date" : ISODate("2014-02-15T08:00:00Z") }
{ "_id" : 5, "item" : "xyz", "price" : 5, "quantity" : 10, "date" : ISODate("2014-02-15T09:05:00Z") }
If you want to find out the max price among all the items.
db.sales.aggregate(
[
{
$group:
{
_id: "1", //** This is the trick
maxPrice: { $max: "$price" }
}
}
]
)
Please note that the value of "_id" - it is "1". You can put any constant...
Since the first answer is correct but the code is obsolete, I'm replying with a similar solution that worked for me:
Query query = new Query();
query.with(Sort.by(Sort.Direction.DESC, "field"));
query.limit(1);
Entity maxEntity = mongoTemplate.findOne(query, Entity.class);