Using Spring Data Mongodb, is it possible to get the max value of a field without pulling and iterating over an entire collection? - mongodb

Using mongoTemplate.find(), I specify a Query with which I can call .limit() or .sort():
.limit() returns a Query object
.sort() returns a Sort object
Given this, I can say Query().limit(int).sort(), but this does not perform the desired operation, it merely sorts a limited result set.
I cannot call Query().sort().limit(int) either since .sort() returns a Sort()
So using Spring Data, how do I perform the following as shown in the mongoDB shell? Maybe there's a way to pass a raw query that I haven't found yet?
I would be ok with extending the Paging interface if need be...just doesn't seem to help any. Thanks!
> j = { order: 1 }
{ "order" : 1 }
> k = { order: 2 }
{ "order" : 2 }
> l = { order: 3 }
{ "order" : 3 }
> db.test.save(j)
> db.test.save(k)
> db.test.save(l)
> db.test.find()
{ "_id" : ObjectId("4f74d35b6f54e1f1c5850f19"), "order" : 1 }
{ "_id" : ObjectId("4f74d3606f54e1f1c5850f1a"), "order" : 2 }
{ "_id" : ObjectId("4f74d3666f54e1f1c5850f1b"), "order" : 3 }
> db.test.find().sort({ order : -1 }).limit(1)
{ "_id" : ObjectId("4f74d3666f54e1f1c5850f1b"), "order" : 3 }

You can do this in sping-data-mongodb. Mongo will optimize sort/limit combinations IF the sort field is indexed (or the #Id field). This produces very fast O(logN) or better results. Otherwise it is still O(N) as opposed to O(N*logN) because it will use a top-k algorithm and avoid the global sort (mongodb sort doc). This is from Mkyong's example but I do the sort first and set the limit to one second.
Query query = new Query();
query.with(new Sort(Sort.Direction.DESC, "idField"));
query.limit(1);
MyObject maxObject = mongoTemplate.findOne(query, MyObject.class);

Normally, things that are done with aggregate SQL queries, can be approached in (at least) three ways in NoSQL stores:
with Map/Reduce. This is effectively going through all the records, but more optimized (works with multiple threads, and in clusters). Here's the map/reduce tutorial for MongoDB.
pre-calculate the max value on each insert, and store it separately. So, whenever you insert a record, you compare it to the previous max value, and if it's greater - update the max value in the db.
fetch everything in memory and do the calculation in the code. That's the most trivial solution. It would probably work well for small data sets.
Choosing one over the other depends on your usage of this max value. If it is performed rarely, for example for some corner reporting, you can go with the map/reduce. If it is used often, then store the current max.

As far as I am aware Mongo totally supports sort then limit: see http://www.mongodb.org/display/DOCS/Sorting+and+Natural+Order
Get the max/min via map reduce is going to be very slow and should be avoided at all costs.
I don't know anything about Spring Data, but I can recommend Morphia to help with queries. Otherwise a basic way with the Java driver would be:
DBCollection coll = db.getCollection("...");
DBCursor curr = coll.find(new BasicDBObject()).sort(new BasicDBObject("order", -1))
.limit(1);
if (cur.hasNext())
System.out.println(cur.next());

Use aggregation $max .
As $max is an accumulator operator available only in the $group stage, you need to do a trick.
In the group operator use any constant as _id .
Lets take the example given in Mongodb site only --
Consider a sales collection with the following documents:
{ "_id" : 1, "item" : "abc", "price" : 10, "quantity" : 2, "date" : ISODate("2014-01-01T08:00:00Z") }
{ "_id" : 2, "item" : "jkl", "price" : 20, "quantity" : 1, "date" : ISODate("2014-02-03T09:00:00Z") }
{ "_id" : 3, "item" : "xyz", "price" : 5, "quantity" : 5, "date" : ISODate("2014-02-03T09:05:00Z") }
{ "_id" : 4, "item" : "abc", "price" : 10, "quantity" : 10, "date" : ISODate("2014-02-15T08:00:00Z") }
{ "_id" : 5, "item" : "xyz", "price" : 5, "quantity" : 10, "date" : ISODate("2014-02-15T09:05:00Z") }
If you want to find out the max price among all the items.
db.sales.aggregate(
[
{
$group:
{
_id: "1", //** This is the trick
maxPrice: { $max: "$price" }
}
}
]
)
Please note that the value of "_id" - it is "1". You can put any constant...

Since the first answer is correct but the code is obsolete, I'm replying with a similar solution that worked for me:
Query query = new Query();
query.with(Sort.by(Sort.Direction.DESC, "field"));
query.limit(1);
Entity maxEntity = mongoTemplate.findOne(query, Entity.class);

Related

Is searching by _id in mongoDB more efficient?

In my use case, I want to search a document by a given unique string in MongoDB. However, I want my queries to be fast and searching by _id will add some overhead. I want to know if there are any benefits in MongoDB to search a document by _id over any other unique value?
To my knowledge object ID are similar to any other unique value in a document [Point made for the case of searching only].
As for the overhead, you can assume I am caching the string to objectID and the cache is very small and in memory [Almost negligible], though the DB is large.
Analyzing your query performance
I advise you to use .explain() provided by mongoDB to analyze your query performance.
Let's say we are trying to execute this query
db.inventory.find( { quantity: { $gte: 100, $lte: 200 } } )
This would be the result of the query execution
{ "_id" : 2, "item" : "f2", "type" : "food", "quantity" : 100 }
{ "_id" : 3, "item" : "p1", "type" : "paper", "quantity" : 200 }
{ "_id" : 4, "item" : "p2", "type" : "paper", "quantity" : 150 }
If we call .execution() this way
db.inventory.find(
{ quantity: { $gte: 100, $lte: 200 } }
).explain("executionStats")
It will return the following result:
{
"queryPlanner" : {
"plannerVersion" : 1,
...
"winningPlan" : {
"stage" : "COLLSCAN",
...
}
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 3,
"executionTimeMillis" : 0,
"totalKeysExamined" : 0,
"totalDocsExamined" : 10,
"executionStages" : {
"stage" : "COLLSCAN",
...
},
...
},
...
}
More details about this can be found here
How efficient is search by _id and indexes
To answer your question, using indexes is always more efficient. Indexes are special data structures that store a small portion of the collection's data set in an easy to traverse form. With _id being the default index provided by MongoDB, that makes it more efficient.
Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the query statement.
So, YES, using indexes like _id is better!
You can also create your own indexes by using createIndex()
db.collection.createIndex( <key and index type specification>, <options> )
Optimize your MongoDB query
In case you want to optimize your query, there are multiple ways to do that.
Creating custom indexes to support your queries
Limit the Number of Query Results to Reduce Network Demand
db.posts.find().sort( { timestamp : -1 } ).limit(10)
Use Projections to Return Only Necessary Data
db.posts.find( {}, { timestamp : 1 , title : 1 , author : 1 , abstract : 1} ).sort( { timestamp : -1 } )
Use $hint to Select a Particular Index
db.users.find().hint( { age: 1 } )
Short answer, yes _id is the primary key and it's indexed. Of course it's fast.
But you can use an index on the other fields too and get more efficient queries.

Can sorting before grouping improve query performance in Mongo using the aggregate framework?

I'm trying to aggregate data for 100 accounts for a 14-15 month period, grouping by year and month.
However, the query performance is horrible as it takes 22-27 seconds. There are currently over 15 million records in the collection and I've got an index on the match criteria and can see using explain() that the optimizer uses it.
I tried adding another index on the sort criteria in the query below and after adding the index, the query now takes over 50 seconds! This happens even after I remove the sort from the query.
I'm extremely confused. I thought because grouping can't utilize an index, that if the collection was sorted beforehand, then the grouping could be much faster. Is this assumption correct? If not, what other options do I have? I can bear the query performance to be as much as 5 seconds but nothing more than that.
//Document Structure
{
Acc: 1,
UIC: true,
date: ISODate("2015-12-01T05:00:00Z"),
y: 2015
mm: 12
value: 22.3
}
//Query
db.MyCollection.aggregate([
{ "$match" : { "UIC" : true, "Acc" : { "$in" : [1, 2, 3, ..., 99, 100] }, "date" : { "$gte" : ISODate("2015-12-01T05:00:00Z"), "$lt" : ISODate("2017-02-01T05:00:00Z") } } },
//{ "$sort" : { "UIC" : 1, "Acc" : 1, "y" : -1, "mm" : 1 } },
{ "$group" : { "_id" : { "Num" : "$Num", "Year" : "$y", "Month" : "$mm" }, "Sum" : { "$sum" : "$value" } } }
])
What I would suggest you to do is to make a script (can be in nodejs) that aggregates the data in a different collection. When you have these long queries, what's advisable is to make a different collection containing the aggregation data and query from that.
My second advice would be to create a composed index in this aggregated collection and search by regular expression. In your case I would make an index containing accountId:period. For example, for account 1, and February of 2016, The index would be something like 1:201602.
Then you would be able to perform queries using regular expressions by account and timestamp. Like as if you wanted the registers for 2016 of account 1, you could do something like:
db.aggregatedCollection.find{_id : \1:2016\})
Hope my answer was helpful

MongoDB / Morphia - Projection not working on recursive objects?

I have a test object which works as nodes on a tree, containing 0 or more children instances of the same type. I'm persisting it on MongoDB and querying it with Morphia.
I perform the following query:
db.TestObject.find( {}, { _id: 1, childrenTestObjects: 1 } ).limit(6).sort( {_id: 1 } ).pretty();
Which results in:
{ "_id" : NumberLong(1) }
{ "_id" : NumberLong(2) }
{ "_id" : NumberLong(3) }
{ "_id" : NumberLong(4) }
{
"_id" : NumberLong(5),
"childrenTestObjects" : [
{
"stringValue" : "6eb887126d24e8f1cd8ad5033482c781",
"creationDate" : ISODate("1997-05-24T00:00:00Z")
"childrenTestObjects" : [
{
"stringValue" : "2ab8f86410b4f3bdcc747699295eb5a4",
"creationDate" : ISODate("2024-10-10T00:00:00Z"),
"_id" : NumberLong(7)
}
],
"_id" : NumberLong(6)
}
]
}
That's awesome, but also a little surprising. I'm having two issues with the results:
1) When I do a projection, it only applies to the top elements. The children elements still return other properties not in the projection (stringValue and creationDate). I'd like the field selection to apply to all documents and sub documents of the same type. This tree has an undermined number of sub items, so I can't specify that in the query explicitly. How to accomplish that?
2) To my surprise, limit applied to sub documents! You see that there was one embedded document with id 6. I was expecting to see 6 top level documents with N sub documents, but instead got just 5. How to tell MongoDB to return 6 top level elements, regardless of what is embedded in them? Without that having a consistent pagination system is impossible.
All your help has made learning MongoDB way faster and I really appreciate it! Thanks!
As for 1), projections retain fields in the results. In this case that field is childrenTestObjects which happens to be a document. So mongo returns that entire field which is, of course, the entire subdocument. Projections are not recursive so you'd have to specify each field explicitly.
As for 2), that doesn't sound right. it would help to see the query results without the projections added (full documents in each return document) and we can take it from there.

How do I access a specific element after aggregation in mongodb?

After aggregation pipeline, I get a list of objects, but there is no way to retrieve the Nth object.
See:
http://docs.mongodb.org/manual/reference/operator/aggregation/group/#retrieve-distinct-values
The doc has an output like so:
{ "_id" : 1, "item" : "abc", "price" : 10, "quantity" : 2, "date" : ISODate("2014-03-01T08:00:00Z") }
{ "_id" : 2, "item" : "jkl", "price" : 20, "quantity" : 1, "date" : ISODate("2014-03-01T09:00:00Z") }
{ "_id" : 3, "item" : "xyz", "price" : 5, "quantity" : 10, "date" : ISODate("2014-03-15T09:00:00Z") }
{ "_id" : 4, "item" : "xyz", "price" : 5, "quantity" : 20, "date" : ISODate("2014-04-04T11:21:39.736Z") }
{ "_id" : 5, "item" : "abc", "price" : 10, "quantity" : 10, "date" : ISODate("2014-04-04T21:23:13.331Z") }
This is a group of objects, but it is not in a list so you can't do stuff like:
results[1] to get the second object. How are you supposed to interact with this group?
First of all, if you use the db.collection.distinct(<fieldname>) function, you can get distinct values of a field as an array:
> db.animals.distinct("call")
["moo", "baa", "woof", "meow", "quack"]
and then you can dereference the result since it's an array.
> calls = db.animals.distinct("call")
> calls[3]
"meow"
Aggregating for distinct values is free of the big limitation of db.collection.distinct() in that it returns a cursor over the distinct values instead of a big array, which means there's no 16MB BSON limit, and the distinct function can use indexes to cover the operation. So use the aggregation approach when you have a gazillion distinct values but otherwise use the distinct function. While you could call .toArray() on the cursor and get all the results in an array, if you have so many results that you couldn't use db.collection.distinct() then that is a bad idea. You should iterate through the cursor and pick out those values that you want and do stuff with them:
> var k = 0
> var calls = db.animals.aggregate(<pipeline for distinct animal calls>)
> while (calls.hasNext()) {
var call = calls.next()
k++
if (k == 96) doStuff(call)
}
You can insert a $skip stage in the pipeline to have the server skip right to the first result that you want, if you also include a $sort to fix an order that the results will be returned in. If you know you only want up to a certain amount, you can also use $limit and then the .toArray() approach may be viable again.
Before MongoDB 2.6 the results from the aggregate method were returned as a single document with the results placed within an array. This changed to accommodate larger result sets that would exceed the 16MB BSON limit that was restricted by this form. The cursor that is returned now is actually an option, but turned on by default in the shell.
Provided your results are not to large, there is actually a "helper" method for .toArray() which just does what you want by turning the cursor results into an array. It does cursor iteration under the hood, but basically just hides that from you:
var results = db.collection.aggregate(pipeline).toArray()
Then just access the element, it's an array so n-1:
var result = results[8];
Similar methods are available to most drivers, or otherwise in the driver do not include the "cursor" option.

Chaining time-based sort and limit issue

Lately I've encountered some strange behaviours (i.e. meaning that they are, IMHO, counter-intuitive) while playing with mongo and sort/limit.
Let's suppose I do have the following collection:
> db.fred.find()
{ "_id" : ObjectId("..."), "record" : 1, "time" : ISODate("2011-12-01T00:00:00Z") }
{ "_id" : ObjectId("..."), "record" : 2, "time" : ISODate("2011-12-02T00:00:00Z") }
{ "_id" : ObjectId("..."), "record" : 3, "time" : ISODate("2011-12-03T00:00:00Z") }
{ "_id" : ObjectId("..."), "record" : 4, "time" : ISODate("2011-12-04T00:00:00Z") }
{ "_id" : ObjectId("..."), "record" : 5, "time" : ISODate("2011-12-05T00:00:00Z") }
What I would like is retrieving, in time order, the 2 records previous to "record": 4 plus record 4 (i.e. record 2, record 3 and record 4)
Naively I was about running something along:
db.fred.find({time: {$lte: ISODate("2011-12-04T00:00:00Z")}}).sort({time: -1}).limit(2).sort({time: 1})
but it does not work the way I expected:
{ "_id" : ObjectId("..."), "record" : 1, "time" : ISODate("2011-12-01T00:00:00Z") }
{ "_id" : ObjectId("..."), "record" : 2, "time" : ISODate("2011-12-02T00:00:00Z") }
I was thinking that the result would have been record 2, record 3 and 4.
From what I recollected, it seems that the 2 sort does apply before limit:
sort({time: -1}) => record 4, record 3, record 2, record 1
sort({time: -1}).limit(2) => record 4, record 3
sort({time: -1}).limit(2).sort({time: 1}) => record 1, record 2
i.e it's like the second sort was applied to the cursor returned by find (i.e. the whole set) and then only, the limit is applied.
What is my mistake here and how can I achieve the expected behavior?
BTW: running mongo 2.0.1 on Ubuntu 11.01
The MongoDB shell lazily evaluates cursors, which is to say, the series of chained operations you've done results in one query being sent to the server, using the final state based on the chained operations. So when you say "sort({time: -1}).limit(2).sort({time: 1})" the second call to sort overrides the sort set by the first call.
To achieve your desired result, you're probably better off reversing the cursor output in your application code, especially if you're limiting to a small result set (here you're using 2). The exact code to do so depends on the language you're using, which you haven't specified.
Applying sort() to the same query multiple times makes no sense here. The effective sorting will be taken from the last sort() call. So
sort({time: -1}).limit(2).sort({time: 1})
is the same as
sort({time: 1}).limit(2)