Sort before querying - mongodb

Is it possible to run a sort on a Mongo collection before running the filtering query? I have older code in which I was using a method of getting a random result from the database by having a field which was a random float between 0 and 1, then querying with findOne to get the first document with a value greater than a random float generated at that time. The sample set was small, so didn't notice a problem at the time, but recently noticed that with one query, I was almost always getting the same value. The "first" document had a random > .9, so nearly every query matched it first.
I realized, for this solution to work, I need to sort by random, then find the first value greater than my random. As I understand it, this isn't as necessary a solution as in the past, as $sample exists as of 3.2, but I figure learning how I could do this would be good? Plus, my understanding is that $sample can return the same document multiple times (where N > 1 obviously, so not directly applicable to my question).
So for example, the following data:
> db.links.find()
{ "_id" : ObjectId("553c072bc87652a80e00002a"), "random" : 0.9162904409691691 }
{ "_id" : ObjectId("553c3332c87652c80700002a"), "random" : 0.00427396921440959 }
{ "_id" : ObjectId("553c3c5cc87652a80e00002b"), "random" : 0.2409569111187011 }
{ "_id" : ObjectId("553c3c66c876521c10000029"), "random" : 0.35101076657883823 }
{ "_id" : ObjectId("553c3c6ec87652200700002e"), "random" : 0.3234482416883111 }
{ "_id" : ObjectId("553c68d5c87652a80e00002c"), "random" : 0.5221220930106938 }
Any attempt to run db.mycollection.findOne({'random': {'$gte': x}}) where x is any value up to .91 always return the first object (_id 553c072). Anything greater returns nothing. If I could sort by the random value in ascending order then filter, it would keep searching until it found the correct value.

I would strongly recommend you to drop your custom solution and simply switch to using the MongoDB built-in $sample stage which will return a random result from your collection.
EDIT based on your comment:
Here's how you can do what you originally asked for:
db.links.find({ "random": { $gte: /* put your value here */ } })
.sort({ "random": 1 /* sort by "random" field in ascending order */ })
.limit(1)
You can, but don't need to use the aggregation framework, too:
db.links.aggregate({
$match: {
"random": {
$gte: /* put your value here */ // filter the collection
}
}
}, {
$sort: {
"random": 1 // sort by "random" field in ascending order
}
}, {
$limit: 1 // return only the first element
})

Related

Show Recent chat message in Mongodb [duplicate]

I can't find anywhere it has been documented this. By default, the find() operation will get the records from beginning. How can I get the last N records in mongodb?
Edit: also I want the returned result ordered from less recent to most recent, not the reverse.
If I understand your question, you need to sort in ascending order.
Assuming you have some id or date field called "x" you would do ...
.sort()
db.foo.find().sort({x:1});
The 1 will sort ascending (oldest to newest) and -1 will sort descending (newest to oldest.)
If you use the auto created _id field it has a date embedded in it ... so you can use that to order by ...
db.foo.find().sort({_id:1});
That will return back all your documents sorted from oldest to newest.
Natural Order
You can also use a Natural Order mentioned above ...
db.foo.find().sort({$natural:1});
Again, using 1 or -1 depending on the order you want.
Use .limit()
Lastly, it's good practice to add a limit when doing this sort of wide open query so you could do either ...
db.foo.find().sort({_id:1}).limit(50);
or
db.foo.find().sort({$natural:1}).limit(50);
The last N added records, from less recent to most recent, can be seen with this query:
db.collection.find().skip(db.collection.count() - N)
If you want them in the reverse order:
db.collection.find().sort({ $natural: -1 }).limit(N)
If you install Mongo-Hacker you can also use:
db.collection.find().reverse().limit(N)
If you get tired of writing these commands all the time you can create custom functions in your ~/.mongorc.js. E.g.
function last(N) {
return db.collection.find().skip(db.collection.count() - N);
}
then from a mongo shell just type last(N)
Sorting, skipping and so on can be pretty slow depending on the size of your collection.
A better performance would be achieved if you have your collection indexed by some criteria; and then you could use min() cursor:
First, index your collection with db.collectionName.setIndex( yourIndex )
You can use ascending or descending order, which is cool, because you want always the "N last items"... so if you index by descending order it is the same as getting the "first N items".
Then you find the first item of your collection and use its index field values as the min criteria in a search like:
db.collectionName.find().min(minCriteria).hint(yourIndex).limit(N)
Here's the reference for min() cursor: https://docs.mongodb.com/manual/reference/method/cursor.min/
In order to get last N records you can execute below query:
db.yourcollectionname.find({$query: {}, $orderby: {$natural : -1}}).limit(yournumber)
if you want only one last record:
db.yourcollectionname.findOne({$query: {}, $orderby: {$natural : -1}})
Note: In place of $natural you can use one of the columns from your collection.
db.collection.find().sort({$natural: -1 }).limit(5)
#bin-chen,
You can use an aggregation for the latest n entries of a subset of documents in a collection. Here's a simplified example without grouping (which you would be doing between stages 4 and 5 in this case).
This returns the latest 20 entries (based on a field called "timestamp"), sorted ascending. It then projects each documents _id, timestamp and whatever_field_you_want_to_show into the results.
var pipeline = [
{
"$match": { //stage 1: filter out a subset
"first_field": "needs to have this value",
"second_field": "needs to be this"
}
},
{
"$sort": { //stage 2: sort the remainder last-first
"timestamp": -1
}
},
{
"$limit": 20 //stage 3: keep only 20 of the descending order subset
},
{
"$sort": {
"rt": 1 //stage 4: sort back to ascending order
}
},
{
"$project": { //stage 5: add any fields you want to show in your results
"_id": 1,
"timestamp" : 1,
"whatever_field_you_want_to_show": 1
}
}
]
yourcollection.aggregate(pipeline, function resultCallBack(err, result) {
// account for (err)
// do something with (result)
}
so, result would look something like:
{
"_id" : ObjectId("5ac5b878a1deg18asdafb060"),
"timestamp" : "2018-04-05T05:47:37.045Z",
"whatever_field_you_want_to_show" : -3.46000003814697
}
{
"_id" : ObjectId("5ac5b878a1de1adsweafb05f"),
"timestamp" : "2018-04-05T05:47:38.187Z",
"whatever_field_you_want_to_show" : -4.13000011444092
}
Hope this helps.
You can try this method:
Get the total number of records in the collection with
db.dbcollection.count()
Then use skip:
db.dbcollection.find().skip(db.dbcollection.count() - 1).pretty()
You can't "skip" based on the size of the collection, because it will not take the query conditions into account.
The correct solution is to sort from the desired end-point, limit the size of the result set, then adjust the order of the results if necessary.
Here is an example, based on real-world code.
var query = collection.find( { conditions } ).sort({$natural : -1}).limit(N);
query.exec(function(err, results) {
if (err) {
}
else if (results.length == 0) {
}
else {
results.reverse(); // put the results into the desired order
results.forEach(function(result) {
// do something with each result
});
}
});
you can use sort() , limit() ,skip() to get last N record start from any skipped value
db.collections.find().sort(key:value).limit(int value).skip(some int value);
Look under Querying: Sorting and Natural Order, http://www.mongodb.org/display/DOCS/Sorting+and+Natural+Order
as well as sort() under Cursor Methods
http://www.mongodb.org/display/DOCS/Advanced+Queries
You may want to be using the find options :
http://docs.meteor.com/api/collections.html#Mongo-Collection-find
db.collection.find({}, {sort: {createdAt: -1}, skip:2, limit: 18}).fetch();
Use .sort() and .limit() for that
Use Sort in ascending or descending order and then use limit
db.collection.find({}).sort({ any_field: -1 }).limit(last_n_records);
If you use MongoDB compass, you can use sort filed to filter,
use $slice operator to limit array elements
GeoLocation.find({},{name: 1, geolocation:{$slice: -5}})
.then((result) => {
res.json(result);
})
.catch((err) => {
res.status(500).json({ success: false, msg: `Something went wrong. ${err}` });
});
where geolocation is array of data, from that we get last 5 record.
db.collection.find().hint( { $natural : -1 } ).sort(field: 1/-1).limit(n)
according to mongoDB Documentation:
You can specify { $natural : 1 } to force the query to perform a forwards collection scan.
You can also specify { $natural : -1 } to force the query to perform a reverse collection scan.
Last function should be sort, not limit.
Example:
db.testcollection.find().limit(3).sort({timestamp:-1});

MongoDB: phantom records in $unwind results under heavy load

I have a simple collection of elements like this
{_id: n, xs: [...]}
I'm trying to count total number of elements in all arrays
db.testRace.aggregate([{ $unwind : "$xs" }, { $group : { _id : null, count : { $sum : 1 } } }])
And it works great unless I start to do massive updates of this collection. Under heavy load of update operations I get wrong total - slightly bigger than it should be.
It can be easily reproduced.
First generate some test data
for(var i = 1; i <= 1000000; i++) {
db.testRace.insert({_id: i, xs: [i]});
}
Then simulate a lot of updates
while(true) {
var id = Math.floor((Math.random() * 1000000) + 1);
var obj = db.testRace.find({_id: id}).next();
obj.some="change";
db.testRace.update({_id: id}, obj);
}
And while it is running do aggregate unwind query.
Without load I get right result - 1000000. But when there are a lot of updates I get bigger numbers, like 1001456.
And if I run query like this
db.testRace.aggregate([{ $unwind : "$xs" }, {$group: {_id:"$xs", count:{$sum: 1}}}, { $sort : { count : -1 } }, { $limit : 2 }]);
I get
"result" : [
{
"_id" : 996972,
"count" : 2
},
{
"_id" : 997789,
"count" : 2
}
],
So it seems aggregate count some records twice.
Is it expected behaviour or maybe I'm doing aggregation wrong?
I tested on local mongodb instance, version - 2.4.9
It's expected behavior due to the way MongoDB handles read isolation. When you have a long running query (and an aggregation that reads every single document is a long running query) with updates to that data during the query it may impact whether or no the updated data is returned in the query - depending on what happens when, you could miss a document, receive it or receive it twice.
From the source code:
Any data inserted, deleted, or modified during a yield that should be
returned by a query may or may not be returned by that query. The
query could return: nothing; the data before; the data after; or both
the data before and the data after.
In short, there is no isolation between a query and an
insert/delete/update. AKA, READ_UNCOMMITTED.
https://github.com/mongodb/mongo/blob/master/src/mongo/db/exec/plan_stage.h
Your aggregation query is yielding mid query, during which some of the data is updated. This impacts the results of the query.

In MongoDB mapreduce, how can I flatten the values object?

I'm trying to use MongoDB to analyse Apache log files. I've created a receipts collection from the Apache access logs. Here's an abridged summary of what my models look like:
db.receipts.findOne()
{
"_id" : ObjectId("4e57908c7a044a30dc03a888"),
"path" : "/videos/1/show_invisibles.m4v",
"issued_at" : ISODate("2011-04-08T00:00:00Z"),
"status" : "200"
}
I've written a MapReduce function that groups all data by the issued_at date field. It summarizes the total number of requests, and provides a breakdown of the number of requests for each unique path. Here's an example of what the output looks like:
db.daily_hits_by_path.findOne()
{
"_id" : ISODate("2011-04-08T00:00:00Z"),
"value" : {
"count" : 6,
"paths" : {
"/videos/1/show_invisibles.m4v" : {
"count" : 2
},
"/videos/1/show_invisibles.ogv" : {
"count" : 3
},
"/videos/6/buffers_listed_and_hidden.ogv" : {
"count" : 1
}
}
}
}
How can I make the output look like this instead:
{
"_id" : ISODate("2011-04-08T00:00:00Z"),
"count" : 6,
"paths" : {
"/videos/1/show_invisibles.m4v" : {
"count" : 2
},
"/videos/1/show_invisibles.ogv" : {
"count" : 3
},
"/videos/6/buffers_listed_and_hidden.ogv" : {
"count" : 1
}
}
}
It's not currently possible, but I would suggest voting for this case: https://jira.mongodb.org/browse/SERVER-2517.
Taking the best from previous answers and comments:
db.items.find().hint({_id: 1}).forEach(function(item) {
db.items.update({_id: item._id}, item.value);
});
From http://docs.mongodb.org/manual/core/update/#replace-existing-document-with-new-document
"If the update argument contains only field and value pairs, the update() method replaces the existing document with the document in the update argument, except for the _id field."
So you need neither to $unset value, nor to list each field.
From https://docs.mongodb.com/manual/core/read-isolation-consistency-recency/#cursor-snapshot
"MongoDB cursors can return the same document more than once in some situations. ... use a unique index on this field or these fields so that the query will return each document no more than once. Query with hint() to explicitly force the query to use that index."
AFAIK, by design Mongo's map reduce will spit results out in "value tuples" and I haven't seen anything that will configure that "output format". Maybe the finalize() method can be used.
You could try running a post-process that will reshape the data using
results.find({}).forEach( function(result) {
results.update({_id: result._id}, {count: result.value.count, paths: result.value.paths})
});
Yep, that looks ugly. I know.
You can do Dan's code with a collection reference:
function clean(collection) {
collection.find().forEach( function(result) {
var value = result.value;
delete value._id;
collection.update({_id: result._id}, value);
collection.update({_id: result.id}, {$unset: {value: 1}} ) } )};
A similar approach to that of #ljonas but no need to hardcode document fields:
db.results.find().forEach( function(result) {
var value = result.value;
delete value._id;
db.results.update({_id: result._id}, value);
db.results.update({_id: result.id}, {$unset: {value: 1}} )
} );
All the proposed solutions are far from optimal. The fastest you can do so far is something like:
var flattenMRCollection=function(dbName,collectionName) {
var collection=db.getSiblingDB(dbName)[collectionName];
var i=0;
var bulk=collection.initializeUnorderedBulkOp();
collection.find({ value: { $exists: true } }).addOption(16).forEach(function(result) {
print((++i));
//collection.update({_id: result._id},result.value);
bulk.find({_id: result._id}).replaceOne(result.value);
if(i%1000==0)
{
print("Executing bulk...");
bulk.execute();
bulk=collection.initializeUnorderedBulkOp();
}
});
bulk.execute();
};
Then call it:
flattenMRCollection("MyDB","MyMRCollection")
This is WAY faster than doing sequential updates.
While experimenting with Vincent's answer, I found a couple of problems. Basically, if you perform updates within a foreach loop, this will move the document to the end of the collection and the cursor will reach that document again (example). This can be circumvented if $snapshot is used. Hence, I am providing a Java example below.
final List<WriteModel<Document>> bulkUpdate = new ArrayList<>();
// You should enable $snapshot if performing updates within foreach
collection.find(new Document().append("$query", new Document()).append("$snapshot", true)).forEach(new Block<Document>() {
#Override
public void apply(final Document document) {
// Note that I used incrementing long values for '_id'. Change to String if
// you used string '_id's
long docId = document.getLong("_id");
Document subDoc = (Document)document.get("value");
WriteModel<Document> m = new ReplaceOneModel<>(new Document().append("_id", docId), subDoc);
bulkUpdate.add(m);
// If you used non-incrementing '_id's, then you need to use a final object with a counter.
if(docId % 1000 == 0 && !bulkUpdate.isEmpty()) {
collection.bulkWrite(bulkUpdate);
bulkUpdate.removeAll(bulkUpdate);
}
}
});
// Fixing bug related to Vincent's answer.
if(!bulkUpdate.isEmpty()) {
collection.bulkWrite(bulkUpdate);
bulkUpdate.removeAll(bulkUpdate);
}
Note : This snippet takes an average of 7.4 seconds to execute on my machine with 100k records and 14 attributes (IMDB dataset). Without batching, it takes an average of 25.2 seconds.

How to get the last N records in mongodb?

I can't find anywhere it has been documented this. By default, the find() operation will get the records from beginning. How can I get the last N records in mongodb?
Edit: also I want the returned result ordered from less recent to most recent, not the reverse.
If I understand your question, you need to sort in ascending order.
Assuming you have some id or date field called "x" you would do ...
.sort()
db.foo.find().sort({x:1});
The 1 will sort ascending (oldest to newest) and -1 will sort descending (newest to oldest.)
If you use the auto created _id field it has a date embedded in it ... so you can use that to order by ...
db.foo.find().sort({_id:1});
That will return back all your documents sorted from oldest to newest.
Natural Order
You can also use a Natural Order mentioned above ...
db.foo.find().sort({$natural:1});
Again, using 1 or -1 depending on the order you want.
Use .limit()
Lastly, it's good practice to add a limit when doing this sort of wide open query so you could do either ...
db.foo.find().sort({_id:1}).limit(50);
or
db.foo.find().sort({$natural:1}).limit(50);
The last N added records, from less recent to most recent, can be seen with this query:
db.collection.find().skip(db.collection.count() - N)
If you want them in the reverse order:
db.collection.find().sort({ $natural: -1 }).limit(N)
If you install Mongo-Hacker you can also use:
db.collection.find().reverse().limit(N)
If you get tired of writing these commands all the time you can create custom functions in your ~/.mongorc.js. E.g.
function last(N) {
return db.collection.find().skip(db.collection.count() - N);
}
then from a mongo shell just type last(N)
Sorting, skipping and so on can be pretty slow depending on the size of your collection.
A better performance would be achieved if you have your collection indexed by some criteria; and then you could use min() cursor:
First, index your collection with db.collectionName.setIndex( yourIndex )
You can use ascending or descending order, which is cool, because you want always the "N last items"... so if you index by descending order it is the same as getting the "first N items".
Then you find the first item of your collection and use its index field values as the min criteria in a search like:
db.collectionName.find().min(minCriteria).hint(yourIndex).limit(N)
Here's the reference for min() cursor: https://docs.mongodb.com/manual/reference/method/cursor.min/
In order to get last N records you can execute below query:
db.yourcollectionname.find({$query: {}, $orderby: {$natural : -1}}).limit(yournumber)
if you want only one last record:
db.yourcollectionname.findOne({$query: {}, $orderby: {$natural : -1}})
Note: In place of $natural you can use one of the columns from your collection.
db.collection.find().sort({$natural: -1 }).limit(5)
#bin-chen,
You can use an aggregation for the latest n entries of a subset of documents in a collection. Here's a simplified example without grouping (which you would be doing between stages 4 and 5 in this case).
This returns the latest 20 entries (based on a field called "timestamp"), sorted ascending. It then projects each documents _id, timestamp and whatever_field_you_want_to_show into the results.
var pipeline = [
{
"$match": { //stage 1: filter out a subset
"first_field": "needs to have this value",
"second_field": "needs to be this"
}
},
{
"$sort": { //stage 2: sort the remainder last-first
"timestamp": -1
}
},
{
"$limit": 20 //stage 3: keep only 20 of the descending order subset
},
{
"$sort": {
"rt": 1 //stage 4: sort back to ascending order
}
},
{
"$project": { //stage 5: add any fields you want to show in your results
"_id": 1,
"timestamp" : 1,
"whatever_field_you_want_to_show": 1
}
}
]
yourcollection.aggregate(pipeline, function resultCallBack(err, result) {
// account for (err)
// do something with (result)
}
so, result would look something like:
{
"_id" : ObjectId("5ac5b878a1deg18asdafb060"),
"timestamp" : "2018-04-05T05:47:37.045Z",
"whatever_field_you_want_to_show" : -3.46000003814697
}
{
"_id" : ObjectId("5ac5b878a1de1adsweafb05f"),
"timestamp" : "2018-04-05T05:47:38.187Z",
"whatever_field_you_want_to_show" : -4.13000011444092
}
Hope this helps.
You can try this method:
Get the total number of records in the collection with
db.dbcollection.count()
Then use skip:
db.dbcollection.find().skip(db.dbcollection.count() - 1).pretty()
You can't "skip" based on the size of the collection, because it will not take the query conditions into account.
The correct solution is to sort from the desired end-point, limit the size of the result set, then adjust the order of the results if necessary.
Here is an example, based on real-world code.
var query = collection.find( { conditions } ).sort({$natural : -1}).limit(N);
query.exec(function(err, results) {
if (err) {
}
else if (results.length == 0) {
}
else {
results.reverse(); // put the results into the desired order
results.forEach(function(result) {
// do something with each result
});
}
});
you can use sort() , limit() ,skip() to get last N record start from any skipped value
db.collections.find().sort(key:value).limit(int value).skip(some int value);
Look under Querying: Sorting and Natural Order, http://www.mongodb.org/display/DOCS/Sorting+and+Natural+Order
as well as sort() under Cursor Methods
http://www.mongodb.org/display/DOCS/Advanced+Queries
You may want to be using the find options :
http://docs.meteor.com/api/collections.html#Mongo-Collection-find
db.collection.find({}, {sort: {createdAt: -1}, skip:2, limit: 18}).fetch();
Use .sort() and .limit() for that
Use Sort in ascending or descending order and then use limit
db.collection.find({}).sort({ any_field: -1 }).limit(last_n_records);
If you use MongoDB compass, you can use sort filed to filter,
use $slice operator to limit array elements
GeoLocation.find({},{name: 1, geolocation:{$slice: -5}})
.then((result) => {
res.json(result);
})
.catch((err) => {
res.status(500).json({ success: false, msg: `Something went wrong. ${err}` });
});
where geolocation is array of data, from that we get last 5 record.
db.collection.find().hint( { $natural : -1 } ).sort(field: 1/-1).limit(n)
according to mongoDB Documentation:
You can specify { $natural : 1 } to force the query to perform a forwards collection scan.
You can also specify { $natural : -1 } to force the query to perform a reverse collection scan.
Last function should be sort, not limit.
Example:
db.testcollection.find().limit(3).sort({timestamp:-1});

Fast way to find duplicates on indexed column in mongodb

I have a collection of md5 in mongodb. I'd like to find all duplicates. The md5 column is indexed. Do you know any fast way to do that using map reduce.
Or should I just iterate over all records and check for duplicates manually?
My current approach using map reduce iterates over the collection almost twice (assuming that there is very small amount of duplicates):
res = db.files.mapReduce(
function () {
emit(this.md5, 1);
},
function (key, vals) {
return Array.sum(vals);
}
)
db[res.result].find({value: {$gte:1}}).forEach(
function (obj) {
out.duplicates.insert(obj)
});
I personally found that on big databases (1TB and more) accepted answer is terribly slow. Aggregation is much faster. Example is below:
db.places.aggregate(
{ $group : {_id : "$extra_info.id", total : { $sum : 1 } } },
{ $match : { total : { $gte : 2 } } },
{ $sort : {total : -1} },
{ $limit : 5 }
);
It searches for documents whose extra_info.id is used twice or more times, sorts results in descending order of given field and prints first 5 values of it.
The easiest way to do it in one pass is to sort by md5 and then process appropriately.
Something like:
var previous_md5;
db.files.find( {"md5" : {$exists:true} }, {"md5" : 1} ).sort( { "md5" : 1} ).forEach( function(current) {
if(current.md5 == previous_md5){
db.duplicates.update( {"_id" : current.md5}, { "$inc" : {count:1} }, true);
}
previous_md5 = current.md5;
});
That little script sorts the md5 entries and loops through them in order. If an md5 is repeated, then they will be "back-to-back" after sorting. So we just keep a pointer to previous_md5 and compare it current.md5. If we find a duplicate, I'm dropping it into the duplicates collection (and using $inc to count the number of duplicates).
This script means that you only have to loop through the primary data set once. Then you can loop through the duplicates collection and perform clean-up.
You can do a group by that field and then query to get the duplicated (having a count > 1). http://www.mongodb.org/display/DOCS/Aggregation#Aggregation-Group
Although, the fastest thing might be to just do a query which only returns that field and then to do the aggregation in the client. Group/Map-Reduce need to provide access to the whole document which is much more costly than just providing the data from the index (which is now covered in 1.7.3+).
If this is a general problem you need to run periodically, you might want to keep a collection which is just {md5:value, count:value} so you can skip the aggregation, and it will be extremely fast when you need to cull duplicates.