Mongodb Aggregation slow count using facet - mongodb

I am wanting to use a facet to create a simple query that i can use to get paged data, however i have noticed that if i do this i get really poor performance when compared to running just two seperate queries.
As a quick test i created a collection with 50000 random documents and ran the following test.
var x = new Date();
var a = {
count : db.getCollection("test").find({}).count(),
data: db.getCollection("test").find({}).skip(0).limit(10)
};
var y = new Date();
print('result ' + a);
print(y - x);
var x = new Date();
var a = db.getCollection("test").aggregate(
[
{
"$match" : {
}
},
{
"$facet" : {
"data": [
{
"$skip": 0
},
{
"$limit": 10
}
],
"pageInfo": [
{
"$group": {
"_id": null,
"count": {
"$sum": 1
}
}
}
]
}
}
]
)
var y = new Date();
print('result ' + a);
print(y - x);
The result of this is that two seperate queries one for find the other for count takes around 2 milliseconds vs the aggregation single query taking upwards of 500 milliseconds.
Why is it that the aggregation is so slow?
Update
Even just a count without a facet within an aggregation is slow
var x = new Date();
var a = db.getCollection("test").find({}).count();
var y = new Date();
print('result ' + a);
print(y - x);
var x = new Date();
var a = db.getCollection("test").aggregate(
[
{ "$count" : "count" }
]
)
var y = new Date();
print('result ' + a);
print(y - x);
In the above with my test data set, the aggregation count takes 200ms vs the Count method taking 2ms.
This issue extends into the NodeJs Mongodb Driver where the .Count() method has been deprecated and replaced with a countDocuments() method, under the hood the new countDocuments() method is using an aggregation and not the count method on a find just like my example above it has significantly worse performance to the point at which i will continue using the deprecated method over the newer countDocuments() method.

Of course it is slow. The count() method just returns the cursor size after a query is applied (which does not necessarily require all documents to be read, depending on your query and indices). Furthermore, with an empty query, the query optimizer knows that all documents ought to be returned and basically only has to return length(_id_1).
Aggregations, by definition, do not work that way. Unless there is a match stage actually ruling out a document, each and every document is read from “disk” (MongoDB’s own cache and FS caches aside for the moment) for further processing.

I am running into the same issue, and I just hope that anyone might have a better answer then what was previously posted.
I have a "user" collection with 12 million users in it, using MongoDB 5.0.
My query looks like this:
db.users.aggregate([
{ '$sort': { updated_at: -1 } },
{ '$facet': {
results: [
{ $skip: 0 },
{ $limit: 20 }
],
total: [
{ $count: 'count' }
]
}
}
])
The query takes around 1 minute, so that is not acceptable.
I have an index on "updated_at", that is not the issue.
Also, I have this issue even if I run it directly on MongoShell in Compass. So it is not related to any NodeJs Mongo Driver as was previously suspected.
Can I somehow tell Mongo to use the estimated count here?
Or is there any other way to improve the query?

Related

Compare Time to Document Interval and Update

Use Case:
I've got a mongodb collection with a couple million documents. Documents in this
collection must be updated sometimes. Therefore I've setup a monitorFrequency field which would define the that a specific document must be updated every 6, 12, 24 or 720 hours. Additionally I setup a field called lastRefreshAt which is a timestamp of the last actual update.
The problem:
How can I select all documents from my collection profiles which need to be refreshed again (because monitorFrequency is older than lastRefreshAt).
Should I run that on a single query which would only return those documents which need to be refreshed again or should I rather iterate on all documents with a cursor and check in my node application if the document needs to be refreshed or not?
I would know how to do approach #2, but I am not sure what approach to chose and how the query for #1 would look like.
There are a couple of approaches depending on available architecture and choices. Some are good choices and some are bad, but we might as well explain them all.
Use $where with multi-update
As a first option to examine, you could use $where to calculate the difference for selection and feed directly to .update() or .updateMany() for that matter:
db.profiles.update(
{
"$where": function() {
return (Date.now() - this.lastRefreshAt.valueOf())
> ( this.monitorFrequency * 1000 * 60 * 60 );
}
},
{ "$currentDate": { "lastRefreshAt": true } },
{ "multi": true }
)
Which pretty simply works out the milliseconds difference between the current "lastRefreshAt" value and the current Date value and compares that to the stored "monitorFrequency" converted into milliseconds itself.
The $currentDate is appplied because it is a "multi" update and applied to all matched documents, so this ensures the "server timestamp" at the actual time of document update is applied to the document.
It's not fantastic as it does require a full collection scan in order to select the documents via calculation and thus cannot use an index. Plus it's JavaScript evaluation, which not being native code does add some overhead.
Loop the matched selection
So JavaScript is not that great a selection option in general when other options apply. Instead try using the aggregation framework for the calculation and loop the cursor result:
var ops = [];
db.profiles.aggregate([
{ "$redact": {
"$cond": {
"if": {
"$gt": [
{ "$subtract": [new Date(), "$lastRefreshAt"] },
{ "$multiply": ["$monitorFrequency", 1000 * 60 * 60] }
]
},
"then": "$$KEEP",
"else": "$$PRUNE"
}
}}
]).forEach(doc => {
ops.push({
"updateOne": {
"filter": { "_id": doc._id },
"update": { "$currentDate": { "lastRefreshAt": true } }
}
});
if ( ops.length > 1000 ) {
db.profiles.bulkWrite(ops);
ops = [];
}
})
if ( ops.length > 0 ) {
db.profiles.bulkWrite(ops);
ops = [];
}
So again that's a collection scan due to the calculation but it is done with native operators, so that part at least should be a bit faster. Also from a technical standpoint it's a little different because the new Date() is actually established at the time of request and not per document iterated as it would be using $where. Lacking an operator to produce the "current date" internally, there is no way for the aggregation framework to do this per iteration.
And of course, instead of just applying our "update" expression as it matches documents, we are looping the result cursor and applying a function. So whilst there are "some" gains, there is also additional overhead. Mileage may vary as to performance and practicality.
Parallel Updates
Personally I would do neither of the above and simply run a query selecting each marked "monitorFrequency" and looking for the dates between the boundaries that exceed the allowed difference.
As a simple example using NodeJS to implement Promise.all() for parallel calls:
const MongoClient = require('mongodb').MongoClient;
const onHour = 1000 * 60 * 60;
(async function() {
let db;
try {
db = await MongoClient.connect('mongodb://localhost/test');
let collection = db.collection('profiles');
let intervals = [6, 12, 24, 720];
let snapDate = new Date();
await Promise.all(
intervals.map( (monitorFrequency,i) =>
collection.updateMany(
{
monitorFrequency,
"lastRefreshAt": Object.assign(
{ "$lt": new Date(snapDate.valueOf() - intervals[i] * oneHour) },
(i < intervals.length) ?
{ "$gt": new Date(snapDate.valueOf() - intervals[i+1] * oneHour) }
: {}
)
},
{ "$currentDate": { "lastRefreshAt": true } },
)
)
);
} catch(e) {
console.error(e);
} finally {
db.close();
}
})();
This would allow you to index on the two fields and allow optimal selection, and since the "date ranges" are paired to their calculated difference from "monitorFrequency" then those documents that "require refresh" are the only ones that get selected for update.
Gievn the finite number of possible intervals this is what I would suspect to be the most optimal solution. But the construction along with the fact that the actual "update" portion remains consistent for each selection leads to one other option.
Use $or for each selection.
Much the same logic as above, but instead applied to build an $or condition for the "query" portion of a "single" update. It is an "array of criteria" afterall, which is essentially the same as an "array of queries" which is what we are doing above. So just turn it around a little:
let intervals = [6, 12, 24, 720];
let snapDate = new Date();
db.profiles.updateMany(
{
"$or": intervals.map( (monitorFrequency,i) =>
({
monitorFrequency,
"lastRefreshAt": Object.assign(
{ "$lt": new Date(snapDate.valueOf() - intervals[i] * oneHour) },
(i < intervals.length) ?
{ "$gt": new Date(snapDate.valueOf() - intervals[i+1] * oneHour) }
: {}
)
})
)
},
{ "$currentDate": { "lastRefreshAt": true } }
)
This then becomes one simple statement and of course can actually use indexes where available. Generally this is what you should be doing, though as I have suggested my intuition tells me that 4 threads of execution constrained only by the slowest one gets the job done slightly faster. Again, mileage may vary on that but logic dictates that this is so.
So the basic lesson here is "whilst you may think" that the logical approach is to calculate the values and compare within the database itself, it's actually the worst possible thing you can do for query performance.
The simple approach taken are to work out the criteria that should select the documents you want "before" you issue the query statement to the server. This means you are looking at "concrete values" rather than "calculation results" in comparison. And "concrete values" can actually be indexed, which is generally what you want for database queries.

Combine $lte and $gte in mongodb for random sample from unknown source doc size [duplicate]

I am looking to get a random record from a huge collection (100 million records).
What is the fastest and most efficient way to do so?
The data is already there and there are no field in which I can generate a random number and obtain a random row.
Starting with the 3.2 release of MongoDB, you can get N random docs from a collection using the $sample aggregation pipeline operator:
// Get one random document from the mycoll collection.
db.mycoll.aggregate([{ $sample: { size: 1 } }])
If you want to select the random document(s) from a filtered subset of the collection, prepend a $match stage to the pipeline:
// Get one random document matching {a: 10} from the mycoll collection.
db.mycoll.aggregate([
{ $match: { a: 10 } },
{ $sample: { size: 1 } }
])
As noted in the comments, when size is greater than 1, there may be duplicates in the returned document sample.
Do a count of all records, generate a random number between 0 and the count, and then do:
db.yourCollection.find().limit(-1).skip(yourRandomNumber).next()
Update for MongoDB 3.2
3.2 introduced $sample to the aggregation pipeline.
There's also a good blog post on putting it into practice.
For older versions (previous answer)
This was actually a feature request: http://jira.mongodb.org/browse/SERVER-533 but it was filed under "Won't fix."
The cookbook has a very good recipe to select a random document out of a collection: http://cookbook.mongodb.org/patterns/random-attribute/
To paraphrase the recipe, you assign random numbers to your documents:
db.docs.save( { key : 1, ..., random : Math.random() } )
Then select a random document:
rand = Math.random()
result = db.docs.findOne( { key : 2, random : { $gte : rand } } )
if ( result == null ) {
result = db.docs.findOne( { key : 2, random : { $lte : rand } } )
}
Querying with both $gte and $lte is necessary to find the document with a random number nearest rand.
And of course you'll want to index on the random field:
db.docs.ensureIndex( { key : 1, random :1 } )
If you're already querying against an index, simply drop it, append random: 1 to it, and add it again.
You can also use MongoDB's geospatial indexing feature to select the documents 'nearest' to a random number.
First, enable geospatial indexing on a collection:
db.docs.ensureIndex( { random_point: '2d' } )
To create a bunch of documents with random points on the X-axis:
for ( i = 0; i < 10; ++i ) {
db.docs.insert( { key: i, random_point: [Math.random(), 0] } );
}
Then you can get a random document from the collection like this:
db.docs.findOne( { random_point : { $near : [Math.random(), 0] } } )
Or you can retrieve several document nearest to a random point:
db.docs.find( { random_point : { $near : [Math.random(), 0] } } ).limit( 4 )
This requires only one query and no null checks, plus the code is clean, simple and flexible. You could even use the Y-axis of the geopoint to add a second randomness dimension to your query.
The following recipe is a little slower than the mongo cookbook solution (add a random key on every document), but returns more evenly distributed random documents. It's a little less-evenly distributed than the skip( random ) solution, but much faster and more fail-safe in case documents are removed.
function draw(collection, query) {
// query: mongodb query object (optional)
var query = query || { };
query['random'] = { $lte: Math.random() };
var cur = collection.find(query).sort({ rand: -1 });
if (! cur.hasNext()) {
delete query.random;
cur = collection.find(query).sort({ rand: -1 });
}
var doc = cur.next();
doc.random = Math.random();
collection.update({ _id: doc._id }, doc);
return doc;
}
It also requires you to add a random "random" field to your documents so don't forget to add this when you create them : you may need to initialize your collection as shown by Geoffrey
function addRandom(collection) {
collection.find().forEach(function (obj) {
obj.random = Math.random();
collection.save(obj);
});
}
db.eval(addRandom, db.things);
Benchmark results
This method is much faster than the skip() method (of ceejayoz) and generates more uniformly random documents than the "cookbook" method reported by Michael:
For a collection with 1,000,000 elements:
This method takes less than a millisecond on my machine
the skip() method takes 180 ms on average
The cookbook method will cause large numbers of documents to never get picked because their random number does not favor them.
This method will pick all elements evenly over time.
In my benchmark it was only 30% slower than the cookbook method.
the randomness is not 100% perfect but it is very good (and it can be improved if necessary)
This recipe is not perfect - the perfect solution would be a built-in feature as others have noted.
However it should be a good compromise for many purposes.
Here is a way using the default ObjectId values for _id and a little math and logic.
// Get the "min" and "max" timestamp values from the _id in the collection and the
// diff between.
// 4-bytes from a hex string is 8 characters
var min = parseInt(db.collection.find()
.sort({ "_id": 1 }).limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
max = parseInt(db.collection.find()
.sort({ "_id": -1 })limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
diff = max - min;
// Get a random value from diff and divide/multiply be 1000 for The "_id" precision:
var random = Math.floor(Math.floor(Math.random(diff)*diff)/1000)*1000;
// Use "random" in the range and pad the hex string to a valid ObjectId
var _id = new ObjectId(((min + random)/1000).toString(16) + "0000000000000000")
// Then query for the single document:
var randomDoc = db.collection.find({ "_id": { "$gte": _id } })
.sort({ "_id": 1 }).limit(1).toArray()[0];
That's the general logic in shell representation and easily adaptable.
So in points:
Find the min and max primary key values in the collection
Generate a random number that falls between the timestamps of those documents.
Add the random number to the minimum value and find the first document that is greater than or equal to that value.
This uses "padding" from the timestamp value in "hex" to form a valid ObjectId value since that is what we are looking for. Using integers as the _id value is essentially simplier but the same basic idea in the points.
Now you can use the aggregate.
Example:
db.users.aggregate(
[ { $sample: { size: 3 } } ]
)
See the doc.
In Python using pymongo:
import random
def get_random_doc():
count = collection.count()
return collection.find()[random.randrange(count)]
Using Python (pymongo), the aggregate function also works.
collection.aggregate([{'$sample': {'size': sample_size }}])
This approach is a lot faster than running a query for a random number (e.g. collection.find([random_int]). This is especially the case for large collections.
it is tough if there is no data there to key off of. what are the _id field? are they mongodb object id's? If so, you could get the highest and lowest values:
lowest = db.coll.find().sort({_id:1}).limit(1).next()._id;
highest = db.coll.find().sort({_id:-1}).limit(1).next()._id;
then if you assume the id's are uniformly distributed (but they aren't, but at least it's a start):
unsigned long long L = first_8_bytes_of(lowest)
unsigned long long H = first_8_bytes_of(highest)
V = (H - L) * random_from_0_to_1();
N = L + V;
oid = N concat random_4_bytes();
randomobj = db.coll.find({_id:{$gte:oid}}).limit(1);
You can pick a random timestamp and search for the first object that was created afterwards.
It will only scan a single document, though it doesn't necessarily give you a uniform distribution.
var randRec = function() {
// replace with your collection
var coll = db.collection
// get unixtime of first and last record
var min = coll.find().sort({_id: 1}).limit(1)[0]._id.getTimestamp() - 0;
var max = coll.find().sort({_id: -1}).limit(1)[0]._id.getTimestamp() - 0;
// allow to pass additional query params
return function(query) {
if (typeof query === 'undefined') query = {}
var randTime = Math.round(Math.random() * (max - min)) + min;
var hexSeconds = Math.floor(randTime / 1000).toString(16);
var id = ObjectId(hexSeconds + "0000000000000000");
query._id = {$gte: id}
return coll.find(query).limit(1)
};
}();
My solution on php:
/**
* Get random docs from Mongo
* #param $collection
* #param $where
* #param $fields
* #param $limit
* #author happy-code
* #url happy-code.com
*/
private function _mongodb_get_random (MongoCollection $collection, $where = array(), $fields = array(), $limit = false) {
// Total docs
$count = $collection->find($where, $fields)->count();
if (!$limit) {
// Get all docs
$limit = $count;
}
$data = array();
for( $i = 0; $i < $limit; $i++ ) {
// Skip documents
$skip = rand(0, ($count-1) );
if ($skip !== 0) {
$doc = $collection->find($where, $fields)->skip($skip)->limit(1)->getNext();
} else {
$doc = $collection->find($where, $fields)->limit(1)->getNext();
}
if (is_array($doc)) {
// Catch document
$data[ $doc['_id']->{'$id'} ] = $doc;
// Ignore current document when making the next iteration
$where['_id']['$nin'][] = $doc['_id'];
}
// Every iteration catch document and decrease in the total number of document
$count--;
}
return $data;
}
In order to get a determinated number of random docs without duplicates:
first get all ids
get size of documents
loop geting random index and skip duplicated
number_of_docs=7
db.collection('preguntas').find({},{_id:1}).toArray(function(err, arr) {
count=arr.length
idsram=[]
rans=[]
while(number_of_docs!=0){
var R = Math.floor(Math.random() * count);
if (rans.indexOf(R) > -1) {
continue
} else {
ans.push(R)
idsram.push(arr[R]._id)
number_of_docs--
}
}
db.collection('preguntas').find({}).toArray(function(err1, doc1) {
if (err1) { console.log(err1); return; }
res.send(doc1)
});
});
The best way in Mongoose is to make an aggregation call with $sample.
However, Mongoose does not apply Mongoose documents to Aggregation - especially not if populate() is to be applied as well.
For getting a "lean" array from the database:
/*
Sample model should be init first
const Sample = mongoose …
*/
const samples = await Sample.aggregate([
{ $match: {} },
{ $sample: { size: 33 } },
]).exec();
console.log(samples); //a lean Array
For getting an array of mongoose documents:
const samples = (
await Sample.aggregate([
{ $match: {} },
{ $sample: { size: 27 } },
{ $project: { _id: 1 } },
]).exec()
).map(v => v._id);
const mongooseSamples = await Sample.find({ _id: { $in: samples } });
console.log(mongooseSamples); //an Array of mongoose documents
I would suggest using map/reduce, where you use the map function to only emit when a random value is above a given probability.
function mapf() {
if(Math.random() <= probability) {
emit(1, this);
}
}
function reducef(key,values) {
return {"documents": values};
}
res = db.questions.mapReduce(mapf, reducef, {"out": {"inline": 1}, "scope": { "probability": 0.5}});
printjson(res.results);
The reducef function above works because only one key ('1') is emitted from the map function.
The value of the "probability" is defined in the "scope", when invoking mapRreduce(...)
Using mapReduce like this should also be usable on a sharded db.
If you want to select exactly n of m documents from the db, you could do it like this:
function mapf() {
if(countSubset == 0) return;
var prob = countSubset / countTotal;
if(Math.random() <= prob) {
emit(1, {"documents": [this]});
countSubset--;
}
countTotal--;
}
function reducef(key,values) {
var newArray = new Array();
for(var i=0; i < values.length; i++) {
newArray = newArray.concat(values[i].documents);
}
return {"documents": newArray};
}
res = db.questions.mapReduce(mapf, reducef, {"out": {"inline": 1}, "scope": {"countTotal": 4, "countSubset": 2}})
printjson(res.results);
Where "countTotal" (m) is the number of documents in the db, and "countSubset" (n) is the number of documents to retrieve.
This approach might give some problems on sharded databases.
You can pick random _id and return corresponding object:
db.collection.count( function(err, count){
db.collection.distinct( "_id" , function( err, result) {
if (err)
res.send(err)
var randomId = result[Math.floor(Math.random() * (count-1))]
db.collection.findOne( { _id: randomId } , function( err, result) {
if (err)
res.send(err)
console.log(result)
})
})
})
Here you dont need to spend space on storing random numbers in collection.
The following aggregation operation randomly selects 3 documents from the collection:
db.users.aggregate(
[ { $sample: { size: 3 } } ]
)
https://docs.mongodb.com/manual/reference/operator/aggregation/sample/
MongoDB now has $rand
To pick n non repeat items, aggregate with { $addFields: { _f: { $rand: {} } } } then $sort by _f and $limit n.
I'd suggest adding a random int field to each object. Then you can just do a
findOne({random_field: {$gte: rand()}})
to pick a random document. Just make sure you ensureIndex({random_field:1})
When I was faced with a similar solution, I backtracked and found that the business request was actually for creating some form of rotation of the inventory being presented. In that case, there are much better options, which have answers from search engines like Solr, not data stores like MongoDB.
In short, with the requirement to "intelligently rotate" content, what we should do instead of a random number across all of the documents is to include a personal q score modifier. To implement this yourself, assuming a small population of users, you can store a document per user that has the productId, impression count, click-through count, last seen date, and whatever other factors the business finds as being meaningful to compute a q score modifier. When retrieving the set to display, typically you request more documents from the data store than requested by the end user, then apply the q score modifier, take the number of records requested by the end user, then randomize the page of results, a tiny set, so simply sort the documents in the application layer (in memory).
If the universe of users is too large, you can categorize users into behavior groups and index by behavior group rather than user.
If the universe of products is small enough, you can create an index per user.
I have found this technique to be much more efficient, but more importantly more effective in creating a relevant, worthwhile experience of using the software solution.
non of the solutions worked well for me. especially when there are many gaps and set is small.
this worked very well for me(in php):
$count = $collection->count($search);
$skip = mt_rand(0, $count - 1);
$result = $collection->find($search)->skip($skip)->limit(1)->getNext();
My PHP/MongoDB sort/order by RANDOM solution. Hope this helps anyone.
Note: I have numeric ID's within my MongoDB collection that refer to a MySQL database record.
First I create an array with 10 randomly generated numbers
$randomNumbers = [];
for($i = 0; $i < 10; $i++){
$randomNumbers[] = rand(0,1000);
}
In my aggregation I use the $addField pipeline operator combined with $arrayElemAt and $mod (modulus). The modulus operator will give me a number from 0 - 9 which I then use to pick a number from the array with random generated numbers.
$aggregate[] = [
'$addFields' => [
'random_sort' => [ '$arrayElemAt' => [ $randomNumbers, [ '$mod' => [ '$my_numeric_mysql_id', 10 ] ] ] ],
],
];
After that you can use the sort Pipeline.
$aggregate[] = [
'$sort' => [
'random_sort' => 1
]
];
My simplest solution to this ...
db.coll.find()
.limit(1)
.skip(Math.floor(Math.random() * 500))
.next()
Where you have at least 500 items on collections
If you have a simple id key, you could store all the id's in an array, and then pick a random id. (Ruby answer):
ids = #coll.find({},fields:{_id:1}).to_a
#coll.find(ids.sample).first
Using Map/Reduce, you can certainly get a random record, just not necessarily very efficiently depending on the size of the resulting filtered collection you end up working with.
I've tested this method with 50,000 documents (the filter reduces it to about 30,000), and it executes in approximately 400ms on an Intel i3 with 16GB ram and a SATA3 HDD...
db.toc_content.mapReduce(
/* map function */
function() { emit( 1, this._id ); },
/* reduce function */
function(k,v) {
var r = Math.floor((Math.random()*v.length));
return v[r];
},
/* options */
{
out: { inline: 1 },
/* Filter the collection to "A"ctive documents */
query: { status: "A" }
}
);
The Map function simply creates an array of the id's of all documents that match the query. In my case I tested this with approximately 30,000 out of the 50,000 possible documents.
The Reduce function simply picks a random integer between 0 and the number of items (-1) in the array, and then returns that _id from the array.
400ms sounds like a long time, and it really is, if you had fifty million records instead of fifty thousand, this may increase the overhead to the point where it becomes unusable in multi-user situations.
There is an open issue for MongoDB to include this feature in the core... https://jira.mongodb.org/browse/SERVER-533
If this "random" selection was built into an index-lookup instead of collecting ids into an array and then selecting one, this would help incredibly. (go vote it up!)
This works nice, it's fast, works with multiple documents and doesn't require populating rand field, which will eventually populate itself:
add index to .rand field on your collection
use find and refresh, something like:
// Install packages:
// npm install mongodb async
// Add index in mongo:
// db.ensureIndex('mycollection', { rand: 1 })
var mongodb = require('mongodb')
var async = require('async')
// Find n random documents by using "rand" field.
function findAndRefreshRand (collection, n, fields, done) {
var result = []
var rand = Math.random()
// Append documents to the result based on criteria and options, if options.limit is 0 skip the call.
var appender = function (criteria, options, done) {
return function (done) {
if (options.limit > 0) {
collection.find(criteria, fields, options).toArray(
function (err, docs) {
if (!err && Array.isArray(docs)) {
Array.prototype.push.apply(result, docs)
}
done(err)
}
)
} else {
async.nextTick(done)
}
}
}
async.series([
// Fetch docs with unitialized .rand.
// NOTE: You can comment out this step if all docs have initialized .rand = Math.random()
appender({ rand: { $exists: false } }, { limit: n - result.length }),
// Fetch on one side of random number.
appender({ rand: { $gte: rand } }, { sort: { rand: 1 }, limit: n - result.length }),
// Continue fetch on the other side.
appender({ rand: { $lt: rand } }, { sort: { rand: -1 }, limit: n - result.length }),
// Refresh fetched docs, if any.
function (done) {
if (result.length > 0) {
var batch = collection.initializeUnorderedBulkOp({ w: 0 })
for (var i = 0; i < result.length; ++i) {
batch.find({ _id: result[i]._id }).updateOne({ rand: Math.random() })
}
batch.execute(done)
} else {
async.nextTick(done)
}
}
], function (err) {
done(err, result)
})
}
// Example usage
mongodb.MongoClient.connect('mongodb://localhost:27017/core-development', function (err, db) {
if (!err) {
findAndRefreshRand(db.collection('profiles'), 1024, { _id: true, rand: true }, function (err, result) {
if (!err) {
console.log(result)
} else {
console.error(err)
}
db.close()
})
} else {
console.error(err)
}
})
ps. How to find random records in mongodb question is marked as duplicate of this question. The difference is that this question asks explicitly about single record as the other one explicitly about getting random documents.
For me, I wanted to get the same records, in a random order, so I created an empty array used to sort, then generated random numbers between one and 7( I have seven fields). So each time I get a different value, I assign a different random sort.
It is 'layman' but it worked for me.
//generate random number
const randomval = some random value;
//declare sort array and initialize to empty
const sort = [];
//write a conditional if else to get to decide which sort to use
if(randomval == 1)
{
sort.push(...['createdAt',1]);
}
else if(randomval == 2)
{
sort.push(...['_id',1]);
}
....
else if(randomval == n)
{
sort.push(...['n',1]);
}
If you're using mongoid, the document-to-object wrapper, you can do the following in
Ruby. (Assuming your model is User)
User.all.to_a[rand(User.count)]
In my .irbrc, I have
def rando klass
klass.all.to_a[rand(klass.count)]
end
so in rails console, I can do, for example,
rando User
rando Article
to get documents randomly from any collection.
you can also use shuffle-array after executing your query
var shuffle = require('shuffle-array');
Accounts.find(qry,function(err,results_array){
newIndexArr=shuffle(results_array);
What works efficiently and reliably is this:
Add a field called "random" to each document and assign a random value to it, add an index for the random field and proceed as follows:
Let's assume we have a collection of web links called "links" and we want a random link from it:
link = db.links.find().sort({random: 1}).limit(1)[0]
To ensure the same link won't pop up a second time, update its random field with a new random number:
db.links.update({random: Math.random()}, link)

Mongodb 3.0 equivalent to $sample [duplicate]

I am looking to get a random record from a huge collection (100 million records).
What is the fastest and most efficient way to do so?
The data is already there and there are no field in which I can generate a random number and obtain a random row.
Starting with the 3.2 release of MongoDB, you can get N random docs from a collection using the $sample aggregation pipeline operator:
// Get one random document from the mycoll collection.
db.mycoll.aggregate([{ $sample: { size: 1 } }])
If you want to select the random document(s) from a filtered subset of the collection, prepend a $match stage to the pipeline:
// Get one random document matching {a: 10} from the mycoll collection.
db.mycoll.aggregate([
{ $match: { a: 10 } },
{ $sample: { size: 1 } }
])
As noted in the comments, when size is greater than 1, there may be duplicates in the returned document sample.
Do a count of all records, generate a random number between 0 and the count, and then do:
db.yourCollection.find().limit(-1).skip(yourRandomNumber).next()
Update for MongoDB 3.2
3.2 introduced $sample to the aggregation pipeline.
There's also a good blog post on putting it into practice.
For older versions (previous answer)
This was actually a feature request: http://jira.mongodb.org/browse/SERVER-533 but it was filed under "Won't fix."
The cookbook has a very good recipe to select a random document out of a collection: http://cookbook.mongodb.org/patterns/random-attribute/
To paraphrase the recipe, you assign random numbers to your documents:
db.docs.save( { key : 1, ..., random : Math.random() } )
Then select a random document:
rand = Math.random()
result = db.docs.findOne( { key : 2, random : { $gte : rand } } )
if ( result == null ) {
result = db.docs.findOne( { key : 2, random : { $lte : rand } } )
}
Querying with both $gte and $lte is necessary to find the document with a random number nearest rand.
And of course you'll want to index on the random field:
db.docs.ensureIndex( { key : 1, random :1 } )
If you're already querying against an index, simply drop it, append random: 1 to it, and add it again.
You can also use MongoDB's geospatial indexing feature to select the documents 'nearest' to a random number.
First, enable geospatial indexing on a collection:
db.docs.ensureIndex( { random_point: '2d' } )
To create a bunch of documents with random points on the X-axis:
for ( i = 0; i < 10; ++i ) {
db.docs.insert( { key: i, random_point: [Math.random(), 0] } );
}
Then you can get a random document from the collection like this:
db.docs.findOne( { random_point : { $near : [Math.random(), 0] } } )
Or you can retrieve several document nearest to a random point:
db.docs.find( { random_point : { $near : [Math.random(), 0] } } ).limit( 4 )
This requires only one query and no null checks, plus the code is clean, simple and flexible. You could even use the Y-axis of the geopoint to add a second randomness dimension to your query.
The following recipe is a little slower than the mongo cookbook solution (add a random key on every document), but returns more evenly distributed random documents. It's a little less-evenly distributed than the skip( random ) solution, but much faster and more fail-safe in case documents are removed.
function draw(collection, query) {
// query: mongodb query object (optional)
var query = query || { };
query['random'] = { $lte: Math.random() };
var cur = collection.find(query).sort({ rand: -1 });
if (! cur.hasNext()) {
delete query.random;
cur = collection.find(query).sort({ rand: -1 });
}
var doc = cur.next();
doc.random = Math.random();
collection.update({ _id: doc._id }, doc);
return doc;
}
It also requires you to add a random "random" field to your documents so don't forget to add this when you create them : you may need to initialize your collection as shown by Geoffrey
function addRandom(collection) {
collection.find().forEach(function (obj) {
obj.random = Math.random();
collection.save(obj);
});
}
db.eval(addRandom, db.things);
Benchmark results
This method is much faster than the skip() method (of ceejayoz) and generates more uniformly random documents than the "cookbook" method reported by Michael:
For a collection with 1,000,000 elements:
This method takes less than a millisecond on my machine
the skip() method takes 180 ms on average
The cookbook method will cause large numbers of documents to never get picked because their random number does not favor them.
This method will pick all elements evenly over time.
In my benchmark it was only 30% slower than the cookbook method.
the randomness is not 100% perfect but it is very good (and it can be improved if necessary)
This recipe is not perfect - the perfect solution would be a built-in feature as others have noted.
However it should be a good compromise for many purposes.
Here is a way using the default ObjectId values for _id and a little math and logic.
// Get the "min" and "max" timestamp values from the _id in the collection and the
// diff between.
// 4-bytes from a hex string is 8 characters
var min = parseInt(db.collection.find()
.sort({ "_id": 1 }).limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
max = parseInt(db.collection.find()
.sort({ "_id": -1 })limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
diff = max - min;
// Get a random value from diff and divide/multiply be 1000 for The "_id" precision:
var random = Math.floor(Math.floor(Math.random(diff)*diff)/1000)*1000;
// Use "random" in the range and pad the hex string to a valid ObjectId
var _id = new ObjectId(((min + random)/1000).toString(16) + "0000000000000000")
// Then query for the single document:
var randomDoc = db.collection.find({ "_id": { "$gte": _id } })
.sort({ "_id": 1 }).limit(1).toArray()[0];
That's the general logic in shell representation and easily adaptable.
So in points:
Find the min and max primary key values in the collection
Generate a random number that falls between the timestamps of those documents.
Add the random number to the minimum value and find the first document that is greater than or equal to that value.
This uses "padding" from the timestamp value in "hex" to form a valid ObjectId value since that is what we are looking for. Using integers as the _id value is essentially simplier but the same basic idea in the points.
Now you can use the aggregate.
Example:
db.users.aggregate(
[ { $sample: { size: 3 } } ]
)
See the doc.
In Python using pymongo:
import random
def get_random_doc():
count = collection.count()
return collection.find()[random.randrange(count)]
Using Python (pymongo), the aggregate function also works.
collection.aggregate([{'$sample': {'size': sample_size }}])
This approach is a lot faster than running a query for a random number (e.g. collection.find([random_int]). This is especially the case for large collections.
it is tough if there is no data there to key off of. what are the _id field? are they mongodb object id's? If so, you could get the highest and lowest values:
lowest = db.coll.find().sort({_id:1}).limit(1).next()._id;
highest = db.coll.find().sort({_id:-1}).limit(1).next()._id;
then if you assume the id's are uniformly distributed (but they aren't, but at least it's a start):
unsigned long long L = first_8_bytes_of(lowest)
unsigned long long H = first_8_bytes_of(highest)
V = (H - L) * random_from_0_to_1();
N = L + V;
oid = N concat random_4_bytes();
randomobj = db.coll.find({_id:{$gte:oid}}).limit(1);
You can pick a random timestamp and search for the first object that was created afterwards.
It will only scan a single document, though it doesn't necessarily give you a uniform distribution.
var randRec = function() {
// replace with your collection
var coll = db.collection
// get unixtime of first and last record
var min = coll.find().sort({_id: 1}).limit(1)[0]._id.getTimestamp() - 0;
var max = coll.find().sort({_id: -1}).limit(1)[0]._id.getTimestamp() - 0;
// allow to pass additional query params
return function(query) {
if (typeof query === 'undefined') query = {}
var randTime = Math.round(Math.random() * (max - min)) + min;
var hexSeconds = Math.floor(randTime / 1000).toString(16);
var id = ObjectId(hexSeconds + "0000000000000000");
query._id = {$gte: id}
return coll.find(query).limit(1)
};
}();
My solution on php:
/**
* Get random docs from Mongo
* #param $collection
* #param $where
* #param $fields
* #param $limit
* #author happy-code
* #url happy-code.com
*/
private function _mongodb_get_random (MongoCollection $collection, $where = array(), $fields = array(), $limit = false) {
// Total docs
$count = $collection->find($where, $fields)->count();
if (!$limit) {
// Get all docs
$limit = $count;
}
$data = array();
for( $i = 0; $i < $limit; $i++ ) {
// Skip documents
$skip = rand(0, ($count-1) );
if ($skip !== 0) {
$doc = $collection->find($where, $fields)->skip($skip)->limit(1)->getNext();
} else {
$doc = $collection->find($where, $fields)->limit(1)->getNext();
}
if (is_array($doc)) {
// Catch document
$data[ $doc['_id']->{'$id'} ] = $doc;
// Ignore current document when making the next iteration
$where['_id']['$nin'][] = $doc['_id'];
}
// Every iteration catch document and decrease in the total number of document
$count--;
}
return $data;
}
In order to get a determinated number of random docs without duplicates:
first get all ids
get size of documents
loop geting random index and skip duplicated
number_of_docs=7
db.collection('preguntas').find({},{_id:1}).toArray(function(err, arr) {
count=arr.length
idsram=[]
rans=[]
while(number_of_docs!=0){
var R = Math.floor(Math.random() * count);
if (rans.indexOf(R) > -1) {
continue
} else {
ans.push(R)
idsram.push(arr[R]._id)
number_of_docs--
}
}
db.collection('preguntas').find({}).toArray(function(err1, doc1) {
if (err1) { console.log(err1); return; }
res.send(doc1)
});
});
The best way in Mongoose is to make an aggregation call with $sample.
However, Mongoose does not apply Mongoose documents to Aggregation - especially not if populate() is to be applied as well.
For getting a "lean" array from the database:
/*
Sample model should be init first
const Sample = mongoose …
*/
const samples = await Sample.aggregate([
{ $match: {} },
{ $sample: { size: 33 } },
]).exec();
console.log(samples); //a lean Array
For getting an array of mongoose documents:
const samples = (
await Sample.aggregate([
{ $match: {} },
{ $sample: { size: 27 } },
{ $project: { _id: 1 } },
]).exec()
).map(v => v._id);
const mongooseSamples = await Sample.find({ _id: { $in: samples } });
console.log(mongooseSamples); //an Array of mongoose documents
I would suggest using map/reduce, where you use the map function to only emit when a random value is above a given probability.
function mapf() {
if(Math.random() <= probability) {
emit(1, this);
}
}
function reducef(key,values) {
return {"documents": values};
}
res = db.questions.mapReduce(mapf, reducef, {"out": {"inline": 1}, "scope": { "probability": 0.5}});
printjson(res.results);
The reducef function above works because only one key ('1') is emitted from the map function.
The value of the "probability" is defined in the "scope", when invoking mapRreduce(...)
Using mapReduce like this should also be usable on a sharded db.
If you want to select exactly n of m documents from the db, you could do it like this:
function mapf() {
if(countSubset == 0) return;
var prob = countSubset / countTotal;
if(Math.random() <= prob) {
emit(1, {"documents": [this]});
countSubset--;
}
countTotal--;
}
function reducef(key,values) {
var newArray = new Array();
for(var i=0; i < values.length; i++) {
newArray = newArray.concat(values[i].documents);
}
return {"documents": newArray};
}
res = db.questions.mapReduce(mapf, reducef, {"out": {"inline": 1}, "scope": {"countTotal": 4, "countSubset": 2}})
printjson(res.results);
Where "countTotal" (m) is the number of documents in the db, and "countSubset" (n) is the number of documents to retrieve.
This approach might give some problems on sharded databases.
You can pick random _id and return corresponding object:
db.collection.count( function(err, count){
db.collection.distinct( "_id" , function( err, result) {
if (err)
res.send(err)
var randomId = result[Math.floor(Math.random() * (count-1))]
db.collection.findOne( { _id: randomId } , function( err, result) {
if (err)
res.send(err)
console.log(result)
})
})
})
Here you dont need to spend space on storing random numbers in collection.
The following aggregation operation randomly selects 3 documents from the collection:
db.users.aggregate(
[ { $sample: { size: 3 } } ]
)
https://docs.mongodb.com/manual/reference/operator/aggregation/sample/
MongoDB now has $rand
To pick n non repeat items, aggregate with { $addFields: { _f: { $rand: {} } } } then $sort by _f and $limit n.
I'd suggest adding a random int field to each object. Then you can just do a
findOne({random_field: {$gte: rand()}})
to pick a random document. Just make sure you ensureIndex({random_field:1})
When I was faced with a similar solution, I backtracked and found that the business request was actually for creating some form of rotation of the inventory being presented. In that case, there are much better options, which have answers from search engines like Solr, not data stores like MongoDB.
In short, with the requirement to "intelligently rotate" content, what we should do instead of a random number across all of the documents is to include a personal q score modifier. To implement this yourself, assuming a small population of users, you can store a document per user that has the productId, impression count, click-through count, last seen date, and whatever other factors the business finds as being meaningful to compute a q score modifier. When retrieving the set to display, typically you request more documents from the data store than requested by the end user, then apply the q score modifier, take the number of records requested by the end user, then randomize the page of results, a tiny set, so simply sort the documents in the application layer (in memory).
If the universe of users is too large, you can categorize users into behavior groups and index by behavior group rather than user.
If the universe of products is small enough, you can create an index per user.
I have found this technique to be much more efficient, but more importantly more effective in creating a relevant, worthwhile experience of using the software solution.
non of the solutions worked well for me. especially when there are many gaps and set is small.
this worked very well for me(in php):
$count = $collection->count($search);
$skip = mt_rand(0, $count - 1);
$result = $collection->find($search)->skip($skip)->limit(1)->getNext();
My PHP/MongoDB sort/order by RANDOM solution. Hope this helps anyone.
Note: I have numeric ID's within my MongoDB collection that refer to a MySQL database record.
First I create an array with 10 randomly generated numbers
$randomNumbers = [];
for($i = 0; $i < 10; $i++){
$randomNumbers[] = rand(0,1000);
}
In my aggregation I use the $addField pipeline operator combined with $arrayElemAt and $mod (modulus). The modulus operator will give me a number from 0 - 9 which I then use to pick a number from the array with random generated numbers.
$aggregate[] = [
'$addFields' => [
'random_sort' => [ '$arrayElemAt' => [ $randomNumbers, [ '$mod' => [ '$my_numeric_mysql_id', 10 ] ] ] ],
],
];
After that you can use the sort Pipeline.
$aggregate[] = [
'$sort' => [
'random_sort' => 1
]
];
My simplest solution to this ...
db.coll.find()
.limit(1)
.skip(Math.floor(Math.random() * 500))
.next()
Where you have at least 500 items on collections
If you have a simple id key, you could store all the id's in an array, and then pick a random id. (Ruby answer):
ids = #coll.find({},fields:{_id:1}).to_a
#coll.find(ids.sample).first
Using Map/Reduce, you can certainly get a random record, just not necessarily very efficiently depending on the size of the resulting filtered collection you end up working with.
I've tested this method with 50,000 documents (the filter reduces it to about 30,000), and it executes in approximately 400ms on an Intel i3 with 16GB ram and a SATA3 HDD...
db.toc_content.mapReduce(
/* map function */
function() { emit( 1, this._id ); },
/* reduce function */
function(k,v) {
var r = Math.floor((Math.random()*v.length));
return v[r];
},
/* options */
{
out: { inline: 1 },
/* Filter the collection to "A"ctive documents */
query: { status: "A" }
}
);
The Map function simply creates an array of the id's of all documents that match the query. In my case I tested this with approximately 30,000 out of the 50,000 possible documents.
The Reduce function simply picks a random integer between 0 and the number of items (-1) in the array, and then returns that _id from the array.
400ms sounds like a long time, and it really is, if you had fifty million records instead of fifty thousand, this may increase the overhead to the point where it becomes unusable in multi-user situations.
There is an open issue for MongoDB to include this feature in the core... https://jira.mongodb.org/browse/SERVER-533
If this "random" selection was built into an index-lookup instead of collecting ids into an array and then selecting one, this would help incredibly. (go vote it up!)
This works nice, it's fast, works with multiple documents and doesn't require populating rand field, which will eventually populate itself:
add index to .rand field on your collection
use find and refresh, something like:
// Install packages:
// npm install mongodb async
// Add index in mongo:
// db.ensureIndex('mycollection', { rand: 1 })
var mongodb = require('mongodb')
var async = require('async')
// Find n random documents by using "rand" field.
function findAndRefreshRand (collection, n, fields, done) {
var result = []
var rand = Math.random()
// Append documents to the result based on criteria and options, if options.limit is 0 skip the call.
var appender = function (criteria, options, done) {
return function (done) {
if (options.limit > 0) {
collection.find(criteria, fields, options).toArray(
function (err, docs) {
if (!err && Array.isArray(docs)) {
Array.prototype.push.apply(result, docs)
}
done(err)
}
)
} else {
async.nextTick(done)
}
}
}
async.series([
// Fetch docs with unitialized .rand.
// NOTE: You can comment out this step if all docs have initialized .rand = Math.random()
appender({ rand: { $exists: false } }, { limit: n - result.length }),
// Fetch on one side of random number.
appender({ rand: { $gte: rand } }, { sort: { rand: 1 }, limit: n - result.length }),
// Continue fetch on the other side.
appender({ rand: { $lt: rand } }, { sort: { rand: -1 }, limit: n - result.length }),
// Refresh fetched docs, if any.
function (done) {
if (result.length > 0) {
var batch = collection.initializeUnorderedBulkOp({ w: 0 })
for (var i = 0; i < result.length; ++i) {
batch.find({ _id: result[i]._id }).updateOne({ rand: Math.random() })
}
batch.execute(done)
} else {
async.nextTick(done)
}
}
], function (err) {
done(err, result)
})
}
// Example usage
mongodb.MongoClient.connect('mongodb://localhost:27017/core-development', function (err, db) {
if (!err) {
findAndRefreshRand(db.collection('profiles'), 1024, { _id: true, rand: true }, function (err, result) {
if (!err) {
console.log(result)
} else {
console.error(err)
}
db.close()
})
} else {
console.error(err)
}
})
ps. How to find random records in mongodb question is marked as duplicate of this question. The difference is that this question asks explicitly about single record as the other one explicitly about getting random documents.
For me, I wanted to get the same records, in a random order, so I created an empty array used to sort, then generated random numbers between one and 7( I have seven fields). So each time I get a different value, I assign a different random sort.
It is 'layman' but it worked for me.
//generate random number
const randomval = some random value;
//declare sort array and initialize to empty
const sort = [];
//write a conditional if else to get to decide which sort to use
if(randomval == 1)
{
sort.push(...['createdAt',1]);
}
else if(randomval == 2)
{
sort.push(...['_id',1]);
}
....
else if(randomval == n)
{
sort.push(...['n',1]);
}
If you're using mongoid, the document-to-object wrapper, you can do the following in
Ruby. (Assuming your model is User)
User.all.to_a[rand(User.count)]
In my .irbrc, I have
def rando klass
klass.all.to_a[rand(klass.count)]
end
so in rails console, I can do, for example,
rando User
rando Article
to get documents randomly from any collection.
you can also use shuffle-array after executing your query
var shuffle = require('shuffle-array');
Accounts.find(qry,function(err,results_array){
newIndexArr=shuffle(results_array);
What works efficiently and reliably is this:
Add a field called "random" to each document and assign a random value to it, add an index for the random field and proceed as follows:
Let's assume we have a collection of web links called "links" and we want a random link from it:
link = db.links.find().sort({random: 1}).limit(1)[0]
To ensure the same link won't pop up a second time, update its random field with a new random number:
db.links.update({random: Math.random()}, link)

Making a collection from collection subset in mongodb

I have a huge collection of documents (more than two millions) ans I found my self querying very a small subset. using something like
scs = db.balance_sheets.find({"9087n":{$gte:40}, "20/58n":{ $lte:40000000}})
which gives less than 5k results. The question is, can I create a new collection with the results of this query?
I'd tried insert:
db.scs.insert(db.balance_sheets.find({"9087n":{$gte:40}, "20/58n":{ $lte:40000000}}).toArray())
But it gives me errors: Socket say send() errno:32 Broken pipe 127.0.0.1:27017
I tryied aggregate:
db.balance_sheets.aggregate([{ "9087n":{$gte:40}, "20/58n":{ $lte:40000000}} ,{$out:"pme"}])
And I get "exception: A pipeline stage specification object must contain exactly one field."
Any hints?
Thanks
The first option would be:
var cursor = db.balance_sheets.find({"9087n":{"$gte": 40}, "20/58n":{ $lte:40000000}});
while (cursor.hasNext()) {
var doc = cursor.next();
db.pme.save(doc);
};
As for the aggregation, try
db.balance_sheets.aggregate([
{
"$match": { "9087n": { "$gte": 40 }, "20/58n": { "$lte": 40000000 } }
},
{ "$out": "pme" }
]);
For improved performance especially when dealing with large collections, take advantage of using the Bulk API for bulk updates as you will be sending the operations to the server in batches of say 500 which gives you a better performance as you are not sending every request to the server, just once in every 500 requests.
The following demonstrates this approach, the first example uses the Bulk API available in MongoDB versions >= 2.6 and < 3.2 to insert all the documents matching the query from the balance_sheets collection into the pme collection:
var bulk = db.pme.initializeUnorderedBulkOp(),
counter = 0;
db.balance_sheets.find({
"9087n": {"$gte": 40},
"20/58n":{ "$lte":40000000}
}).forEach(function (doc) {
bulk.insert(doc);
counter++;
if (counter % 500 == 0) {
bulk.execute(); // Execute per 500 operations
// and re-initialize every 1000 update statements
bulk = db.pme.initializeUnorderedBulkOp();
}
})
// Clean up remaining operations in queue
if (counter % 500 != 0) { bulk.execute(); }
The next example applies to the new MongoDB version 3.2 which has since deprecated the Bulk API and provided a newer set of apis using bulkWrite():
var bulkOps = db.balance_sheets.find({
"9087n": { "$gte": 40 },
"20/58n": { "$lte": 40000000 }
}).map(function (doc) {
return { "insertOne" : { "document": doc } };
});
db.pme.bulkWrite(bulkOps);

Mongoose limit/offset and count query

Bit of an odd one on query performance... I need to run a query which does a total count of documents, and can also return a result set that can be limited and offset.
So, I have 57 documents in total, and the user wants 10 documents offset by 20.
I can think of 2 ways of doing this, first is query for all 57 documents (returned as an array), then using array.slice return the documents they want. The second option is to run 2 queries, the first one using mongo's native 'count' method, then run a second query using mongo's native $limit and $skip aggregators.
Which do you think would scale better? Doing it all in one query, or running two separate ones?
Edit:
// 1 query
var limit = 10;
var offset = 20;
Animals.find({}, function (err, animals) {
if (err) {
return next(err);
}
res.send({count: animals.length, animals: animals.slice(offset, limit + offset)});
});
// 2 queries
Animals.find({}, {limit:10, skip:20} function (err, animals) {
if (err) {
return next(err);
}
Animals.count({}, function (err, count) {
if (err) {
return next(err);
}
res.send({count: count, animals: animals});
});
});
I suggest you to use 2 queries:
db.collection.count() will return total number of items. This value is stored somewhere in Mongo and it is not calculated.
db.collection.find().skip(20).limit(10) here I assume you could use a sort by some field, so do not forget to add an index on this field. This query will be fast too.
I think that you shouldn't query all items and than perform skip and take, cause later when you have big data you will have problems with data transferring and processing.
Instead of using 2 separate queries, you can use aggregate() in a single query:
Aggregate "$facet" can be fetch more quickly, the Total Count and the Data with skip & limit
db.collection.aggregate([
//{$sort: {...}}
//{$match:{...}}
{$facet:{
"stage1" : [ {"$group": {_id:null, count:{$sum:1}}} ],
"stage2" : [ { "$skip": 0}, {"$limit": 2} ]
}},
{$unwind: "$stage1"},
//output projection
{$project:{
count: "$stage1.count",
data: "$stage2"
}}
]);
output as follows:-
[{
count: 50,
data: [
{...},
{...}
]
}]
Also, have a look at https://docs.mongodb.com/manual/reference/operator/aggregation/facet/
db.collection_name.aggregate([
{ '$match' : { } },
{ '$sort' : { '_id' : -1 } },
{ '$facet' : {
metadata: [ { $count: "total" } ],
data: [ { $skip: 1 }, { $limit: 10 },{ '$project' : {"_id":0} } ] // add projection here wish you re-shape the docs
} }
] )
Instead of using two queries to find the total count and skip the matched record.
$facet is the best and optimized way.
Match the record
Find total_count
skip the record
And also can reshape data according to our needs in the query.
There is a library that will do all of this for you, check out mongoose-paginate-v2
After having to tackle this issue myself, I would like to build upon user854301's answer.
Mongoose ^4.13.8 I was able to use a function called toConstructor() which allowed me to avoid building the query multiple times when filters are applied. I know this function is available in older versions too but you'll have to check the Mongoose docs to confirm this.
The following uses Bluebird promises:
let schema = Query.find({ name: 'bloggs', age: { $gt: 30 } });
// save the query as a 'template'
let query = schema.toConstructor();
return Promise.join(
schema.count().exec(),
query().limit(limit).skip(skip).exec(),
function (total, data) {
return { data: data, total: total }
}
);
Now the count query will return the total records it matched and the data returned will be a subset of the total records.
Please note the () around query() which constructs the query.
You don't have to use two queries or one complicated query with aggregate and such.
You can use one query
example:
const getNames = async (queryParams) => {
const cursor = db.collection.find(queryParams).skip(20).limit(10);
return {
count: await cursor.count(),
data: await cursor.toArray()
}
}
mongo returns a cursor that has predefined functions such as count, which will return the full count of the queried results regardless of skip and limit
So in count property, you will get the full length of the collection and in data, you will get just the chunk with offset of 20 and limit of 10 documents
Thanks Igor Igeto Mitkovski, a best solution is using native connection
document is here: https://docs.mongodb.com/manual/reference/method/cursor.count/#mongodb-method-cursor.count
and mongoose dont support it ( https://github.com/Automattic/mongoose/issues/3283 )
we have to use native connection.
const query = StudentModel.collection.find(
{
age: 13
},
{
projection:{ _id:0 }
}
).sort({ time: -1 })
const count = await query.count()
const records = await query.skip(20)
.limit(10).toArray()