Getting previous page's results when using skip and limit while querying DocumentDB - mongodb

I have a collection with ~800,000 documents, I am trying to fetch all of them, 5,000 at a time.
When running the following code:
const CHUNK_SIZE = 5000;
let skip = 0;
do {
matches = await dbClient
.collection(collectionName)
.find({})
.skip(skip)
.limit(CHUNK_SIZE)
.toArray();
// ... some processing
skip += CHUNK_SIZE;
} while (matches.length)
After about 30 iterations, I start getting documents I already received in a previous iteration.
What am I missing here?

As posted in the comments, you'll have to apply a .sort() on the query.
To do so without adding too much performance overhead it would be easiest to do this on the _id e.g.
.sort(
{
"_id" : 1.0
}
)
Neither MongoDB or the AmazonDocumentDB flavor guarantees implicit result sort ordering without it.
Amazon DocumentDB
Mongo Result Ordering

Related

Iterating over MongoDB collection to duplicate all documents is painfully slow

I have a collection of 7,000,000 documents (each of perhaps 1-2 KB BSON) in a MongoDB collection that I would like to duplicate, modifying one field. The field is a string with a numeric value, and I would like to increment the field by 1.
Following this approach From the Mongo shell, I took the following approach:
> var all = db.my_collection.find()
> all.forEach(function(it) {
... it._id = 0; // to force mongo to create a new objectId
... it.field = (parseInt(it.field) + 1).toString();
... db.my_collection.insert(it);
... })
Executing the following code is taking an extremely long time; at first I thought the code was broken somehow, but from a separate terminal I checked the status of the collection something like an hour later to find the process was still running and there was now 7,000,001 documents! I checked to find that sure enough, there was exactly 1 new document that matched the incremented field.
For context, I'm running a 2015 MBP with 4 cores and 16 GB ram. I see mongo near the top of my CPU overhead averaging about 85%.
1) Am I missing a bulk modify/update capability in Mongodb?
2) Any reason why the above operation would be working, yet working so slowly that it is updating a document at a rate of 1/hr?
Try the db.collection.mapReduce() way:
NB: A single emit can only hold half of MongoDB’s maximum BSON document size.
var mapFunction1 = function() {
emit(ObjectId(), (parseInt(this.field) + 1).toString());
};
MongoDB will not call the reduce function for a key that has only a single value.
var reduceFunction1 = function(id, field) {
return field;
};
Finally,
db.my_collection.mapReduce(
mapFunction1,
reduceFunction1.
{"out":"my_collection"} //Replaces the entire content; consider merge
)
I'm embarrassed to say that I was mistaken that this line:
... it._id = 0; // to force mongo to create a new objectId
Does indeed force mongo to create a new ObjectId. Instead I needed to be explicit:
... it._id = ObjectId();

Mongodb - Data model change, optimal way of converting array to string

Our previous data model assumed that a certain field, let's call it field for lack of imagination, could contain more than 1 value, so we modeled it as an array.
Initial model:
{
field: ['val1]
}
We then realized (10 million docs later) that that wasn't the case and changed to:
{
field: 'val1;
}
I thought it would be simple to migrate to the new model but apparently it isn't.
I tried:
db.collection.update({},{$rename: {"field.0": 'newField'}})
but it complains that an array element cannot be used in the first place of $rename operator.
As I understood that in an update operation you cannot assign a field value to another one, I investigated the aggregation framework but I couldn't figure out a way.
Can I redact a document with aggregation fw and $out operator to accomplish what I want?
I also tried foreach, but is dead slow:
db.coll.find({"field":{$exists:true}}).snapshot().forEach(function(doc)
{
doc.newField = doc.field[0];
delete doc.field;
db.coll.save(doc);
});
I parallelized it using a bash script and I was able to get to about 200 updates/s, which means 10.000.000/(200*60*60)= 14h, quite some time to wait, without considering timeout errors that I handle with the bash script but that waste more time.
So now I ask, is there any chance that bulk operations or aggregation framework would speed up the process?
Would go for the bulk operations as they allow for the execution of bulk update operations which are simply abstractions on top of the server to make it easy to build bulk operations, thus streamlining your updates. You get performance gains over large collections as the bulk API sends write operations in bulk, and even better, it gives you real feedback about what succeeded and what failed. In the bulk update, you will be sending the operations to the server in batches of say 1000 which gives you a better performance as you are not sending every request to the server, just once in every 1000 requests:
var bulk = db.collection.initializeOrderedBulkOp(),
counter = 0;
db.collection.find({"field": { "$exists": true, "$type": 4 }}).forEach(function(doc) {
var updatedVal = doc.field[0];
bulk.find({ "_id": doc._id }).updateOne({
"$set": { "field": updatedVal }
});
counter++;
if (counter % 1000 == 0) {
bulk.execute(); // Execute per 1000 operations and re-initialize every 1000 update statements
bulk = db.collection.initializeUnorderedBulkOp();
}
});
// Clean up queues
if (counter % 1000 != 0) { bulk.execute(); }

Get count of documents without loading the whole collection in DerbyJS 0.6

How can I count the results of a query without loading the whole resultset into memory?
The easy way of counting documents returned by a query would be:
var q = model.query('mycollection', { date: today });
q.fetch(function() {
var length = q.get().length;
});
But this would load the whole resultset into memory and "count" an array in javascript. When you have lots of data you don't want to do this. I think.
Counting the underlying mongodb collection is rather complicated since LiveDB (I think it is LiveDB) creates many mongodb documents for one derbyjs document.
The internets point to this google groups thread from 2013, but the solution described there (putting $count: true into the query options) doesn't seem to work in DerbyJS 0.6 and current mongodb.".
query.extraRef is undefined.
It is done like described in the google groups thread. But query.extraRef is now query.refExtra.
Example:
var q = model.query('mycollection', { $count: true, date: today });
q.refExtra('_page.docsOfToday');
q.fetch();

MongoDB FindAndModify extremely slow

i'm using mongodb and have some trouble with the speed. My collections got bigger and now contains about 7.000.000 items. As a result, the findAndModify query takes about 3 seconds. I have a index on the queried field (in my case "links", which is an array). Does anybody see a big failure or inefficient code (see below).
public Cluster findAndLockWithUpsert(String url, String lockid) {
Query query = Query.query(Criteria.where("links").in(Arrays.asList(url)));
Update update = new Update().push("lock", lockid).push("links", url);
FindAndModifyOptions options = new FindAndModifyOptions();
options.remove(false);
options.returnNew(true);
options.upsert(true);
Cluster result = mongo.findAndModify(query, update, options, Cluster.class, COLLECTION);
return result;
}
thank you in advance!

Ordering a result set randomly in mongo

I've recently discovered that Mongo has no SQL equivalent to "ORDER BY RAND()" in it's command syntax (https://jira.mongodb.org/browse/SERVER-533)
I've seen the recommendation at http://cookbook.mongodb.org/patterns/random-attribute/ and frankly, adding a random attribute to a document feels like a hack. This won't work because this places an implicit limit to any given query I want to randomize.
The other widely given suggestion is to choose a random index to offset from. Because of the order that my documents were inserted in, that will result in one of the string fields being alphabetized, which won't feel very random to a user of my site.
I have a couple ideas on how I could solve this via code, but I feel like I'm missing a more obvious and native solution. Does anyone have a thought or idea on how to solve this more elegantly?
I have to agree: the easiest thing to do is to install a random value into your documents. There need not be a tremendously large range of values, either -- the number you choose depends on the expected result size for your queries (1,000 - 1,000,000 distinct integers ought to be enough for most cases).
When you run your query, don't worry about the random field -- instead, index it and use it to sort. Since there is no correspondence between the random number and the document, you should get fairly random results. Note that collisions will likely result in documents being returned in natural order.
While this is certainly a hack, you have a very easy escape route: given MongoDB's schema-free nature, you can simply stop including the random field once there is support for random sort in the server. If size is an issue, you could run a batch job to remove the field from existing documents. There shouldn't be a significant change in your client code if you design it carefully.
An alternative option would be to think long and hard about the number of results that will be randomized and returned for a given query. It may not be overly expensive to simply do shuffling in client code (i.e., if you only consider the most recent 10,000 posts).
What you want cannot be done without picking either of the two solutions you mention. Picking a random offset is a horrible idea if your collection becomes larger than a few thousands documents. The reason for this is that the skip(n) operation takes O(n) time. In other words, the higher your random offset the longer the query will take.
Adding a randomized field to the document is, in my opinion, the least hacky solution there is given the current feature set of MongoDB. It provides stable query times and gives you some say over how the collection is randomized (and allows you to generate a new random value after each query through a findAndModify for example). I also do not understand how this would impose an implicit limit on your queries that make use of randomization.
You can give this a try - it's fast, works with multiple documents and doesn't require populating rand field at the beginning, which will eventually populate itself:
add index to .rand field on your collection
use find and refresh, something like:
// Install packages:
// npm install mongodb async
// Add index in mongo:
// db.ensureIndex('mycollection', { rand: 1 })
var mongodb = require('mongodb')
var async = require('async')
// Find n random documents by using "rand" field.
function findAndRefreshRand (collection, n, fields, done) {
var result = []
var rand = Math.random()
// Append documents to the result based on criteria and options, if options.limit is 0 skip the call.
var appender = function (criteria, options, done) {
return function (done) {
if (options.limit > 0) {
collection.find(criteria, fields, options).toArray(
function (err, docs) {
if (!err && Array.isArray(docs)) {
Array.prototype.push.apply(result, docs)
}
done(err)
}
)
} else {
async.nextTick(done)
}
}
}
async.series([
// Fetch docs with unitialized .rand.
// NOTE: You can comment out this step if all docs have initialized .rand = Math.random()
appender({ rand: { $exists: false } }, { limit: n - result.length }),
// Fetch on one side of random number.
appender({ rand: { $gte: rand } }, { sort: { rand: 1 }, limit: n - result.length }),
// Continue fetch on the other side.
appender({ rand: { $lt: rand } }, { sort: { rand: -1 }, limit: n - result.length }),
// Refresh fetched docs, if any.
function (done) {
if (result.length > 0) {
var batch = collection.initializeUnorderedBulkOp({ w: 0 })
for (var i = 0; i < result.length; ++i) {
batch.find({ _id: result[i]._id }).updateOne({ rand: Math.random() })
}
batch.execute(done)
} else {
async.nextTick(done)
}
}
], function (err) {
done(err, result)
})
}
// Example usage
mongodb.MongoClient.connect('mongodb://localhost:27017/core-development', function (err, db) {
if (!err) {
findAndRefreshRand(db.collection('profiles'), 1024, { _id: true, rand: true }, function (err, result) {
if (!err) {
console.log(result)
} else {
console.error(err)
}
db.close()
})
} else {
console.error(err)
}
})
The other widely given suggestion is to choose a random index to offset from. Because of the order that my documents were inserted in, that will result in one of the string fields being alphabetized, which won't feel very random to a user of my site.
Why? If you have 7.000 documents and you choose three random offsets from 0 to 6999, the chosen documents will be random, even if the collection itself is sorted alphabetically.
One could insert an id field (the $id field won't work because its not an actual number) use modulus math to get a random skip. If you have 10,000 records and you wanted 10 results you could pick a modulus between 1 and 1000 randomly sucH as 253 and then request where mod(id,253)=0 and this is reasonably fast if id is indexed. Then randomly sort client side those 10 results. Sure they are evenly spaced out instead of truly random, but it close to what is desired.
Both of the options seems like non-perfect hacks to me, random filed and will always have same value and skip will return same records for a same number.
Why don't you use some random field to sort then skip randomly, i admit it is also a hack but in my experience gives better sense of randomness.