How to avoid mongo from returning duplicated documents by iterating a cursor in a constantly updated big collection? - mongodb

Context
I have a big collection with millions of documents which is constantly updated with production workload. When performing a query, I have noticed that a document can be returned multiple times; My workload tries to migrate the documents to a SQL system which is set to allow unique row ids, hence it crashes.
Problem
Because the collection is so big and lots of users are updating it after the query is started, iterating over the cursor's result may give me documents with the same id (old and updated version).
What I'v tried
const cursor = db.collection.find(query, {snapshot: true});
while (cursor.hasNext()) {
const doc = cursor.next();
// do some stuff
}
Based on old documentation for the mongo driver (I'm using nodejs but this is applicable to any official mongodb driver), there is an option called snapshot which is said to avoid what is happening to me. Sadly, the driver returns an error indicating that this option does not exists (It was deprecated).
Question
Is there a way to iterate through the documents of a collection in a safe fashion that I don't get the same document twice?
I only see a viable option with aggregation pipeline, but I want to explore other options with standard queries.

Finally I got the answer from a mongo changelog page:
MongoDB 3.6.1 deprecates the snapshot query option.
For MMAPv1, use hint() on the { _id: 1} index instead to prevent a cursor from returning a document more than once if an intervening write operation results in a move of the document.
For other storage engines, use hint() with { $natural : 1 } instead.
So, from my code example:
const cursor = db.collection.find(query).hint({$natural: 1});
while (cursor.hasNext()) {
const doc = cursor.next();
// do some stuff
}

Related

MongoDB takes an hour to delete_many() 1GB of data

We have a 3GB collection in mongoDB 4.2 and this python 3.7, pymongo 3.12 function that deletes rows from the collection:
def delete_from_mongo_collection(table_name):
# connect to mongo cluster
cluster = MongoClient(MONGO_URI)
db = cluster["cbbap"]
# remove rows and return
query = { 'competitionId': { '$in': [30629, 30630] } }
db[table_name].delete_many(query)
return
Here is the relevant info on this collection, note that it has 360MB worth of indexes which are set to speed up retrievals of data from this collection by our Node API, although they may be the problem here.
The delete_many() is part of a pattern where we (a) remove stale data and (b) upload fresh data, each day. However, given that it is taking over an hour to remove the rows that match the query { 'competitionId': { '$in': [30629, 30630] } }, we'd be better off just dropping and re-inserting the entire table. What's frustrating is that competitionId is an index, and as the first index in our compound indexes, I thought it should be very fast to drop rows using an index. I wonder if having 360MB of indexes is responsible for the slow deletes?
We cannot use the hint parameter as we have mongoDB 4.2, not 4.4, and we do not want to upgrade to 4.4 yet as we are worried about major breaking changes that may occur in our pipelines and our node API.
What else can be done here to improve the performance of delete_many()?

MongoDB - how to get fields fill-rates as quickly as possible?

We have a very big MongoDB collection of documents with some pre-defined fields that can either have a value or not.
We need to gather fill-rates of those fields, we wrote a script that goes over all documents and counts fill-rates for each, problem is it takes a long time to process all documents.
Is there a way to use db.collection.aggregate or db.collection.mapReduce to run such a script server-side?
Should it have significant performance improvements?
Will it slow down other usages of that collection (e.g. holding a major lock)?
Answering my own question, I was able to migrate my script using a cursor to scan the whole collection, to a map-reduce query, and running on a sample of the collection it seems it's at least twice as fast using the map-reduce.
Here's how the old script worked (in node.js):
var cursor = collection.find(query, projection).sort({_id: 1}).limit(limit);
var next = function() {
cursor.nextObject(function(err, doc) {
processDoc(doc, next);
});
};
next();
and this is the new script:
collection.mapReduce(
function () {
var processDoc = function(doc) {
...
};
processDoc(this);
},
function (key, values) {
return Array.sum(values)
},
{
query : query,
out: {inline: 1}
},
function (error, results) {
// print results
}
);
processDoc stayed basically the same, but instead of incrementing a counter on a global stats object, I do:
emit(field_name, 1);
running old and new on a sample of 100k, old took 20 seconds, new took 8.
some notes:
map-reduce's limit option doesn't work on sharded collections, I had to query for _id : { $gte, $lte} to create the sample size needed.
map-reduce's performance boost option: jsMode : true doesn't work on sharded collections as well (might have improve performance even more), it might work to run it manually on each shard to gain that feature.
As I understood what you want to achieve is compute something on your documents, after that you have a new "document" that can be queried. You don't need to store the "new values" computed.
If you don't need to write your "new values" inside that documents, you can use Aggregation Framework.
Aggregations operations process data records and return computed results. Aggregation operations group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result.
https://docs.mongodb.com/manual/aggregation/
Since Aggregation Framework has a lot of features i can't give you more informations about how to resolve your issue.

Query documents, update it and return it back to MongoDB

In my MongoDB 3.2 based application I want to perform the documents processing. In order to avoid the repeated processing on the same document I want to update its flag and update this document in the database.
The possible approach is:
Query the data: FindIterable<Document> documents = db.collection.find(query);.
Perform some business logic on these documents.
Iterate over the documents, update each document and store it in a new collection.
Push the new collection to the database with db.collection.updateMany();.
Theoretically, this approach should work but I'm not sure that it is the optimal scenario.
Is there any way in MongoDB Java API to perform the followings two operations:
to query documents (to get them from the DB and to pass to the separate method);
to update them and then store the updated version in DB;
in a more elegant way comparing to the proposed above approach?
You can update document inplace using update:
db.collection.update(
{query},
{update},
{multi:true}
);
It will iterate over all documents in the collection which match the query and updated fields specified in the update.
EDIT:
To apply some business logic to individual documents you can iterate over matching documents as following:
db.collection.find({query}).forEach(
function (doc) {
// your logic business
if (doc.question == "Great Question of Life") {
doc.answer = 42;
}
db.collection.save(doc);
}
)

Find the collection name from document._id in meteor (mongodb)

From the looks of the syntax for handling mongodb related things in meteor it seems that you always need to know the collection's name to update, insert, remove or anything to the document.
What I am wondering is if it's possible to get the collection's name from the _id field of a document in meteor.
Meaning if you have a document with the _id equal to TNTco3bHzoSFMXKJT. Now knowing the _id of the document you want to find which collection the document is located in. Is this possible through meteor's implementation of mongodb or vanilla mongodb?
As taken from the official docs:
idGeneration String
The method of generating the _id fields of new documents in this collection. Possible values:
'STRING': random strings
'MONGO': random Meteor.Collection.ObjectID values
The default id generation technique is 'STRING'.
Your best option would be to insert records within a pseudo transaction where the second step is to take the id and collection name to feed it into a reference collection. Then, you can do your lookups from that.
It would be pretty costly, though to construct your find's but might be a pattern worthwhile exploring if you are building an app where your users will be creating arbitrary data patterns.
You could accomplish this by doing a findOne on all of the collections:
var collectionById = function(id) {
return _.find(_.keys(this), function(name) {
if (this[name] instanceof Meteor.Collection) {
if (this[name].findOne(id)) {
return true;
}
}
});
};
I tested this on both the client and the server and it seemed to work when run in the global context.

MongoDB: range queries on insertion time with _id and ObjectID

I am trying to use mongodb's ObjectID to do a range query on the insertion time of a given collection. I can't really find any documentation that this is possible, except for this blog entry: http://mongotips.com/b/a-few-objectid-tricks/ .
I want to fetch all documents created after a given timestamp. Using the nodejs driver, this is what I have:
var timeId = ObjectId.createFromTime(timestamp);
var query = {
localUser: userId,
_id: {$gte: timeId}
};
var cursor = collection.find(query).sort({_id: 1});
I always get the same amount of records (19 in a collection of 27), independent of the timestamp. I noticed that createFromTime only fills the bytes in the objectid related to time, the other ones are left at 0 (like this: 4f6198be0000000000000000).
The reason that I try to use an ObjectID for this, is that I need the timestamp when inserting the document on the mongodb server, not when passing the document to the mongodb driver in node.
Anyone knows how to make this work, or has another idea how to generate and query insertion times that were generated on the mongodb server?
Not sure about nodejs driver in ruby, you can simply apply range queries like this.
jan_id = BSON::ObjectId.from_time(Time.utc(2012, 1, 1))
feb_id = BSON::ObjectId.from_time(Time.utc(2012, 2, 1))
#users.find({'_id' => {'$gte' => jan_id, '$lt' => feb_id}})
make sure
var timeId = ObjectId.createFromTime(timestamp) is creating an ObjectId.
Also try query without localuser