How do I split/sample from a cursor in MongoDB? - mongodb

I have a database with millions of documents. I want to perform a relatively time-consuming operation on each document and then update it. I have two (related) questions:
If I want a random sample of, say, 1000 documents from a given cursor, how do I do that?
I want to compute on and update a million documents. I am on a cluster and I want to dispatch a separate job for each batch of, say 1000 documents. What's the easiest way to do something like this?
Thanks!
Uri

In order to do this, you'd have to push things to a worker manager. I would suggest gearman for this. In that case, a script would: 1. query all documents that you want to update, and return their _id. 2. push all object IDs into a gearman server 3. have a gearman worker process running on each of the machines in your cluster 4. have each gearman worker pick up a new object ID from the queue and process the document, saving it back into MonogDB.

Related

SELECT FOR UPDATE SKIP LOCKED in MongoDB

I have 100+ worker threads, which are going to poll database, looking for a new job.
To take a job, a thread need to change status of the bunch of documents from NEW to IN_PROGRESS, so no other threads can peek the same job.
This can be solved perfectly fine in PostgreSQL with SELECT FOR UPDATE SKIP LOCKED WHERE status = "NEW" statement.
Is there a way to do such atomic update in MongoDB for a single document? For a batch?
There's a findAndModify method, which works exactly as you've described for a single document.
For a batch, it's not possible right now, as
In MongoDB, write operations, e.g. db.collection.update(), db.collection.findAndModify(), db.collection.remove(), are atomic on the level of a single document.
It will be possible in MongoDB 4.0 though, with transactions.

How to select all documents in MongoDB collection by parallel processes?

I have multiple worker processes which select data from huge mongodb collection and performs some complex calculations.
Each document from MongoDB collection should be processed only once.
Now I'm using following technic: I each worker marks and selects documents to process by .FindOneAndUpdate method. It finds a not marked document, marks it, and return to the worker. FindOneAndUpdate (findAndModify) is an atomic operation, so each document is selected only once.
Selecting documents one by one looks not so efficient. Is there some way to select by 100 documents and be sure document will be processed only once?
Is there some other, maybe MongoDB specific way to process a huge number of documents in parallel?
Interesting...
One way to solve that is by implementing segments for your data. Let's say you have 1M documents in your collection and 100 workers, find a field on your structure that can be equally-ish divided and pre-assign 10K documents for each worker.
But that process may be overkilled and will its efficiency could not really be better than query and process the documents individually. If you set an index on your marked field, the operation should be quite efficient as mongo will know where to look for unmarked documents.
I think the safest way to do what you need is actually processing them one by one. Mongo's atomicity is at a document level, so you may not find a way to lock several specific documents at the same time. $isolated operator may help you in case that you may find a good way to segment the data for your workers.
This another answer has useful links regarding atomicity and $isolated operator.

Will MongoDB provide cursors to multiple clients (named cursors)?

I am running several NodeJS instances on separate compute engines all accessing the same MongoDB. In each instance I am running a background housekeeping process which scans the entire customer collection in the database. I am using cursors to access documents, fetching the next customer one by one.
This yields a number of competing housekeeping processes, all wanting to access the same documents (customers) in the same order.
What I am looking for instead, is for my housekeeping processes to cooperate rather than compete.
So if I had two instances I could construct two opposite direction cursors. But if I have 3 or more instances or if I want to be tolerant to any number of instances going up and down without duplicating or phantoming customers, I need to find a different approach.
I was thinking, does MongoDB provide cursors addressable by name from multiple NodeJS instances such that all instances fetch the next document from the same cursor, never obtaining the same document?
If not, can anyone suggest a good pattern to apply to this problem?

Keep only n documents in collection - Meteor

I want to keep only the latest n documents in my activityFeed collection, in order to speed up the application. I know that I could subscribe only to n activityFeed elements in my iron-router configs, but it is not necessary to keep all the entries.
How can I do this?
Do I need to check on every insert, or is there a better way to do this?
Any help would be greatly appreciated.
As you point out, your subscription could handle this on the client side, but if you also want to purge your DB there are two obvious solutions:
Use Collection hooks to delete the oldest item on every insert.
Use a cron job to find the nth oldest element every so often (15 minutes or whatever), and then delete everything older than that. I'd use synced-cron for this.
Better would be to create "capped collection": https://docs.mongodb.org/manual/core/capped-collections/
In meteor you can use it like this:
var coll = new Meteor.Collection("myCollection");
coll._createCappedCollection(numBytes, maxDocuments);
Where maxDocuments is maximal number of documents to store in collection and numBytes is the max size of collection. If one of these limits will be reached then automatically mongo will drop the oldest inserted documents from collection.
No need for additional scripts and cron jobs.

Multi room chat database with mongoDB (mongoose)

I need to set a scheme for a multi room chat which uses mongodb for storage. I'm currently using mongoose v2 and I've thought of the following methods:
Method 1
Each chat log (each room) has its own mongo collection
Each chat log collection is populated by documents (schema) message with from, to, message and time.
There is the user collection
The user collection is populated by documents (schema) user (with info regarding the user)
Doubts:
1 How exactly can I retrieve the documents from a specific collection (the chat room)?
Method 2
There is a collection (chat_logs)
The collection chat_logs is pop. by documents (schema) message with from, to (chatroom), user, etc...
There is the user collection, as above.
Doubts:
1 Is there a max size for collections?
Any advice is welcome... thank you for your help.
There is no good reason to have a separate collection per chatroom. The size of collections is unlimited and MongoDB offers no way to query data from more than one collection. So when you distribute your data over multiple collections, you won't be able to analyze data across more than one chatroom.
There isn't a limit on normal collections. However, do you really want to save every word ever written in any chat forever? Most likely not. You would like to save say the last 1000 or 10000 messages written. Or even 100.000. Lets make it 1.000.000. Given the average size of a chat message, this shouldn't be more than 20MB. So let's make it really safe and multiply that by 10.
What I would do is to use a capped collection per chat room and use tailable cursors. You don't need to be afraid as for too many connections. The average mongo server can take a couple of hundred of them. The queries can be made tailable quite easily as shown in the Mongoose docs.
This approach has some advantages:
Capped collections are fast - they are basically fifo buffers for BSON data.
The data is returned exactly in insertion order for free - no sorting, no complex queries, no extra indices
There is no need to maintain the individual rooms. Simply set a cap on creation and mongodb will take care of the rest.
As for how to do it: Simply create a connection per chat room, save them on an application level in an associative array with the chat room as the key name. Use the connections for creating a new tailable cursors per request. Use XHR to request the chat data. Respond as stream. Process accordingly.