How to select all documents in MongoDB collection by parallel processes? - mongodb

I have multiple worker processes which select data from huge mongodb collection and performs some complex calculations.
Each document from MongoDB collection should be processed only once.
Now I'm using following technic: I each worker marks and selects documents to process by .FindOneAndUpdate method. It finds a not marked document, marks it, and return to the worker. FindOneAndUpdate (findAndModify) is an atomic operation, so each document is selected only once.
Selecting documents one by one looks not so efficient. Is there some way to select by 100 documents and be sure document will be processed only once?
Is there some other, maybe MongoDB specific way to process a huge number of documents in parallel?

Interesting...
One way to solve that is by implementing segments for your data. Let's say you have 1M documents in your collection and 100 workers, find a field on your structure that can be equally-ish divided and pre-assign 10K documents for each worker.
But that process may be overkilled and will its efficiency could not really be better than query and process the documents individually. If you set an index on your marked field, the operation should be quite efficient as mongo will know where to look for unmarked documents.
I think the safest way to do what you need is actually processing them one by one. Mongo's atomicity is at a document level, so you may not find a way to lock several specific documents at the same time. $isolated operator may help you in case that you may find a good way to segment the data for your workers.
This another answer has useful links regarding atomicity and $isolated operator.

Related

Is there a way to count reads/writes per collection in mongodb?

is there a way to view the reads/writes in mongodb on a per collection-basis? I would like to see how many documents have been read and written on a specific collection.
We are currently researching the costs of some specific queries and try to find out more about if they are heavy read- and/or write-tasks.
Thank you :-)
Yes you can. You can perform db.collection.stats()
this will return the size and count of documents, index information and a lot of other useful information. But you want to count the number of reads and writes performed on a specific collection. For that you can use mongostat. It captures and returns the counts of database operations by type (e.g. insert, query, update, delete, etc.). These counts report on the load distribution on the server. Read more about mongostat on their documentation. Here's the link https://docs.mongodb.com/manual/reference/program/mongostat/#bin.mongostat

Mongo DB update query performance

I would like to understand which of the below queries would be faster, while doing updates, in mongo db? I want to update few thousands of records at one stretch.
Accumulating the object ids of those records and firing them using $in or using bulk update?
Using one or two fields in the collection which are common for those few thousand records - akin to "where" in sql and firing an update using those fields. These fields might or might not be indexed.
I know that query will be much smaller in the 2nd case as every single "_id" (oid) is not accumulated. Does accumulating _ids and using those to update documents offer any practical performance advantages?
Does accumulating _ids and using those to update documents offer any practical performance advantages?
Yes because MongoDB will certainly use the _id index (idhack).
In the second method - as you observed - you can't tell whether or not an index will be used for a certain field.
So the answer will be: it depends.
If your collection has million of documents or more, and / or the number of search fields is quite large you should prefer the first search method. Especially if the id list size is not small and / or the id values are adjacent.
If your collection is pretty small and you can tolerate a full scan you may prefer the second approach.
In any case, you should testify both methods using explain().

What is the preferred way to add many fields to all documents in a MongoDB collection?

I have have a Python application that is iteratively going through every document in a MongoDB (3.0.2) collection (typically between 10K and 1M documents), and adding new fields (probably doubling/tripling the number of fields in the document).
My initial thought was that I would use upsert the entire of the revised documents (using pyMongo) - now I'm questioning that:
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
Also, is it better to perform a write to the collection on a document by document basis or in bulk?
this is actually a great question that can be solved a few different ways depending on how you are managing your data.
if you are upserting additional fields does this mean your data is appending additional fields at a later point in time with the only changes being the addition of the additional fields? if so you could set the ttl on your documents so that the old ones drop off over time. keep in mind that if you do this you will want to set an index that sorts your results by descending _id so that the most recent additions are selected before the older ones.
the benefit of this of doing it this way is that your are continually writing data as opposed to seeking and updating data so it is faster.
in regards to upserts vs bulk inserts. bulk inserts are always faster than upserts since bulk upserting requires you to find the original document first.
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
you really need to understand your data fully to determine what is best but if only change to the data is additional fields or changes that only need to be considered from that point forward then bulk inserting and setting a ttl on your older data is the better method from the stand point of write operations as opposed to seek, find and update. when using this method you will want to db.document.find_one() as opposed to db.document.find() so that only your current record is returned.
Also, is it better to perform a write to the collection on a document by document basis or in bulk?
bulk inserts will be faster than inserting each one sequentially.

Limit the number of documents in a mongodb collection , without FIFO policy

I'm building an application to handle ticket sales and expect to have really high demand. I want to try using MongoDB with multiple concurrent client nodes serving a node.js website (and gracefully handle failure of clients).
I've read "Limit the number of documents in a collection in mongodb" (which is completely unrelated) and "Is there a way to limit the number of records in certain collection" (but that talks about capped collections, where the new documents overwrite the oldest documents).
Is it possible to limit the number of documents in a collection to some maximum size, and have documents after that limit just be rejected. The simple example is adding ticket sales to the database, then failing if all the tickets are already sold out.
I considered having a NumberRemaining document, which I could atomically decerement until it reaches 0 but that leaves me with a problem if a node crashes between decrementing that number, and saving the purchase of the ticket.
Store the tickets in a single MongoDB document. As you can only atomically set one document at a time, you shouldn't have a problem with document dependencies that could have been solved by using a traditional transactional database system.
As a document can be up to 16MB, by storing only a ticket_id in a master document, you should be able to store plenty of tickets without needing to do any extra complex document management. While it could introduce a hot spot, the document likely won't be very large. If it does get large, you could use more than one document (by splitting them into multiple documents as one document "fills", activate another).
If that doesn't work, 10gen has a pattern that might fit.
My only solution so far (I'm hoping someone can improve on this):
Insert documents into an un-capped collection as they arrive. Keep the implicit _id value of ObjectID, which can be sorted and will therefore order the documents by when they were added.
Run all queries ordered by _id and limited to the max number of documents.
To determine whether an insert was "successful", run an additional query that checks that the newly inserted document is within the maximum number of documents.
My solution was: I use an extra count variable in another collection. This collection has a validation rule that avoids count variables to become negative. Count variable should always be non negative integer number.
"count": { "$gte": 0 }
The algorithm was simple. Decrement the count by one. If it succeed insert the document. If it fails it means there is no space left.
Vice versa for deletion.
Also you can use transactions to prevent failures(Count is decremented but service is failed just before insertion operation).

List of updated documents

Is there a way to update (or delete) many documents matching a certain criteria and get the list of IDs of actually updated/deleted documents (or some other fields of those documents)? I cannot simply query the documents matching my criteria beforehand because I need kinda atomicity for this operation. And I can't use findAndModify because it can only process one document at a time which is too slow because of round-trips. Suggestions?
MongoDB only supports atomic operations on a single document.
http://www.mongodb.org/display/DOCS/Atomic+Operations
The only way to do this is to do what you said you didn't one to:
First query the collection to find id's for our query:
db.things.find({"name":"john"}, {_id:1});
Then, use the same query to remove:
db.things.remove({"name":"john"}, {_id:1});
Not ideal, and not atomic, but it's as good as you're going to get in this scenario.