Is _id field in mongodb always increased for the next inserted document in the collection even if we have multiple shards? So if I have collection.watch do I always get higher _id field for the next document than for the prev one? I need this to implement catch-up subscription and not to lose any document. So on every processed document from collection.watch I store its _id and if crash - I can select all documents with _id > last_seen_id in addition to collection.watch.
Or do I have to use some sort of auto-incemented value? I don't wanna cause it will hurt performance a lot and kill reason of sharding.
ObjectIds are guaranteed to be monotonically increasing most of the time, but not all of the time. See What does MongoDB's documentation mean when it says ObjectIDs are "likely unique"? and Can a 4 byte timestamp value in MongoDb ObjectId overflow?. If you need a guaranteed monotonically increasing counter, you need to implement it yourself.
As you pointed out this isn't a trivial thing to implement in a distributed environment, which is why MongoDB doesn't provide this.
One possible solution:
Have a dedicated counter collection
Seed the collection with a document like {i: 1}
Issue find-and-modify operation that uses https://docs.mongodb.com/manual/reference/operator/update/inc/ and no condition (thus affecting all documents in the collection, i.e. the one and only document which is the counter)
Request the new document as the update result (e.g. https://docs.mongodb.com/ruby-driver/master/tutorials/ruby-driver-crud-operations/#update-options return_document: :after)
Use the returned value as the counter
This doesn't get you a queue. If you want a queue, there are various libraries and systems that provide queues.
Related
I have multiple worker processes which select data from huge mongodb collection and performs some complex calculations.
Each document from MongoDB collection should be processed only once.
Now I'm using following technic: I each worker marks and selects documents to process by .FindOneAndUpdate method. It finds a not marked document, marks it, and return to the worker. FindOneAndUpdate (findAndModify) is an atomic operation, so each document is selected only once.
Selecting documents one by one looks not so efficient. Is there some way to select by 100 documents and be sure document will be processed only once?
Is there some other, maybe MongoDB specific way to process a huge number of documents in parallel?
Interesting...
One way to solve that is by implementing segments for your data. Let's say you have 1M documents in your collection and 100 workers, find a field on your structure that can be equally-ish divided and pre-assign 10K documents for each worker.
But that process may be overkilled and will its efficiency could not really be better than query and process the documents individually. If you set an index on your marked field, the operation should be quite efficient as mongo will know where to look for unmarked documents.
I think the safest way to do what you need is actually processing them one by one. Mongo's atomicity is at a document level, so you may not find a way to lock several specific documents at the same time. $isolated operator may help you in case that you may find a good way to segment the data for your workers.
This another answer has useful links regarding atomicity and $isolated operator.
As far as I understood, nscannedObjects entry in the explain() method means the number of documents that MongoDB needed to go to find in the disk.
My question is: when this value is 0, what this actually mean besides the explanation above? Does MongoDB keep a cache with some documents stored there?
nscannedObjects=0 means that there was no fetching or filtering to satisfy your query, the query was resolved solely based on indexes. So for example if you were to query for {_id:10} and there were no matching documents you would get nscannedObjects=0.
It has nothing to do with the data being in memory, there is no such distinction with the query plan.
Note that in MongoDB 3.0 and later nscanned and nscannedObjects are now called totalKeysExamined and totalDocsExamined, which is a little more self-explanatory.
Mongo is a document database, which means that it can interpret the structure of the stored documents (unlike for example key-value stores).
One particular advantage of that approach is that you can build indices on the documents in the database.
Index is a data structure (usually a variant of b-tree), which allows for fast searching of documents basing on some of their attributes (for example id (!= _id) or some other distinctive feature). These are usually stored in memory, allowing very fast access to them.
When you search for documents basing on indexed attributes (let's say id > 50), then mongo doesn't need to fetch the document from memory/disk/whatever - it can see which documents match the criteria basing solely on the index (note that fetching something from disk is several orders of magnitude slower than memory lookup, even with no cache). The only time it actually goes to the disk is when you need to fetch the document for further processing (and which is not covered by the statistic you cited).
Indices are crucial to achieve high performance, but also have drawbacks (for example rarely used index can slow down inserts and not be worth it - after each insertion the index has to be updated).
Can we save new record in decending order in MongoDB? So that the first saved document will be returned last in a find query. I do not want to use $sort, so data should be presaved in decending order.
Is it possible?
According to above mentioned description ,as an alternative solution if you do not need to use $sort, you need to create a Capped collection which maintains order of insertion of documents into MongoDB collection
For more detailed description regarding Capped collections in MongoDB please refer the documentation mentioned in following URL
https://docs.mongodb.org/manual/core/capped-collections/
But please note that capped collections are fixed size collections hence it will automatically flush old documents in case when collection size exceeds size of capped collection
The order of the records is not guaranteed by MongoDB unless you add a $sort operator. Even if the records happen to be ordered on disk, there is no guarantee that MongoDB will always return the records in the same order. MongoDB does quite a bit of work under the hood and as your data grows in size, the query optimiser may pick a different execution plan and return the data in a different order.
I'm building an application to handle ticket sales and expect to have really high demand. I want to try using MongoDB with multiple concurrent client nodes serving a node.js website (and gracefully handle failure of clients).
I've read "Limit the number of documents in a collection in mongodb" (which is completely unrelated) and "Is there a way to limit the number of records in certain collection" (but that talks about capped collections, where the new documents overwrite the oldest documents).
Is it possible to limit the number of documents in a collection to some maximum size, and have documents after that limit just be rejected. The simple example is adding ticket sales to the database, then failing if all the tickets are already sold out.
I considered having a NumberRemaining document, which I could atomically decerement until it reaches 0 but that leaves me with a problem if a node crashes between decrementing that number, and saving the purchase of the ticket.
Store the tickets in a single MongoDB document. As you can only atomically set one document at a time, you shouldn't have a problem with document dependencies that could have been solved by using a traditional transactional database system.
As a document can be up to 16MB, by storing only a ticket_id in a master document, you should be able to store plenty of tickets without needing to do any extra complex document management. While it could introduce a hot spot, the document likely won't be very large. If it does get large, you could use more than one document (by splitting them into multiple documents as one document "fills", activate another).
If that doesn't work, 10gen has a pattern that might fit.
My only solution so far (I'm hoping someone can improve on this):
Insert documents into an un-capped collection as they arrive. Keep the implicit _id value of ObjectID, which can be sorted and will therefore order the documents by when they were added.
Run all queries ordered by _id and limited to the max number of documents.
To determine whether an insert was "successful", run an additional query that checks that the newly inserted document is within the maximum number of documents.
My solution was: I use an extra count variable in another collection. This collection has a validation rule that avoids count variables to become negative. Count variable should always be non negative integer number.
"count": { "$gte": 0 }
The algorithm was simple. Decrement the count by one. If it succeed insert the document. If it fails it means there is no space left.
Vice versa for deletion.
Also you can use transactions to prevent failures(Count is decremented but service is failed just before insertion operation).
I know that we can bulk update documents in mongodb with
db.collection.update( criteria, objNew, upsert, multi )
in one db call, but it's homogeneous, i.e. all those documents impacted are following one kind of criteria. But what I'd like to do is something like
db.collection.update([{criteria1, objNew1}, {criteria2, objNew2}, ...]
, to send multiple update request which would update maybe absolutely different documents or class of documents in single db call.
What I want to do in my app is to insert/update a bunch of objects with compound primary key, if the key is already existing, update it; insert it otherwise.
Can I do all these in one combine in mongodb?
That's two seperate questions. To the first one; there is no MongoDB native mechanism to bulk send criteria/update pairs although technically doing that in a loop yourself is bound to be about as efficient as any native bulk support.
Checking for the existence of a document based on an embedded document (what you refer to as compound key, but in the interest of correct terminology to avoid confusion it's better to use the mongo name in this case) and insert/update depending on that existence check can be done with upsert :
document A :
{
_id: ObjectId(...),
key: {
name: "Will",
age: 20
}
}
db.users.update({name:"Will", age:20}, {$set:{age: 21}}), true, false)
This upsert (update with insert if no document matches the criteria) will do one of two things depending on the existence of document A :
Exists : Performs update "$set:{age:21}" on the existing document
Doesn't exist : Create a new document with fields "name" and field
"age" with values "Will" and "20" respectively (basically the
criteria are copied into the new doc) and then the update is applied
($set:{age:21}). End result is a document with "name"="Will" and
"age"=21.
Hope that helps
we are seeing some benefits of $in clause.
our use case was to update the 'status' in a document for a large number number records.
In our first cut, we were doing a for loop and doing updates one by 1. But then we switched to using $in clause and that made a huge improvement.
There is no real benefit from doing updates the way you suggest.
The reason that there is a bulk insert API and that it is faster is that Mongo can write all the new documents sequentially to memory, and update indexes and other bookkeeping in one operation.
A similar thing happens with updates that affect more than one document: the update will traverse the index only once and update objects as they are found.
Sending multiple criteria with multiple criteria cannot benefit from any of these optimizations. Each criteria means a separate query, just as if you issued each update separately. The only possible benefit would be sending slightly fewer bytes over the connection. The database would still have to do each query separately and update each document separately.
All that would happen would be that Mongo would queue the updates internally and execute them sequentially (because only one update can happen at any one time), this is exactly the same as if all the updates were sent separately.
It's unlikely that the overhead in sending the queries separately would be significant, Mongo's global write lock will be the limiting factor anyway.