Limit the number of documents in a mongodb collection , without FIFO policy - mongodb

I'm building an application to handle ticket sales and expect to have really high demand. I want to try using MongoDB with multiple concurrent client nodes serving a node.js website (and gracefully handle failure of clients).
I've read "Limit the number of documents in a collection in mongodb" (which is completely unrelated) and "Is there a way to limit the number of records in certain collection" (but that talks about capped collections, where the new documents overwrite the oldest documents).
Is it possible to limit the number of documents in a collection to some maximum size, and have documents after that limit just be rejected. The simple example is adding ticket sales to the database, then failing if all the tickets are already sold out.
I considered having a NumberRemaining document, which I could atomically decerement until it reaches 0 but that leaves me with a problem if a node crashes between decrementing that number, and saving the purchase of the ticket.

Store the tickets in a single MongoDB document. As you can only atomically set one document at a time, you shouldn't have a problem with document dependencies that could have been solved by using a traditional transactional database system.
As a document can be up to 16MB, by storing only a ticket_id in a master document, you should be able to store plenty of tickets without needing to do any extra complex document management. While it could introduce a hot spot, the document likely won't be very large. If it does get large, you could use more than one document (by splitting them into multiple documents as one document "fills", activate another).
If that doesn't work, 10gen has a pattern that might fit.

My only solution so far (I'm hoping someone can improve on this):
Insert documents into an un-capped collection as they arrive. Keep the implicit _id value of ObjectID, which can be sorted and will therefore order the documents by when they were added.
Run all queries ordered by _id and limited to the max number of documents.
To determine whether an insert was "successful", run an additional query that checks that the newly inserted document is within the maximum number of documents.

My solution was: I use an extra count variable in another collection. This collection has a validation rule that avoids count variables to become negative. Count variable should always be non negative integer number.
"count": { "$gte": 0 }
The algorithm was simple. Decrement the count by one. If it succeed insert the document. If it fails it means there is no space left.
Vice versa for deletion.
Also you can use transactions to prevent failures(Count is decremented but service is failed just before insertion operation).

Related

Mongodb always increased "_id" field?

Is _id field in mongodb always increased for the next inserted document in the collection even if we have multiple shards? So if I have collection.watch do I always get higher _id field for the next document than for the prev one? I need this to implement catch-up subscription and not to lose any document. So on every processed document from collection.watch I store its _id and if crash - I can select all documents with _id > last_seen_id in addition to collection.watch.
Or do I have to use some sort of auto-incemented value? I don't wanna cause it will hurt performance a lot and kill reason of sharding.
ObjectIds are guaranteed to be monotonically increasing most of the time, but not all of the time. See What does MongoDB's documentation mean when it says ObjectIDs are "likely unique"? and Can a 4 byte timestamp value in MongoDb ObjectId overflow?. If you need a guaranteed monotonically increasing counter, you need to implement it yourself.
As you pointed out this isn't a trivial thing to implement in a distributed environment, which is why MongoDB doesn't provide this.
One possible solution:
Have a dedicated counter collection
Seed the collection with a document like {i: 1}
Issue find-and-modify operation that uses https://docs.mongodb.com/manual/reference/operator/update/inc/ and no condition (thus affecting all documents in the collection, i.e. the one and only document which is the counter)
Request the new document as the update result (e.g. https://docs.mongodb.com/ruby-driver/master/tutorials/ruby-driver-crud-operations/#update-options return_document: :after)
Use the returned value as the counter
This doesn't get you a queue. If you want a queue, there are various libraries and systems that provide queues.

How to select all documents in MongoDB collection by parallel processes?

I have multiple worker processes which select data from huge mongodb collection and performs some complex calculations.
Each document from MongoDB collection should be processed only once.
Now I'm using following technic: I each worker marks and selects documents to process by .FindOneAndUpdate method. It finds a not marked document, marks it, and return to the worker. FindOneAndUpdate (findAndModify) is an atomic operation, so each document is selected only once.
Selecting documents one by one looks not so efficient. Is there some way to select by 100 documents and be sure document will be processed only once?
Is there some other, maybe MongoDB specific way to process a huge number of documents in parallel?
Interesting...
One way to solve that is by implementing segments for your data. Let's say you have 1M documents in your collection and 100 workers, find a field on your structure that can be equally-ish divided and pre-assign 10K documents for each worker.
But that process may be overkilled and will its efficiency could not really be better than query and process the documents individually. If you set an index on your marked field, the operation should be quite efficient as mongo will know where to look for unmarked documents.
I think the safest way to do what you need is actually processing them one by one. Mongo's atomicity is at a document level, so you may not find a way to lock several specific documents at the same time. $isolated operator may help you in case that you may find a good way to segment the data for your workers.
This another answer has useful links regarding atomicity and $isolated operator.

Slow Upserts with PyMongoDB

I'm trying to insert ~800 million records into MongoDB using PyMongo on a macbook air 1.7GHz i7 with no multi-threading, the documents are structured as below:
Records I'm reading are the following tuple:
(user_id,imp_date,imp_creative,imp_pid,geo_id)
I'm creating my own _id field based on the user_id in the file I'm reading from.
{_id:user_id,
'imp_date':[array of dates],
'imp_creative':[array of numeric ids],
'imp_pid':[array of numeric ids],
'geo_id':numeric id}
I'm using an upsert with $push to append date, creative id, and pid for the corresponding arrays
self.collection.update({'_id':uid},
{"$push":{'imp_date':<datevalue>,
'imp_creative':<creative_id>,
'imp_pid':<pid>}},safe=True,upsert=True)
I'm using an upsert with $set to overwrite the geographic location (only care about most recent.
self.collection.update({'_id':uid},
{"$set":{'geo_id':<geo id>}},safe=True,upsert=True)
I'm only writing about 1,500 records per second (8,000 if I set safe=False). My question is: what can I do to speed this up further (ideally 20k/second or faster)?
Ideas I can't find a definitive recommendation on:
-Using multiple threads to insert data
-Sharding
-Padding arrays (my arrays grow very slowly, each document array will have an average length of ~4 at the end of the file)
-Turning journaling off
Apologies if I've left out any required information, this is my first post.
1- You could add an index to speed it up, and index would help you to find the documents faster although the inserts would be slower (you have to update the index as well). If the improvement in the retrieving phase compensates the extra time to update the index depends on how many records you have in the collections, how many indexes you have and how complicated those indexes are.
However, in your case you are only querying with the _id so there's no much more you can do with indexes.
2- Are you using two consecutive updates? I mean, one for the $set and one for the $push?
If that's true, then you should definetelly use just one:
self.collection.update({'_id':uid},
{"$push":{'imp_date':<datevalue>,
'imp_creative':<creative_id>,
'imp_pid':<pid>},
"$set":{'geo_id':<geo id>}},
safe=True,upsert=True)
3- The update operation is an atomic operation which might locks other queries. If the document you are about to update is not already in RAM but it is in the disk, mongo will have to first fetch it from the disk and then update it. If you do a find operation first (which doesn't block as it's a read-only operation) the document will be in RAM for sure so the update operation (the locking one) will be faster:
self.collection.findOne({'_id':uid})
self.collection.update({'_id':uid},
{"$push":{'imp_date':<datevalue>,
'imp_creative':<creative_id>,
'imp_pid':<pid>},
"$set":{'geo_id':<geo id>}},
safe=True,upsert=True)
4-If your documents don't grow too much as you have said, it won't be necessary to bother about padding factor and reallocation issues. Furthermore, in some recent versions (can't remember if it was since 2.2 or 2.4) collections are created with the powerOfTwo option enabled by default.

Aggregate collection that have an aggregate collectin

I am having some trouble which schema design to pick, i have a document which holds user info each user have a very big set of items that can be up to 20k items.
an item have a date and an id and 19 other fields and also an internal array which can have 20-30 items , and it can be modified,deleted and of course newly inserted and queried by any property that it holds.
so i came up with 2 possible schemas.
1.Putting everything into a single docment
{_id:ObjectId("") type:'user' name:'xxx' items:[{.......,internalitems:[]},{.......,internalitems:[]},...]}
{_id:ObjectId("") type:'user' name:'yyy' items:[{.......,internalitems:[]},{.......,internalitems:[]},...]}
2.Seperating the items from the user and letting eachitem have its own
document
{_id:ObjectId(""), type:'user', username:'xxx'}
{_id:ObjectId(""), type:'user', username:'yyy'}
{_id:ObjectId(""), type:'useritem' username:'xxx' item:{.......,internalitems:[]}]}
{_id:ObjectId(""), type:'useritem' username:'xxx' item:{.......,internalitems:[]}]}
{_id:ObjectId(""), type:'useritem' username:'yyy' item:{.......,internalitems:[]}]}
{_id:ObjectId(""), type:'useritem' username:'yyy' item:{.......,internalitems:[]}]}
as i explained before a single user can have thousands of items and i have tens of users, internalitems can have 20-30 items, and it has 9 fields
considering that a single item can be queried by different users and can be modified only by the owner and another process.
if performance is really important which design would you pick?
if you pick neither of them what schema can you suggest?
on a side note i will be sharding and i have a single collection for everything.
I wouldn't recommend the first approach, there is a limit to the maximum document size:
"The maximum BSON document size is 16 megabytes.
The maximum document size helps ensure that a single document cannot use excessive amount of RAM or, during transmission, excessive amount of bandwidth. To store documents larger than the maximum size, MongoDB provides the GridFS API. See mongofiles and the documentation for your driver for more information about GridFS."
Source: http://docs.mongodb.org/manual/reference/limits/
There is also a performance implication if you exceed the current allocated document space when updating (http://docs.mongodb.org/manual/core/write-performance/ "Document Growth").
Your first solution is susceptible to both of these issues.
The second one is (Disclaimer: In the case of 20-30 internal items) is less susceptible of reaching the limit but still might require reallocation when doing updates. I haven't had this issue with a similar scenario, so this might be the way to go. And you might wanna look into Record Padding(http://docs.mongodb.org/manual/core/record-padding/) for some more details.
And, if all else fails, you can always split the internal items out as well.
Hope this helps!

Updating large number of records in a collection

I have collection called TimeSheet having few thousands records now. This will eventually increase to 300 million records in a year. In this collection I embed few fields from another collection called Department which is mostly won't get any updates and only rarely some records will be updated. By rarely I mean only once or twice in a year and also not all records, only less than 1% of the records in the collection.
Mostly once a department is created there won't any update, even if there is an update, it will be done initially (when there are not many related records in TimeSheet)
Now if someone updates a department after a year, in a worst case scenario there are chances collection TimeSheet will have about 300 million records totally and about 5 million matching records for the department which gets updated. The update query condition will be on a index field.
Since this update is time consuming and creates locks, I'm wondering is there any better way to do it? One option that I'm thinking is run update query in batches by adding extra condition like UpdatedDateTime> somedate && UpdatedDateTime < somedate.
Other details:
A single document size could be about 3 or 4 KB
We have a replica set containing three replicas.
Is there any other better way to do this? What do you think about this kind of design? What do you think if there numbers I given are less like below?
1) 100 million total records and 100,000 matching records for the update query
2) 10 million total records and 10,000 matching records for the update query
3) 1 million total records and 1000 matching records for the update query
Note: The collection names department and timesheet, and their purpose are fictional, not the real collections but the statistics that I have given are true.
Let me give you a couple of hints based on my global knowledge and experience:
Use shorter field names
MongoDB stores the same key for each document. This repetition causes a increased disk space. This can have some performance issue on a very huge database like yours.
Pros:
Less size of the documents, so less disk space
More documennt to fit in RAM (more caching)
Size of the do indexes will be less in some scenario
Cons:
Less readable names
Optimize on index size
The lesser the index size is, the more it gets fit in RAM and less the index miss happens. Consider a SHA1 hash for git commits for example. A git commit is many times represented by first 5-6 characters. Then simply store the 5-6 characters instead of the all hash.
Understand padding factor
For updates happening in the document causing costly document move. This document move causing deleting the old document and updating it to a new empty location and updating the indexes which is costly.
We need to make sure the document don't move if some update happens. For each collection there is a padding factor involved which tells, during document insert, how much extra space to be allocated apart from the actual document size.
You can see the collection padding factor using:
db.collection.stats().paddingFactor
Add a padding manually
In your case you are pretty sure to start with a small document that will grow. Updating your document after while will cause multiple document moves. So better add a padding for the document. Unfortunately, there is no easy way to add a padding. We can do it by adding some random bytes to some key while doing insert and then delete that key in the next update query.
Finally, if you are sure that some keys will come to the documents in the future, then preallocate those keys with some default values so that further updates don't cause growth of document size causing document moves.
You can get details about the query causing document move:
db.system.profile.find({ moved: { $exists : true } })
Large number of collections VS large number of documents in few collection
Schema is something which depends on the application requirements. If there is a huge collection in which we query only latest N days of data, then we can optionally choose to have separate collection and old data can be safely archived. This will make sure that caching in RAM is done properly.
Every collection created incur a cost which is more than cost of creating collection. Each of the collection has a minimum size which is a few KBs + one index (8 KB). Every collection has a namespace associated, by default we have some 24K namespaces. For example, having a collection per User is a bad choice since it is not scalable. After some point Mongo won't allow us to create new collections of indexes.
Generally having many collections has no significant performance penalty. For example, we can choose to have one collection per month, if we know that we are always querying based on months.
Denormalization of data
Its always recommended to keep all the related data for a query or sequence of queries in the same disk location. You something need to duplicate the information across different documents. For example, in a blog post, you'll want to store post's comments within the post document.
Pros:
index size will be very less as number of index entries will be less
query will be very fast which includes fetching all necessary details
document size will be comparable to page size which means when we bring this data in RAM, most of the time we are not bringing other data along the page
document move will make sure that we are freeing a page, not a small tiny chunk in the page which may not be used in further inserts
Capped Collections
Capped collection behave like circular buffers. They are special type of fixed size collections. These collection can receive very high speed writes and sequential reads. Being fixed size, once the allocated space is filled, the new documents are written by deleting the older ones. However document updates are only allowed if the updated document fits the original document size (play with padding for more flexibility).