Capped collections - BsonId, uniqueness and index - mongodb

I want to use a capped collection as a cache store, I plan on selecting using a compound index - key and expiry-date. Since it's impossible to update/delete from a capped collection, I will add new entries with new expiry dates and just select the one with future expiry.
1) Is this the optimal way of creating the index if I'll be using Query.GTE("expiry", DateTime.Now) in the query?
cacheColl.EnsureIndex(new IndexKeysBuilder().Ascending("key").Descending("expiry"));
2) Do I need a [BsonId] attribute on the class? I know that "key" won't be unique. Does a record need to have a unique id entry??
3) My only motivation for using a capped collection is to limit the final size of the cache (both disk and memory) and not having to delete expired cache items myself. Is there a reason to prefer a regular collection and update items / delete expired ones? Even if I delete the documents, I read that space is not freed (would I need to compact?)

1) The index looks about right. You can also add a sort descending and limit 1 to the query if you only care about the latest.
2) No. In a capped collection, _id is not automatically created and is not required. The reason why I needs to be unique on normal collections is because a unique index on _id is created for that collection by default.
3) There are pros and cons to both approach and which is better is totally dependent on your needs. One thing you might want to consider about capped collection is that it will not be easy to resize the collection once you created it. This will be problematic if you realize later on that the size you initially set was too small to fit into the time frame window that you needed.
P.S. You are right about the part that the space used by extents of a deleted document is not being freed back. However, Mongo keeps track of these extents and reuse them whenever possible.

Related

MongoDb constraint allowing only one entry in collection

Can I make a constraint to prevent more than one document appearing in a collection? The collection will store the version of the database structures
In SQL I would have put a check constraint on the table that checks the count of rows in the table is less than 2
I believe the best way would be using User-Defined Roles to provide Collection-Level Access Control. You can create a document with an innitial database_version and posteriorly assign to your user a role that restricts the access to the collection to only find and update documents.
P.S.:While searching for a solution you may have come across a possible alternative called Capped Collections. It won't work in your case as Capped Collections restrict updates if the updates result in increased document size.

MongoDB Internal implementation of indexing?

I've learned a lot of things about indexing and finding some stuff from
here.
Indexes support the efficient execution of queries in MongoDB. Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the
query statement. If an appropriate index exists for a query, MongoDB can use the index to limit the number of documents it must inspect.
But i still have some questions:
While Creating index using (createIndex), is the Record always stored in
RAM?
Is every time need to create Index Whenever My application
is going to restart ?
What will Happen in the case of default id (_id). Is always Store in RAM.
_id Is Default Index, That means All Records is always Store in RAM for particular collections?
Please help me If I am wrong.
Thanks.
I think, you are having an idea that indexes are stored in RAM. What if I say they are not.
First of all we need to understand what are indexes, indexes are basically a pointer to tell where on disk that document is. Just like we have indexing in book, for faster access we can see what topic is on which page number.
So when indexes are created, they also are stored in the disk, But when an application is running, based on the frequent use and even faster access they get loaded into RAM but there is a difference between loaded and created.
Also loading an index is not same as loading a collection or records into RAM. If we have index loaded we know what all documents to pick up from disk, unlike loading all document and verifying each one of them. So indexes avoid collection scan.
Creation of indexes is one time process, but each write on the document can potentially alter the indexing, so some part might need to be recalculating because records might get shuffled based on the change in data. that's why indexing makes write slow and read fast.
Again think of as a book, if you add a new topic of say 2 pages in between the book, all the indexes after that topic number needs to be recalculated. accordingly.
While Creating index Using (createIndex),Is Record always store in RAM
?.
No, records are not stored in RAM, while creating it sort of processes all the document in the collection and create an index sheet, this would be time consuming understandably if there are too many documents, that's why there is an option to create index in background.
Is every time need to create Index Whenever My application is going to
restart ?
Index are created one time, you can delete it and create again, but it won't recreated on the application or DB restart. that would be insane for huge collection in sharded environment.
What will Happen in the case of default id (_id). Is always Store in
RAM.
Again that's not true. _id comes as indexed field, so index is already created for empty collection, as when you do a write , it would recalculate the index. Since it's a unique index, the processing would be faster.
_id Is Default Index, That means All Records is always Store in RAM for particular collections ?
all records would only be stored in RAM when you are using in-memory engine of MongoDB, which I think comes as enterprise edition. Due to indexing it would not automatically load the record into RAM.
To answer the question from the title:
MongoDB indexes use a B-tree data structure.
source: https://docs.mongodb.com/manual/indexes/index.html#b-tree

MongoDB 3.X : Does it make sense to have only one collection per database

Since MongoDB 3.x introduces lock per record and not on collection or database, does it make sense to write all of your data to single collection with one extra identifier field "documentType".
It will help simulate "join" through map-reduce operation.
Couchbase does the same thing with "buckets" instead of collection.
Does anybody see any disadvatanges with this approach ?
There's one big general-case disadvantage: indexes.
With Mongo, you generally want to set up indexes so that most, if not all, queries you make, use them. So in addition to the one on _id, you'll set up indexes on the primary fields you search by (often compounded with those you sort by).
If you're storing everything in one single collection, that means you need to have all those indexes on that collection. Which means two things:
The indexes are be bigger, since there's more documents to index. Granted, this can be somewhat mitigated by using sparse indexes.
Inserting or modifying documents in the collection requires Mongo to update all these indexes (where it'd just update the relevant indexes in the standard use-many-collections approach). This kills your write performance.
Furthermore, if you have in your application a query that somehow doesn't use one of those many indexes, it needs to scan through the entire collection, which is O(n) where n is the number of documents in the collection -- in your case, that means the number of documents in the entire database.
Collections are cheap. Use them ;)

Best way to organize subdocuments in Mongo?

I have a project that I'm working on that will require me to store a large number of objects in an array linked to a parent object, akin to the storing of social media comments to their original post. What is the best way for me to organize the data for the array of child documents/comments?
Is it considered best practice to have the child objects under a different collection and reference to their parent or would it be more ideal just to put them all within the parent object directly?
I discuss this a little here, read this first:
https://stackoverflow.com/a/27285313/68567
For your case, Option 3 (keeping some of the data in your primary model) is probably the best. The key is to Avoid unbounded array growth.
This has to do with how Mongodb allocates documents. http://docs.mongodb.org/manual/core/storage/
"Every document in MongoDB is stored in a record which contains the document itself and extra space, or padding, which allows the document to grow as the result of updates."
When node allocates new documents it allocates space based on the size of the inserted document and the sizes of documents already in your collection. (Read more in the link above.) If you have some documents that are orders of magnitude larger than others this will likely lead to fragmentation.
The way to avoid having too many documents in your 'comments' sub-document array is with the $push and $slice commands.
http://docs.mongodb.org/manual/reference/operator/update/slice/
So store the 'most recent 5' and display those when the item first loads. (Or oldest, or whatever other sorting criteria you want to use.) Then provide a way for the user to load more which will do a separate round-trip to the collection that has all of them.

Updating large number of records in a collection

I have collection called TimeSheet having few thousands records now. This will eventually increase to 300 million records in a year. In this collection I embed few fields from another collection called Department which is mostly won't get any updates and only rarely some records will be updated. By rarely I mean only once or twice in a year and also not all records, only less than 1% of the records in the collection.
Mostly once a department is created there won't any update, even if there is an update, it will be done initially (when there are not many related records in TimeSheet)
Now if someone updates a department after a year, in a worst case scenario there are chances collection TimeSheet will have about 300 million records totally and about 5 million matching records for the department which gets updated. The update query condition will be on a index field.
Since this update is time consuming and creates locks, I'm wondering is there any better way to do it? One option that I'm thinking is run update query in batches by adding extra condition like UpdatedDateTime> somedate && UpdatedDateTime < somedate.
Other details:
A single document size could be about 3 or 4 KB
We have a replica set containing three replicas.
Is there any other better way to do this? What do you think about this kind of design? What do you think if there numbers I given are less like below?
1) 100 million total records and 100,000 matching records for the update query
2) 10 million total records and 10,000 matching records for the update query
3) 1 million total records and 1000 matching records for the update query
Note: The collection names department and timesheet, and their purpose are fictional, not the real collections but the statistics that I have given are true.
Let me give you a couple of hints based on my global knowledge and experience:
Use shorter field names
MongoDB stores the same key for each document. This repetition causes a increased disk space. This can have some performance issue on a very huge database like yours.
Pros:
Less size of the documents, so less disk space
More documennt to fit in RAM (more caching)
Size of the do indexes will be less in some scenario
Cons:
Less readable names
Optimize on index size
The lesser the index size is, the more it gets fit in RAM and less the index miss happens. Consider a SHA1 hash for git commits for example. A git commit is many times represented by first 5-6 characters. Then simply store the 5-6 characters instead of the all hash.
Understand padding factor
For updates happening in the document causing costly document move. This document move causing deleting the old document and updating it to a new empty location and updating the indexes which is costly.
We need to make sure the document don't move if some update happens. For each collection there is a padding factor involved which tells, during document insert, how much extra space to be allocated apart from the actual document size.
You can see the collection padding factor using:
db.collection.stats().paddingFactor
Add a padding manually
In your case you are pretty sure to start with a small document that will grow. Updating your document after while will cause multiple document moves. So better add a padding for the document. Unfortunately, there is no easy way to add a padding. We can do it by adding some random bytes to some key while doing insert and then delete that key in the next update query.
Finally, if you are sure that some keys will come to the documents in the future, then preallocate those keys with some default values so that further updates don't cause growth of document size causing document moves.
You can get details about the query causing document move:
db.system.profile.find({ moved: { $exists : true } })
Large number of collections VS large number of documents in few collection
Schema is something which depends on the application requirements. If there is a huge collection in which we query only latest N days of data, then we can optionally choose to have separate collection and old data can be safely archived. This will make sure that caching in RAM is done properly.
Every collection created incur a cost which is more than cost of creating collection. Each of the collection has a minimum size which is a few KBs + one index (8 KB). Every collection has a namespace associated, by default we have some 24K namespaces. For example, having a collection per User is a bad choice since it is not scalable. After some point Mongo won't allow us to create new collections of indexes.
Generally having many collections has no significant performance penalty. For example, we can choose to have one collection per month, if we know that we are always querying based on months.
Denormalization of data
Its always recommended to keep all the related data for a query or sequence of queries in the same disk location. You something need to duplicate the information across different documents. For example, in a blog post, you'll want to store post's comments within the post document.
Pros:
index size will be very less as number of index entries will be less
query will be very fast which includes fetching all necessary details
document size will be comparable to page size which means when we bring this data in RAM, most of the time we are not bringing other data along the page
document move will make sure that we are freeing a page, not a small tiny chunk in the page which may not be used in further inserts
Capped Collections
Capped collection behave like circular buffers. They are special type of fixed size collections. These collection can receive very high speed writes and sequential reads. Being fixed size, once the allocated space is filled, the new documents are written by deleting the older ones. However document updates are only allowed if the updated document fits the original document size (play with padding for more flexibility).