MongoDB and update-in-place - mongodb

I have a very large MongoDB object, about 2MB.
I have to update frequently the readCount field and I need to be sure that the operation is very fast.
I know about "update-in-place" and I'm able to send this simple operation
db.pages.update( { name:"SamplePage" }, { $inc: { readCount : 1 } } );
But how MongoDB process that operation internally?
It load all the document from disk, modify the value, and store the entire document, or, if the document size does not change, it is able to update on disk only the file part relative to the readCount value?

MongoDB uses memory-mapped files for its data file management. What this actually means is that mongo doesn't load documents from disk. Instead it tries to access a memory page where that document is located. If that page is not yet in RAM, then the OS goes ahead and fetches it from the disk.
Writing is exactly the same. Mongo tries to write to a memory page. If it's in RAM, then it's ultra-fast (just swapping some bits in the memory). The page is marked dirty and the OS will take care of flushing it back to disk (persisting your changes).
If you have journal enabled, then your inserts/updates are somewhat more expensive, as mongodb has to make another write to the append-only file.
In my app mongodb handles 10-50k updates per second per instance on a modest hardware.

MongoDB computes padding factor for each collection based on how often items grow or move. More often grow larger padding factor. Internally it uses an adaptive algorithm to try to minimize moves on an update. Basically it operates in RAM.

Related

MongoDB: reduce read size and RAM needed with project?

I am designing a MongoDB database that looks something like this:
registry:{
id:1,
duration:123,
score:3,
text:"aaaaaaaaaaaaaaaaaaaaaaaaaaaa"
}
The text field is very big compared to the rest. I sometimes need to perform analytics queries that average the duration or the score, but never use the text.
I have queries that are more specific, and retrieve all the information about a single document. But in this queries I could spend more time making two queries to retrieve all the data.
My question is, if I make a query like this:
db.registries.aggregate( [
{
$group: {
_id: null,
averageDuration: { $avg: "$duration" },
}
}
] )
Would it need to read the data from the transcript field? That would make the query much slower and it would take a lot of RAM. If that is the case it would be better to split the records in two and have something like this right?:
registry:{
id:1,
duration:123,
score:3,
}
registry_text:{
id:1,
text:"aaaaaaaaaaaaaaaaaaaaaaaaaaaa"
}
Thanks a lot!
I don't know how the server works in this case but I expect that, for caching reasons, the server will load complete documents into memory when it reads them from disk. Disk reads are very slow (= expensive in time taken) and I expect server will aggressively use memory if it can to avoid reads.
An important note here is that the documents are stored on disk as lists of key-value pairs comprising their contents. To not load a field from disk the server would have to rebuild the document in question as part of reading it since there are length fields involved. I don't see this happening in practice.
So, once the documents are in memory I assume they are there with all of their fields and I don't expect you can tune this.
When you are querying, the server may or may not drop individual fields but this would only change the memory requirements for the particular query. Generally these memory requirements are dwarfed by the overall database cache size and aggregation pipelines. So I don't think it really matters at what point a large field is dropped from a document during query processing (assuming you project it out in the query).
I think this isn't a worthwhile matter to try to ponder/optimize. If you have a real system with real workloads, you'll be much more pressed to optimize something else.
If you are concerned with memory usage when the amount of available memory is consumer-sized (say, under 16 gb), just get more memory - it's insanely cheap given how much time you'd spend working around lack of it (whether we are talking about provisioning bigger AWS instances or buying more sticks of RAM).
You should be able to use $project to limit the fields read.
As a general advice, don't try to normalize the data with MongoDB as you would with SQL. Also, it's often more performant to read documents plain from DB and do the processing on your server.
I have found this answer that seems to indicate that project needs to fetch all document in the database server, it only reduces bandwith
When using projection to remove unused fields, the MongoDB server will
have to fetch each full document into memory (if it isn't already
there) and filter the results to return. This use of projection
doesn't reduce the memory usage or working set on the MongoDB server,
but can save significant network bandwidth for query results depending
on your data model and the fields projected.
https://dba.stackexchange.com/questions/198444/how-mongodb-projection-affects-performance

Using nested document structure in mongodb

I am planning to use a nested document structure for my MongoDB Schema design as I don't want to go for flat schema design as In my case I will need to fetch my result in one query only.
Since MongoDB has a size limit for a document.
MongoDB Limits and Threshold
A MongoDB document has a size limit of 16MB ( an amount of data). If your subcollection can growth without limits go flat.
I don't need to fetch my nested data but only be needing my nested data for filtering and querying purpose.
I want to know whether I will still be bound by MongoDB size limits even if I use my embedded data only for querying and filter purpose and never for fetching of nested data because as per my understanding, in this case, MongoDB won't load the complete document in memory but only the selected fields?
Nested schema design example
{
clinicName: "XYZ Hopital",
clinicAddress: "ABC place.",
"doctorsWorking":{
"doctorId1":{
"doctorJoined": ISODate("2017-03-15T10:47:47.647Z")
},
"doctorId2":{
"doctorJoined": ISODate("2017-04-15T10:47:47.647Z")
},
"doctorId3":{
"doctorJoined": ISODate("2017-05-15T10:47:47.647Z")
},
...
...
//upto 30000-40000 more records suppose
}
}
I don't think your understanding is correct when you say "because as per my understanding, in this case, MongoDB won't load the complete document in memory but only the selected fields?".
If we see MongoDB Doc. then it reads
The maximum BSON document size is 16 megabytes. The maximum document size helps ensure that a single document cannot use excessive amount of RAM or, during transmission, excessive amount of bandwidth. To store documents larger than the maximum size, MongoDB provides the GridFS API.
So the clear limit is 16 MB on document size. Mongo should stop you from saving such a document which is greater than this size.
If I agree with your understanding for a while then let's say that it allows to
save any size of document but more than 16 MB in RAM is not allowed. But on other hand, while storing the data it won't know what queries will be run on this data. So ultimately you will be inserting such big documents which can't be used later. (because while inserting we don't tell the query pattern, we can even try to fetch the full document in a single shot later).
If the limit is on transmission (hypothetically assuming) then there are lot of ways (via code) software developers can bring data into RAM in clusters and they won't cross 16 MB limit ever (that's how they do IO ops. on large files). They will make fun of this limit and just leave it useless. I hope MongoDB creators knew it and didn't want it to happen.
Also if limit is on transmission then there won't be any need of separate collection. We can put everything in a single collections and just write smart queries and can fetch data. If fetched data is crossing 16 MB then fetch it in parts and forget the limit. But it doesn't go this way.
So the limit must be on document size else it can create so many issues.
In my opinion if you just need "doctorsWorking" data for filtering or querying purpose (and if you also think that "doctorsWorking" will cause document to cross 16 MB limit) then it's good to keep it in a separate collection.
Ultimately all things depend on query and data pattern. If a doctor can serve in multiple hospitals in shifts then it will be great to keep doctors in separate collection.

Does WiredTiger of MongoDb has the performance issue of reallocation as MMAPv1

As the MMAPv1 Document said
All records are contiguously located on disk, and when a document becomes larger than the allocated record, MongoDB must allocate a new record. New allocations require MongoDB to move a document and update all indexes that refer to the document, which takes more time than in-place updates and leads to storage fragmentation.
Changed in version 3.0.0.
By default, MongoDB uses Power of 2 Sized Allocations so that every document in MongoDB is stored in a record which contains the document itself and extra space, or padding. Padding allows the document to grow as the result of updates while minimizing the likelihood of reallocations.
But the WiredTiger Document says nothing about this. So I just wanna know whether it is very ok when the record size changes or it has some performance issue but doesn't mention in the document.
You do not have to worry about document movement, padding etc. with WiredTiger. New writes initially get written to files in unused regions and then incorporated in with the rest of the data in the background later.
WiredTiger, during an update, will actually write a new version of documents rather than overriding existing data the way a mmapv1 does in many cases. (Check the video from MongodDB free online courses)

Will Updating a document key values uses more space in Mongodb

In Mongodb if i continuously update Key Values of a document in a collection, will it consume more space? If i update its value 100 thousand times, will the space be wasted on the hard disc.
Basically it won't use more space as the writes happens in place, so if the new value doesn't require more space it won't have to allocate more.
About rapid updates - mongodb writes are lazy so it can group multiple writes to one physical write to the disk.
you can find more info here
Please note that if you have logging enabled, it will use more disk space, but it is depends on your configuration.
MongoDB dbStats provide you the database storage usage, try to use it.

Why and when is necessary to rebuild indexes in MongoDB?

Been working with MongoDB for a while and today I had a doubt while discussing with a colleague.
The thing is that when you create an index in MongoDB, the collection is processed and the index is built.
The index is updated within insertion and deletion of documents so I don't really see the need to run a rebuild index operation (which drops the index and then rebuild it).
According to MongoDB documentation:
Normally, MongoDB compacts indexes during routine updates. For most
users, the reIndex command is unnecessary. However, it may be worth
running if the collection size has changed significantly or if the
indexes are consuming a disproportionate amount of disk space.
Does someone has had the need of running a rebuild index operation that worth it?
As per the MongoDB documentation, there is generally no need to routinely rebuild indexes.
NOTE: Any advice on storage becomes more interesting with MongoDB 3.0+, which introduced a pluggable storage engine API. My comments below are specifically in reference to the default MMAP storage engine in MongoDB 3.0 and earlier. WiredTiger and other storage engines have different storage implementations for data & indexes.
There may be some benefit in rebuilding an index with the MMAP storage engine if:
An index is consuming a larger than expected amount of space compared to the data. Note: you need to monitor historical data & index size to have a baseline for comparison.
You want to migrate from an older index format to a newer one. If a reindex is advisible this will be mentioned in the upgrade notes. For example, MongoDB 2.0 introduced significant index performance improvements so the release notes include a suggested reindex to the v2.0 format after upgrading. Similarly, MongoDB 2.6 introduced 2dsphere (v2.0) indexes which have a different default behaviour (sparse by default). Existing indexes are not rebuilt after index version upgrades; the choice of if/when to upgrade is left to the database administrator.
You have changed the _id format for a collection to or from a monotonically increasing key (eg. ObjectID) to a random value. This is a bit esoteric, but there's an index optimisation that splits b-tree buckets 90/10 (instead of 50/50) if you are inserting _ids that are always increasing (ref: SERVER-983). If the nature of your _ids changes significantly, it may be possible to build a more efficient b-tree with a re-index.
For more information on general B-tree behaviour, see: Wikipedia: B-tree
Visualising index usage
If you're really curious to dig into the index internals a bit more, there are some experimental commands/tools you can try. I expect these are limited to MongoDB 2.4 & 2.6 only:
indexStats command
storage-viz tool
While I don't know the exact technical reasons why, in MongoDB, I can make some assumptions about this, based on what I know about indexing from other systems and based on the documentation that you quoted.
The General Idea Of An Index
When moving from one document to the next, in the full document collection, there is a lot of wasted time and effort skipping past all the data that doesn't need to be dealt with. If you're looking for document with id "1234", having to move through 100K+ of each document makes it slow
Rather than having to search through all of the content of each document in the collection (physically moving the disk read heads, etc), an index makes this fast. It's basically a key/value pair that gives you the id and the location of that document. MongoDB can quickly scan through all of the id's in the index, find the locations of the documents that it needs, and go load them directly.
Allocating File Size For An Index
Indexes take up disk space because they are basically a key/value pair stored in a much smaller location. If you have a very large collection (large number of items in the collection) then your index grows in size.
Most operating systems allocate chunks of disk space in certain block sizes. Most database also allocate disk space in large chunks, as needed.
Instead of growing 100K of file size when 100K of documents are added, MongoDB will probably grow 1MB or maybe 10MB or something - I don't know what the actual growth size is. In SQL Server, you can tell it how fast to grow, and MongoDB probably has something like that.
Growing in chunks give the ability to 'grow' the documents in to the space faster because the database doesn't need to constantly expand. If the database now has 10MB of space already allocated, it can just use that space up. It doesn't have to keep expanding the file for each document. It just has to write the data to the file.
This is probably true of collections and indexes for collections - anything that is stored on disk.
File Size And Index Re-Building
When a large collection has a lot of documents added and removed, the index becomes fragmented. index keys may not be in order because there was room in the middle of the index file and not at the end, when the index needed to be built. Index keys may have a lot of space in between them, as well.
If there are 10,000 items in the index, and # 10,001 needs to be inserted, it may be inserted in the middle of the index file. Now the index needs to re-build itself to put everything back in order. This involves moving a lot of data around, to make room at the end of the file and put item # 10,001 at the end.
If the index is constantly being thrashed - lots of stuff removed and added - it's probably faster to just grow the index file size and always put stuff at the end. this is fast to create the index, but leaves empty holes in the file where old things were deleted.
If the index file has empty space where deleted things used to be, this is wasted effort when reading the index. The index file has more movement than needed, to get to the next item in the index. So, the index repairs itself... which can be time consuming for very large collections or very large changes to a collection.
Rebuild For A Large Index File
It can take a lot of disk access and I/O operations to correctly compact the index file back down to a reasonable size, with everything in order. Move out of place items to temp location, free up space in right spot, move them back. Oh by the way, to free up space, you had to move other items to temp location. It's recursive and heavy-handed.
Therefore, if you have a very large number of items in a collection and that collection has items added and removed on a regular basis, the index may need to be rebuilt from scratch. Doing this would wipe the current index file and rebuild from the ground up - which is probably going to be faster than trying to do thousands of moves inside of the existing file. Rather than moving things around, it just writes them sequentially, from scratch.
Large Change In Collection Size
Giving everything I'm assuming above, a large change in the collection size would cause this kind of thrashing. If you have 10,000 documents in the collection and you delete 8,000 of them... well, now you have empty space in your index file where the 8,000 items used to be. MongoDB needs to move the remaining 2,000 items around in the physical file, to rebuild it in a compact form.
Instead of waiting around for 8,000 empty spaces to be cleaned up, it might be faster to rebuild from the ground up with the remaining 2,000 items.
Conclusion? Maybe?
So, the documentation that you quoted is probably going to deal with "big data" needs or high thrashing collections and indexes.
Also keep in mind that I'm making an educated guess based on what I know about indexing, disk allocation, file fragmentation, etc.
My guess is that "most users" in the documentation, means 99.9% or more of mongodb collections don't need to worry about this.
MongoDB specific case
According to MongoDB documentation:
The remove() method does not remove the indexes
So if you delete documents from a collection you are wasting disk space unless you rebuild the index for that collection.