In Mongodb if i continuously update Key Values of a document in a collection, will it consume more space? If i update its value 100 thousand times, will the space be wasted on the hard disc.
Basically it won't use more space as the writes happens in place, so if the new value doesn't require more space it won't have to allocate more.
About rapid updates - mongodb writes are lazy so it can group multiple writes to one physical write to the disk.
you can find more info here
Please note that if you have logging enabled, it will use more disk space, but it is depends on your configuration.
MongoDB dbStats provide you the database storage usage, try to use it.
Related
After storing some binary data in MongoDB 4.2.5 (3 nodes replicate set) the oplog.rs collection did grow to ca. 700MB. The binary data was removed and the data model restructured, but the oplog.rs collection stays the same size (as expected). I do understand that it's a capped collection with a maximum size and eventually it'll reuse the space. In my case though, I'd like to reclaim the space and start over. The database is used mostly for internal testing purposes. I don't mind losing some data from the oplog, but I do mind having a big oplog file, since the whole database is just a few MB.
Is it safe to use the emptycapped command on the oplog.rs collection in a replicate set scenario? Do I need to run this command on each node? Do I need to compact the collection after the deletion (last part from https://docs.mongodb.com/manual/tutorial/change-oplog-size/)?
Is there any other way to gracefully "reset" the oplog and free up the space?
OpLog is limited by what size you have defined in config or whether you have left it to default.
The OpLog (operations log) is a special capped collection that keeps a rolling record of all operations that modify the data stored in your databases.
It fills up to the defined size as the changes are coming through (or noops heartbeats).
If you want to reduce the size, reset the OpLog size in your config. But don't forget, larger OpLog size means you get a better OpLog window.
OpLog Window tells you how long a secondary member can be offline and still catch up to the primary without doing a full resync.
MongoDB is easy to start, but it is not easy to ensure availability (buy EC2 to build a master-slave? or more replica set ?). And there is many key-value public service(Dynamo, AzureTable) with high availability and good performance. So if I can replace MongoDB storage engine with, such as Dynamo, then I got friendly MongoDB API and high available storage. Is that possible ?
Actually, it is pretty easy to use MongoDB as an in memory key value store:
Approach 1: Prevent disk usage
Disable journaling
Set syncPeriodSecs to 0
Store the key in the _id field and the value in a value field.
Now you have an all in-memory key/value store
The problem with this approach is that you are limited to your machines RAM size and your data does not persist.
A by far more elegant approach is to use
Approach 2: Using covered queries
A covered query is a query which is answered by using only an index, which is kept in memory as long as enough is available. If the index exceeds the RAM, the least recently used parts are "swapped" to disk. Furthermore, your key/value pairs are persisted. And all that happens transparently.
Create an index over both the keys and the values:
db.keyValueCollection.createIndex({_id:1,value:1})
Make sure MongoDB is informed that only _id and value are needed and it doesn't need to bother reading anything else from disk (flexible schema!) using projection to limit the fields returned in your query to those two fields:
db.keyValueCollection.find({_id:"foo"},{_id:1,value:1})
Done! Your queries will now be answered from RAM, overflow will be automatically handled and your values will persist.
Side note
Ensuring availability with MongoDB is close to foolproof. Choose a replica set name, add it to all confit files, fire up one of the members, initiate the replica set and add the members. Problem solved.
Edit: removed an artificial key field, and made sure the query in approach 2 is covered by limiting the fields returned.
I have looked around the whole documentation but can't figure out if this actually happens or not. If I remove an index from a collection in MongoDB, does it delete the index files right away? Is space reclaimed?
No, MongoDB won't automatically release diskspace after collection data or indexes are deleted. Allocating new files is a relatively slow thing compared to other functions in a high-performance databases so MongoDB keeps all previously allocated files open and available by design.
If you need to reclaim diskspace use the repairDatabase command which achieves compaction as a side-effect of it's checking/fixing functionality.
An alternative that is available when using replica sets is to add a new member and let it sync- the data will be inserted fairly compactly in the replica set member's new database extent files. To compact all members you would do it in a rolling fashion, and probably force the primary to step down at the end so it can be re-synced too.
We had heard mongodb had one client with 42T per node and I am wondering more about this. I know cassandra has Bloomfilters that skipp hitting disk to find out which file a row might be in.
Does mongodb have something similar to bloomfilters?
IS mongodb using something similar to SSTables?
I did read mongodb does compaction just like cassandra, I would think this would be an awfully long process with a 42T node????
I guess I don't know what terms to search for as I research mongodb here(in cassandra they are called SSTables).
thanks,
Dean
MongoDB does not support online compaction. In fact, data fragmentation is a current problem in systems with many doc updates. To prevent data fragmentation MongoDB tries to calculate an automated padding factor, minimizing the number of data moves.
The compact command blocks the entire database until it finished. Besides, MongoDB does not support dictionary compression, so field names takes space on every object stored. I guess the layout used by MongoDB is not any fancy data structure. It's simply composed of header (offset, length...), bson data and padding factor.
Since MongoDB is not a key/value or columnar database it doesn't use SSTables (efficient data structure for columnar layout). Every file created for the database is named "extent".
AFAIK, MongoDB doesn't use bloom filters.
I have a very large MongoDB object, about 2MB.
I have to update frequently the readCount field and I need to be sure that the operation is very fast.
I know about "update-in-place" and I'm able to send this simple operation
db.pages.update( { name:"SamplePage" }, { $inc: { readCount : 1 } } );
But how MongoDB process that operation internally?
It load all the document from disk, modify the value, and store the entire document, or, if the document size does not change, it is able to update on disk only the file part relative to the readCount value?
MongoDB uses memory-mapped files for its data file management. What this actually means is that mongo doesn't load documents from disk. Instead it tries to access a memory page where that document is located. If that page is not yet in RAM, then the OS goes ahead and fetches it from the disk.
Writing is exactly the same. Mongo tries to write to a memory page. If it's in RAM, then it's ultra-fast (just swapping some bits in the memory). The page is marked dirty and the OS will take care of flushing it back to disk (persisting your changes).
If you have journal enabled, then your inserts/updates are somewhat more expensive, as mongodb has to make another write to the append-only file.
In my app mongodb handles 10-50k updates per second per instance on a modest hardware.
MongoDB computes padding factor for each collection based on how often items grow or move. More often grow larger padding factor. Internally it uses an adaptive algorithm to try to minimize moves on an update. Basically it operates in RAM.