Here is my view, and I am not sure if it is right or wrong:
The journaling log is the "redo" log. It records the modification of the data files.
For example, I want to change the field value of one record from 'a' to 'b', then the mongodb will find how to modify the dbfile (include all the namespace, data, index and so on), then mongodb write the modifications to the journal.
After that, mongodb does all the real modifications to the dbfile. If something goes wrong here, when mongoDB restarts it will read the journal (if it exists). It will then change the alter the dbfile to make the data set consistent.
So, in the journal, the data to change is not recorded, but instead how to change the dbfile.
Am I right? where can I get more information about the journal's format?
EDIT: my original link to a 2011 presentation at MongoSF by Dwight is now dead, but there is a 2012 presentation by Ben Becker with similar content.
Just in case that stops working at some point too, I will give a quick summary of how the journal in the original MMAP storage engine worked, but it should be noted that with the advent of the pluggable storage engine model (MongoDB 3.0 and later), this now completely depends on the storage engine (and potentially the options) you are using - so please check.
Back to the original (MMAP) storage engine journal. At a very rudimentary level, the journal contains a series of queued operations and all operations are written into it as they happen - basically an append only sequential write to disk.
Once these operations have been applied and flushed to disk, then they are no longer needed in the journal and can be aged out. In this sense the journal basically acts like a circular buffer for write operations.
Internally, the operations in the journal are stored in "commit groups" - a logical group of write operations. Once an operation is in a complete commit group it can be considered to be synced to disk as part of the journal (and will satisfy the j:true write concern for example). After an unclean shutdown, mongod will attempt to apply all complete commit groups that have not previously been flushed to disk, incomplete commit groups will be discarded.
The operations in the journal are not what you will see in the oplog, rather they are a more simple set of files, offsets (disk locations essentially), and data to be written at the location. This allows for efficient replay of the data, and for a compact format for the journal, but will make the contents look like gibberish to most (as opposed to the aforementioned oplog which is basically readable as JSON documents). This basically answers one of the questions posed - it does not have any awareness of the database file's contents and the changes to be made to it, it is even more simple - it basically only knows to go to disk location X and write data Y, that's it.
The write-ahead, sequential nature of the journal means that it fits nicely on a spinning disk and the sequential access pattern will usually be at odds with the MMAP data access patterns (though not necessarily the access patterns of other engines). Hence it is sometimes a good idea to put the journal on its own disk or partition to reduce IO contention.
Related
Looking for best practices to get DELTAs of data over time.
No timestamps available, cannot program timestamps!
GOAL: To get differences in all files for all fields over time. Only need primary key as output. Also I need this for 15 minute intervals of data changes
Example:
Customer file has 50 columns/fields, if any field changes I want another file to record the primary key. Or anything to record the occurrence of a change in the customer file.
Issue:
I am not sure if triggers are the way to go since there is a lot of overhead associated with triggers.
Can anyone suggest best practices for DB2 deltas over time with consideration to overhead and performance?
I'm not sure why you think there is a lot of overhead associated with triggers, they are very fast in my experience, but as David suggested, you can journal the files you want to track, then analyze the journal receivers.
To turn on Journaling you need to perform three steps:
Create a receiver using CRTJRNRCV
Create a journal for the receiver using CRTJRN
Start journaling on the files using STRJRNPF. You will need to keep *BEFORE and *AFTER images to detect a change on update, but you can omit *OPNCLS records to save some space.
Once you do this, you can also use commitment control to manage transactions! But, you will now have to manage those receivers as they use a lot of space. You can do that by using MNGRCV(*SYSTEM) on the CRTJRN command. I suspect that you will want to prevent the system from deleting the old receivers automatically as that could cause you to miss some changes when the system changes receivers. But that means you will have to delete old receivers on your own when you are done with them. I suggest waiting a day or two to delete old receivers. That can be an overnight process.
To read the journal receiver, you will need to use RTVJRNE (Retreive Journal Entries) which lets you retrieve journal entries into variables, or DSPJRN (Display Journal) which lets you return journal entries to the display, a printer file, or an *OUTFILE. The *OUTFILE can then be read using ODBC, or SQL or however you want to process it. You can filter the journal entries that you want to receive by file, and by type.
Have you looked at journalling the files and evaluating the journal receivers?
I am trying to create NoSQL kind of database for storing blobs. Each blob will be of fixed size for example 4KB to 64KB. Each blob will be rewritten entirely, so Let's say I have 1GB to 1TB of file, in FileStream I could do Seek and write etc. However I am little skeptical about Locking.
Is MemoryMappedFile for such large file with View of only 4KB to 64KB will perform better? Or should I use FileStream with locking.
FilStream provides Lock API, however MemoryMappedFile does not provide locking, so I will have to use some inter process locking.
My requirements are,
I want file to be open only by single process
Single process will have multiple threads accessing same MMF but probably different Views
I want to easily read/write chunks of 4B to 64KB but with full locking, will use mutex
I will be writing entire page of 4KB to 64KB of fixed size and I want to perform Flush only at the end of my processing. But I want this write operation to be extremely durable.
From documentation on MSDN, I see MemoryMappedFile as best candidate to create database system, however I have seen some open source nosql c# databases and I haven't seen any of them use this so it creates doubt that what are bottlenecks. Another reason could be MMF was introduced very late in .NET and those databases did not have option of MMF.
No, pretty unlikely to get mileage out of an MMF for a dbase engine. For starters, the operating system already provides a memory-mapped view of the file data. You get it for free from the file system cache.
Using an MMF for reading cannot improve on that. You want to optimize for the common access patterns, the ones that should be fast because the query accesses data sequentially. Very well supported by the file system cache, it reads ahead to slurp data off the same cylinder on the disk since you get that practically for free. An MMF is liable to die the Chinese torture death by a thousand pin pricks from the page faults that are triggered when you access the view. Not an issue when you access the data repeatedly, but that's one thing a dbase engine doesn't do.
An MMF is very nice for writing, you simply write to memory and the operating system lazily updates the file. But that's the one feature you'd never want to use in a dbase engine, you want to be sure that the file data is updated on disk when the transaction is committed. You can force a flush but that's crude, the entire view is flushed, not just the data that should be committed.
i'm trying to find the best solution to create scalable storage for big files. File size can vary from 1-2 megabytes and up to 500-600 gigabytes.
I have found some information about Hadoop and it's HDFS, but it looks a little bit complicated, because i don't need any Map/Reduce jobs and many other features. Now i'm thinking to use MongoDB and it's GridFS as file storage solution.
And now the questions:
What will happen with gridfs when i try to write few files
concurrently. Will there be any lock for read/write operations? (I will use it only as file storage)
Will files from gridfs be cached in ram and how it will affect read-write perfomance?
Maybe there are some other solutions that can solve my problem more efficiently?
Thanks.
I can only answer for MongoDB here, I will not pretend I know much about HDFS and other such technologies.
The GridFs implementation is totally client side within the driver itself. This means there is no special loading or understanding of the context of file serving within MongoDB itself, effectively MongoDB itself does not even understand they are files ( http://docs.mongodb.org/manual/applications/gridfs/ ).
This means that querying for any part of the files or chunks collection will result in the same process as it would for any other query, whereby it loads the data it needs into your working set ( http://en.wikipedia.org/wiki/Working_set ) which represents a set of data (or all loaded data at that time) required by MongoDB within a given time frame to maintain optimal performance. It does this by paging it into RAM (well technically the OS does).
Another point to take into consideration is that this is driver implemented. This means that the specification can vary, however, I don't think it does. All drivers will allow you to query for a set of documents from the files collection which only houses the files meta data allowing you to later serve the file itself from the chunks collection with a single query.
However that is not the important thing, you want to serve the file itself, including its data; this means that you will be loading the files collection and its subsequent chunks collection into your working set.
With that in mind we have already hit the first snag:
Will files from gridfs be cached in ram and how it will affect read-write perfomance?
The read performance of small files could be awesome, directly from RAM; the writes would be just as good.
For larger files, not so. Most computers will not have 600 GB of RAM and it is likely, quite normal in fact, to house a 600 GB partition of a single file on a single mongod instance. This creates a problem since that file, in order to be served, needs to fit into your working set however it is impossibly bigger than your RAM; at this point you could have page thrashing ( http://en.wikipedia.org/wiki/Thrashing_%28computer_science%29 ) whereby the server is just page faulting 24/7 trying to load the file. The writes here are no better as well.
The only way around this is to starting putting a single file across many shards :\.
Note: one more thing to consider is that the default average size of a chunks "chunk" is 256KB, so that's a lot of documents for a 600GB file. This setting is manipulatable in most drivers.
What will happen with gridfs when i try to write few files concurrently. Will there be any lock for read/write operations? (I will use it only as file storage)
GridFS, being only a specification uses the same locks as on any other collection, both read and write locks on a database level (2.2+) or on a global level (pre-2.2). The two do interfere with each other as well, i.e. how can you ensure a consistent read of a document that is being written to?
That being said the possibility for contention exists based on your scenario specifics, traffic, number of concurrent writes/reads and many other things we have no idea about.
Maybe there are some other solutions that can solve my problem more efficiently?
I personally have found that S3 (as #mluggy said) in reduced redundancy format works best storing a mere portion of meta data about the file within MongoDB, much like using GridFS but without the chunks collection, let S3 handle all that distribution, backup and other stuff for you.
Hopefully I have been clear, hope it helps.
Edit: Unlike what I accidently said, MongoDB does not have a collection level lock, it is a database level lock.
Have you considered saving meta data onto MongoDB and writing actual files to Amazon S3? Both have excellent drivers and the latter is highly redundant, cloud/cdn-ready file storage. I would give it a shot.
I'll start by answering the first two:
There is a write lock when writing in to GridFS, yes. No lock for reads.
The files wont be cached in memory when you query them, but their metadata will.
GridFS may not be the best solution for your problem. Write locks can become something of a pain when you're dealing with this type of situation, particularly for huge files. There are other databases out there that may solve this problem for you. HDFS is a good choice, but as you say, it is very complicated. I would recommend considering a storage mechanism like Riak or Amazon's S3. They're more oriented around being storage for files, and don't end up with major drawbacks. S3 and Riak both have excellent admin facilities, and can handle huge files. Though with Riak, last I knew, you had to do some file chunking to store files over 100mb. Despite that, it generally is a best practice to do some level of chunking for huge file sizes. There are a lot of bad things that can happen when transferring files in to DBs- From network time outs, to buffer overflows, etc. Either way, your solution is going to require a fair amount of tuning for massive file sizes.
First off, I know about this question:
Auto compact the deleted space in mongodb?
My question is not about shrinking DB file sizes though, but more about the reuse of deleted space. Say I have 100K documents in a collection, I then delete 50K of those. Will Mongo reuse the space within its data file that the deleted documents have freed? Or are they simply "marked" as deleted?
I don't care so much about the actual size of the file on disk, its more about "does it just grow and grow".
Update (Mar 2015): As of the 3.0 release, there are multiple storage engines available in MongoDB. This answer applies to the MMAP storage engine (still the default in MongoDB 3.0), the answer for other engines (WiredTiger for example) is quite different and may well be tunable and adjustable. Hence if you are using another engine, please read the relevant docs for that storage engine to determine what your space re-use defaults and options are.
With the MMAP storage engine, when documents are deleted the space left behind is put into a free list. However, to use the space there will need to be similarly sized documents inserted later, and MongoDB will need to find an appropriate space for that document within a certain time frame (once it times out looking at the list, it will just append) otherwise the space re-use is not going to happen very often. This deletion is done within the data files, so there is no disk space reclamation happening here - all of this is done internally within the existing data files.
If you subsequently do a repair, or resync a secondary from scratch, the data files are rewritten and the space on disk will be reclaimed (any padding on docs is also removed). This is where you will see actual space reclamation on-disk. For any other actions (compact included) the on disk usage will not change and may even increase.
With 2.2+ you can now use the collMod command and the usePowersOf2Sizes option to make the re-use of deleted space more likely (note that this is the default in 2.6+). This means that the initial space allocation for a document is a bit less efficient (512 bytes for a 400 byte doc for example) but means that when a new doc is inserted it is more likely to be able to re-use that space. If you are deleting (or growing and hence moving) documents a lot, then this will be more efficient in the long term.
For anyone that is interested, one of the people that wrote a lot of the storage code (Mathias Stearn) has a great presentation about the storage internals, which can be found here
If I read and write a single file using normal IO APIs, writes are guaranteed to be atomic on a per-block basis. That is, if my write only modifies a single block, the operating system guarantees that either the whole block is written, or nothing at all.
How do I achieve the same effect on a memory mapped file?
Memory mapped files are simply byte arrays, so if I modify the byte array, the operating system has no way of knowing when I consider a write "done", so it might (even if that is unlikely) swap out the memory just in the middle of my block-writing operation, and in effect I write half a block.
I'd need some sort of a "enter/leave critical section", or some method of "pinning" the page of a file into memory while I'm writing to it. Does something like that exist? If so, is that portable across common POSIX systems & Windows?
The technique of keeping a journal seems to be the only way. I don't know how this works with multiple apps writing to the same file. The Cassandra project has a good article on how to get performance with a journal. The key thing is to make sure of, is that the journal only records positive actions (my first approach was to write the pre-image of each write to the journal allowing you to rollback, but it got overly complicated).
So basically your memory-mapped file has a transactionId in the header, if your header fits into one block you know it won't get corrupted, though many people seem to write it twice with a checksum: [header[cksum]] [header[cksum]]. If the first checksum fails, use the second.
The journal looks something like this:
[beginTxn[txnid]] [offset, length, data...] [commitTxn[txnid]]
You just keep appending journal records until it gets too big, then roll it over at some point. When you startup your program you check to see if the transaction id for the file is at the last transaction id of the journal -- if not you play back all the transactions in the journal to sync up.
If I read and write a single file using normal IO APIs, writes are guaranteed to be atomic on a per-block basis. That is, if my write only modifies a single block, the operating system guarantees that either the whole block is written, or nothing at all.
In the general case, the OS does not guarantee "writes of a block" done with "normal IO APIs" are atomic:
Blocks are more of a filesystem concept - a filesystem's block size may actually map to multiple disk sectors...
Assuming you meant sector, how do you know your write only mapped to a sector? There's nothing saying the I/O was well aligned to that of a sector when it's gone through the indirection of a filesystem
There's nothing saying your disk HAS to implement sector atomicity. A "real disk" usually does but it's not mandatory or a guaranteed property. Sadly your program can't "check" for this property unless its an NVMe disk and you have access to the raw device or you're sending raw commands that have atomicity guarantees to a raw device.
Further, you're usually concerned with durability over multiple sectors (e.g. if power loss happens was the data I sent before this sector definitely on stable storage?). If there's any buffering going on, your write may have still only been in RAM/disk cache unless you used another command to check first / opened the file/device with flags requesting cache bypass and said flags were actually honoured.