Is MemoryMappedFile better option for creating database system? - .net-4.5

I am trying to create NoSQL kind of database for storing blobs. Each blob will be of fixed size for example 4KB to 64KB. Each blob will be rewritten entirely, so Let's say I have 1GB to 1TB of file, in FileStream I could do Seek and write etc. However I am little skeptical about Locking.
Is MemoryMappedFile for such large file with View of only 4KB to 64KB will perform better? Or should I use FileStream with locking.
FilStream provides Lock API, however MemoryMappedFile does not provide locking, so I will have to use some inter process locking.
My requirements are,
I want file to be open only by single process
Single process will have multiple threads accessing same MMF but probably different Views
I want to easily read/write chunks of 4B to 64KB but with full locking, will use mutex
I will be writing entire page of 4KB to 64KB of fixed size and I want to perform Flush only at the end of my processing. But I want this write operation to be extremely durable.
From documentation on MSDN, I see MemoryMappedFile as best candidate to create database system, however I have seen some open source nosql c# databases and I haven't seen any of them use this so it creates doubt that what are bottlenecks. Another reason could be MMF was introduced very late in .NET and those databases did not have option of MMF.

No, pretty unlikely to get mileage out of an MMF for a dbase engine. For starters, the operating system already provides a memory-mapped view of the file data. You get it for free from the file system cache.
Using an MMF for reading cannot improve on that. You want to optimize for the common access patterns, the ones that should be fast because the query accesses data sequentially. Very well supported by the file system cache, it reads ahead to slurp data off the same cylinder on the disk since you get that practically for free. An MMF is liable to die the Chinese torture death by a thousand pin pricks from the page faults that are triggered when you access the view. Not an issue when you access the data repeatedly, but that's one thing a dbase engine doesn't do.
An MMF is very nice for writing, you simply write to memory and the operating system lazily updates the file. But that's the one feature you'd never want to use in a dbase engine, you want to be sure that the file data is updated on disk when the transaction is committed. You can force a flush but that's crude, the entire view is flushed, not just the data that should be committed.

Related

MongoDB as file storage

i'm trying to find the best solution to create scalable storage for big files. File size can vary from 1-2 megabytes and up to 500-600 gigabytes.
I have found some information about Hadoop and it's HDFS, but it looks a little bit complicated, because i don't need any Map/Reduce jobs and many other features. Now i'm thinking to use MongoDB and it's GridFS as file storage solution.
And now the questions:
What will happen with gridfs when i try to write few files
concurrently. Will there be any lock for read/write operations? (I will use it only as file storage)
Will files from gridfs be cached in ram and how it will affect read-write perfomance?
Maybe there are some other solutions that can solve my problem more efficiently?
Thanks.
I can only answer for MongoDB here, I will not pretend I know much about HDFS and other such technologies.
The GridFs implementation is totally client side within the driver itself. This means there is no special loading or understanding of the context of file serving within MongoDB itself, effectively MongoDB itself does not even understand they are files ( http://docs.mongodb.org/manual/applications/gridfs/ ).
This means that querying for any part of the files or chunks collection will result in the same process as it would for any other query, whereby it loads the data it needs into your working set ( http://en.wikipedia.org/wiki/Working_set ) which represents a set of data (or all loaded data at that time) required by MongoDB within a given time frame to maintain optimal performance. It does this by paging it into RAM (well technically the OS does).
Another point to take into consideration is that this is driver implemented. This means that the specification can vary, however, I don't think it does. All drivers will allow you to query for a set of documents from the files collection which only houses the files meta data allowing you to later serve the file itself from the chunks collection with a single query.
However that is not the important thing, you want to serve the file itself, including its data; this means that you will be loading the files collection and its subsequent chunks collection into your working set.
With that in mind we have already hit the first snag:
Will files from gridfs be cached in ram and how it will affect read-write perfomance?
The read performance of small files could be awesome, directly from RAM; the writes would be just as good.
For larger files, not so. Most computers will not have 600 GB of RAM and it is likely, quite normal in fact, to house a 600 GB partition of a single file on a single mongod instance. This creates a problem since that file, in order to be served, needs to fit into your working set however it is impossibly bigger than your RAM; at this point you could have page thrashing ( http://en.wikipedia.org/wiki/Thrashing_%28computer_science%29 ) whereby the server is just page faulting 24/7 trying to load the file. The writes here are no better as well.
The only way around this is to starting putting a single file across many shards :\.
Note: one more thing to consider is that the default average size of a chunks "chunk" is 256KB, so that's a lot of documents for a 600GB file. This setting is manipulatable in most drivers.
What will happen with gridfs when i try to write few files concurrently. Will there be any lock for read/write operations? (I will use it only as file storage)
GridFS, being only a specification uses the same locks as on any other collection, both read and write locks on a database level (2.2+) or on a global level (pre-2.2). The two do interfere with each other as well, i.e. how can you ensure a consistent read of a document that is being written to?
That being said the possibility for contention exists based on your scenario specifics, traffic, number of concurrent writes/reads and many other things we have no idea about.
Maybe there are some other solutions that can solve my problem more efficiently?
I personally have found that S3 (as #mluggy said) in reduced redundancy format works best storing a mere portion of meta data about the file within MongoDB, much like using GridFS but without the chunks collection, let S3 handle all that distribution, backup and other stuff for you.
Hopefully I have been clear, hope it helps.
Edit: Unlike what I accidently said, MongoDB does not have a collection level lock, it is a database level lock.
Have you considered saving meta data onto MongoDB and writing actual files to Amazon S3? Both have excellent drivers and the latter is highly redundant, cloud/cdn-ready file storage. I would give it a shot.
I'll start by answering the first two:
There is a write lock when writing in to GridFS, yes. No lock for reads.
The files wont be cached in memory when you query them, but their metadata will.
GridFS may not be the best solution for your problem. Write locks can become something of a pain when you're dealing with this type of situation, particularly for huge files. There are other databases out there that may solve this problem for you. HDFS is a good choice, but as you say, it is very complicated. I would recommend considering a storage mechanism like Riak or Amazon's S3. They're more oriented around being storage for files, and don't end up with major drawbacks. S3 and Riak both have excellent admin facilities, and can handle huge files. Though with Riak, last I knew, you had to do some file chunking to store files over 100mb. Despite that, it generally is a best practice to do some level of chunking for huge file sizes. There are a lot of bad things that can happen when transferring files in to DBs- From network time outs, to buffer overflows, etc. Either way, your solution is going to require a fair amount of tuning for massive file sizes.

How does MongoDB journaling work

Here is my view, and I am not sure if it is right or wrong:
The journaling log is the "redo" log. It records the modification of the data files.
For example, I want to change the field value of one record from 'a' to 'b', then the mongodb will find how to modify the dbfile (include all the namespace, data, index and so on), then mongodb write the modifications to the journal.
After that, mongodb does all the real modifications to the dbfile. If something goes wrong here, when mongoDB restarts it will read the journal (if it exists). It will then change the alter the dbfile to make the data set consistent.
So, in the journal, the data to change is not recorded, but instead how to change the dbfile.
Am I right? where can I get more information about the journal's format?
EDIT: my original link to a 2011 presentation at MongoSF by Dwight is now dead, but there is a 2012 presentation by Ben Becker with similar content.
Just in case that stops working at some point too, I will give a quick summary of how the journal in the original MMAP storage engine worked, but it should be noted that with the advent of the pluggable storage engine model (MongoDB 3.0 and later), this now completely depends on the storage engine (and potentially the options) you are using - so please check.
Back to the original (MMAP) storage engine journal. At a very rudimentary level, the journal contains a series of queued operations and all operations are written into it as they happen - basically an append only sequential write to disk.
Once these operations have been applied and flushed to disk, then they are no longer needed in the journal and can be aged out. In this sense the journal basically acts like a circular buffer for write operations.
Internally, the operations in the journal are stored in "commit groups" - a logical group of write operations. Once an operation is in a complete commit group it can be considered to be synced to disk as part of the journal (and will satisfy the j:true write concern for example). After an unclean shutdown, mongod will attempt to apply all complete commit groups that have not previously been flushed to disk, incomplete commit groups will be discarded.
The operations in the journal are not what you will see in the oplog, rather they are a more simple set of files, offsets (disk locations essentially), and data to be written at the location. This allows for efficient replay of the data, and for a compact format for the journal, but will make the contents look like gibberish to most (as opposed to the aforementioned oplog which is basically readable as JSON documents). This basically answers one of the questions posed - it does not have any awareness of the database file's contents and the changes to be made to it, it is even more simple - it basically only knows to go to disk location X and write data Y, that's it.
The write-ahead, sequential nature of the journal means that it fits nicely on a spinning disk and the sequential access pattern will usually be at odds with the MMAP data access patterns (though not necessarily the access patterns of other engines). Hence it is sometimes a good idea to put the journal on its own disk or partition to reduce IO contention.

Memory Mapped files and atomic writes of single blocks

If I read and write a single file using normal IO APIs, writes are guaranteed to be atomic on a per-block basis. That is, if my write only modifies a single block, the operating system guarantees that either the whole block is written, or nothing at all.
How do I achieve the same effect on a memory mapped file?
Memory mapped files are simply byte arrays, so if I modify the byte array, the operating system has no way of knowing when I consider a write "done", so it might (even if that is unlikely) swap out the memory just in the middle of my block-writing operation, and in effect I write half a block.
I'd need some sort of a "enter/leave critical section", or some method of "pinning" the page of a file into memory while I'm writing to it. Does something like that exist? If so, is that portable across common POSIX systems & Windows?
The technique of keeping a journal seems to be the only way. I don't know how this works with multiple apps writing to the same file. The Cassandra project has a good article on how to get performance with a journal. The key thing is to make sure of, is that the journal only records positive actions (my first approach was to write the pre-image of each write to the journal allowing you to rollback, but it got overly complicated).
So basically your memory-mapped file has a transactionId in the header, if your header fits into one block you know it won't get corrupted, though many people seem to write it twice with a checksum: [header[cksum]] [header[cksum]]. If the first checksum fails, use the second.
The journal looks something like this:
[beginTxn[txnid]] [offset, length, data...] [commitTxn[txnid]]
You just keep appending journal records until it gets too big, then roll it over at some point. When you startup your program you check to see if the transaction id for the file is at the last transaction id of the journal -- if not you play back all the transactions in the journal to sync up.
If I read and write a single file using normal IO APIs, writes are guaranteed to be atomic on a per-block basis. That is, if my write only modifies a single block, the operating system guarantees that either the whole block is written, or nothing at all.
In the general case, the OS does not guarantee "writes of a block" done with "normal IO APIs" are atomic:
Blocks are more of a filesystem concept - a filesystem's block size may actually map to multiple disk sectors...
Assuming you meant sector, how do you know your write only mapped to a sector? There's nothing saying the I/O was well aligned to that of a sector when it's gone through the indirection of a filesystem
There's nothing saying your disk HAS to implement sector atomicity. A "real disk" usually does but it's not mandatory or a guaranteed property. Sadly your program can't "check" for this property unless its an NVMe disk and you have access to the raw device or you're sending raw commands that have atomicity guarantees to a raw device.
Further, you're usually concerned with durability over multiple sectors (e.g. if power loss happens was the data I sent before this sector definitely on stable storage?). If there's any buffering going on, your write may have still only been in RAM/disk cache unless you used another command to check first / opened the file/device with flags requesting cache bypass and said flags were actually honoured.

Why can't DMBSes rely on the OS buffer pool?

Stonebraker's paper (Operating System Support for Database Management) explains that, "the overhead to fetch a block from the buffer pool manager usually includes that of a system call and a core-to-core move." Forget about the buffer-replacement strategy, etc. The only point I question is the quoted.
My understanding is that when a DBMS wants to read a block x it issues a common read instruction. There should be no difference from that of any other application requesting a read.
I'm not looking for generic answers (I got them, and read papers). I seek a detailed answer of the described problem.
See Does a file read from a Java application invoke a system call?
Reading from your other question, and working forward:
When the DBMS must bring a page from disk it will involve at least one system call. At his point most DBMSs place the page into their own buffer. (They also end up in the OS' buffer, but that's unimportant).
So, we have one system call. However, we can avoid any further system calls. This is possible because the DBMS is caching pages in its own memory space. The first thing the DBMS will do when it decides it needs a page is check and see if it has it in its cache. If it does, it retrieves it from there without ever invoking a system call.
The DBMS is free to expire pages in its cache in whatever way is most beneficial for its IO needs. The OS's cache is expired in a more general way since the OS has other things to worry about. One example of this is that a DBMS will typically use a great deal of memory to cache pages as it knows that disk IO is one of the most expensive things it can do. The OS won't do this as it has to balance the cost of disk IO against having memory for other applications to use.
The operating system disk i/o must be generalised to work for a variety of situations. The DBMS can sometimes gain significant performance using less general code that is optimised to its own needs.
The DBMS does its own caching, so doesn't want to work through the O/S caching. It "owns" the patch of disk, so it doesn't need to worry about sharing with other processes.
Update
The link to the paper is a help.
Firstly, the paper is almost thirty years old and is referring to long-obsolete hardware. Notwithstanding that, it makes quite interesting reading.
Firstly, understand that disk i/o is a layered process. It was in 1981 and is even more so now. At the lowest point, a device driver will issue physical read/write instructions to the hardware. Above that may be the o/s kernel code then the o/s user space code then the application. Between a C program's fread() and the disk heads moving, there are at least three or four levels and might be considerably more. The DBMS may seek to improve performance might seek to bypass some layers and talk directly with the kernel, or even lower.
I recall some years ago installing Oracle on a Sun box. It had an option to dedicate a disk as a "raw" partition, where Oracle would format the disk in its own manner and then talk straight to the device driver. The O/S had no access to the disk at all.
It's mainly a performance issue. A dbms has highly specific and unusual I/O demands.
The OS may have any number of processes doing I/O and filling its buffers with the assorted cached data that this produces.
And of course there is the issue of size and what gets cached (a dbms may be able to peform better cache for its needs than the more generic device buffer caching).
And then there is the issue that a generic “block” may in fact amount to a considerably larger I/O burden (this depends on partitioning and such like) than what a dbms ideally would like to bear; its own cache may be tuned to work better with the layout of the data on the disk and thereby able to minimise I/O.
A further thing is the issue of indexes and similar means to speed up queries, which of course works rather better if the cache actually knows what these mean in the first place.
The real issue is that the file buffer cache is not in the filesystem used by the DBMS; it's in the kernel and shared by all of the filesystems resident in the system. Any memory read out of the kernel must be copied into user space: this is the core-to-core move you read about.
Beyond this, some other reasons you can't rely on the system buffer pool:
Often, DBMS's have a really good idea about its upcoming access patterns, and it can't communicate these patterns to the kernel. This can lead to lower performance.
The buffer cache is traditional stored in a fixed-size kernel memory range, so it cannot grow or shrink. That also means the cache is much smaller than main memory, so by using the buffer cache a DBMS would be unable to take advantage of system resources.
I know this is old, but it came up as unanswered.
Essentially:
The OS uses a separate address spaces for every process.
Retrieving information from any other address space requires a system call or page fault. **(see below)
The DBMS is a process with its own address space.
The OS buffer pool Stonebraker describes is in the kernel address space.
So ... to get data from the kernel address space to the DBMS's address space, a system call or page fault is unavoidable.
You're correct that accessing data from the OS buffer pool manager is no more expensive than a normal read() call. (In fact, it's done with a normal read call.) However, Stonebraker is not talking about that. He's specifically discussing the caching needs of DBMSes, after the data has been read from the disk and is present in RAM.
In essence, he's saying that the OS's buffer pool cache is too slow for the DBMS to use because it's stored in a different address space. He's suggesting using a local cache in the same process (and therefore same address space), which can give you a significant speedup for applications like DBMSes which hit the cache heavily, because it will eliminate that syscall overhead.
Here's the exact paragraph where he discusses using a local cache in the same process:
However, many DBMSs including INGRES
[20] and System R [4] choose to put a
DBMS managed buffer pool in user space
to reduce overhead. Hence, each of
these systems has gone to the
trouble of constructing its own
buffer pool manager to enhance
performance.
He also mentions multi-core issues in the excerpt you quote above. Similar effects apply here, because if you can have just one cache per core, you may be able to avoid the slowdowns from CPU cache flushes when multiple CPUs are reading and writing the same data.
** BTW, I believe Stonebraker's 1981 paper is actually pre-mmap. He mentions it as future work. "The trend toward providing the file system as a part of shared virtual memory (e.g., Pilot [16]) may provide a solution to this problem."

Main Memory DB vs Object DB

I'm currently trying to pick a database vendor.
I'm just seeking some personal opinions from fellow database developers out there.
My question is especially targeted towards people who:
1) have used Main Memory DB (MMDB) that supports replicating to disk (hybrid) before (i.e. ExtremeDB)
or
2) have used Versant Object Database and/or Objectivity Database and/or Progress ObjectStore
and the question is really: if you could recommend a database vendor, based on your experience, that would suit my application.
My application is a commercial real-time (read: high-performance) object-oriented C++ GIS kind of app, where we need to do a lot of lat/lon search (i.e. given an area, find all matching targets within the area...R-Tree index).
The types of data that I would like to store into the database are all modeled as objects and they make use of std::list and std::vector, so naturally, Object Database seems to make sense. I have read through enough articles to convince myself that a traditional RDBMS probably isnt what I'm really looking for in terms of
performance (joins or multiple
tables for dynamic-length data like
list/vector)
ease of programming
(impedance mismatch)
However, in terms of performance,
Input data is being fed into the system at about 40 MB/s.
Hence, the system will also be doing insert into the database at the rate of roughly 350 inserts per second (where each object varies from 64KB to 128KB),
Database will consistently be searched and updated via multiple threads.
From my understanding, all of the Object DBs I have listed here use cache for storing database objects. ExtremeDB claims that since it's designed especially for memory, it can avoid overhead of caching logic, etc. See more by googling: Main Memory vs. RAM-Disk Databases: A Linux-based Benchmark
So..I'm just a bit confused. Can Object DBs be used in real-time system? Is it as "fast" as MMDB?
Fundamentally, I difference between a MMDB and a OODB is that the MMDB has the expectation that all of its data is based in RAM, but persisted to disk at some point. Whereas an OODB is more conventional in that there's no expectation of the entire DB fitting in to RAM.
The MMDB can leverage this by giving up on the concept that the persisted data doesn't necessarily have to "match" the in RAM data.
The way anything with persistence is going to work, is that it has to write the data to disk on update in some fashion.
Almost all DBs use some kind of log for this. These logs are basically "raw" pages of data, or perhaps individual transactions, appended to a file. When the file gets "too big", a new file is started.
Once the logs are properly consolidated in to the main store, the logs are discarded (or reused).
Now, a crude, in RAM DB can exist simply by appending transactions to a log file, and when it's restarted, it just loads the log in to RAM. So, in essence, the log file IS the database.
The downside of this technique is the longer and more transactions you have, the bigger your log/DB is, and thus the longer the DB startup time. But, ideally, you can also "snapshot" the current state, which eliminates all of the logs up to date, and effectively compresses them.
In this manner, all the routine operations of the DB have to manage is appending pages to logs, rather than updating other disk pages, index pages, etc. Since, ideally, most systems don't need to "Start up" that often, perhaps start up time is less of an issue.
So, in this way, a MMDB can be faster than an OODB who has a different contract with the disk, maintaining logs and disk pages. In this way, an OODB can be slower even if the entire DB fits in to RAM and is properly cached, simply because you incur disk operations outside of the log operations during normal operations, vs a MMDB where these operations happen as a "maintenance" task, which can be scheduled during down time and/or quiet time.
As to whether either of these systems can meet you actual performance needs, I can't say.
The back ends of databases (reader and writer processes, caching, lock managing, txn log files, ACID semantics) are the same, so RDBs and OODB are actually very similar here. The difference is the interface to the application programmer. Is your data model complicated, consists of lots of classes with real inheritance relationships? Then OO is good. Is it relatively flat and simple? Then go RDB. What is the nature of the relationships? Is it pointer-like and set like? Then go RDB. Is is more complicated, like (ordered) list, array, map? Then you should go OO. Also, do you have a stand-alone application with no need to integrate with other apps? Then OO is ok. Do you have to share data with other apps (i.e. several apps access the same database)? Then that's a deal-breaker for OO, and you should stick with RDB. Is the schema of your database stable or do you expect it to evolve frequently? OODBs are bad ad schema evolution, so if you expect frequent changes, stick with RDBs.