Append-only file write with fsync, on emmc/SSD/sdcard, ext4 or f2fs? - operating-system

I am building two operating systems for IoT. The Libertas OS and Hornet OS.
The data APIs are designed to be append-only time series. fsync() is required after each append of byte block to ensure data safety.
The storage could be emmc, SSD, or sdcard. The question is, which filesystem is a better fit for different storage types?
I understand f2fs is designed as append-only. But what about EXT4? Couldn't easily find information about it.
Theoretically, at least for file content, the append shall continue writing on the current underlying block to minimize wearing. Since the file size is changed after append, the file meta-data shall be updated, ideally through append-log.
I also don't know the details of the internal controller of sdcard and emmc, will the controller honor such block level append?
Any insight will be greatly appreciated!

Related

How does the OS decide data that goes in each page?

I have a comma separated data file, lets assume each record is of fixed length.
How does the OS(Linux) determine, which data parts are kept in one page in the hard disk?
Does it simply look at the file, organize the records one after the other(sequentially) in one page? Is it possible to programmatically set this or does the OS take care of it automatically?
Your question is quite general - you didn't specify which OS or filesystem - so the answer will be too.
Generally speaking the OS does not examine the data being written to a file. It simply writes the data to enough disk sectors to contain the data. If the sector size is 4K, then bytes 0-4095 are written to the first sector, bytes 4096-8191 are written to the second sector, etc. The OS does this automatically.
Very few programs wish to manage their disk sector allocation. One exception is high performance database management systems, which often implement their own filesystem in order have low level control of the file data to sector mapping.

Inode vs Vnode Difference

I had some doubts regarding an Inode vs a Vnode. As far as my understanding goes, inode is the representation of a file that is used by the Virtual File System. Whereas vnodes are file system specific. Is this correct?
Also, I am confused whether inode is a kernel data structure i.e whether it is an in-memory data structure or a data structure that exists on blocks in an actual disk?
To add something to this: The best explanation I could find for a vnode is here on the FreeBSD docs. For the more academically minded there is also the original paper that introduced the concept, [Vnodes: An Architecture for Multiple File System Types in Sun UNIX], which provides a much more in depth resource.
That said vnodes were originally created for FreeBSD because of the proliferation of different types of filesystems that needed to be used like UFS, NFS, etc. Vnodes are meant to provide an abstraction across all possible filesystems so that the OS can interface with them and so that kernel functions don't have to specifically support every filesystem under the sun; they only have to have knowledge of how to interact with the file's vnode.
Back to your original question vnodes, as #Allen Luce mentioned, are in memory abstractions and are not filesystem specific. They can be used interchangeably in UFS, ext4, and whatever else. In contrast, inodes are stored on disk and are specific to the exact filesystem being used. inodes contain metadata about the file like size, owner, pointers to the block addresses among other things. Vnodes contain some data about the file but only attributes that do not change over the file's lifetime so an inode would be the location to reference if you wanted the most information possible about a file. If you're more interested in inodes I would suggest you check out wikipedia which has a good article on them.
Typically (like for Linux and BSD mainstream filesystems), an inode is first an on-disk structure that describes the storage of a file in terms of that disk (usually in blocks). A vnode is an in-memory structure that abstracts much of what an inode is (an inode can be one of its data fields) but also captures things like operations on files, locks, etc. This lets it support non-inode based filesystems, in particular networked filesystems.
It depends on your OS. On a Linux system for instance there is no v-node, only a generic i-node struct inode which although is conceptually similar to a v-node is implemented differently.
For BSD-derived and UNIX kernels, the v-node points to an i-node structure specific to the filesystem, along with some additional information including pointers to functions that operate on the file and metadata not included in the inode. A major difference is the inode is files system while the vnode is not. (In Linux as mentioned above there is both a system-independent inode and a file system-dependent inode)
An inode is not a kernel data structure, the vnode/generic inode is however, being an in-kernel representation of the inode.
The concept of vnode differs to a degree depending on what system you're under, everyone just kind of took the name and ran with it.
Under Linux and Unix, you can consider the abstraction as follow.
Suppose there exists
f.tmp
You want f to stick around while your program is running, because you're accessing it, but if your program ends or crashes, you want to be sure it goes away.
You can do this by opening f, and then unlink() it. You will still retain a reference to f, even though its inode now has 0 directory entries, and so has been marked free. The operating system is still retaining where the file started and its allocation state, until your program ends. This "virtualization" of the inode that no longer exists is a vnode.
Another common situation for this is when you're reading a resource that disappears out from under you. Suppose you're watching a movie, while it is streaming to a temporary location. When the movie is completely downloaded, it will be relocated to another volume for storage. Somehow you can continue watching and scrubbing through the movie so long as it remains open. In this case even though there are again no links, since there is a vnode, this inode can't be cleaned up yet.
This depends on the operating system and the file system you are using or working on. For instance VXFS and ADVFS inode's are nothing but on-disk data structure called vnode's. In general both refer to file metadata.
Simply put, the in-memory data structure vnode is just an inode cache that stores information about
the file(typically inode stores in the disk) so that it can be accessed more quickly.

Memory Mapped files and atomic writes of single blocks

If I read and write a single file using normal IO APIs, writes are guaranteed to be atomic on a per-block basis. That is, if my write only modifies a single block, the operating system guarantees that either the whole block is written, or nothing at all.
How do I achieve the same effect on a memory mapped file?
Memory mapped files are simply byte arrays, so if I modify the byte array, the operating system has no way of knowing when I consider a write "done", so it might (even if that is unlikely) swap out the memory just in the middle of my block-writing operation, and in effect I write half a block.
I'd need some sort of a "enter/leave critical section", or some method of "pinning" the page of a file into memory while I'm writing to it. Does something like that exist? If so, is that portable across common POSIX systems & Windows?
The technique of keeping a journal seems to be the only way. I don't know how this works with multiple apps writing to the same file. The Cassandra project has a good article on how to get performance with a journal. The key thing is to make sure of, is that the journal only records positive actions (my first approach was to write the pre-image of each write to the journal allowing you to rollback, but it got overly complicated).
So basically your memory-mapped file has a transactionId in the header, if your header fits into one block you know it won't get corrupted, though many people seem to write it twice with a checksum: [header[cksum]] [header[cksum]]. If the first checksum fails, use the second.
The journal looks something like this:
[beginTxn[txnid]] [offset, length, data...] [commitTxn[txnid]]
You just keep appending journal records until it gets too big, then roll it over at some point. When you startup your program you check to see if the transaction id for the file is at the last transaction id of the journal -- if not you play back all the transactions in the journal to sync up.
If I read and write a single file using normal IO APIs, writes are guaranteed to be atomic on a per-block basis. That is, if my write only modifies a single block, the operating system guarantees that either the whole block is written, or nothing at all.
In the general case, the OS does not guarantee "writes of a block" done with "normal IO APIs" are atomic:
Blocks are more of a filesystem concept - a filesystem's block size may actually map to multiple disk sectors...
Assuming you meant sector, how do you know your write only mapped to a sector? There's nothing saying the I/O was well aligned to that of a sector when it's gone through the indirection of a filesystem
There's nothing saying your disk HAS to implement sector atomicity. A "real disk" usually does but it's not mandatory or a guaranteed property. Sadly your program can't "check" for this property unless its an NVMe disk and you have access to the raw device or you're sending raw commands that have atomicity guarantees to a raw device.
Further, you're usually concerned with durability over multiple sectors (e.g. if power loss happens was the data I sent before this sector definitely on stable storage?). If there's any buffering going on, your write may have still only been in RAM/disk cache unless you used another command to check first / opened the file/device with flags requesting cache bypass and said flags were actually honoured.

mmap writes to file on disk(synchronous/asynchronous)

I havea question regarding mmap functionality. when mmap is used in asynchronous mode where the kernel takes care of persisting the data to the mapped file on the disk , is it possible to have the former updates overwrite the later updates ?
Lets say at time T, we modify a location in memory that is memory mapped to a file on disk and again at time T+1 we modify the same location in memory. As the writes to the file are not synchronous, is it possible that kernel first picks up the modifications at time T+1 and then picks up the modifications at time T resulting in inconsistency in the memory mapped file ?
It's not exactly possible. The file is allowed to be inconsistent till msync(2) or munmap(2) - when that happens, dirty (modified) pages are written to disk page by page (sometimes more, depends on filesystem in newer kernels). msync() allows you to specify synchronous operation and invalidation of caches after finished write, which allows you to ensure that the data in cache is the same as data in file. Without that, it's possible that your program will see newer data but file contains older - exact specifics of the rather hairy situation depend on CPU architecture and specific OS implementation of those routines.

Do any common OS file systems use hashes to avoid storing the same content data more than once?

Many file storage systems use hashes to avoid duplication of the same file content data (among other reasons), e.g., Git and Dropbox both use SHA256. The file names and dates can be different, but as long as the content gets the same hash generated, it never gets stored more than once.
It seems this would be a sensible thing to do in a OS file system in order to save space. Are there any file systems for Windows or *nix that do this, or is there a good reason why none of them do?
This would, for the most part, eliminate the need for duplicate file finder utilities, because at that point the only space you would be saving would be for the file entry in the file system, which for most users is not enough to matter.
Edit: Arguably this could go on serverfault, but I feel developers are more likely to understand the issues and trade-offs involved.
ZFS supports deduplication since last month: http://blogs.oracle.com/bonwick/en_US/entry/zfs_dedup
Though I wouldn't call this a "common" filesystem (afaik, it is currently only supported by *BSD), it is definitely one worth looking at.
It would save space, but the time cost is prohibitive. The products you mention are already io bound, so the computational cost of hashing is not a bottleneck. If you hashed at the filesystem level, all io operations which are already slow will get worse.
NTFS has single instance storage.
NetApp has supported deduplication (that's what its called in the storage industry) in the WAFL filesystem (yeah, not your common filesystem) for a few years now. This is one of the most important features found in the enterprise filesystems today (and NetApp stands out because they support this on their primary storage also as compared to other similar products which support it only on their backup or secondary storage; they are too slow for primary storage).
The amount of data which is duplicate in a large enterprise with thousands of users is staggering. A lot of those users store the same documents, source code, etc. across their home directories. Reports of 50-70% data deduplicated have been seen often, saving lots of space and tons of money for large enterprises.
All of this means that if you create any common filesystem on a LUN exported by a NetApp filer, then you get deduplication for free, no matter what the filesystem created in that LUN. Cheers. Find out how it works here and here.
btrfs supports online de-duplication of data at the block level. I'd recommend duperemove as an external tool is needed.
It would require a fair amount of work to make this work in a file system. First of all, a user might be creating a copy of a file, planning to edit one copy, while the other remains intact -- so when you eliminate the duplication, the hard link you created that way would have to give COW semantics.
Second, the permissions on a file are often based on the directory into which that file's name is placed. You'd have to ensure that when you create your hidden hard link, that the permissions were correctly applied based on the link, not just the location of the actual content.
Third, users are likely to be upset if they make (say) three copies of a file on physically separate media to ensure against data loss from hardware failure, then find out that there was really only one copy of the file, so when that hardware failed, all three copies disappeared.
This strikes me as a bit like a second-system effect -- a solution to a problem long after the problem ceased to exist (or at least matter). With hard drives current running less than $100US/terabyte, I find it hard to believe that this would save most people a whole dollar worth of hard drive space. At that point, it's hard to imagine most people caring much.
There are file systems that do deduplication, which is sort of like this, but still noticeably different. In particular, deduplication is typically done on a basis of relatively small blocks of a file, not on complete files. Under such a system, a "file" basically just becomes a collection of pointers to de-duplicated blocks. Along with the data, each block will typically have some metadata for the block itself, that's separate from the metadata for the file(s) that refer to that block (e.g., it'll typically include at least a reference count). Any block that has a reference count greater than 1 will be treated as copy on write. That is, any attempt at writing to that block will typically create a copy, write to the copy, then store the copy of the block to the pool (so if the result comes out the same as some other block, deduplication will coalesce it with the existing block with the same content).
Many of the same considerations still apply though--most people don't have enough duplication to start with for deduplication to help a lot.
At the same time, especially on servers, deduplication at a block level can serve a real purpose. One really common case is dealing with multiple VM images, each running one of only a few choices of operating systems. If we look at the VM image as a whole, each is usually unique, so file-level deduplication would do no good. But they still frequently have a large chunk of data devoted to storing the operating system for that VM, and it's pretty common to have many VMs running only a few operating systems. With block-level deduplication, we can eliminate most of that redundancy. For a cloud server system like AWS or Azure, this can produce really serious savings.