Why does the sequential write to a journal file speed up if it is in a different file system?

Why does the sequential write to a journal file speed up if it is in a different file system? - mongodb

As per MongoDB documentation at http://docs.mongodb.org/manual/core/journaling,
To speed the frequent sequential writes that occur to the current
journal file, you can ensure that the journal directory is on a
different filesystem
storing the journal file on a different file system speeds things up. Is it because two different hard disk spindles are at work? Just wanted to understand the mechanics of this optimization tip.

Yes,
If you are using physical rotating hard drives, there is significant performance benefit from separating the journal activities onto a separate (preferably dedicated) physical drive.
The benefits are not the same if you're using SAN hardware. And to an extent are lessened by larger drive caches available in modern hard drives. And it's a different story again with SSD.
The main factor with spinning disks is seek time - the time that it takes for the read/write head to get to the right part of the disk. Hard disks are arranged with circular tracks. To get to a specific block on the disk, the head moves to the right track, and the disk spins around to the right place (the disks keep spinning of course, so it's simply a matter of waiting for the right place to come around).
This doesn't take much time, but when it's happening a lot it adds up.
When you have the primary activity and the journal activity on the same drive, the head has to rapidly move between the two (many, really) locations that the system needs to look at.
If you have your journalling on another physical drive, then the head on that drive can be almost (or perhaps more accurately, relatively) static, with the ability to more rapidly access the correct track / location required. Meanwhile the other drive (with the primary activity on it) will be more efficient also, because the head will not be constantly seeking back to the where the journal entries are being written between the other activities required to keep the database running.
This benefit applies to most database systems and many other applications where there is a constant sequential writing to disk going on at the same time as other mixed disk activity.
You don't get the same profile if you're using SAN, because even if it appears to be separate file systems, it's actually likely to be striped across many drives which are both cached and shared.
SSD has a different profile also, because there is no physical seek time.

Related

Does a user process have any control over paging?

A program might have some data that, when needed, it wants to access very fast. Let's call this VIP data. It would like to reduce the likelihood that page in memory that the VIP data resides on gets swapped to disk when memory utilization is high on the system. What types of control/influence does it have over this?
For example, I think it can consider the page replacement policy and try to influence the OS to not swap this VIP data to disk. If the policy is LRU, the program can periodically read the VIP data to ensure that the page has always been accessed fairly recently. A program can also use a very small amount of memory in total, making it likely that all its pages are recently accessed when it runs and therefore the VIP data is not likely swapped to disk.
Can it exert any more explicit control over paging?

In order to do this, you might consider
Prioritising the process using renice command or
Lock the processes in the main memory using MLOCK(2)

This is entirely operating system dependent. On some systems, if you have appropriate privileges you can lock pages in physical memory.

Strategy for local logging on an embedded Raspberry Pi?

My company uses the Raspberry Pi 3 as an embedded controller in a product. User's don't power it off gracefully, they just flip a switch. To avoid corruption, the /boot and /root file systems are read-only. This appears to be bulletproof - we've used test rig to "pull the plug" over and over (2000+ cycles) with no problems.
We are working on a new feature that requires local logging. To do so, we created an additional ext4 read/write partition on the SD card (we are currently using about 2GB on an 8GB card) for the log file. To minimize wear, the application buffers the log data and writes to the card only once every minute. The log file is closed between writes. Nothing else uses that partition. The log file is not written to when the application is in a state that likely indicates the user is about to shut down.
In testing this, we've found that in spite of the rather conservative approach we're using, the read/write partition is always marked as "dirty" after a reboot, frequently contains filesystem errors, and often has a damaged log file. We've also had a number of cards suffer unrecoverable errors which prevent the device from booting up.
Loss of the last set of log entries is not a problem.
Loss of the log file is undesireable but acceptable.
Damage to the /root and /boot filesystems is unacceptable, as is physical damage (other than standard NAND flash wear) to the card.
Short of adding a UPS to gracefully shut down the Pi, is there any approach that will safely allow for read/write operations?
Is there a configuration of the SD card partition "geometry" that would ensure that no two partitions overlap one flash erase block?

Just some points:
Dirty flag: I guess that you are not unmounting the filesystem, right? This is a possible reason to see dirty flag after each unclean reboot. Another (probably better way) is to switch filesystem to read-only mode after writing and make it read-write before writing the file.
BTW, ext4 defers writes to the disk. close() on file doesn't mean that the files are written to the disk, you need to call extra fsync() or sync (see Does Linux guarantee the contents of a file is flushed to disc after close()?). So it is better to ask system to really write the file.

I suggest to use UBIFS or JFFS2 or YAFFS2. Its best practics way. Also I heard about LogFS.
All time mount and writing without delay posible because this FS designed to work with hard shutdown.
Copy-Pasted oveview from https://superuser.com/questions/248078/choice-of-filesystem-for-gnu-linux-on-an-sd-card
JFFS2
Includes compression and elegant wear leveling protection.
YAFFS2
Single thing that makes the difference: short mount times, after successful umount.
Implements write once property: once data is written to one block, there is no need to rewrite it. This is important, as it reduces wear.
LogFS
Not very mature, but already included in Linux kernel tree.
Supports larger filesystems than JFFS2/YAFFS2 without problems.
UBIFS
More mature than LogFS
Write caching support
On scalability: [article][3]. On large disks, better performance than with JFFS2
ext4
If no driver or card (for example SSD drives do have internal wear leveling, at least usually) handle wear leveling, then ext4 is not the best idea, as it is not intended for raw flash usage.

Performance benchmarks for attaching read-only disks to google compute engine

Has anyone benchmarked the performance of attaching a singular, read-only disk to multiple Google Compute Engine instances (i.e., the same disk in read-only mode)?
The Google documentation ( https://cloud.google.com/compute/docs/disks/persistent-disks#use_multi_instances ) indicates that it is OK to attach multiple instances to the same disk, and personal experience has shown it to work at a small scale (5 to 10 instances), but soon we will be running a job across 500+ machines (GCE instances). We would like to know how performance scales out as the number of parallel attachments grows, and as the bandwidth of those attachments grows. We currently pull down large blocks of data (read-only) from Google Cloud Storage Buckets, and are wondering about the merits of switching to a Standard Persistent Disk configuration. This involves Terabytes of data, so we don't want to change course, willy-nilly.
One important consideration: It is likely that code on each of the 500+ machines will try to access the same file (400MB) at the same time. How do buckets and attached drives compare in that case? Maybe the answer is obvious - and it would save having to set up a rigorous benchmarking system (across 500 machines) ourselves. Thanks.

Persistent disks on GCE should have consistent performance. Currently that is 12MB/s and 30IOPS per 100GB of volume size for a standard persistent disk:
https://cloud.google.com/compute/docs/disks/persistent-disks#pdperformance
Using it on multiple instances should not change the disk's overall performance. It will however make it easier to use those limits since you don't need to worry about using the instance's maximum read speed. However, accessing the same data many times at once might. I do know how either persistent disks or GCS handle contention.
If it is only a 400MB file that are in contention, it may make sense to just benchmark the fastest method to deliver this separately. One possible solution is to make duplicates of your critical file and pick which one you access at random. This should cause less nodes to contend for each file.
Duplicating the critical file means a bigger disk and therefore also contributes to your IO performance. If you already intended to increase your volume size for better performance, the copies are free.

Does it make sense to cache data obtained from a memory mapped file?

Or it would be faster to re-read that data from mapped memory once again, since the OS might implement its own cache?
The nature of data is not known in advance, it is assumed that file reads are random.

i wanted to mention a few things i've read on the subject. The answer is no, you don't want to second guess the operating system's memory manager.
The first comes from the idea that you want your program (e.g. MongoDB, SQL Server) to try to limit your memory based on a percentage of free RAM:
Don't try to allocate memory until there is only x% free
Occasionally, a customer will ask for a way to design their program so it continues consuming RAM until there is only x% free. The idea is that their program should use RAM aggressively, while still leaving enough RAM available (x%) for other use. Unless you are designing a system where you are the only program running on the computer, this is a bad idea.
(read the article for the explanation of why it's bad, including pictures)
Next comes from some notes from the author of Varnish, and reverse proxy:
Varnish Cache - Notes from the architect
So what happens with squids elaborate memory management is that it gets into fights with the kernels elaborate memory management, and like any civil war, that never gets anything done.
What happens is this: Squid creates a HTTP object in "RAM" and it gets used some times rapidly after creation. Then after some time it get no more hits and the kernel notices this. Then somebody tries to get memory from the kernel for something and the kernel decides to push those unused pages of memory out to swap space and use the (cache-RAM) more sensibly for some data which is actually used by a program. This however, is done without squid knowing about it. Squid still thinks that these http objects are in RAM, and they will be, the very second it tries to access them, but until then, the RAM is used for something productive.
Imagine you do cache something from a memory-mapped file. At some point in the future that memory holding that "cache" will be swapped out to disk.
the OS has written to the hard-drive something which already exists on the hard drive
Next comes a time when you want to perform a lookup from your "cache" memory, rather than the "real" memory. You attempt to access the "cache", and since it has been swapped out of RAM the hardware raises a PAGE FAULT, and cache is swapped back into RAM.
your cache memory is just as slow as the "real" memory, since both are no longer in RAM
Finally, you want to free your cache (perhaps your program is shutting down). If the "cache" has been swapped out, the OS must first swap it back in so that it can be freed. If instead you just unmapped your memory-mapped file, everything is gone (nothing needs to be swapped in).
in this case your cache makes things slower
Again from Raymon Chen: If your application is closing - close already:
When DLL_PROCESS_DETACH tells you that the process is exiting, your best bet is just to return without doing anything
I regularly use a program that doesn't follow this rule. The program
allocates a lot of memory during the course of its life, and when I
exit the program, it just sits there for several minutes, sometimes
spinning at 100% CPU, sometimes churning the hard drive (sometimes
both). When I break in with the debugger to see what's going on, I
discover that the program isn't doing anything productive. It's just
methodically freeing every last byte of memory it had allocated during
its lifetime.
If my computer wasn't under a lot of memory pressure, then most of the
memory the program had allocated during its lifetime hasn't yet been
paged out, so freeing every last drop of memory is a CPU-bound
operation. On the other hand, if I had kicked off a build or done
something else memory-intensive, then most of the memory the program
had allocated during its lifetime has been paged out, which means that
the program pages all that memory back in from the hard drive, just so
it could call free on it. Sounds kind of spiteful, actually. "Come
here so I can tell you to go away."
All this anal-rententive memory management is pointless. The process
is exiting. All that memory will be freed when the address space is
destroyed. Stop wasting time and just exit already.
The reality is that programs no longer run in "RAM", they run in memory - virtual memory.
You can make use of a cache, but you have to work with the operating system's virtual memory manager:
you want to keep your cache within as few pages as possible
you want to ensure they stay in RAM, by the virtue of them being accessed a lot (i.e. actually being a useful cache)
Accessing:
a thousand 1-byte locations around a 400GB file
is much more expensive than accessing
a single 1000-byte location in a 400GB file
In other words: you don't really need to cache data, you need a more localized data structure.
If you keep your important data confined to a single 4k page, you will play much nicer with the VMM; Windows is your cache.
When you add 64-byte quad-word aligned cache-lines, there's even more incentive to adjust your data structure layout. But then you don't want it too compact, or you'll start suffering performance penalties of cache flushes from False Sharing.

The answer is highly OS-specific. Generally speaking, there will be no sense in caching this data. Both the "cached" data as well as the memory-mapped can be paged away at any time.
If there will be any difference it will be specific to an OS - unless you need that granularity, there is no sense in caching the data.

Reading Multiple Files in Multiple Threads using C#, Slow !

I have an Intel Core 2 Duo CPU and i was reading 3 files from my C: drive and showing
some matching values from the files onto a EditBox on Screen.The whole process takes 2 minutes.Then I thought of processing each file in a separate thread and then the whole process is taking 2.30 minutes !!! i.e 30 seconds more than single threaded processing.
I was expecting the other way around !I can see both the Graphs in CPU usage history.Some one please explain to me what is going on ?
here is my code snippet.
foreach (FileInfo file in FileList)
{
Thread t = new Thread(new ParameterizedThreadStart(ProcessFileData));
t.Start(file.FullName);
}
where processFileData is the method that process the files.
Thanks!

The root of the problem is that the files are on the same drive and, unlike your dual core processor, your hard drive can only do one thing at a time.
If you read two files simultaneously, the disk heads will jump from one file to the other and back again. Given that your hard drive can read each file in roughly 40 seconds, it now has the additional overhead of moving its disk head between the three separate files many times during the read.
The fastest way to read multiple files from a single hard drive is to do it all in one thread and read them one after another. This way, the head only moves once per file read (at the very beginning) and not multiple times per read.
To optimize this process, you'll either need to change your logic (do you really need to read the whole contents of all three files?). Or purchase a faster hard drive/put the 3 files in three different hard drives and use threading/use a raid.

If you read from disk using multiple threads, then the disk heads will bounce around from one part of the disk to another as each thread reads from a different part of the drive. That can reduce throughput significantly, as you've seen.
For that reason, it's actually often a better idea to have all disk accesses go through a single thread, to help minimize disk seeks.
If your task is I/O bound and if it needs to run often, you might look at a tool like "contig" to make sure the layout of your files on disk is optimized / contiguous.

If you processing is mostly IO bound and CPU bound it make sense it take same time or even more.
How do you compare those files ? You should think what is the bottleneck of you application? IO output/input, CPU, memory ...
The multithreading is only interesting for CPU bound processing. i.e. complex calculation, comparison of data in memory, sorting etc ...

Since your process is IO bound, you should let the OS do your threading for you. Look at FileStream.BeginRead() for an example how to queue up your reads. Your EndRead() method can spin up your next request to read your next block of data pointing to itself to handle each subsequent completed block.
Also, with you creating additional threads, the OS has to manage more threads. And if a different CPU happens to get picked to handle the completed read, you've lost all of the CPU caching where your thread originated.
As you've found, you can't "speed up" an application just by adding threads.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse