Strategy for local logging on an embedded Raspberry Pi? - raspberry-pi

My company uses the Raspberry Pi 3 as an embedded controller in a product. User's don't power it off gracefully, they just flip a switch. To avoid corruption, the /boot and /root file systems are read-only. This appears to be bulletproof - we've used test rig to "pull the plug" over and over (2000+ cycles) with no problems.
We are working on a new feature that requires local logging. To do so, we created an additional ext4 read/write partition on the SD card (we are currently using about 2GB on an 8GB card) for the log file. To minimize wear, the application buffers the log data and writes to the card only once every minute. The log file is closed between writes. Nothing else uses that partition. The log file is not written to when the application is in a state that likely indicates the user is about to shut down.
In testing this, we've found that in spite of the rather conservative approach we're using, the read/write partition is always marked as "dirty" after a reboot, frequently contains filesystem errors, and often has a damaged log file. We've also had a number of cards suffer unrecoverable errors which prevent the device from booting up.
Loss of the last set of log entries is not a problem.
Loss of the log file is undesireable but acceptable.
Damage to the /root and /boot filesystems is unacceptable, as is physical damage (other than standard NAND flash wear) to the card.
Short of adding a UPS to gracefully shut down the Pi, is there any approach that will safely allow for read/write operations?
Is there a configuration of the SD card partition "geometry" that would ensure that no two partitions overlap one flash erase block?

Just some points:
Dirty flag: I guess that you are not unmounting the filesystem, right? This is a possible reason to see dirty flag after each unclean reboot. Another (probably better way) is to switch filesystem to read-only mode after writing and make it read-write before writing the file.
BTW, ext4 defers writes to the disk. close() on file doesn't mean that the files are written to the disk, you need to call extra fsync() or sync (see Does Linux guarantee the contents of a file is flushed to disc after close()?). So it is better to ask system to really write the file.

I suggest to use UBIFS or JFFS2 or YAFFS2. Its best practics way. Also I heard about LogFS.
All time mount and writing without delay posible because this FS designed to work with hard shutdown.
Copy-Pasted oveview from https://superuser.com/questions/248078/choice-of-filesystem-for-gnu-linux-on-an-sd-card
JFFS2
Includes compression and elegant wear leveling protection.
YAFFS2
Single thing that makes the difference: short mount times, after successful umount.
Implements write once property: once data is written to one block, there is no need to rewrite it. This is important, as it reduces wear.
LogFS
Not very mature, but already included in Linux kernel tree.
Supports larger filesystems than JFFS2/YAFFS2 without problems.
UBIFS
More mature than LogFS
Write caching support
On scalability: [article][3]. On large disks, better performance than with JFFS2
ext4
If no driver or card (for example SSD drives do have internal wear leveling, at least usually) handle wear leveling, then ext4 is not the best idea, as it is not intended for raw flash usage.

Related

What is physical storage allocation?

I am studying in Msc.It. in subject called 'Advanced Operating System', there is chapter called file system, in which topic named "PHYSICAL STORAGE ALLOCATION" is there. i searched in books and on internet too but didn't find any topic related to it. can any one give me definition of PHYSICAL STORAGE ALLOCATION and explain it's work in 4-5 lines.
The term is IMHO ambiguous.
It could mean, in the context of file systems, the allocation and management of blocks on a physical device (e.g. a disk, a disk partition, some USB storage key). The ext2 wikipage has a nice figure explaining that (for the old ext2 file system). In general, data is read and written from disks in blocks of fixed size (512 bytes on old disks, 4Kbytes on newer ones), and the file system code organize files and provide the "file" abstraction above such blocks.
(your edit, and the image you are showing, suggests that first meaning)
It could also mean, in the context of virtual memory, the allocation of pages in physical RAM (related to the configuration of the MMU), or maybe in the swap device.
Read also Operating Systems: Three Easy Pieces (freely downloadable).

How much time of life has an sd card with Raspbian Linux for ARM (Plate Raspberry Pi)?

Staff, this question is for anyone who believes in Debian linux, more precisely of Raspbian, which is a version to run on the board Raspberry Pi:
As all users of Raspberry Pi should know: The operating system is installed on an SD card. AND the problem is that the SD card is a Flash memory, and this type of memory supports only a limited quantity of write operations.
I would like to know if the Raspbian writes the SD card when it is idle. If this happens, how can I disable?
I found this:
Tips for running Linux on a flash device by David Härdeman
If you are running your NSLU2 on a USB flash key, there are a number
of things you might want to do in order to reduce the wear and tear on
the underlying flash device (as it only supports a limited number of
writes).
Note: this document currently describes Debian etch (4.0) and needs to
be updated to Debian squeeze (6.0) and Debian wheezy (7.0). Some of
the hints may still apply, but some may not.
The ext3 filesystem per default writes metadata changes every five
seconds to disk. This can be increased by mounting the root filesystem
with the commit=N parameter which tells the kernel to delay writes to
every N seconds.
The kernel writes a new atime for each file that has been read which
generates one write for each read. This can be disabled by mounting
the filesystem with the noatime option.
Both of the above can be done by adding e.g. noatime,commit=120,... to /etc/fstab. This can also be done on an
already mounted filesystem by running the command:
mount -o remount,noatime,commit=120 /
The system will run updatedb every day which creates a database of all
files on the system for use with the locate command. This will also
put some stress on the filesystem, so you might want to disable it by
adding
exit 0
early in the /etc/cron.daily/find script.
syslogd will in the default installation sync a lot of log files to
disk directly after logging some new information. You might want to
change /etc/syslog.conf so that every filename starts with a - (minus)
which means that writes are not synced immediately (which increases
the risk that some log messages are lost if your system crashes). For
example, a line such as:
kern.* /var/log/kern.log
would be changed to:
kern.* -/var/log/kern.log
You also might want to disable some classes of messages altogether by
logging them to /dev/null instead, see syslog.conf(5) for details.
In addition, syslogd likes to write -- MARK -- lines to log files
every 20 minutes to show that syslog is still running. This can be
disabled by changing SYSLOGD in /etc/default/syslogd so that it reads
SYSLOGD="-m 0"
After you've made any changes, you need to restart syslogd by running
/etc/init.d/syslogd restart
If you have a swap partition or swap file on the flash device, you
might want to move it to a different part of the disk every now and
then to make sure that different parts of the disk gets hit by the
frequent writes that it can generate. For a swap file this can be done
by creating a new swap file before you remove the old one.
If you have a swap partition or swap file stored on the flash device,
you can make sure that it is used as little as possible by setting
/proc/sys/vm/swappiness to zero.
The kernel also has a setting known as laptop_mode, which makes it
delay writes to disk (initially intended to allow laptop disks to spin
down while not in use, hence the name). A number of files under
/proc/sys/vm/ controls how this works:
/proc/sys/vm/laptop_mode: How many seconds after a read should a
writeout of changed files start (this is based on the assumption that
a read will cause an otherwise spun down disk to spin up again).
/proc/sys/vm/dirty_writeback_centisecs: How often the kernel should
check if there is "dirty" (changed) data to write out to disk (in
centiseconds).
/proc/sys/vm/dirty_expire_centisecs: How old "dirty" data should be
before the kernel considers it old enough to be written to disk. It is
in general a good idea to set this to the same value as
dirty_writeback_centisecs above.
/proc/sys/vm/dirty_ratio: The maximum amount of memory (in percent) to
be used to store dirty data before the process that generates the data
will be forced to write it out. Setting this to a high value should
not be a problem as writeouts will also occur if the system is low on
memory.
/proc/sys/vm/dirty_background_ratio: The lower amount of memory (in
percent) where a writeout of dirty data to disk is allowed to stop.
This should be quite a bit lower than the above dirty_ratio to allow
the kernel to write out chunks of dirty data in one go.
All of the above kernel parameters can be tuned by using a custom init
script, such as this example script. Store it to e.g.
/etc/init.d/kernel-params, make it executable with
chmod a+x /etc/init.d/kernel-params
and make sure it is executed by running
update-rc.d kernel-params defaults
Note: Most of these settings reduce the number of writes to disk by
increasing memory usage. This increases the risk for out of memory
situations (which can trigger the dreaded OOM killer in the kernel).
This can even happen when there is free memory available (for example
when the kernel needs to allocate more than one contiguous page and
there are only fragmented free pages available).
As with any tweaks, you are advised to keep a close eye on the amount
of free memory and adapt the tweaks (e.g. by using less aggressive
caching and increasing the swappiness) depending on your workload.
This article has been contributed by David Härdeman
Go back to the Debian on NSLU2 page.
http://www.cyrius.com/debian/nslu2/linux-on-flash/
Someone has some more tip?
I have been using various raspberry pi setups and haven't had SD card troubles to date (fingers crossed). That being said, there is a bit of evidence for SD card lifespan related issues
A quick google search does show a few more tips though:
Bigger is better - reduces the load on specific sections
Write to ram for temp
Only store the boot partition on SD card and leave the OS on USB drive
(http://www.makeuseof.com/tag/extend-life-raspberry-pis-sd-card/)
Anyway, it'll be interesting to hear from someone who has a raspberry cluster or some such on their SD card lifespans!
(https://resin.io/blog/what-would-you-do-with-a-120-raspberry-pi-cluster/)
You can put files in tmpfs after load and write them back before shutdown using script from http://www.observium.org/wiki/Persistent_RAM_disk_RRD_storage
But it can be detrimental:
Tmpfs will destroy all changes on power outage, you must use UPS;
Raspberry Pi RAM is far from big, don't waste it.
If your pi often writes small files this can work for you

Why does the sequential write to a journal file speed up if it is in a different file system?

As per MongoDB documentation at http://docs.mongodb.org/manual/core/journaling,
To speed the frequent sequential writes that occur to the current
journal file, you can ensure that the journal directory is on a
different filesystem
storing the journal file on a different file system speeds things up. Is it because two different hard disk spindles are at work? Just wanted to understand the mechanics of this optimization tip.
Yes,
If you are using physical rotating hard drives, there is significant performance benefit from separating the journal activities onto a separate (preferably dedicated) physical drive.
The benefits are not the same if you're using SAN hardware. And to an extent are lessened by larger drive caches available in modern hard drives. And it's a different story again with SSD.
The main factor with spinning disks is seek time - the time that it takes for the read/write head to get to the right part of the disk. Hard disks are arranged with circular tracks. To get to a specific block on the disk, the head moves to the right track, and the disk spins around to the right place (the disks keep spinning of course, so it's simply a matter of waiting for the right place to come around).
This doesn't take much time, but when it's happening a lot it adds up.
When you have the primary activity and the journal activity on the same drive, the head has to rapidly move between the two (many, really) locations that the system needs to look at.
If you have your journalling on another physical drive, then the head on that drive can be almost (or perhaps more accurately, relatively) static, with the ability to more rapidly access the correct track / location required. Meanwhile the other drive (with the primary activity on it) will be more efficient also, because the head will not be constantly seeking back to the where the journal entries are being written between the other activities required to keep the database running.
This benefit applies to most database systems and many other applications where there is a constant sequential writing to disk going on at the same time as other mixed disk activity.
You don't get the same profile if you're using SAN, because even if it appears to be separate file systems, it's actually likely to be striped across many drives which are both cached and shared.
SSD has a different profile also, because there is no physical seek time.

Does it make sense to cache data obtained from a memory mapped file?

Or it would be faster to re-read that data from mapped memory once again, since the OS might implement its own cache?
The nature of data is not known in advance, it is assumed that file reads are random.
i wanted to mention a few things i've read on the subject. The answer is no, you don't want to second guess the operating system's memory manager.
The first comes from the idea that you want your program (e.g. MongoDB, SQL Server) to try to limit your memory based on a percentage of free RAM:
Don't try to allocate memory until there is only x% free
Occasionally, a customer will ask for a way to design their program so it continues consuming RAM until there is only x% free. The idea is that their program should use RAM aggressively, while still leaving enough RAM available (x%) for other use. Unless you are designing a system where you are the only program running on the computer, this is a bad idea.
(read the article for the explanation of why it's bad, including pictures)
Next comes from some notes from the author of Varnish, and reverse proxy:
Varnish Cache - Notes from the architect
So what happens with squids elaborate memory management is that it gets into fights with the kernels elaborate memory management, and like any civil war, that never gets anything done.
What happens is this: Squid creates a HTTP object in "RAM" and it gets used some times rapidly after creation. Then after some time it get no more hits and the kernel notices this. Then somebody tries to get memory from the kernel for something and the kernel decides to push those unused pages of memory out to swap space and use the (cache-RAM) more sensibly for some data which is actually used by a program. This however, is done without squid knowing about it. Squid still thinks that these http objects are in RAM, and they will be, the very second it tries to access them, but until then, the RAM is used for something productive.
Imagine you do cache something from a memory-mapped file. At some point in the future that memory holding that "cache" will be swapped out to disk.
the OS has written to the hard-drive something which already exists on the hard drive
Next comes a time when you want to perform a lookup from your "cache" memory, rather than the "real" memory. You attempt to access the "cache", and since it has been swapped out of RAM the hardware raises a PAGE FAULT, and cache is swapped back into RAM.
your cache memory is just as slow as the "real" memory, since both are no longer in RAM
Finally, you want to free your cache (perhaps your program is shutting down). If the "cache" has been swapped out, the OS must first swap it back in so that it can be freed. If instead you just unmapped your memory-mapped file, everything is gone (nothing needs to be swapped in).
in this case your cache makes things slower
Again from Raymon Chen: If your application is closing - close already:
When DLL_PROCESS_DETACH tells you that the process is exiting, your best bet is just to return without doing anything
I regularly use a program that doesn't follow this rule. The program
allocates a lot of memory during the course of its life, and when I
exit the program, it just sits there for several minutes, sometimes
spinning at 100% CPU, sometimes churning the hard drive (sometimes
both). When I break in with the debugger to see what's going on, I
discover that the program isn't doing anything productive. It's just
methodically freeing every last byte of memory it had allocated during
its lifetime.
If my computer wasn't under a lot of memory pressure, then most of the
memory the program had allocated during its lifetime hasn't yet been
paged out, so freeing every last drop of memory is a CPU-bound
operation. On the other hand, if I had kicked off a build or done
something else memory-intensive, then most of the memory the program
had allocated during its lifetime has been paged out, which means that
the program pages all that memory back in from the hard drive, just so
it could call free on it. Sounds kind of spiteful, actually. "Come
here so I can tell you to go away."
All this anal-rententive memory management is pointless. The process
is exiting. All that memory will be freed when the address space is
destroyed. Stop wasting time and just exit already.
The reality is that programs no longer run in "RAM", they run in memory - virtual memory.
You can make use of a cache, but you have to work with the operating system's virtual memory manager:
you want to keep your cache within as few pages as possible
you want to ensure they stay in RAM, by the virtue of them being accessed a lot (i.e. actually being a useful cache)
Accessing:
a thousand 1-byte locations around a 400GB file
is much more expensive than accessing
a single 1000-byte location in a 400GB file
In other words: you don't really need to cache data, you need a more localized data structure.
If you keep your important data confined to a single 4k page, you will play much nicer with the VMM; Windows is your cache.
When you add 64-byte quad-word aligned cache-lines, there's even more incentive to adjust your data structure layout. But then you don't want it too compact, or you'll start suffering performance penalties of cache flushes from False Sharing.
The answer is highly OS-specific. Generally speaking, there will be no sense in caching this data. Both the "cached" data as well as the memory-mapped can be paged away at any time.
If there will be any difference it will be specific to an OS - unless you need that granularity, there is no sense in caching the data.

What is the fastest way to read 10 GB file from the disk?

We need to read and count different types of messages/run
some statistics on a 10 GB text file, e.g a FIX engine
log. We use Linux, 32-bit, 4 CPUs, Intel, coding in Perl but
the language doesn't really matter.
I have found some interesting tips in Tim Bray's
WideFinder project. However, we've found that using memory mapping
is inherently limited by the 32 bit architecture.
We tried using multiple processes, which seems to work
faster if we process the file in parallel using 4 processes
on 4 CPUs. Adding multi-threading slows it down, maybe
because of the cost of context switching. We tried changing
the size of thread pool, but that is still slower than
simple multi-process version.
The memory mapping part is not very stable, sometimes it
takes 80 sec and sometimes 7 sec on a 2 GB file, maybe from
page faults or something related to virtual memory usage.
Anyway, Mmap cannot scale beyond 4 GB on a 32 bit
architecture.
We tried Perl's IPC::Mmap and Sys::Mmap. Looked
into Map-Reduce as well, but the problem is really I/O
bound, the processing itself is sufficiently fast.
So we decided to try optimize the basic I/O by tuning
buffering size, type, etc.
Can anyone who is aware of an existing project where this
problem was efficiently solved in any language/platform
point to a useful link or suggest a direction?
Most of the time you will be I/O bound not CPU bound, thus just read this file through normal Perl I/O and process it in single thread. Unless you prove that you can do more I/O than your single CPU work, don't waste your time with anything more. Anyway, you should ask: Why on Earth is this in one huge file? Why on Earth don't they split it in a reasonable way when they generate it? It would be magnitude more worth work. Then you can put it in separate I/O channels and use more CPU's (if you don't use some sort of RAID 0 or NAS or ...).
Measure, don't assume. Don't forget to flush caches before each test. Remember that serialized I/O is a magnitude faster than random.
This all depends on what kind of preprocessing you can do and and when.
On some of systems we have, we gzip such large text files, reducing them to 1/5 to 1/7 of their original size. Part of what makes this possible is we don't need to process these files
until hours after they're created, and at creation time we don't really have any other load on the machines.
Processing them is done more or less in the fashion of zcat thosefiles | ourprocessing.(well it's done over unix sockets though with a custom made zcat). It trades cpu time for disk i/o time, and for our system that has been well worth it. There's ofcourse a lot of variables that can make this a very poor design for a particular system.
Perhaps you've already read this forum thread, but if not:
http://www.perlmonks.org/?node_id=512221
It describes using Perl to do it line-by-line, and the users seem to think Perl is quite capable of it.
Oh, is it possible to process the file from a RAID array? If you have several mirrored disks, then the read speed can be improved. Competition for disk resources may be what makes your multiple-threads attempt not work.
Best of luck.
I wish I knew more about the content of your file, but not knowing other than that it is text, this sounds like an excellent MapReduce kind of problem.
PS, the fastest read of any file is a linear read. cat file > /dev/null should be the speed that the file can be read.
Have you thought of streaming the file and filtering out to a secondary file any interesting results? (Repeat until you have a manageble size file).
Basically need to "Divide and conquer", if you have a network of computers, then copy the 10G file to as many client PCs as possible, get each client PC to read an offset of the file. For added bonus, get EACH pc to implement multi threading in addition to distributed reading.
Parse the file once, reading line by line. Put the results in a table in a decent database. Run as many queries as you wish. Feed the beast regularly with new incoming data.
Realize that manipulating a 10 Gb file, transferring it across the (even if local) network, exploring complicated solutions etc all take time.
I have a co-worker who sped up his FIX reading by going to 64-bit Linux. If it's something worthwhile, drop a little cash to get some fancier hardware.
hmmm, but what's wrong with the read() command in C? Usually has a 2GB limit,
so just call it 5 times in sequence. That should be fairly fast.
If you are I/O bound and your file is on a single disk, then there isn't much to do. A straightforward single-threaded linear scan across the whole file is the fastest way to get the data off of the disk. Using large buffer sizes might help a bit.
If you can convince the writer of the file to stripe it across multiple disks / machines, then you could think about multithreading the reader (one thread per read head, each thread reading the data from a single stripe).
Since you said platform and language doesn't matter...
If you want a stable performance that is as fast as the source medium allows for, the only way I am aware that this can be done on Windows is by overlapped non-OS-buffered aligned sequential reads. You can probably get to some GB/s with two or three buffers, beyond that, at some point you need a ring buffer (one writer, 1+ readers) to avoid any copying. The exact implementation depends on the driver/APIs. If there's any memory copying going on the thread (both in kernel and usermode) dealing with the IO, obviously the larger buffer is to copy, the more time is wasted on that rather than doing the IO. So the optimal buffer size depends on the firmware and driver. On windows good values to try are multiples of 32 KB for disk IO. Windows file buffering, memory mapping and all that stuff adds overhead. Only good if doing either (or both) multiple reads of same data in random access manner. So for reading a large file sequentially a single time, you don't want the OS to buffer anything or do any memcpy's. If using C# there's also penalties for calling into the OS due to marshaling, so the interop code may need bit of optimization unless you use C++/CLI.
Some people prefer throwing hardware at problems but if you have more time than money, in some scenarios it's possible to optimize things to perform 100-1000x better on a single consumer level computer than a 1000 enterprise priced computers. The reason is that if the processing is also latency sensitive, going beyond using two cores is probably adding latency. This is why drivers can push gigabytes/s while enterprise software is ends stuck at megabytes/s by the time it's all done. Whatever reporting, business logic and such the enterprise software do can probably also be done at gigabytes/s on two core consumer CPU, if written like you were back in the 80's writing a game. The most famous example I've heard of approaching their entire business logic in this manner is the LMAX forex exchange, which published some of their ring buffer based code, which was said to be inspired by network card drivers.
Forgetting all the theory, if you are happy with < 1 GB/s, one possible starting point on Windows I've found is looking at readfile source from winimage, unless you want to dig into sdk/driver samples. It may need some source code fixes to calculate perf correctly at SSD speeds. Experiment with buffer sizes also.
The switches /h multi-threaded and /o overlapped (completion port) IO with optimal buffer size (try 32,64,128 KB etc) using no windows file buffering in my experience give best perf when reading from SSD (cold data) while simultaneously processing (use the /a for Adler processing as otherwise it's too CPU-bound).
I seem to recall a project in which we were reading big files, Our implementation used multithreading - basically n * worker_threads were starting at incrementing offsets of the file (0, chunk_size, 2xchunk_size, 3x chunk_size ... n-1x chunk_size) and was reading smaller chunks of information. I can't exactly recall our reasoning for this as someone else was desining the whole thing - the workers weren't the only thing to it, but that's roughly how we did it.
Hope it helps
Its not stated in the problem that sequence matters really or not. So,
divide the file into equal parts say 1GB each, and since you are using multiple CPUs, then multiple threads wont be a problem, so read each file using separate thread, and use RAM of capacity > 10 GB, then all your contents would be stored in RAM read by multiple threads.