I am developing a low level SATA driver for an FPGA based embedded system. The driver is running fine.
When I read sectors from the disk using dd command, I can see that the SCSI read(10) command (opcode 0x28) is recived by my low level driver, which is I think correct. But, when I write sectors to the disk using dd command, the SCSI driver sends first read(10) command (opcode 0x28) for several times and then few write(10) commands (opcode 0x2A).
Can someone explain me, that why does SCSI driver send read() command during the write operation?
Edited: During a file write operation I can see that the driver first reads (DMA mode) from LBA 0 upto some 8 sectors. Then it writes (DMA) sg blocks and then it reads (PIO) disk specific information. After that it takes some random LBAs and performs several reads(DMA) finally it stops by reading device specific data read(PIO). This is a sequence for dd'ing 1KB file. The disk has no partition table and no file system (verified from fdisk utility ).
Is it a normal behaviour of driver? If yes then is it not decreasing the speed of operation? As overall file reading is faster than writing due to extra reads in write operation.
Thank you
It's hard to say concretely without knowing more about your system. Two possibilities come to mind:
Linux is looking for partition tables. This is likely the case if the reads are to LBA 0 and the first few logical blocks, or if the reads are to the very end of the device, where there is a secondary GPT header.
You're dd'ing a file on a filesystem, and the filesystem is reading in uncached metadata.
Related
My company uses the Raspberry Pi 3 as an embedded controller in a product. User's don't power it off gracefully, they just flip a switch. To avoid corruption, the /boot and /root file systems are read-only. This appears to be bulletproof - we've used test rig to "pull the plug" over and over (2000+ cycles) with no problems.
We are working on a new feature that requires local logging. To do so, we created an additional ext4 read/write partition on the SD card (we are currently using about 2GB on an 8GB card) for the log file. To minimize wear, the application buffers the log data and writes to the card only once every minute. The log file is closed between writes. Nothing else uses that partition. The log file is not written to when the application is in a state that likely indicates the user is about to shut down.
In testing this, we've found that in spite of the rather conservative approach we're using, the read/write partition is always marked as "dirty" after a reboot, frequently contains filesystem errors, and often has a damaged log file. We've also had a number of cards suffer unrecoverable errors which prevent the device from booting up.
Loss of the last set of log entries is not a problem.
Loss of the log file is undesireable but acceptable.
Damage to the /root and /boot filesystems is unacceptable, as is physical damage (other than standard NAND flash wear) to the card.
Short of adding a UPS to gracefully shut down the Pi, is there any approach that will safely allow for read/write operations?
Is there a configuration of the SD card partition "geometry" that would ensure that no two partitions overlap one flash erase block?
Just some points:
Dirty flag: I guess that you are not unmounting the filesystem, right? This is a possible reason to see dirty flag after each unclean reboot. Another (probably better way) is to switch filesystem to read-only mode after writing and make it read-write before writing the file.
BTW, ext4 defers writes to the disk. close() on file doesn't mean that the files are written to the disk, you need to call extra fsync() or sync (see Does Linux guarantee the contents of a file is flushed to disc after close()?). So it is better to ask system to really write the file.
I suggest to use UBIFS or JFFS2 or YAFFS2. Its best practics way. Also I heard about LogFS.
All time mount and writing without delay posible because this FS designed to work with hard shutdown.
Copy-Pasted oveview from https://superuser.com/questions/248078/choice-of-filesystem-for-gnu-linux-on-an-sd-card
JFFS2
Includes compression and elegant wear leveling protection.
YAFFS2
Single thing that makes the difference: short mount times, after successful umount.
Implements write once property: once data is written to one block, there is no need to rewrite it. This is important, as it reduces wear.
LogFS
Not very mature, but already included in Linux kernel tree.
Supports larger filesystems than JFFS2/YAFFS2 without problems.
UBIFS
More mature than LogFS
Write caching support
On scalability: [article][3]. On large disks, better performance than with JFFS2
ext4
If no driver or card (for example SSD drives do have internal wear leveling, at least usually) handle wear leveling, then ext4 is not the best idea, as it is not intended for raw flash usage.
Staff, this question is for anyone who believes in Debian linux, more precisely of Raspbian, which is a version to run on the board Raspberry Pi:
As all users of Raspberry Pi should know: The operating system is installed on an SD card. AND the problem is that the SD card is a Flash memory, and this type of memory supports only a limited quantity of write operations.
I would like to know if the Raspbian writes the SD card when it is idle. If this happens, how can I disable?
I found this:
Tips for running Linux on a flash device by David Härdeman
If you are running your NSLU2 on a USB flash key, there are a number
of things you might want to do in order to reduce the wear and tear on
the underlying flash device (as it only supports a limited number of
writes).
Note: this document currently describes Debian etch (4.0) and needs to
be updated to Debian squeeze (6.0) and Debian wheezy (7.0). Some of
the hints may still apply, but some may not.
The ext3 filesystem per default writes metadata changes every five
seconds to disk. This can be increased by mounting the root filesystem
with the commit=N parameter which tells the kernel to delay writes to
every N seconds.
The kernel writes a new atime for each file that has been read which
generates one write for each read. This can be disabled by mounting
the filesystem with the noatime option.
Both of the above can be done by adding e.g. noatime,commit=120,... to /etc/fstab. This can also be done on an
already mounted filesystem by running the command:
mount -o remount,noatime,commit=120 /
The system will run updatedb every day which creates a database of all
files on the system for use with the locate command. This will also
put some stress on the filesystem, so you might want to disable it by
adding
exit 0
early in the /etc/cron.daily/find script.
syslogd will in the default installation sync a lot of log files to
disk directly after logging some new information. You might want to
change /etc/syslog.conf so that every filename starts with a - (minus)
which means that writes are not synced immediately (which increases
the risk that some log messages are lost if your system crashes). For
example, a line such as:
kern.* /var/log/kern.log
would be changed to:
kern.* -/var/log/kern.log
You also might want to disable some classes of messages altogether by
logging them to /dev/null instead, see syslog.conf(5) for details.
In addition, syslogd likes to write -- MARK -- lines to log files
every 20 minutes to show that syslog is still running. This can be
disabled by changing SYSLOGD in /etc/default/syslogd so that it reads
SYSLOGD="-m 0"
After you've made any changes, you need to restart syslogd by running
/etc/init.d/syslogd restart
If you have a swap partition or swap file on the flash device, you
might want to move it to a different part of the disk every now and
then to make sure that different parts of the disk gets hit by the
frequent writes that it can generate. For a swap file this can be done
by creating a new swap file before you remove the old one.
If you have a swap partition or swap file stored on the flash device,
you can make sure that it is used as little as possible by setting
/proc/sys/vm/swappiness to zero.
The kernel also has a setting known as laptop_mode, which makes it
delay writes to disk (initially intended to allow laptop disks to spin
down while not in use, hence the name). A number of files under
/proc/sys/vm/ controls how this works:
/proc/sys/vm/laptop_mode: How many seconds after a read should a
writeout of changed files start (this is based on the assumption that
a read will cause an otherwise spun down disk to spin up again).
/proc/sys/vm/dirty_writeback_centisecs: How often the kernel should
check if there is "dirty" (changed) data to write out to disk (in
centiseconds).
/proc/sys/vm/dirty_expire_centisecs: How old "dirty" data should be
before the kernel considers it old enough to be written to disk. It is
in general a good idea to set this to the same value as
dirty_writeback_centisecs above.
/proc/sys/vm/dirty_ratio: The maximum amount of memory (in percent) to
be used to store dirty data before the process that generates the data
will be forced to write it out. Setting this to a high value should
not be a problem as writeouts will also occur if the system is low on
memory.
/proc/sys/vm/dirty_background_ratio: The lower amount of memory (in
percent) where a writeout of dirty data to disk is allowed to stop.
This should be quite a bit lower than the above dirty_ratio to allow
the kernel to write out chunks of dirty data in one go.
All of the above kernel parameters can be tuned by using a custom init
script, such as this example script. Store it to e.g.
/etc/init.d/kernel-params, make it executable with
chmod a+x /etc/init.d/kernel-params
and make sure it is executed by running
update-rc.d kernel-params defaults
Note: Most of these settings reduce the number of writes to disk by
increasing memory usage. This increases the risk for out of memory
situations (which can trigger the dreaded OOM killer in the kernel).
This can even happen when there is free memory available (for example
when the kernel needs to allocate more than one contiguous page and
there are only fragmented free pages available).
As with any tweaks, you are advised to keep a close eye on the amount
of free memory and adapt the tweaks (e.g. by using less aggressive
caching and increasing the swappiness) depending on your workload.
This article has been contributed by David Härdeman
Go back to the Debian on NSLU2 page.
http://www.cyrius.com/debian/nslu2/linux-on-flash/
Someone has some more tip?
I have been using various raspberry pi setups and haven't had SD card troubles to date (fingers crossed). That being said, there is a bit of evidence for SD card lifespan related issues
A quick google search does show a few more tips though:
Bigger is better - reduces the load on specific sections
Write to ram for temp
Only store the boot partition on SD card and leave the OS on USB drive
(http://www.makeuseof.com/tag/extend-life-raspberry-pis-sd-card/)
Anyway, it'll be interesting to hear from someone who has a raspberry cluster or some such on their SD card lifespans!
(https://resin.io/blog/what-would-you-do-with-a-120-raspberry-pi-cluster/)
You can put files in tmpfs after load and write them back before shutdown using script from http://www.observium.org/wiki/Persistent_RAM_disk_RRD_storage
But it can be detrimental:
Tmpfs will destroy all changes on power outage, you must use UPS;
Raspberry Pi RAM is far from big, don't waste it.
If your pi often writes small files this can work for you
I need to transfer huge amount of data between Java and C++ programs under Linux(CentOS). Performance is the first concern.
What will be the best choice? RAMDisk (/dev/shm/) or local socket?
A socket is fastest because the other end can start processing the data (on a separate cpu core) before you have finished sending data.
Say you're sending 100KB of data, the other end can begin processing as soon as it recieves a couple of kilobytes. And by the time all 100KB has been sent, it has probably finished processing 90KB or thereabouts, so it only has 10KB left.
While with a RAM disk, you have to write the entire 100KB before it can even start processing data. Making it about 10x faster to use a socket than a ram disk, assuming both ends need to do about the same amount of work.
Maybe it takes 1 millisecond to write 100KB to a RAM disk and then 1 millisecond to process it. With a socket it would take 1 millisecond to send the data but only 0.1 millisecond to finish processing after all the data has been sent.
The larger the amount of data being sent, the bigger the performance gain for sockets. 10 seconds to write all the data, and another 0.1 millisecond to fnish processing after all data has been sent.
However, a RAM disk is easier to work with. Sockets use streams of data, which is more time consuming in terms of writing the code and debugging/testing it.
Also, don't assume you need a ram disk. Depending on how the operating system has been configured writing 100MB to a spinning platter hard drive might simply write it to a RAM cache and then put it on the hard drive later on. You can read it from the temporary RAM cache immediately without waiting for the data to be written to the HDD. Always test before making performance assumptions. Do not assume a HDD is slower than RAM, because it might be optimised out for you silently.
The mac I'm typing this on, which is UNIX just like CentOS, currently has about 8GB of RAM dedicated to holding copies of files it guesses I'm going to read at some point in the near future. I didn't have to create a RAM disk manually, it just put them in RAM heuristically. CentOS does the same sort of thing, you have to test it to see how fast it actually is.
But sockets are definitely the fastest option, since you do not need to write all the data to start processing it.
I try to understand os bootstraping process. Some questions are not clear to me.
One of them is :
How does bootstrap code in Volume boot record know about absolute LBA address of 0 sector of patition where Volume boot record resides?
Within the VBR is a structure called a BIOS Parameter Block, named after the BIOS, the bottom-half of the traditional MS-DOS kernel structure. Within the BIOS Parameter Block is a field denoting the number of hidden sectors between the partition and the (MBR-style) partition table entry that encloses it. The VBR code simply reads that field out of itself and adds it to the Volume-relative Block Address to produce the LBA to read from the disc.
This is why it is impossible to boot operating systems such as Windows NT, MS/PC/DR-DOS, and OS/2 from secondary partitions directly, without assistance. In primary partitions, the BPB field is simply the start LBA of the start of the volume, because the partition table that it is relative to is the primary MBR in block #0 of the disc, and everything works. In secondary partitions, because of a quirk of MS-DOS version 3 that everyone has had to remain compatible with ever since, the BPB field is only the offset of the "logical drive" within the "extended partition" containing it, and the boot code doesn't work because it looks for the rest of the boot volume in completely the wrong place on the disc.
Boot managers provide assistance by fixing up the BPB on the fly. The VBR code of course reads the in-memory copy of itself, not the on-disc copy. So boot managers simply adjust the field of the BPB for secondary partitions to the correct absolute value, as they are loading the VBR into memory. Then everything works.
I need to figure out the hard drive name for a solaris box and it is not clear to me what the device name is. On linux, it would be something like /dev/hda or /dev/sda, but on solaris I am getting a bit lost in the partitions and what the device is called. I think that entries like /dev/rdsk/c0t0d0s0 are the partitions, how is the whole hard drive referenced?
/dev/rdsk/c0t0d0s0 means Controller 0, SCSI target (ID) 0, and s means Slice (partition) 0.
Typically, by convention, s2 is the entire disk. This partition overlaps with the other partitions.
prtvtoc /dev/rdsk/c0t0d0s0 will show you the partition table for the disk, to make sure.
If you run Solaris on non SPARC hardware and don't use EFI, the whole hard drive is not c0t0d0s2 but c0t0d0p0, s2 is in that case just the Solaris primary partition.
What do you want to do to the whole disk? Look at the EXAMPLES section of the man page for the command in question to see how much of a disk name the command requires.
zpool doesn't require a partition, as in: c0t0d0
newfs does: c0t0d0s0
dd would use the whole disk partition: c0t0d0s2
Note: s2 as the entire disk is just a convention. A root user can use the Solaris format command and change the extent of any of the partitions.
The comments about slice 2 are only correct for drives with an SMI label.
If the drive is greater than 1TB, or if the drive has been used for ZFS, the drive will have an EFI label and slice 2 will NOT be the entire disk. With an EFI label, slice 2 is "just another slice". You would then refer to the whole disk by using the device name without a slice, e.g. c0t0d0.
There are two types for disk label, one is SMI(vtoc), the other is GPT(EFI).
On X86 platform and the disk is SMI labeled(default behavior):
cXtXdXp0 is the whole physical disk
cXtXdXp1-cXtXdXp4 are primary partitions, included the solaris partitions.
cXtXdXs0-cXtXdXs8 are the partitions(slices) of the activate Solaris partitions.
cXtXdXs2 is the whole activate Solaris partition, maybe not the whole disk.
Hope I am clear.
/Meng
C0 - Controller
T0 - Target
D0 - Disk
S- - Slice
c0t0d0s0 is the entire drive. The breakdown is:
/dev/[r]dsk/c C t A d0s S
...where C is the controller number, A is the SCSI address, and S is the "slice". Slice 0 is the whole disk; the other slices are the partition numbers.
See this for more info.
cXtYdZs2 is the whole drive. period.