About Ram & Secondary Storage - operating-system

Why do Ram size is always smaller than Secondary Storage(HDD/SSD)? If you observe any device you will get the same question

Why do Ram size is always smaller than Secondary Storage(HDD/SSD)? If you observe any device you will get the same question
The primary reason is price. For example (depending a lot on type, etc) currently RAM is around $4 per GiB and "rotating disk" HDD is $0.04 per GiB, so RAM costs about 100 times as much per GIB.
Another reason is that HDD/SSD is persistent (the data remains when you turn the power off); and the amount of data you want to keep when power is turned off is typically much larger than amount of data you don't want to keep when power is turned off. A special case for this is when you put a computer into a "hibernate" state (where the OS stores everything in RAM and turns power off, and then when power is turned on again it loads everything back into RAM so that it looks the same; where the amount of persistent storage needs to be larger than the amount of RAM).
Another (much smaller) reason is speed. It's not enough to be able to store data, you have to be able to access it too, and the speed of accessing data gets worse as the amount of storage increases. This holds true for all kinds of storage for different reasons (and is why you also have L1, L2, L3 caches ranging from "very small and very fast" to "larger and slower"). For RAM it's caused by the number of address lines and the size of "row select" circuitry. For HDD it's caused by seek times. For humans getting the milk out of a refrigerator it's "search time + movement speed" (faster to get the milk out of a tiny bar fridge than to walk around inside a large industrial walk-in refrigerator).
However; there are special cases (there's always special cases). For example, you might have a computer that boots from network and then uses the network for persistent storage; where there's literally no secondary storage in the computer at all. Another special case is small embedded systems where RAM is often larger than persistent storage.

Related

Virtual memory location on hard-disk

I was reading about paging and swap-space and I'm a little confused about how much space (and where) on the hard-disk is used to page out / swap-out frames. Let's think of the following scenario :
We have a single process which progressively uses newer pages in virtual memory. Each time for a new page, we allocate a frame in physical memory.
But after a while, frames in the physical memory get exhausted and we choose a victim frame to be removed from RAM.
I have the following doubts :
Does the victim frame get swapped out to the swap space or paged out to some different location (apart from swap-space) on the hard-disk?
From what I've seen, swap space is usually around 1-2x size of RAM, so does this mean a process can use only RAM + swap-space amount of memory in total? Or would it be more than that and limited by the size of virtual memory?
Does the victim frame get swapped out to the swap space or paged out to some different location (apart from swap-space) on the hard-disk?
It gets swapped to the swap space. Swap space is used for that. A system without swap space cannot use this feature of virtual memory. It still has other features like avoiding external fragmentation and memory protection.
From what I've seen, swap space is usually around 1-2x size of RAM, so does this mean a process can use only RAM + swap-space amount of memory in total? Or would it be more than that and limited by the size of virtual memory?
The total memory available to a process will be RAM + swap-space. Imagine a computer with 1GB of RAM + 1GB of swap space and a process which requires 3GB. The process has virtual memory needs above what is available. This will not work because eventually the process will access all this code/data and it will make the program crash. Basically, the process image is bigger than RAM + swap space so eventually the program will get loaded completely from the executable and the computer will simply not have enough space to hold the process. It will crash the process.
There's really 2 options here. You either store a part of the process in RAM directly or you store it in the swap space. If there's no room in both of these for your process then the kernel doesn't have anywhere else to go. It thus crashes the process.

NVMe SSD's bandwidth decreases when increasing the number of I/O queues

As far as I have learned from all the relevant articles about NVMe SSDs, one of NVMe SSDs' benefits is multiple queues. Leveraging multiple NVMe I/O queues, NVMe bandwidth can be greatly utilized.
However, what I have found from my own experiment does not agree with that.
I want to do parallel 4k-granularity sequential reads from an NVMe SSD. I'm using Samsung 970 EVO Plus 250GB. I used FIO to benchmark the SSD. The command I used is:
fio --size=1000m --directory=/home/xxx/fio_test/ --ioengine=libaio --direct=1 --name=4kseqread --bs=4k --iodepth=64 --rw=read --numjobs 1/2/4 --group_reporting
And below is what I got testing 1/2/4 parallel sequential reads:
numjobs=1: 1008.7MB/s
numjobs=2: 927 MB/s
numjobs=4: 580 MB/s
Even if will not increasing bandwidth, I expect increasing I/O queues would at least keep the same bandwidth as the single-queue performance. The bandwidth decrease is a little bit counter-intuitive. What are the possible reasons for the decrease?
Thank you.
I would like to highlight 3 reasons why you may see the issue:
Effective Queue Depth is too high,
Capacity under the test is limited to 1GB only,
Drive Precondition
First, parameter --iodepth=X is specified per Job. It means in your last experiment (--iodepth=64 and --numjobs=4) effective Queue Depth is 4x64=256. This may be too high for your Drive. Based on the vendor specification of your 250GB Drive, 4KB Random Read should show 250 KIOPS (1GB/s) for the Queue Depth of 32. By this Vendor is stating that QD32 is quite optimal for your Drive operation in order to reach best performance. If we start to increase QD, then commands will start aggregating and waiting in the Submission Queue. It does not improve performance. Vice Versa it will start to eat system resources (CPU, memory) and will degrade the throughput.
Second, limiting capacity under test to such a small range (1GB) can cause lot of collisions inside SSD. It is the situation when Reads will hit the same Media Physical Read Unit (aka Die aka LUN). In such situation new Reads will have to wait for previous one to complete. Increase of the testing capacity to entire Drive or at least to 50-100GB should minimize the collisions.
Third, in order to get performance numbers as per specification, Drive needs to be preconditioned accordingly. For the case of measuring Sequential and Random Reads it is better to use Full Drive Sequential Precondition. Command bellow will perform 128KB Sequential Write at QD32 to the Entire Drive Capacity.
fio --size=100% --ioengine=libaio --direct=1 --name=128KB_SEQ_WRITE_QD32 --bs=128k --iodepth=4 --rw=write --numjobs=8

Redshift free storage doesn't increase after adding 2 nodes

My 4-Node (dc2.large 160 GB storage per node) Redshift cluster had around 75% storage full, so I added 2 more nodes, to make a total of 6 Nodes, and I was expecting the disk usage to drop down to around 50%, but after making the said change, the disk usage still remains at 75% (even after few days and after VACUUM).
75% of 4*160 = 480 GB of data
6*160 = 960 of available storage in the new configuration, which means it should have dropped to 480/960 i.e somewhere close to 50% disk usage.
The image shows the disk space percentage before and after adding two nodes.
I also checked if there are any large table which are using DISTSTYLE ALL, which causes data replication across the nodes, but the tables I have in that are very small in size as compared to the total storage capacity, so I don't think they'd have any significant impact on the storage.
What can I do here to reduce the storage usage as I don't want to add more nodes and then again land up in the same situation?
It sounds like your tables are affected by the minimum table size. It may be counter-intuitive but you can often reduce the size of small tables by converting them to DISTSTYLE ALL.
https://aws.amazon.com/premiumsupport/knowledge-center/redshift-cluster-storage-space/
Can you clarify what distribution style you are using for some of the bigger tables?
If you are not specifying a distribution style then Redshift will automatically pick one (see here), and it's possible that it will chose ALL distribution at first and only switch to EVEN or KEY distribution once you reach a certain disk usage %.
Also, have you run the ANALYZE command to make sure the table stats are up to date?

Why is swap not good when using a SSD?

On Digitalocean I came up with this message when I want to add swap:
Although swap is generally recommended for systems utilizing traditional spinning hard drives, using swap with SSDs can cause issues with hardware degradation over time. Due to this consideration, we do not recommend enabling swap on DigitalOcean or any other provider that utilizes SSD storage. Doing so can impact the reliability of the underlying hardware for you and your neighbors. This guide is provided as reference for users who may have spinning disk systems elsewhere.
If you need to improve the performance of your server on DigitalOcean, we recommend upgrading your Droplet. This will lead to better results in general and will decrease the likelihood of contributing to hardware issues that can affect your service.
Why is that? I thought it was necessary for creating a stable server (not running into memory issues)
I believe that here's your answer.
Early SSDs had a reputation for failing after fewer writes than HDDs. If the swap was used often, then the SSD may fail sooner. This might be why you heard it could be bad to use an SSD for swap.
Modern SSDs don't have this issue, and they should not fail any faster than a comparable HDD. Placing swap on an SSD will result in better performance than placing it on an HDD due to its faster speeds.
I believe this is referring to the fact that SSDs have a relatively limited lifetime measured in number of times data is written in each memory location. Although such number has gotten big enough that using SSD as storage drives should not be a concern anymore, Swap memory, as a backup for ram memory, can potentially be written on pretty frequently, thus reducing the overall life of the SSD.
SSD Endurance is measured in so called DWPD units. DWPD stands for Drive full Writes Per Day. For Mobile, Client and Enterprise Storage Market segments DWPD requirements are very different. SSD Vendors usually state warranty as, for example, 0.8 DWPD / 3 years or 3.0 DWPD / 5 years. First example means that writing 80% of Drive Capacity every single day will result into 3 years life-time. Technically you can kill your 480GB Drive (let's say with 1 DWPD / 3 years warranty) within 12 days if to perform non-stop write access at the speed of 500 MB/s.
SSDs show much higher throughput on the one side if to compare with HDDs, but at the same time quite low endurance level. Partially it is due to the media physical structure and mapping. For example, when writing 1GB of user data to the HDD drive - internally physical media will receive around 10% more data (meta data, error protection data, etc.). Ratio between Host Data Amount and Internal Data Amount is called Write Amplification Factor (WAF). In comparison SSD may need to write 4 times more data than received from Host. Pure Random access is the worst scenario, when writing 1GB of Host Data will result into writing 4GB of data to the Internal Flash Media. If to perform only sequential write access WAF for SSDs will be close to 1.0, like for HDDs.
Enabling System swap and its intensive usage (probably due to DRAM shortage) will generate more Random access to the SSD. Endurance will degrade quicker if to compare with disable swap. Unless you are running Enterprise System with non-stop IO traffic to the SSD, I would not expect Swap enablement to affect SSD endurance much. You can always monitor SSD SMART Health parameter called - SSD Life Left. How it is changing in dynamic with/without swap enabled will help to make a decision.

SD card write limit - data logging

I want track/register when my system (a Raspberry Pi) was shut down, usually due to abrupt power loss.
I want to do it by recording a heartbeat every 10 minutes to an SD card - so every 10 mins it'd go to the SD and write the current time/date in a file. Would that damage the SD in the long run?
If there's only 100k write cycles to it, it'd have a bad block in a couple of years. But I've read there's circuitry to prevent it - would it prevent the bad block? Would it be safer to distribute the log in several blocks?
Thanks
The general answer to this question is a strong "it depends". (Practical answer is what you already have; if your file system parameters are not wrong, you have a large margin in this case.) It depends on the following:
SD card type (SLC/MLC)
SD card controller (wear levelling)
SD card size
file system
luck
If we take a look at a flash chip, it is organised into sectors. A sector is an area which can be completely erased (actually reset to a state with only 1's), typically 128 KiB for SD cards. Zeros can be written bit-by-bit, but the only way to write ones is to erase the sector.
The number of sector erases is limited. The erase operation will take longer each time it is performed on the same sector, and there is more uncertainty in the values written to each cell. The write limit given to a card is really the number of erases for a single sector.
In order to avoid reaching this limit too fast, the SD card has a controller which takes care of wear levelling. The basic idea is that transparently to the user the card changes which sectors are used. If you request the same memory position, it may be mapped to different sectors at different times. The basic idea is that the card has a list of empty sectors, and whenever one is needed, it takes the one which has been used least.
There are other algorithms, as well. The controller may track sector erase times or errors occurring on a sector. Unfortunately, the card manufacturers do not usually tell too much about the exact algorithms, but for an overview, see:
http://en.wikipedia.org/wiki/Wear_leveling
There are different types of flash chips available. SLC chips store only one bit per memory cell (it is either 0 or 1), MLC cells store two or three bits. Naturally, MLC chips are more sensitive to ageing. Three-bit (eight level) cells may not endure more than 1000 writes. So, if you need reliability, take a SLC card despite its higher price,
As the wear levelling distributes the wear across the card, bigger cards endure more sector erases than small cards, as they have more sectors. In principle, a 4 GiB card with 100 000 write cycles will be able to carry 400 TB of data during its lifetime.
But to make things more complicated, the file system has a lot to do with this. When a small piece of data is written onto a disk, a lot of different things happen. At least the data is appended to the file, and the associated directory information (file size) is changed. With a typical file system this means at least two 4 KiB block writes, of which one may be just an append (no requirement for an erase). But a lot of other things may happen: write to a journal, a block becoming full, etc.
There are file systems which have been tuned to be used with flash devices (JFFS2 being the most common). They are all, as far as I know, optimised for raw flash and take care of wear levelling and use bit or octet level atomic operations. I am not aware of any file systems optimised for SD cards. (Maybe someone with academic interests could create one taking the wear levelling systems of the cards into account. That would result in a nice paper or even a few.) Fortunately, the usual file systems can be tuned to be more compatible (faster, leads wear and tear) with the SD card by tweaking file system parameters.
Now that there are these two layers on top of the physical disk, it is almost impossible to track how many erases have been performed. One of the layers is very complicated (file system), the other (wear levelling) completely non-transparent.
So, we can just make some rough estimates. Let's guess that a small write invalidates two 4 KiB blocks in average. This way logging every 10 minutes consumes a 128 KiB erase sector every 160 minutes. If the card is a 8 GiB card, it has around 64k sectors, so the card is gone through once every 20 years. If the card endures 1000 write cycles, it will be good for 20 000 years...
The calculation above assumes perfect wear levelling and a very efficient file system. However, a safety factor of 1 000 should be enough.
Of course, this can be spoiled quite easily. One of the easiest ways is to forget to mount the disk with the noatime attribute. Then the file system will update file access times, which may result in a write every time a file is accessed (even read). Or the OS is swapping virtual memory onto the card.
Last but not least of the factors is luck. Modern SD cards have the unfortunate tendency to die from other causes. The number of lemons with even quite well-known manufacturers is not very small. If you kill a card, it is not necessarily because of the wear limit. If the card is worn out, it is still readable. If it is completely dead, it has died of something else (static electricity, small fracture somewhere).