Why is swap not good when using a SSD?

Why is swap not good when using a SSD? - server

On Digitalocean I came up with this message when I want to add swap:
Although swap is generally recommended for systems utilizing traditional spinning hard drives, using swap with SSDs can cause issues with hardware degradation over time. Due to this consideration, we do not recommend enabling swap on DigitalOcean or any other provider that utilizes SSD storage. Doing so can impact the reliability of the underlying hardware for you and your neighbors. This guide is provided as reference for users who may have spinning disk systems elsewhere.
If you need to improve the performance of your server on DigitalOcean, we recommend upgrading your Droplet. This will lead to better results in general and will decrease the likelihood of contributing to hardware issues that can affect your service.
Why is that? I thought it was necessary for creating a stable server (not running into memory issues)

I believe that here's your answer.
Early SSDs had a reputation for failing after fewer writes than HDDs. If the swap was used often, then the SSD may fail sooner. This might be why you heard it could be bad to use an SSD for swap.
Modern SSDs don't have this issue, and they should not fail any faster than a comparable HDD. Placing swap on an SSD will result in better performance than placing it on an HDD due to its faster speeds.

I believe this is referring to the fact that SSDs have a relatively limited lifetime measured in number of times data is written in each memory location. Although such number has gotten big enough that using SSD as storage drives should not be a concern anymore, Swap memory, as a backup for ram memory, can potentially be written on pretty frequently, thus reducing the overall life of the SSD.

SSD Endurance is measured in so called DWPD units. DWPD stands for Drive full Writes Per Day. For Mobile, Client and Enterprise Storage Market segments DWPD requirements are very different. SSD Vendors usually state warranty as, for example, 0.8 DWPD / 3 years or 3.0 DWPD / 5 years. First example means that writing 80% of Drive Capacity every single day will result into 3 years life-time. Technically you can kill your 480GB Drive (let's say with 1 DWPD / 3 years warranty) within 12 days if to perform non-stop write access at the speed of 500 MB/s.
SSDs show much higher throughput on the one side if to compare with HDDs, but at the same time quite low endurance level. Partially it is due to the media physical structure and mapping. For example, when writing 1GB of user data to the HDD drive - internally physical media will receive around 10% more data (meta data, error protection data, etc.). Ratio between Host Data Amount and Internal Data Amount is called Write Amplification Factor (WAF). In comparison SSD may need to write 4 times more data than received from Host. Pure Random access is the worst scenario, when writing 1GB of Host Data will result into writing 4GB of data to the Internal Flash Media. If to perform only sequential write access WAF for SSDs will be close to 1.0, like for HDDs.
Enabling System swap and its intensive usage (probably due to DRAM shortage) will generate more Random access to the SSD. Endurance will degrade quicker if to compare with disable swap. Unless you are running Enterprise System with non-stop IO traffic to the SSD, I would not expect Swap enablement to affect SSD endurance much. You can always monitor SSD SMART Health parameter called - SSD Life Left. How it is changing in dynamic with/without swap enabled will help to make a decision.

Related

About Ram & Secondary Storage

Why do Ram size is always smaller than Secondary Storage(HDD/SSD)? If you observe any device you will get the same question

Why do Ram size is always smaller than Secondary Storage(HDD/SSD)? If you observe any device you will get the same question
The primary reason is price. For example (depending a lot on type, etc) currently RAM is around $4 per GiB and "rotating disk" HDD is $0.04 per GiB, so RAM costs about 100 times as much per GIB.
Another reason is that HDD/SSD is persistent (the data remains when you turn the power off); and the amount of data you want to keep when power is turned off is typically much larger than amount of data you don't want to keep when power is turned off. A special case for this is when you put a computer into a "hibernate" state (where the OS stores everything in RAM and turns power off, and then when power is turned on again it loads everything back into RAM so that it looks the same; where the amount of persistent storage needs to be larger than the amount of RAM).
Another (much smaller) reason is speed. It's not enough to be able to store data, you have to be able to access it too, and the speed of accessing data gets worse as the amount of storage increases. This holds true for all kinds of storage for different reasons (and is why you also have L1, L2, L3 caches ranging from "very small and very fast" to "larger and slower"). For RAM it's caused by the number of address lines and the size of "row select" circuitry. For HDD it's caused by seek times. For humans getting the milk out of a refrigerator it's "search time + movement speed" (faster to get the milk out of a tiny bar fridge than to walk around inside a large industrial walk-in refrigerator).
However; there are special cases (there's always special cases). For example, you might have a computer that boots from network and then uses the network for persistent storage; where there's literally no secondary storage in the computer at all. Another special case is small embedded systems where RAM is often larger than persistent storage.

System Design - how to Pick CPU, Memory for an application

I am practicing System Design concepts and I am not clear what configuration (cpu, memory, disk storage) to pick for an application instance? Also, how many instances are needed (assuming you are running your application on Kubernetes cluster)
For Back of the envelope calculation ,I saw examples of calculating tps for read and write calls, calculate bandwidth needs, database storage needs etc. but I have not seen how to determine cpu, memory needs and how many instances are enough. Is there a procedure that guides to solve this problem?
My hunch says that we pick small to medium sized server instance (if we use cloud provider like AWS) and run stress tests for calculated TPS and see CPU and memory usage and see if we need to increase or decrease server configuration based on results?
I would greatly appreciate any inputs you may have.

I am not clear what configuration (cpu, memory, disk storage) to pick for an application instance? Also, how many instances are needed (assuming you are running your application on Kubernetes cluster)
This is mostly a question about economics. If resources was very cheap, you could use a lot of them - but unfortunately, they have an economic cost.
Scale out horizontal or scale up vertical
The first fundamental question to ask, should you scale up your app vertically (e.g. to bigger instances) or should you scale out your app horizontally.
The most important thing here is that scaling out horizontally is much easier. But wether you can scale out horizontally of if you have to scale up vertically depends on your app. If your app is a stateless webserver, it typically is very easy to scale out, but if you have a stateful cache or database, scale up vertically might be your only short term option. Try to design so that you can scale out horizontally since that is much easier.
Accurate size - use observability
To find your accurate size, use observability and investigate your bottlenecks and adjust relatively to that.
E.g. if you use too little memory, your app will be terminated, or if you use too little CPU, your response time will be slow. Just start somewhere and adjust.

In addition to Jonas's answer:
You have two approaches (which are not mutually exclusive):
Estimate your needs based on expected load, etc.
Adjust you needs based on what you observe in production.
Regarding the first approach:
Have you done any analysis into what your expected load is? E.g. how many users (unique sessions), how many requests on average per hour (page views, API calls, etc), potential peaks in activity leading to increased load, etc.
Have you done any benchmarking?
Have you looked at your system and what it does, and worked out if it has any specific resource (CPU, memory, disk, etc) needs?
Estimating resources ahead of time requires some knowledge (or informed guesses) regarding what the load will be, as per the 3 points above. Having an idea of what the daily or hourly request average is isn't a bad place to start.
Also make sure you aware if any potential spikes that might catch you out (end of month for financial systems/services). Whether or not these are significant enough that is worth worrying about is another thing. A friend of mine was working on a ticketing system once, and they had massive traffic spikes for major events that did warrant serious scaling-out and back... but your average system probably won't need to be that extreme.
CPU is probably only worth "worrying" about if you have anything that does any above average processing - this should be obvious through benchmarking or if you/your team has good knowledge of your code.
Disk usage can be calculated - e.g.
If on average a user generates 1Mb of data in a session (not including system logs), and you get 100 sessions a day then that's 100Mb a day, 500Mb a working week, 200Mb a month, etc.
If a user profile has on average 200Kb of data and 300Kb of storage space (images) then you can calculate that.
You can also do this for records, especially for records that you know are "large" (e.g. >25mb) or where there will be lots of them (e.g. millions).
You can also start to forecast growth over time if you allow a growth rate (e.g. number of users and their sessions, and the amount of data generated). A simple way to do that is to have a spreadsheet with some simple formulas that take various inputs like number of users, average requests per user, disk space per user, etc. You can then do what-if modelling by playing with the inputs.
In terms of the second approach - as Jonas says, observe and adjust. Make sure you know how to do that, and that your solution provides the data you need. This might be using metrics provided by your cloud-provider (if applicable) or instrumentation / reporting you have custom built into you solution.
Scaling-Up is probably more relevant in scenarios where you have a central point/resource that cannot be scaled-out, like a central database.

Does memory copying on APUs (e.g. apple m1 mac) use GPU-specific wide vector instructions?

I was reading this article Why mmap is faster than system calls, where the main difference appeared to be mmap's ability to use vector instructions like AVX-2, something system calls can't.
I understand that the SIMD instructions used by GPUs tend to be much wider. A Nvidia warp of size 32 operating on float32 = 1024 bits (?) vs 256 bits of AVX-2. So potentially a 4x speedup. I guess this is not used in traditional discrete gpu settings as host-to-device (and back) copy would outweigh any benefit from wide registers.
However in APUs, GPU shares memory with CPU, eliminating the need for these expensive copies. I was wondering if those GPU instructions can therefore be used to accelerate mmap like vector operations further (numpy is another example). Has it already been done (in M1 mac or any CPUs with integrated graphics)? or can you please detail the architectural issues that prevent this?

You're kind of asking 2 separate questions: whether an OS (or user-space standard libraries?) can use GPGPU to speed up reading from the pagecache (into user-space memory with a read system call, or from an mmaped region). And separately whether GPGPU on normally-allocated process memory (and/or the pagecache) can avoid a copy to memory dedicated to the GPU.
For the 2nd part Apple has said the answer is yes for MacOS on M1 thanks to making the integrated GPU's memory accesses cache-coherent with the CPU. I think AMD made similar suggestions that copying could be avoided in graphics or GPGPU drivers on their APUs (Fusion IIRC?), but IDK if software ever took full advantage.
For the first part; doubtful. Large memory copies are bottlenecked by DRAM bandwidth, not CPU-core <-> L1d cache bandwidth (which scales with SIMD register width). On x86, an AVX2 loop on a single core can come pretty close to maxing out the DRAM bandwidth of an Intel "client" chip (quad-core or similar, not a big xeon with a higher-latency interconnect). Single-core bandwidth (to L3 or DRAM) tends to be limited by the number of outstanding cache misses that a core can track, not by doing the copy with fewer instructions. That mostly helps in terms of seeing farther with the same size out-of-order execution window, to start page walks sooner across page boundaries and stuff like that. See Why is std::fill(0) slower than std::fill(1)? for SSE (16-byte) vs. AVX (32-byte) vectors.
GPU offload would thus not help for large copies. It could only possibly help for small copies, and then it would not leave the copy result hot in L1d cache of the CPU. And/or not be able to take advantage of the source or destination already being hot in L1d cache of a CPU working with the data.
Also, setup overhead (to communicate with the GPU, going outside the current core) would dominate any faster copying for small copies.

NVMe SSD's bandwidth decreases when increasing the number of I/O queues

As far as I have learned from all the relevant articles about NVMe SSDs, one of NVMe SSDs' benefits is multiple queues. Leveraging multiple NVMe I/O queues, NVMe bandwidth can be greatly utilized.
However, what I have found from my own experiment does not agree with that.
I want to do parallel 4k-granularity sequential reads from an NVMe SSD. I'm using Samsung 970 EVO Plus 250GB. I used FIO to benchmark the SSD. The command I used is:
fio --size=1000m --directory=/home/xxx/fio_test/ --ioengine=libaio --direct=1 --name=4kseqread --bs=4k --iodepth=64 --rw=read --numjobs 1/2/4 --group_reporting
And below is what I got testing 1/2/4 parallel sequential reads:
numjobs=1: 1008.7MB/s
numjobs=2: 927 MB/s
numjobs=4: 580 MB/s
Even if will not increasing bandwidth, I expect increasing I/O queues would at least keep the same bandwidth as the single-queue performance. The bandwidth decrease is a little bit counter-intuitive. What are the possible reasons for the decrease?
Thank you.

I would like to highlight 3 reasons why you may see the issue:
Effective Queue Depth is too high,
Capacity under the test is limited to 1GB only,
Drive Precondition
First, parameter --iodepth=X is specified per Job. It means in your last experiment (--iodepth=64 and --numjobs=4) effective Queue Depth is 4x64=256. This may be too high for your Drive. Based on the vendor specification of your 250GB Drive, 4KB Random Read should show 250 KIOPS (1GB/s) for the Queue Depth of 32. By this Vendor is stating that QD32 is quite optimal for your Drive operation in order to reach best performance. If we start to increase QD, then commands will start aggregating and waiting in the Submission Queue. It does not improve performance. Vice Versa it will start to eat system resources (CPU, memory) and will degrade the throughput.
Second, limiting capacity under test to such a small range (1GB) can cause lot of collisions inside SSD. It is the situation when Reads will hit the same Media Physical Read Unit (aka Die aka LUN). In such situation new Reads will have to wait for previous one to complete. Increase of the testing capacity to entire Drive or at least to 50-100GB should minimize the collisions.
Third, in order to get performance numbers as per specification, Drive needs to be preconditioned accordingly. For the case of measuring Sequential and Random Reads it is better to use Full Drive Sequential Precondition. Command bellow will perform 128KB Sequential Write at QD32 to the Entire Drive Capacity.
fio --size=100% --ioengine=libaio --direct=1 --name=128KB_SEQ_WRITE_QD32 --bs=128k --iodepth=4 --rw=write --numjobs=8

What are the minimum specifications for a Scala development machine?

I need to spec out a new computer. I'm only going to use this computer for developing Scala software. I'm going to be running Intellij, doing builds with Maven and SBT, and perhaps firing up a couple of Virtual Machines. I'm going to building a mixture of fairly large Play Framework and micro-services. What is a reasonable machine for doing this work?

The Scala compiler still has poor parrelisation. I doubt that's going to change before you'll be due an upgrade. For this reason I would suggest as a minimum using a Haswell 4670. Going up to an i7 will probably be of doubtful benefit. if you want to spend extra money over-clock a 4770K or a 4670k. If you've really got money to burn use an Ivybridge 4960x, but you won't see much benefit for that extra money. Intel beats AMD on core for core performance. Make sure you've got a 4 memory slot motherboard. 2 Eight Gig DDR3 1600 sticks are probably more than sufficient but allow for an upgrade to 32 Gig in a year or so's time when hopefully memory's come down in price.
As already stated a decent SSD. Run your operating System, your IDE and your projects off the SSD. You'll want a SATA drive for mass storage.
Anyway above $1500 or so for the Base unit diminishing returns set in rapidly. Unless you've really got money to burn.
You'll probably want a graphic cards to run multiple monitors. An AMD 7790 should do the job. I'm assuming that a budget of a 1000 to 1500 dollars for a base unit is not as issue. Personally I find 3 24" 1920 * 1200 monitors just right for civilised development.

You should turn your focus to these key components:
CPU
Since Scala (and the compiler) parallelize well: go for more cores, the more the better. Depending on your budget you can think of multi-CPU systems.
RAM
A lot of RAM helps a lot. I am pretty happy with 16GB, but depending on the size you are planning you might need more, but 16 is a decent amount. You can also think about using RAM to make a RAM-disk for faster compilation etc.
HDD
You definitely want to have a fast SSD. You should look for one with high IOPS, the transfer rate is for developing not that important. If you have a large budget, you can go for 2 SSDs in RAID-0. But be aware that some RAID controllers are not fast enough and will not give you the full possible performance of your SSD RAID.