Horizontal Scalability and disk storage - virtualization

I am trying to understand how horizontal scalability (virtualization) is working in terms of disk storage.
virtualization is a layer upon the hardware computer nodes and manage the needed resources for the requests.
So my question is what happens when I deploy my war into the web server for example ? I mean I have a replicated storage in different nodes?
After I did some researches I found NAS vs SAN. so i expect to have SAN replication for data stability .... that is true ?
Where is my storage disk exactly when I have a horizontal based server like Google Engine or AWS?
Thanks,

Hopefully a couple of these examples will help. Let's take a general, crude example. I'll try to keep information simple to understand. Let's say I have a business running on LAMP stack. Apache+PHP is running on WEB1 server, MySQL on DB1 server. Customer data sits on WEB1.
SAN replication
First - your question about replication. That's mostly for disaster recovery. For data stability/reliability, SAN have appropriate RAID levels, service level agreements and spare disks. For example RAID5 tolerates failure of 1 disk in a raid-set. RAID6 tolerates failure of 2 disks in a raid-set etc. Having hot-spare disks help in quick repopulation of failed RAID disk. Organizations also snapshot their disk volumes and replay them in a different data-center so as to have a 2nd copy of their data. This is done over and above regular backups and VM snapshots.
AWS disks
There are 2 types of disks AWS has:
Ephemeral: disks connected to EC2 instance
Elastic Block Storage (EBS)
Ephemeral storage
Don't use this for anything critical. AWS offers EC2 instances with ephemeral storage (that means, VM has disks attached to the server) and recommends that users purchase slice of their disk in the form of EBS (Elastic Block Storage). I'd chose to not run anything on ephemeral storage because if EC2 instance stops, information on ephemeral storage is gone! However, if my partitions were on EBS volume, EC2 restart will be seamless. All data will stay alive on my EBS volume.
EBS
When I want a VM, I'd choose an EC2 instance (CPU/Memory). Then I buy disk in the form of EBS volume of 100GB (or more if I want to do RAID/LVM etc.) and attach it to my EC2 instance. Now I can install OS on my EC2 volume. Partitions are all created on my EBS volume. When EC2 reboots, my data stays as-is.
Disk scaling
Let's say I began my business with an EC2 instance + 100GB of EBS volume. All's well until my customers began to upload really large files. My disk is getting full and I need to expand a partition. With AWS, I could buy another slice of 100GB of EBS volume and expand my partition to use this additional 100GB.
Server scaling
Let's say my business is doing really well and my EC2 instance isn't keeping up with traffic. I need more horse-power and I choose to add another server WEB2 running Apache+PHP server with its own EBS volume. But what about customer data? Will I store some data on WEB1 and some on WEB2? That'd be hard to reconcile.
Keeping code same on WEB1 and WEB2
Code from Git (or version control of choice) will be deployed to both WEB1 and WEB2 simultaneously. That will keeps both my server's code up to date. Configuration management of my servers can happen through Ansible/Puppet/Chef.
Streamlining data storage
I have some options. Let's discuss two options that will allow WEB1 and WEB2 to share data/disk space. Important note - EBS volume cannot be shared with multiple EC2 instances. EBS volume can be attached to only one EC2 instance.
First option - stand up another server DATA1 and attach a large EBS volume to it and move customer files there. WEB1 and WEB2 will send customer data to DATA1 (rsync/ftp/scp). WEB1 and WEB2 will read/write from DB1 database also. I could even safeguard my data by taking snapshots of EBS volume and replaying the snapshot on another server called DATA2 in a different AWS region or availability zone in case DATA1 is unavailable.
Second option - AWS has S3 storage. It's reliable and cheaper than EBS. Instead of standing up DATA1 and DATA2, it is much easier and cheaper to create a bucket on S3 and store customer data there. WEB1 and WEB2 can read/write to S3 seamlessly.
Where're my disks on AWS?
I don't know, and I don't need to know. AWS must have racks and racks of disks. I am getting a slice of disk space from somewhere there. Their disks are likely to have redundancy but EBS failures are possible. For our own sanity, it is good to RAID and snapshot EBS volumes over and above taking regular backups.
Similar to disks, AWS must have racks and racks of servers. I am getting a virtual machine in the form of EC2 instance of my choice from somewhere in those racks. When I shutdown and restart EC2 server, I may get the same specification VM from a different rack. However, when my EBS volume will remain the same unless I terminate EBS volume and buy a new EBS volume.
One thing to recognize is that if I bought EC2 instance in Oregon, my EBS volume will be in the same Oregon region and also the same availability zone.
Note - this is a very generic answer.

Related

Should I use EBS or EFS for database?

For database directories for MongoDB, Cassandra or Elasticsearch clusters with high availability, should I use EBS or EFS? MongoDB, Cassnadra and Elasticsearch clusters take care of replicating data across nodes if they are configured to have replication factor > 1, so EFS replication feature may not be needed I giuess.
EBS - for databases
EFS - for file sharing across applications, VMs etc
Here is a good article that differentiates between the storage types
https://dzone.com/articles/confused-by-aws-storage-options-s3-ebs-amp-efs-explained
EFS is for multiple servers having access to the same set of files. Cassandra has replication built in, so it has no use for that feature. You would not want multiple Cassandra nodes accessing the same files anyway as each node manages its own sstables.
Not to mention Cassandra is disk intensive and gets angry if there is latency. Cassandra connections time out really easily. So, using an NFS mount (EFS) instead of a “local” disk is just a bad idea.
Read this if you haven’t already: https://aws.amazon.com/blogs/big-data/best-practices-for-running-apache-cassandra-on-amazon-ec2/
(Can’t speak for other databases like MongoDB.)

Migrate to kubernetes

We're planning to migrate our software to run in kubernetes with auto scalling, this is our current infrastructure:
PHP and apache are running in Google Compute Engine n1-standard-4 (4 vCPUs, 15 GB memory)
MySql is running in Google Cloud SQL
Data files (csv, pdf) and the code are storing in a single SSD Persistent Disk
I found many posts that recomments to store the data file in the Google Cloud Storage and use the API to fetch the file and uploading to the bucket. We have very limited time so I decide to use NFS to share the data files over the pods, the problem is nfs speed is slow, it's around 100mb/s when I copying the file with pv, the result from iperf is 1.96 Gbits/sec.Do you know how to achieve the same result without implement the cloud storage? or increase the NFS speed?
Data files (csv, pdf) and the code are storing in a single SSD Persistent Disk
There's nothing stopping you from volume mounting an SSD into the Pod so you can continue to use an SSD. I can only speak to AWS terminology, but some EC2 instances come with "local" SSD hardware, and thus you would only need to use a nodeSelector to ensure your Pods were scheduled onto machines that had said local storage available.
Where you're going to run into problems is if you are currently just using one php+apache and thus just one SSD, but now you want to scale the application up and it requires that all php+apache have access to the same SSD. That's a classic distributed application architecture problem, and something kubernetes itself can't fix for you.
If you're willing to expend the effort, you can also try any one of the other distributed filesystems (Ceph, GlusterFS, etc) and see if they perform better for your situation. Then again, "We have very limited time" I guess pretty much means that's off the table.

Backing up EC2 instance from Ephemeral to Persistent Storage

I'm pretty new with EC2 and backing up data, but currently, the app that I've built has no backup strategy and I want to know how to build a proper one. Currently, I have my RoR app and my MongoDB database on one instance. I've just now read about EBS volumes and snapshots, but I just can't wrap my head around it.
Supposedly EBS can be used as a datastore. If that is so, how do I set up a MongoDB database in EBS and migrate the data I have in my EC2 instance to it? I'm not familiar with configuring EBS and I've read the documentation and have more questions than answers.
In short, my instance is ephemeral storage right now and I want to turn it into persistent storage.
Thank you,
Don
It is pretty simple.
EBS is network disk volumes, it is used to store data.
A snapshot is an compress image backup, so this can apply to EC2 instance, RDS instances, even snapshot EBS volumes itself. After create the snapshot, it must store some where, thus, AWS use to store this backup into EBS.
Configure EBS is not difficult, it is little different that put on a new hard drive. You just need to "attach" an EBS volume to your instance. Then inside the EC2, do the usual OS disk initialisation work.
Because EBS is a dynamic storage, as long as your EC2 instance OS support it, you can extend the disk space anytime you need it (although it is recommended to do backup before doing it).
But from the operation perspective, you may want to consider putting your data into RDS if it is run for 24x7x365. So you don't need to deal with DB installation, complicate replication update,etc. If you run the DB occasionally, then you might want to stick to the EC2 instance mongodb.

EBS or Instance Storage for MongoDB in EC2?

Cassandra recommends using instance local storage for EC2 deployments instead of EBS
I am deploying MongoDB in EC2... should I also be using instance local storage instead of EBS PIOPS?
Here is a slide about using Hybrid (Instance store and PIOPS EBS) of MongoDB on EC2.
http://www.slideshare.net/mongodb/world-high-performance-mongo-db-on-ec2-20140620
Related topic:
Instance store is super fast - https://gist.github.com/ktheory/3c3616fca42a3716346b
Conclusions:
Instance-store is over 5x faster than EBS-SSD for uncached reads.
Instance-store and EBS-SSD are equalivent for cached reads.
Instance-store is over 10x faster than EBS-SSD for writes.
Special notes:
Ephemeral storage or instance-store DOES persist across reboots of an instance! It does not persist across a stop/start, nor a termination, nor some instance hardware failures.
The MongoDB manual has a section with EC2 storage considerations including the general recommendation to use EBS-optimized EC2 instances with provisioned IOPS (PIOPS) EBS volumes.
There are several good reasons to use EBS over local storage:
Local storage (or "Instance Store" in EC2 terms) is ephemeral and introduces potential data loss scenarios on instance stop/start/terminate as well as hardware failure (see AWS docs on Instance Store Lifetime).
While an Instance Store is dedicated to a particular instance, the disk subsystem is shared among instances on the host server hardware. As with regular EBS volumes, contention for a shared resource can lead to unpredictable I/O behaviour. Provisioned IOPS EBS volumes will provide more predictable I/O performance for an active database workload -- no spikes of higher than expected performance, but also no troughs of decreased performance.
The sizes of Instance Stores are determined by the instance type. EBS volumes can be provisioned independently to meet your storage and performance requirements.
If you want to change your instance types, EBS volumes can be re-attached to a new instance in the same availability zone.
EBS volumes can be combined using RAID for additional capacity or redundancy.
EBS volumes support asynchronous snapshots, which are a common backup strategy.
EBS volumes can support encryption for data at rest for most instance types.
EBS is recommended because it provided by more than one actual drive with 2ms transaction commit between mirror drives. EBS itself is fast enough and can reach 500+MB/sec for read and write.
Linux kernel this is what affect IOPS dramatically, see what Pinterest engineers investigated:
Final choice: kernel 3.18.7 + XFS + 64K RAID block size.
• Best overall performance for async random read.
• Very competitive performance everywhere else.
• Networking-related kernel bugs (Xen-specific) in 3.13 that aren’t
fixed until 3.16.
https://www.percona.com/live/mysql-conference-2015/sites/default/files/slides/all_your_iops_are_belong_to_usPLMCE2015.pdf

Do you need to run RAID 10 on Mongo when using Provisioned IOPS on Amazon EBS?

I'm trying to setup a production mongo system on Amazon to use as a datastore for a realtime metrics system,
I initially used the MongoDB AMIs[1] in the Marketplace, but I'm confused in that there is only one data EBS. I've read that Mongo recommends RAID 10 on EBS storage (8 EBS on each server). Additionally, I've read that the bare minimum for production is a primary/secondary with an arbiter. Is RAID 10 still the recommended setup, or is one provisioned IOPS EBS sufficient?
Please Advise. We are a small shop, so what is the bare minimum we can get away with and still be reasonably safe?
[1] MongoDB 2.4 with 1000 IOPS - data: 200 GB # 1000 IOPS, journal: 25 GB # 250 IOPS, log: 10 GB # 100 IOPS
So, I just got off of a call with an Amazon System Engineer, and he had some interesting insights related to this question.
First off, if you are going to use RAID, he said to simply do striping, as the EBS blocks were mirrored behind the scenes anyway, so raid 10 seemed like overkill to him.
Standard EBS volumes tend to handle spiky traffic well (it may be able to handle 1K-2K iops for a few seconds), however eventually it will tail off to an average of 100 iops. One suggestion was to use many small EBS volumes and stripe them to get better iops throughput.
Some of his customers use just the ephemeral storage on the EC2 images, but then have multiple (3-5) nodes in the availability set. The ephemeral storage is the storage on the physical machine. Apparently, if you use the EC2 instance with the SSD storage, you can get up to 20K iops.
Some customers will do a huge EC2 image w/ssd for the master, then do a smaller EC2 w/ EBS for the secondary. The primary machine is performant, but the failover is available but has degraded performance.
make sure you check 'EBS Optimized' when you spin up an instance. That means you have a dedicated channel to the EBS storage (of any kind) instead of sharing the NIC.
Important! Provisioned IOPS EBS is expensive, and the bill does not shut off when you shut down the EC2 instances they are attached to. (this sucks while you are testing) His advice was to take a snapshot of the EBS volumes, then delete them. When you need them again, just create new provisioned IOPS EBS volumes, restore the snapshot, then reconfigure your EC2 instances to attache the new storage. (it's more work than it should be, but it's worth it not to get sucker punched with the IOPS bill.
I've got the same question. Both Amazon and Mongodb try to market a lot on provisioned IOPs chewing over its advantages over a standard EBS volume. We run prod instances on m2.4xlarge aws instances with 1 primary and 2 secondaries setup per service. In the highest utilized service cluster, apart from a few slow queries the monitoring charts do not reveal any drop on performance at all. Page faults are rare occurrences and that too between 0.0001 and 0.0004 faults once or twice a day. Background flushes are in milliseconds and locks and queues are so far at manageable levels. I/O waits on the Primary node at any time ranges between 0 to 2 %, mostly less than 1 and %idle steadily stays above 90% mark. Do I still need to consider provisioned IOPs given we've a budget still to improve any potential performance drag? Any guidance will be appreciated.