Should I use EBS or EFS for database? - mongodb

For database directories for MongoDB, Cassandra or Elasticsearch clusters with high availability, should I use EBS or EFS? MongoDB, Cassnadra and Elasticsearch clusters take care of replicating data across nodes if they are configured to have replication factor > 1, so EFS replication feature may not be needed I giuess.

EBS - for databases
EFS - for file sharing across applications, VMs etc
Here is a good article that differentiates between the storage types
https://dzone.com/articles/confused-by-aws-storage-options-s3-ebs-amp-efs-explained

EFS is for multiple servers having access to the same set of files. Cassandra has replication built in, so it has no use for that feature. You would not want multiple Cassandra nodes accessing the same files anyway as each node manages its own sstables.
Not to mention Cassandra is disk intensive and gets angry if there is latency. Cassandra connections time out really easily. So, using an NFS mount (EFS) instead of a “local” disk is just a bad idea.
Read this if you haven’t already: https://aws.amazon.com/blogs/big-data/best-practices-for-running-apache-cassandra-on-amazon-ec2/
(Can’t speak for other databases like MongoDB.)

Related

If I declare 2 replicas of PostgreSQL StatefulSet pods in k8s, are they the same database or they just share the volume?

After making 2 replicas of PostgreSQL StatefulSet pods in k8s, are the the same database?
If they do, why I created DB and user in one pod, and can not find the value in the other.
If they not, is there no point of creating replicas?
There isn't one simple answer here, it depends on how you configured things. Postgres doesn't support multiple instances sharing the same underlying volume without massive corruption so if you did set things up that way, it's definitely a mistake. More common would be to use the volumeClaimTemplate system so each pod gets its own distinct storage. Then you set up Postgres streaming replication yourself.
Or look at using an operator which handles that setup (and probably more) for you.
To add the answer in coderanger, as he said it's not easy to say how Postgres will work with the multi replicas, and data replication across the cluster unless checking more in-depth. Setting the multiple replicas directly without reading the document of replication of data might lead to big issue.
Here is one nice example from google for ref : https://cloud.google.com/architecture/deploying-highly-available-postgresql-with-gke
For the example of Postgres database replication example and clustering config files : https://github.com/CrunchyData/crunchy-containers/tree/master/examples/kube

How Databases synchronize data between persistent volumens in Kubernetes

I`ve just read Deploying Cassandra with Stateful Sets topic in the Kubernetes documentation.
The deployment process:
1. Creation of StorageClass
2. Creation of PersistentVolume (in my case 4 PersistentVolume). Set created in 1) storageClassName
3. Creation of Cassandra Headless Service
4. Using a StatefulSet to Create a Cassandra Ring - setting created in 1) storageClassName in StatefulSet yml definition.
As a result, there are 4 pods: Cassandra-0, Cassandra-1, Cassandra-2, Cassandra-4, which are mounted to created in 2) volumes (pv-0, pv-1, pv-2, pv-3).
I wonder how / if these persistent volumes synchronize data with each other.
E.g. if I add some record, which will be written by pod cassandra-0 in persistent volume pv-0, then if someone who is going to retrieve data from the database a moment later - using the cassandra-1 pod/pv will see data that has been added to pv-0. Can anyone tell me how it works exactly?
This is not related to Kubernetes
The replication is done by database and is configurable
See the CAP theorem and Eventual Consistency for Cassandra
You can control the level of consistency in Cassandra, whether the record is immediately updated across or later , depends on the configuration you do in Cassandra.
See also: Synchronous Replication , Asynchronous Replication
Cassandra Consistency:
how to set cassandra read and write consistency
How is the consistency level configured?
The mechanism to spread data across the clusters is independent if it was deployed in kubernetes or bare-metal instances. Cassandra will try to spread randomly the data across the nodes depending on a hash value (known as token), and will use the same algorithm to retrieve the information.
There are other factors to take in consideration: The replication factor (amount of copies), and the consistency level used.
You would want to take a look to DS201: DataStax Enterprise Foundations of Apache Cassandra™ in Datastax academy, where they cover the basics of Cassandra.
Just to slightly extend Carlos' answer, Kubernetes is not involved and the volumes are completely isolated. The replication and distribution stuffs are entirely up to the database software to handle. As far as K8s sees, they are just separate processes and separate volumes.
Thanks for comments guys!
so, when I have my db with 3 PVs:
cassandra-pod0 cassandra-pod1 cassandra-pod2
| | |
cassandra-pv0 cassandra-pv0 cassandra-pv0
Data is divided into 3 pvs.When I kill cassandra-pod1 - it is possible that I will lose (temporarily) part of the data. Am I right?

Horizontal Scalability and disk storage

I am trying to understand how horizontal scalability (virtualization) is working in terms of disk storage.
virtualization is a layer upon the hardware computer nodes and manage the needed resources for the requests.
So my question is what happens when I deploy my war into the web server for example ? I mean I have a replicated storage in different nodes?
After I did some researches I found NAS vs SAN. so i expect to have SAN replication for data stability .... that is true ?
Where is my storage disk exactly when I have a horizontal based server like Google Engine or AWS?
Thanks,
Hopefully a couple of these examples will help. Let's take a general, crude example. I'll try to keep information simple to understand. Let's say I have a business running on LAMP stack. Apache+PHP is running on WEB1 server, MySQL on DB1 server. Customer data sits on WEB1.
SAN replication
First - your question about replication. That's mostly for disaster recovery. For data stability/reliability, SAN have appropriate RAID levels, service level agreements and spare disks. For example RAID5 tolerates failure of 1 disk in a raid-set. RAID6 tolerates failure of 2 disks in a raid-set etc. Having hot-spare disks help in quick repopulation of failed RAID disk. Organizations also snapshot their disk volumes and replay them in a different data-center so as to have a 2nd copy of their data. This is done over and above regular backups and VM snapshots.
AWS disks
There are 2 types of disks AWS has:
Ephemeral: disks connected to EC2 instance
Elastic Block Storage (EBS)
Ephemeral storage
Don't use this for anything critical. AWS offers EC2 instances with ephemeral storage (that means, VM has disks attached to the server) and recommends that users purchase slice of their disk in the form of EBS (Elastic Block Storage). I'd chose to not run anything on ephemeral storage because if EC2 instance stops, information on ephemeral storage is gone! However, if my partitions were on EBS volume, EC2 restart will be seamless. All data will stay alive on my EBS volume.
EBS
When I want a VM, I'd choose an EC2 instance (CPU/Memory). Then I buy disk in the form of EBS volume of 100GB (or more if I want to do RAID/LVM etc.) and attach it to my EC2 instance. Now I can install OS on my EC2 volume. Partitions are all created on my EBS volume. When EC2 reboots, my data stays as-is.
Disk scaling
Let's say I began my business with an EC2 instance + 100GB of EBS volume. All's well until my customers began to upload really large files. My disk is getting full and I need to expand a partition. With AWS, I could buy another slice of 100GB of EBS volume and expand my partition to use this additional 100GB.
Server scaling
Let's say my business is doing really well and my EC2 instance isn't keeping up with traffic. I need more horse-power and I choose to add another server WEB2 running Apache+PHP server with its own EBS volume. But what about customer data? Will I store some data on WEB1 and some on WEB2? That'd be hard to reconcile.
Keeping code same on WEB1 and WEB2
Code from Git (or version control of choice) will be deployed to both WEB1 and WEB2 simultaneously. That will keeps both my server's code up to date. Configuration management of my servers can happen through Ansible/Puppet/Chef.
Streamlining data storage
I have some options. Let's discuss two options that will allow WEB1 and WEB2 to share data/disk space. Important note - EBS volume cannot be shared with multiple EC2 instances. EBS volume can be attached to only one EC2 instance.
First option - stand up another server DATA1 and attach a large EBS volume to it and move customer files there. WEB1 and WEB2 will send customer data to DATA1 (rsync/ftp/scp). WEB1 and WEB2 will read/write from DB1 database also. I could even safeguard my data by taking snapshots of EBS volume and replaying the snapshot on another server called DATA2 in a different AWS region or availability zone in case DATA1 is unavailable.
Second option - AWS has S3 storage. It's reliable and cheaper than EBS. Instead of standing up DATA1 and DATA2, it is much easier and cheaper to create a bucket on S3 and store customer data there. WEB1 and WEB2 can read/write to S3 seamlessly.
Where're my disks on AWS?
I don't know, and I don't need to know. AWS must have racks and racks of disks. I am getting a slice of disk space from somewhere there. Their disks are likely to have redundancy but EBS failures are possible. For our own sanity, it is good to RAID and snapshot EBS volumes over and above taking regular backups.
Similar to disks, AWS must have racks and racks of servers. I am getting a virtual machine in the form of EC2 instance of my choice from somewhere in those racks. When I shutdown and restart EC2 server, I may get the same specification VM from a different rack. However, when my EBS volume will remain the same unless I terminate EBS volume and buy a new EBS volume.
One thing to recognize is that if I bought EC2 instance in Oregon, my EBS volume will be in the same Oregon region and also the same availability zone.
Note - this is a very generic answer.

setting development project with mongo database on EC2 cluster

I would like to create a development project on EC2 cluster. Current design suggest accessing mongo database files stored on EBS volume. If that is possible to run distributed computing and access same files in /data/db/ simultaneously from different nodes?
No, that will not work. You cannot access the same mongodb database files from different processes on different nodes.
The way you use mongoDB in a distributed environment is with replica sets and sharding. In both cases you have mongodb instances running on each node. Replica sets duplicate the same data across all the nodes in the set, for data redundancy and fault tolerance. Sharding allows you to distribute different sets of data on different nodes to provide horizontal scaling. Large production environments use both replica sets and sharding.
Best place to read up on all of this is:
http://docs.mongodb.org/manual/administration/replica-sets/
http://docs.mongodb.org/manual/sharding/
http://docs.mongodb.org/ecosystem/platforms/amazon-ec2/

MongoDB on Amazon SSD-backed EC2

We have mongodb sharded cluster currently deployed on EC2 instances in Amazon. These shards are also replica sets. The instances used are using EBS with IOPS provisioned.
We have about 30 million documents in a collection. Our queries count the whole collection that matches the filters. We have indexes on almost all of the query-able fields. This results to the RAM reaching 100% usage. Our working set exceeds the size of the RAM. We think that the slow response of our queries are caused by EBS being slow so we are thinking of migrating to the new SSD-backed instances.
C3 is available
http://aws.typepad.com/aws/2013/11/a-generation-of-ec2-instances-for-compute-intensive-workloads.html
I2 is coming soon
http://aws.typepad.com/aws/2013/11/coming-soon-the-i2-instance-type-high-io-performance-via-ssd.html
Our only concern is that SSD is ephemeral, meaning the data will be gone once the instance stops, terminates, or fails. How can we address this? How do we automate backups. Is it a good idea to migrate to SSD to improve the performance of our queries? Do we still need to set-up a sharded cluster?
Working with the ephemeral disks is a risk but if you have your replication setup correctly it shouldn't be a huge concern. I'm assuming you've setup a three node replica set correct? Also you have three nodes for your config servers?
I can speak of this from experience as the company I'm at has been setup this way. To help mitigate risk I'm moving towards a backup strategy that involved a hidden replica. With this setup I can shutdown the hidden replica set and one of the config servers (first having stopped balancing) and take a complete copy of the data files (replica and config server) and have a valid backup. If AWS went down on my availability zone I'd still have a daily backup available on S3 to restore from.
Hope this helps.