mongoDB architecture for scalable read-heavy app (constant writes) - mongodb

My app runs a daily job that collects data and feeds it to a mongoDB. This data is processed and then exposed via rest API.
Need to setup a mongodb cluster in AWS, the requirements:
Data will grow about the same size each day ( about 50M records), so write throughput doesn't need to scale. writes would be triggered by a cron at a certain hour. Objects are immutable ( they won't grow)
Read throughput will depend on number of users / traffic, so it should be scalable. traffic won't be heavy in the beginning.
Data is mostly simple JSON, need a couple of indices around some of the fields for fast-querying / filtering.
what kind of architecture should I use in terms of replica sets, shards, etc ?.
What kind of storage volumes should I use for this architecture? ( EBS, NVMe) ?
Is it preferred to use more instances or to use RAID setups. ?
I'm looking to spend some ~500 a month.
Thanks in advance

To setup the MongoDB cluster in AWS I would recommend to refer the latest AWS quick start for MongoDB which will cover the architectural aspects and also provides CloudFormation templates.
For the storage volumes you need to use EC2 instance types that supports EBS instead of NVMe storage since NVMe is only an instance storage. If you stop and start the EC2, the data in NVMe is lost.
Also for the storage volume throughput, you can start with General Purpose IOPS with resonable storage size and if you find any limitations then only consider Provisioned IOPS.
For high availability and fault tolerance the CloudFormation will create multiple instances(Nodes) in MongoDB cluster.

Related

Is there a way to test an M10 Atlas cluster on MongoDB Atlas?

We have an M10 cluster and the official page states that we get a max of 100 IOPS.
I cant run mongoperf on the cluster as we have direct mongo shell and compass access and mongoperf needs to be run on the instance that has MongoDB installed.
Is there any way to test the maximum requests per second that this cluster can handle and if not, is there any rough estimate available as to how many read/write operations it can handle concurrently?
PS:- Assume the queries being run aren't particularly complex and are only entering small sets of data such as Name, Email Address, Age, etc.
Thanks in advance!
Is there any way to test the maximum requests per second that this cluster can handle and if not, is there any rough estimate available as to how many read/write operations it can handle concurrently?
The answer to this, really depends on a lot of variables. i.e. document sizes, indexes utilisation, network latency between application and servers, etc.
To provide a rough estimation however, assuming your MongoDB cluster is hosted on AWS (GCP and Azure would be different), the specs would be:
M10, 2GB RAM and 10GB included storage.
In addition to this, you can select different EBS storage sizes as well as provisioned IOPS to match your required performance.
See also MongoDB Atlas: FAQ
We have an M10 cluster and the official page states that we get a max of 100 IOPS.
The number of IOPS advertised is, what would be advised by the cloud provider, i.e. AWS. Not taking account the network latency and your database usage that affects the server resources i.e. CPU/RAM.

Is there any drawback to MongoDb data being on on Amazon EFS?

I have a relatively low traffic system, but I want to keep the data safe. The data are stored in a single MongoDb instance. I don't want to run multiple replicas and manage them. So, I'm planning to change the data directory to EFS path to take advantage of its replication and other benefits. Periodic snapshots can cause data loss, and recovery is manual.
Is there any drawback of storing the data and the journal files on EFS caused by the additional latency?
As you alluded to, EFS objects are replicated across availability zones. To contrast, EBS volumes are only replicated within a single availability zone. The difference in pricing is significant with EFS currently starting at $0.30/GB and EBS starting at $0.10/GB. Typical EFS use-cases are for data that needs to be shared across instances, like user home directories and application data. EBS is also capable of providing the lowest-latency.
With those points in mind, I do not recommend EFS for MongoDB data. If EFS's multi-AZ replication is your major desire, you could achieve it with EBS by taking periodic snapshots (which are stored in S3) of the EBS volume. I think EBS will give you better performance and lower cost.
Using EFS is not really an alternative to running multiple MongoDB instances. Replication and sharding are not things that EFS can help achieve.

MongoDB replication and EBS vs ephemeral

I've read all of the MongoDB related documentation talking about the recommended practices for deploying Mongo on AWS, but I don't understand the recommendation to install on EBS with RAID-10 (pdf) to avoid data loss.
This seems like admitting that replication doesn't work. Why shouldn't one run Mongo using ephemeral drives and a cluster of 5 servers doing replication?
Performance is much greater and latency is predictable on local disks.
If a server goes down, the EBS backed store would have to be resynced with the replica anyway. Sure you have the data, but it is already out of date.
Using EBS makes for a much more complicated setup. You need to use LVM or some other layer if you want to take snapshots, since EBS snapshots won't work across RAID. You need to monitor and manage your RAID array and rebuild in the case of failure or if one of the EBS volumes has performance issues.
What exactly does using EBS protect against if one has backups and a large replica set? It's almost admitting that replica sets won't protect you against dataloss. (ignoring for the moment the race condition when writes have been sent to secondaries and a failure on the master happens before acknowledgements have been sent).
Why shouldn't one run Mongo using ephemeral drives and a cluster of 5 servers doing replication?
AWS is not perfect, it can have a network failure which results in the entire set being down. with ephemeral memory you would lose all your data. Plus block devices survive restarts of nodes.
That is a few things, I am sure there are more.
If a server goes down, the EBS backed store would have to be resynced with the replica anyway.
Only after the point it went down, if that is a considerable amount of time then yes, it might be easier to copy the directory frm one replica to the other.
Using EBS makes for a much more complicated setup. You need to use LVM or some other layer if you want to take snapshots, since EBS snapshots won't work across RAID.
You don't really need RAID within AWS itself, I mean they RAID each of your block devices and replica sets are good as throw away sets. You can get by with one block device per node.
What exactly does using EBS protect against if one has backups and a large replica set?
It safe guards your sanity, restoring a backup of sizeable data across 10 odd members and resetting all the firewall/user permissions and OS etc etc could be...well...nasty.
I mean imagine having to re-setup your OS every single time you restart it.
It's almost admitting that replica sets won't protect you against dataloss.
Hmm, you must have misread some where brecaue THAT is not what they guarantee. It is true that it is harder to lose data with repilica sets (if they are setup right) but they are actually designed to give High Availability (HA).
Backups and jornalling and other consistentcy methods are designed to not lose data.
So where do you see the recommendation to run RAID10 on EBS for mongodb? Their docs list it as an option but specifically recommend only EBS and Provisioned IOPS.
For almost all deployments EBS will be the better choice. For production systems we recommend using
EBS-optimized EC2 instances
Provisioned IOPS (PIOPS) EBS volumes
http://docs.mongodb.org/ecosystem/platforms/amazon-ec2/
We run all of our mongodb instances at EC2 and all of them use EBS storage volumes with production instances using provisioned IO. Here's why:
Bringing back a failed member is faster. If an instance fails and needs to be stopped and restarted (not that frequent but it does happen) we can just detach the storage and re-attach it to another instance. Mongod comes up fine, recovers via the journal and then re-syncs with the primary for only the delta in data since the failure. This is a big deal when you have large data sets that may take many hours to restore from scratch. Storing the data on an ephemeral drive does not provide this capability.
Backups are easier (at least for replica sets under 1 TB). With a single EBS storage volume (up to 1 TB) we can take snapshots of a live secondary. As long as the journal is on the same storage volume the backup will be consistent. No need for a dedicated secondary for backups that has to be brought offline to backup.
No need for RAID except for multiple TB replica sets or for performance. EBS is already RAID behind the scenes for redundancy. We do use RAID when a replica set grows beyond 1 TB in storage but that's it and have not yet hit a point where a high IOPS EBS volume provides sufficient performance.
Provisioned IOPS give decent control of performance vs. cost. Being able to select EBS storage rated up to 4000 IOPS has allowed us to scale up performance for production systems (at higher cost) while still gaining the benefits of EBS storage. We use regular EBS volumes at lower cost for test systems.
Copying production data off for use in a test environment is much easier for large data sets. Snapshot the volumes, create a new storage volume from the snapshot and you're up and running.
I certainly can imagine future deployments using ephemeral storage (particularly as SSD costs drop) for certain high performance situations but EBS has been fairly reliable and dependable for us. Of course your experience and needs can and will differ but for us following the recommendation from MongoDB has served us well. In fact it's been reliable enough that for some environments we've moved to 1 Primary, 1 Secondary and an Arbiter, which helps with cost savings. Probably would not have done that without the ease of recovery and overall reliability of using EBS volumes on the Primary and Secondary.

Hosting and scaling mongodb

I'm looking for a hosting service to host my mongodb database, such as MongoLab-MongoHQ-Heroku-AWS EBS, etc.
What I need is to find one of this services (or another) that provides auto-scaling my storage size.
Is there a way (service) to auto-scale mongodb? How?
There are many hosting providers for MongoDB that provide scaling solutions.
Storage size is only one dimension to scaling; there are other resources to monitor like memory, CPU, and I/O throughput.
There are several different approaches you can use to scale your MongoDB storage size:
1) Deploy a MongoDB replica set and scale vertically
The minimum footprint you would need is two data bearing nodes (a primary and a secondary) plus a third node which is either another data node (secondary) or a voting-only node without data (arbiter).
As your storage needs change, you can adjust the storage on your secondaries by temporarily stopping the mongod service and doing the O/S changes to adjust storage provisioning on your dbpath. After you adjust each secondary, you would restart mongod and allow that secondary to catch up on replication sync before changing the next secondary. Lastly, you would step down and upgrade the primary.
2) Deploy a MongoDB sharded cluster and scale horizontally.
A sharded cluster allows you to partition databases across a number of shards, which are typically backed by replica sets (although technically a shard can be a standalone mongod). In this case you can add (or remove) shards based on your requirements. Changing the number of shards can have a notable performance impact due to chunk migration across shards, so this isn't something you'd do too reactively (i.e. far more likely on a daily or weekly basis rather than hourly).
3) Let a hosted database-as-a-service provider take care of the details for you :)
Replica sets and sharding are the basic building blocks, but hosted providers can often take care of the details for you. Armed with this terminology you should be able to approach hosting providers and ask them about available plan options and costing.

How does mongodb replica compare with amazon ebs?

I am new to mongodb and amazon ec2.
It seems to me that mongo replicas are here to : 1/ avoid data loss and 2/ make reads and serving faster.
In Amazon they have this EbS thing. From what I understand it is a global persistent storage, like dropbox for instance.
So is there a need to have replicas if amazon abstracts away the need of it with EBS ?
Thanks in advance
Thomas
Let me clarify a couple of things.
EBS is essentially a SAN Volume if you are used to working within existing technologies. It can be attached to one instance, but it still has a limited IO throughout. Using RAID can help maximize the IO, provisioned IOPS can help you maximize the throughput.
Ideally however, with MongoDB, you want to have enough memory where indexes can be completely accessed within memory, performance drops if the disk needs to be hit.
Mongo can use Replicas, which is primarily used for failover and replication (You can send reads to a slave, but all writes need to hit the primary), and sharding which is used to split a dataset to increase performance. You will still need to do these things anyway even if you are using EBS for storage.
Replicas are there not just for storage redundancy but also for server redundancy. What happens if your MongoDB server (which uses an EBS volume) suddenly disappears because, for example, the host on which is sits fails? You would need to do a whole bunch of stuff, like clone a new instance to replace it, attach the volume to that instance, reroute traffic to it, etc. Mongo's replica sets mean you don't have to do that. They keep working even if one of them fails, so you have basically 0 down time.
Additionally, it's one more layer of redundancy. You can only trust EBS so far - what if AWS has a bug that erases your volume or that makes it unavailable for an unacceptably long time? With replica sets you can even replicate your data across availability zones or even to a completely different cloud provider.
Replica sets also let you read from multiple nodes, so you can increase your read throughput, theoretically, after you've maxed out what the EBS connection gives you from one instance.