Hosting and scaling mongodb - mongodb

I'm looking for a hosting service to host my mongodb database, such as MongoLab-MongoHQ-Heroku-AWS EBS, etc.
What I need is to find one of this services (or another) that provides auto-scaling my storage size.
Is there a way (service) to auto-scale mongodb? How?

There are many hosting providers for MongoDB that provide scaling solutions.
Storage size is only one dimension to scaling; there are other resources to monitor like memory, CPU, and I/O throughput.
There are several different approaches you can use to scale your MongoDB storage size:
1) Deploy a MongoDB replica set and scale vertically
The minimum footprint you would need is two data bearing nodes (a primary and a secondary) plus a third node which is either another data node (secondary) or a voting-only node without data (arbiter).
As your storage needs change, you can adjust the storage on your secondaries by temporarily stopping the mongod service and doing the O/S changes to adjust storage provisioning on your dbpath. After you adjust each secondary, you would restart mongod and allow that secondary to catch up on replication sync before changing the next secondary. Lastly, you would step down and upgrade the primary.
2) Deploy a MongoDB sharded cluster and scale horizontally.
A sharded cluster allows you to partition databases across a number of shards, which are typically backed by replica sets (although technically a shard can be a standalone mongod). In this case you can add (or remove) shards based on your requirements. Changing the number of shards can have a notable performance impact due to chunk migration across shards, so this isn't something you'd do too reactively (i.e. far more likely on a daily or weekly basis rather than hourly).
3) Let a hosted database-as-a-service provider take care of the details for you :)
Replica sets and sharding are the basic building blocks, but hosted providers can often take care of the details for you. Armed with this terminology you should be able to approach hosting providers and ask them about available plan options and costing.

Related

Do mongodb databases within a cluster share the same node set under the hood?

The reason I am asking is, I have a resource-intensive collection that degrades performance of its entire database. I need to decide whether to migrate other collections away to a different database within the same cluster or to a different cluster altogether.
The answer I think depends on under-the-hood implementation. Does a poorly performing collection take resources only from its own database, or from the cluster as a whole?
Hosted on Atlas.
I would suggest first look at your logical and schema designs and try to optimize it but if that is not working then
"In MongoDB Atlas, all databases within a cluster share the same set of nodes (servers) and are subject to the same resource limitations. Each database has its own logical namespace and operates independently from the other databases, but they share the same underlying hardware resources, such as CPU, memory, and I/O bandwidth.
So, if you have a resource-intensive collection that is degrading performance for its entire database, migrating other collections to a different database within the same cluster may not significantly improve performance if the resource bottleneck is at the cluster level. In this case, you may need to consider scaling up the cluster or upgrading to a higher-tier plan to increase the available resources and improve overall cluster performance."
Reference: https://www.mongodb.com/community/forums/t/creating-a-new-database-vs-a-new-collection-vs-a-new-cluster/99187/2
The term "cluster" is overloaded. It can refer to a replica set or to a sharded cluster.
A sharded cluster is effectively a group of replica set with a query router.
If you are using a sharded cluster, you can design a sharding strategy that will put the busy collection on its own shard, the rest of the data on the other shard(s), and still have a common point to query them both.

How to handle different server types in MongoDB sharded cluster

Is there a way to deal with different server types in a sharded cluster? According to MongoDB documentation the balancer attempts to achieve an even distribution of chunks across all shards in the cluster. So, it purely seems to be based on the amount of data.
However, when you add new servers to an existing sharded cluster then typically the new server has more disc space, disc is faster and CPU has more power. Especially when you run an application for several years then this condition might come a fact.
Does the balancer take such topics into account or do you have to ensure that all servers in a sharded cluster have similar performance and resources?
You are correct that the balancer would assume that all parts of the cluster is of similar hardware. However you can use zone sharding to custom tailor the behaviour of the balancer.
To quote from the zone sharding docs page:
In sharded clusters, you can create zones of sharded data based on the shard key. You can associate each zone with one or more shards in the cluster. A shard can associate with any number of zones. In a balanced cluster, MongoDB migrates chunks covered by a zone only to those shards associated with the zone.
Using zones, you can specify data distribution to be by location, by hardware spec, by application/customer, and others.
To directly answer your question, the use case you'll be most interested in would be Tiered Hardware for Varying SLA or SLO. Please see the link for a tutorial on how to achieve this.
Note that defining the zones is a design decision on your part, and there is currently no automated way for the server to do this for you.
Small note: the balancer balances the cluster purely using the shard key. It doesn't take into account the amount of data at all. Thus in an improperly designed shard key, it is possible to have some shard overflowing with data while others are completely empty. In a pathological mis-design case, some chunks are not divisible, leading to a situation where the cluster is forever unbalanced until an extensive redesign is done.

mongoDB architecture for scalable read-heavy app (constant writes)

My app runs a daily job that collects data and feeds it to a mongoDB. This data is processed and then exposed via rest API.
Need to setup a mongodb cluster in AWS, the requirements:
Data will grow about the same size each day ( about 50M records), so write throughput doesn't need to scale. writes would be triggered by a cron at a certain hour. Objects are immutable ( they won't grow)
Read throughput will depend on number of users / traffic, so it should be scalable. traffic won't be heavy in the beginning.
Data is mostly simple JSON, need a couple of indices around some of the fields for fast-querying / filtering.
what kind of architecture should I use in terms of replica sets, shards, etc ?.
What kind of storage volumes should I use for this architecture? ( EBS, NVMe) ?
Is it preferred to use more instances or to use RAID setups. ?
I'm looking to spend some ~500 a month.
Thanks in advance
To setup the MongoDB cluster in AWS I would recommend to refer the latest AWS quick start for MongoDB which will cover the architectural aspects and also provides CloudFormation templates.
For the storage volumes you need to use EC2 instance types that supports EBS instead of NVMe storage since NVMe is only an instance storage. If you stop and start the EC2, the data in NVMe is lost.
Also for the storage volume throughput, you can start with General Purpose IOPS with resonable storage size and if you find any limitations then only consider Provisioned IOPS.
For high availability and fault tolerance the CloudFormation will create multiple instances(Nodes) in MongoDB cluster.

In Mongo what is the difference between sharding and replication?

Replication seems to be a lot simpler than sharding, unless I am missing the benefits of what sharding is actually trying to achieve. Don't they both provide horizontal scaling?
In the context of scaling MongoDB:
replication creates additional copies of the data and allows for automatic failover to another node. Replication may help with horizontal scaling of reads if you are OK to read data that potentially isn't the latest.
sharding allows for horizontal scaling of data writes by partitioning data across multiple servers using a shard key. It's important to choose a good shard key. For example, a poor choice of shard key could lead to "hot spots" of data only being written on a single shard.
A sharded environment does add more complexity because MongoDB now has to manage distributing data and requests between shards -- additional configuration and routing processes are added to manage those aspects.
Replication and sharding are typically combined to created a sharded cluster where each shard is supported by a replica set.
From a client application point of view you also have some control in relation to the replication/sharding interaction, in particular:
Read preferences
Write concerns
Consider you have a great music collection on your hard disk, you store the music in logical order based on year of release in different folders.
You are concerned that your collection will be lost if drive fails.
So you get a new disk and occasionally copy the entire collection keeping the same folder structure.
Sharding >> Keeping your music files in different folders
Replication >> Syncing your collection to other drives
Replication is a mostly traditional master/slave setup, data is synced to backup members and if the primary fails one of them can take its place. It is a reasonably simple tool. It's primarily meant for redundancy, although you can scale reads by adding replica set members. That's a little complicated, but works very well for some apps.
Sharding sits on top of replication, usually. "Shards" in MongoDB are just replica sets with something called a "router" in front of them. Your application will connect to the router, issue queries, and it will decide which replica set (shard) to forward things on to. It's significantly more complex than a single replica set because you have the router and config servers to deal with (these keep track of what data is stored where).
If you want to scale Mongo horizontally, you'd shard. 10gen likes to call the router/config server setup auto-sharding. It's possible to do a more ghetto form of sharding where you have the app decide which DB to write to as well.
Sharding
Sharding is a technique of splitting up a large collection amongst multiple servers. When we shard, we deploy multiple mongod servers. And in the front, mongos which is a router. The application talks to this router. This router then talks to various servers, the mongods. The application and the mongos are usually co-located on the same server. We can have multiple mongos services running on the same machine. It's also recommended to keep set of multiple mongods (together called replica set), instead of one single mongod on each server. A replica set keeps the data in sync across several different instances so that if one of them goes down, we won't lose any data. Logically, each replica set can be seen as a shard. It's transparent to the application, the way MongoDB chooses to shard is we choose a shard key.
Assume, for student collection we have stdt_id as the shard key or it could be a compound key. And the mongos server, it's a range based system. So based on the stdt_id that we send as the shard key, it'll send the request to the right mongod instance.
So, what do we need to really know as a developer?
insert must include a shard key, so if it's a multi-parted shard key, we must include the entire shard key
we've to understand what the shard key is on collection itself
for an update, remove, find - if mongos is not given a shard key - then it's going to have to broadcast the request to all the different shards that cover the collection.
for an update - if we don't specify the entire shard key, we have to make it a multi update so that it knows that it needs to broadcast it
Whenever you're thinking about sharding or replication, you need to think in the context of writers/update operations. If you don't need to scale writes then replications, as it fairly simpler, is a good choice for you.
On the other hand, if you workload mostly updates/writes then at some point you'll hit a write bottleneck. If write request comes Mongo blocks other writes request. Those write request blocks until the first request will be done. If you want to scale this writes and want parallelize it then you need to implement sharding.
Just to put this somewhere...
The most basic way to run mongo is as standalone server.
You write a config (file or cli options)
initiate the server using mongod
For this picture, I didn't include the "client". Check the next one.
A replica set is a set of servers initialized exactly as above with a different config file.
To link them, we connect to one of them, and initialize the replica set mode.
They will mirror each other (in the most common configuration). This system guarantees high availability of data.
The initialization of the replica set is represented in the red border box.
Sharding is not about replicating data, but about fragmenting data.
Each fragment of data is called chunk and goes to a different shard. shard = each replica set.
"main" server, running mongos instead of mongod. This is a router for queries from the client.
Obvious: The trade-off is a more complex architecture.
Novelty: configuration server (again, a different config file).
There is much more to add, but apart from the words the pictures hold much the same.
Even mongoDB recommends to study your case carefully before going sharding. Vertical scaling (vs) is probably a good idea at least once before horizontal scaling (hs).
vs is done upgrading hardware (cpu, ram, etc). hs is needs more computers (but could be cheap computers).
Both replication and sharding can be used (individually or together) for horizontal scaling of a MongoDB installation.
Sharding is MongoDB's solution for meeting the demands of data growth. Sharding stores data records across multiple servers to provide faster throughput on read and write queries, particularly for very large data sets.
Any of the servers in the sharded cluster can respond to a read or write operation, which greatly speeds up query responses.
Replication is MongoDB's solution for providing stability, backup, and disaster recovery to a MongoDB installation. This process copies and synchronizes the replica data set across multiple servers. This prevents downtime if one server goes offline.
Any of the secondary servers can respond to read queries, but only the primary server will perform write operations. The results of the write operation will then be propagated out to the secondary servers.
Scenario 1: Fault-Tolerance
In this scenario, the user is storing billing data in a MongoDB installation. This data is mission-critical to the user's business, and needs to be available 24/7, even if a server crashes or is taken offline.
MongoDB replication is the best solution for this user. With replication, the entire data set is mirrored on multiple servers. If a server fails or is taken offline, the other servers in the cluster take over.
Scenario 2: High Performance
In this scenario, the user is running a social networking site which is run from a MongoDB database. As the social network grows, the MongoDB data set has grown along with it. The user is seeing query times and page loads increase beyond an acceptable point. It is critical that the user's MongoDB installation receives a major performance boost.
Setting up a sharded MongoDB cluster is the best solution for this user. The sharded cluster will break up the user's data set and store parts of it on separate secondary servers. Each secondary server can respond to read or write queries on its portion of the data, which greatly increases the installation's response time
MongoDB Atlas is a Database as a service in could. It support three major cloud providers such as Azure , AWS and GCP. In cloud environment , we usually talk about high availability and scalability. In Atlas “clusters”, can be either a replica set or a sharded cluster.
These two address high availability and scalability features of our cloud environment.
In general Cluster is a group of servers used to achieve a specific task. So sharded clusters are used to store data in across multiple machines to meet the demand of data growth. As the size of the data increases, a single machine may not be sufficient to store the data nor provide an acceptable read and write throughput. Sharded clusters supports the horizontal scalability of the underling cloud environment.
A replica set in MongoDB is a group of mongod processes that maintain the same data set. Replica sets provide redundancy and high availability, and are the basis for all production deployments.In a replica, one node is a primary node that receives all write operations. All other instances, such as secondaries, apply operations from the primary so that they have the same data set. Replica set mainly focus on the availability of data.
Please check the documentation
Thank You.

Does MongoDB require at least 2 server instances to prevent the loss of data?

I have decided to start developing a little web application in my spare time so I can learn about MongoDB. I was planning to get an Amazon AWS micro instance and start the development and the alpha stage there. However, I stumbled across a question here on Stack Overflow that concerned me:
But for durability, you need to use at least 2 mongodb server
instances as master/slave. Otherwise you can lose the last minute of
your data.
Is that true? Can't I just have my box with everything installed on it (Apache, PHP, MongoDB) and rely on the data being correctly stored? At least, there must be a config option in MongoDB to make it behave reliably even if installed on a single box - isn't there?
The information you have on master/slave setups is outdated. Running single-server MongoDB with journaling is a durable data store, so for use cases where you don't need replica sets or if you're in development stage, then journaling will work well.
However if you're in production, we recommend using replica sets. For the bare minimum set up, you would ideally run three (or more) instances of mongod, a 'primary' which receives reads and writes, a 'secondary' to which the writes from the primary are replicated, and an arbiter, a single instance of mongod that allows a vote to take place should the primary become unavailable. This 'automatic failover' means that, should your primary be unable to receive writes from your application at a given time, the secondary will become the primary and take over receiving data from your app.
You can read more about journaling here and replication here, and you should definitely familiarize yourself with the documentation in general in order to get a better sense of what MongoDB is all about.
Replication provides redundancy and increases data availability. With multiple copies of data on different database servers, replication protects a database from the loss of a single server. Replication also allows you to recover from hardware failure and service interruptions. With additional copies of the data, you can dedicate one to disaster recovery, reporting, or backup.
In some cases, you can use replication to increase read capacity. Clients have the ability to send read and write operations to different servers. You can also maintain copies in different data centers to increase the locality and availability of data for distributed applications.
Replication in MongoDB
A replica set is a group of mongod instances that host the same data set. One mongod, the primary, receives all write operations. All other instances, secondaries, apply operations from the primary so that they have the same data set.
The primary accepts all write operations from clients. Replica set can have only one primary. Because only one member can accept write operations, replica sets provide strict consistency. To support replication, the primary logs all changes to its data sets in its oplog. See primary for more information.