How to handle different server types in MongoDB sharded cluster - mongodb

Is there a way to deal with different server types in a sharded cluster? According to MongoDB documentation the balancer attempts to achieve an even distribution of chunks across all shards in the cluster. So, it purely seems to be based on the amount of data.
However, when you add new servers to an existing sharded cluster then typically the new server has more disc space, disc is faster and CPU has more power. Especially when you run an application for several years then this condition might come a fact.
Does the balancer take such topics into account or do you have to ensure that all servers in a sharded cluster have similar performance and resources?

You are correct that the balancer would assume that all parts of the cluster is of similar hardware. However you can use zone sharding to custom tailor the behaviour of the balancer.
To quote from the zone sharding docs page:
In sharded clusters, you can create zones of sharded data based on the shard key. You can associate each zone with one or more shards in the cluster. A shard can associate with any number of zones. In a balanced cluster, MongoDB migrates chunks covered by a zone only to those shards associated with the zone.
Using zones, you can specify data distribution to be by location, by hardware spec, by application/customer, and others.
To directly answer your question, the use case you'll be most interested in would be Tiered Hardware for Varying SLA or SLO. Please see the link for a tutorial on how to achieve this.
Note that defining the zones is a design decision on your part, and there is currently no automated way for the server to do this for you.
Small note: the balancer balances the cluster purely using the shard key. It doesn't take into account the amount of data at all. Thus in an improperly designed shard key, it is possible to have some shard overflowing with data while others are completely empty. In a pathological mis-design case, some chunks are not divisible, leading to a situation where the cluster is forever unbalanced until an extensive redesign is done.

Related

Do mongodb databases within a cluster share the same node set under the hood?

The reason I am asking is, I have a resource-intensive collection that degrades performance of its entire database. I need to decide whether to migrate other collections away to a different database within the same cluster or to a different cluster altogether.
The answer I think depends on under-the-hood implementation. Does a poorly performing collection take resources only from its own database, or from the cluster as a whole?
Hosted on Atlas.
I would suggest first look at your logical and schema designs and try to optimize it but if that is not working then
"In MongoDB Atlas, all databases within a cluster share the same set of nodes (servers) and are subject to the same resource limitations. Each database has its own logical namespace and operates independently from the other databases, but they share the same underlying hardware resources, such as CPU, memory, and I/O bandwidth.
So, if you have a resource-intensive collection that is degrading performance for its entire database, migrating other collections to a different database within the same cluster may not significantly improve performance if the resource bottleneck is at the cluster level. In this case, you may need to consider scaling up the cluster or upgrading to a higher-tier plan to increase the available resources and improve overall cluster performance."
Reference: https://www.mongodb.com/community/forums/t/creating-a-new-database-vs-a-new-collection-vs-a-new-cluster/99187/2
The term "cluster" is overloaded. It can refer to a replica set or to a sharded cluster.
A sharded cluster is effectively a group of replica set with a query router.
If you are using a sharded cluster, you can design a sharding strategy that will put the busy collection on its own shard, the rest of the data on the other shard(s), and still have a common point to query them both.

Sharding with replication

Sharding with replication]1
I have a multi tenant database with 3 tables(store,products,purchases) in 5 server nodes .Suppose I've 3 stores in my store table and I am going to shard it with storeId .
I need all data for all shards(1,2,3) available in nodes 1 and 2. But node 3 would contain only shard for store #1 , node 4 would contain only shard for store #2 and node 5 for shard #3. It is like a sharding with 3 replicas.
Is this possible at all? What database engines can be used for this purpose(preferably sql dbs)? Did you have any experience?
Regards
I have a feeling you have not adequately explained why you are trying this strange topology.
Anyway, I will point out several things relating to MySQL/MariaDB.
A Galera cluster already embodies multiple nodes (minimum of 3), but does not directly support "sharding". You can have multiple Galera clusters, one per "shard".
As with my comment about Galera, other forms of MySQL/MariaDB can have replication between nodes of each shard.
If you are thinking of having a server with all data, but replicate only parts to readonly Replicas, there are settings for replicate_do/ignore_database. I emphasize "readonly" because changes to these pseudo-shards cannot easily be sent back to the Primary server. (However see "multi-source replication")
Sharding is used primarily when there is simply too much traffic to handle on a single server. Are you saying that the 3 tenants cannot coexist because of excessive writes? (Excessive reads can be handled by replication.)
A tentative solution:
Have all data on all servers. Use the same Galera cluster for all nodes.
Advantage: When "most" or all of the network is working all data is quickly replicated bidirectionally.
Potential disadvantage: If half or more of the nodes go down, you have to manually step in to get the cluster going again.
Likely solution for the 'disadvantage': "Weight" the nodes differently. Give a height weight to the 3 in HQ; give a much smaller (but non-zero) weight to each branch node. That way, most of the branches could go offline without losing the system as a whole.
But... I fear that an offline branch node will automatically become readonly.
Another plan:
Switch to NDB. The network is allowed to be fragile. Consistency is maintained by "eventual consistency" instead of the "[virtually] synchronous replication" of Galera+InnoDB.
NDB allows you to immediately write on any node. Then the write is sent to the other nodes. If there is a conflict one of the values is declared the "winner". You choose which algorithm for determining the winner. An easy-to-understand one is "whichever write was 'first'".

Running MongoDB on heterogeneous cluster

We're currently trying to find a solution, where we can host tens of millions of images of various sizes on our website. We're evaluating Riak and MongoDB. Currently Riak looks very nice, but all the servers in the cluster should be homogeneous, because Riak treats each node equally.
It is a bit hard to find information about MongoDB regarding to this, except the fact, that you can set a priority on the nodes. My questions are:
What is the consequence of creating a MongoDB cluster composed of machines with wildly varying specifications (cpu, diskspace, disk speed, amount and speed of memory, network speed)?
Can the MongoDB cluster detect and compensate for such differences automatically?
I'm also open for ideas for other solutions, where heterogeneous cluster would work nicely.
What is the consequence of creating a MongoDB cluster composed of machines with wildly varying specifications (cpu, diskspace, disk speed, amount and speed of memory, network speed)?
Each shard in MongoDB is its own mongod, an isolated MongoDB in itself.
It isn't tied by the performance of others, so say you have a shard that requires more power, you can just upgrade its hardware and job is done. MongoDB likes environments like this and actively promotes them.
Can the MongoDB cluster detect and compensate for such differences automatically?
It doesn't need to detect but it will automatically compensate, naturally without any work within the mongos or configsrv
In fact you will find you can even run different MongoDB versions.

Hosting and scaling mongodb

I'm looking for a hosting service to host my mongodb database, such as MongoLab-MongoHQ-Heroku-AWS EBS, etc.
What I need is to find one of this services (or another) that provides auto-scaling my storage size.
Is there a way (service) to auto-scale mongodb? How?
There are many hosting providers for MongoDB that provide scaling solutions.
Storage size is only one dimension to scaling; there are other resources to monitor like memory, CPU, and I/O throughput.
There are several different approaches you can use to scale your MongoDB storage size:
1) Deploy a MongoDB replica set and scale vertically
The minimum footprint you would need is two data bearing nodes (a primary and a secondary) plus a third node which is either another data node (secondary) or a voting-only node without data (arbiter).
As your storage needs change, you can adjust the storage on your secondaries by temporarily stopping the mongod service and doing the O/S changes to adjust storage provisioning on your dbpath. After you adjust each secondary, you would restart mongod and allow that secondary to catch up on replication sync before changing the next secondary. Lastly, you would step down and upgrade the primary.
2) Deploy a MongoDB sharded cluster and scale horizontally.
A sharded cluster allows you to partition databases across a number of shards, which are typically backed by replica sets (although technically a shard can be a standalone mongod). In this case you can add (or remove) shards based on your requirements. Changing the number of shards can have a notable performance impact due to chunk migration across shards, so this isn't something you'd do too reactively (i.e. far more likely on a daily or weekly basis rather than hourly).
3) Let a hosted database-as-a-service provider take care of the details for you :)
Replica sets and sharding are the basic building blocks, but hosted providers can often take care of the details for you. Armed with this terminology you should be able to approach hosting providers and ask them about available plan options and costing.

In Mongo what is the difference between sharding and replication?

Replication seems to be a lot simpler than sharding, unless I am missing the benefits of what sharding is actually trying to achieve. Don't they both provide horizontal scaling?
In the context of scaling MongoDB:
replication creates additional copies of the data and allows for automatic failover to another node. Replication may help with horizontal scaling of reads if you are OK to read data that potentially isn't the latest.
sharding allows for horizontal scaling of data writes by partitioning data across multiple servers using a shard key. It's important to choose a good shard key. For example, a poor choice of shard key could lead to "hot spots" of data only being written on a single shard.
A sharded environment does add more complexity because MongoDB now has to manage distributing data and requests between shards -- additional configuration and routing processes are added to manage those aspects.
Replication and sharding are typically combined to created a sharded cluster where each shard is supported by a replica set.
From a client application point of view you also have some control in relation to the replication/sharding interaction, in particular:
Read preferences
Write concerns
Consider you have a great music collection on your hard disk, you store the music in logical order based on year of release in different folders.
You are concerned that your collection will be lost if drive fails.
So you get a new disk and occasionally copy the entire collection keeping the same folder structure.
Sharding >> Keeping your music files in different folders
Replication >> Syncing your collection to other drives
Replication is a mostly traditional master/slave setup, data is synced to backup members and if the primary fails one of them can take its place. It is a reasonably simple tool. It's primarily meant for redundancy, although you can scale reads by adding replica set members. That's a little complicated, but works very well for some apps.
Sharding sits on top of replication, usually. "Shards" in MongoDB are just replica sets with something called a "router" in front of them. Your application will connect to the router, issue queries, and it will decide which replica set (shard) to forward things on to. It's significantly more complex than a single replica set because you have the router and config servers to deal with (these keep track of what data is stored where).
If you want to scale Mongo horizontally, you'd shard. 10gen likes to call the router/config server setup auto-sharding. It's possible to do a more ghetto form of sharding where you have the app decide which DB to write to as well.
Sharding
Sharding is a technique of splitting up a large collection amongst multiple servers. When we shard, we deploy multiple mongod servers. And in the front, mongos which is a router. The application talks to this router. This router then talks to various servers, the mongods. The application and the mongos are usually co-located on the same server. We can have multiple mongos services running on the same machine. It's also recommended to keep set of multiple mongods (together called replica set), instead of one single mongod on each server. A replica set keeps the data in sync across several different instances so that if one of them goes down, we won't lose any data. Logically, each replica set can be seen as a shard. It's transparent to the application, the way MongoDB chooses to shard is we choose a shard key.
Assume, for student collection we have stdt_id as the shard key or it could be a compound key. And the mongos server, it's a range based system. So based on the stdt_id that we send as the shard key, it'll send the request to the right mongod instance.
So, what do we need to really know as a developer?
insert must include a shard key, so if it's a multi-parted shard key, we must include the entire shard key
we've to understand what the shard key is on collection itself
for an update, remove, find - if mongos is not given a shard key - then it's going to have to broadcast the request to all the different shards that cover the collection.
for an update - if we don't specify the entire shard key, we have to make it a multi update so that it knows that it needs to broadcast it
Whenever you're thinking about sharding or replication, you need to think in the context of writers/update operations. If you don't need to scale writes then replications, as it fairly simpler, is a good choice for you.
On the other hand, if you workload mostly updates/writes then at some point you'll hit a write bottleneck. If write request comes Mongo blocks other writes request. Those write request blocks until the first request will be done. If you want to scale this writes and want parallelize it then you need to implement sharding.
Just to put this somewhere...
The most basic way to run mongo is as standalone server.
You write a config (file or cli options)
initiate the server using mongod
For this picture, I didn't include the "client". Check the next one.
A replica set is a set of servers initialized exactly as above with a different config file.
To link them, we connect to one of them, and initialize the replica set mode.
They will mirror each other (in the most common configuration). This system guarantees high availability of data.
The initialization of the replica set is represented in the red border box.
Sharding is not about replicating data, but about fragmenting data.
Each fragment of data is called chunk and goes to a different shard. shard = each replica set.
"main" server, running mongos instead of mongod. This is a router for queries from the client.
Obvious: The trade-off is a more complex architecture.
Novelty: configuration server (again, a different config file).
There is much more to add, but apart from the words the pictures hold much the same.
Even mongoDB recommends to study your case carefully before going sharding. Vertical scaling (vs) is probably a good idea at least once before horizontal scaling (hs).
vs is done upgrading hardware (cpu, ram, etc). hs is needs more computers (but could be cheap computers).
Both replication and sharding can be used (individually or together) for horizontal scaling of a MongoDB installation.
Sharding is MongoDB's solution for meeting the demands of data growth. Sharding stores data records across multiple servers to provide faster throughput on read and write queries, particularly for very large data sets.
Any of the servers in the sharded cluster can respond to a read or write operation, which greatly speeds up query responses.
Replication is MongoDB's solution for providing stability, backup, and disaster recovery to a MongoDB installation. This process copies and synchronizes the replica data set across multiple servers. This prevents downtime if one server goes offline.
Any of the secondary servers can respond to read queries, but only the primary server will perform write operations. The results of the write operation will then be propagated out to the secondary servers.
Scenario 1: Fault-Tolerance
In this scenario, the user is storing billing data in a MongoDB installation. This data is mission-critical to the user's business, and needs to be available 24/7, even if a server crashes or is taken offline.
MongoDB replication is the best solution for this user. With replication, the entire data set is mirrored on multiple servers. If a server fails or is taken offline, the other servers in the cluster take over.
Scenario 2: High Performance
In this scenario, the user is running a social networking site which is run from a MongoDB database. As the social network grows, the MongoDB data set has grown along with it. The user is seeing query times and page loads increase beyond an acceptable point. It is critical that the user's MongoDB installation receives a major performance boost.
Setting up a sharded MongoDB cluster is the best solution for this user. The sharded cluster will break up the user's data set and store parts of it on separate secondary servers. Each secondary server can respond to read or write queries on its portion of the data, which greatly increases the installation's response time
MongoDB Atlas is a Database as a service in could. It support three major cloud providers such as Azure , AWS and GCP. In cloud environment , we usually talk about high availability and scalability. In Atlas “clusters”, can be either a replica set or a sharded cluster.
These two address high availability and scalability features of our cloud environment.
In general Cluster is a group of servers used to achieve a specific task. So sharded clusters are used to store data in across multiple machines to meet the demand of data growth. As the size of the data increases, a single machine may not be sufficient to store the data nor provide an acceptable read and write throughput. Sharded clusters supports the horizontal scalability of the underling cloud environment.
A replica set in MongoDB is a group of mongod processes that maintain the same data set. Replica sets provide redundancy and high availability, and are the basis for all production deployments.In a replica, one node is a primary node that receives all write operations. All other instances, such as secondaries, apply operations from the primary so that they have the same data set. Replica set mainly focus on the availability of data.
Please check the documentation
Thank You.