Do mongodb databases within a cluster share the same node set under the hood? - mongodb

The reason I am asking is, I have a resource-intensive collection that degrades performance of its entire database. I need to decide whether to migrate other collections away to a different database within the same cluster or to a different cluster altogether.
The answer I think depends on under-the-hood implementation. Does a poorly performing collection take resources only from its own database, or from the cluster as a whole?
Hosted on Atlas.

I would suggest first look at your logical and schema designs and try to optimize it but if that is not working then
"In MongoDB Atlas, all databases within a cluster share the same set of nodes (servers) and are subject to the same resource limitations. Each database has its own logical namespace and operates independently from the other databases, but they share the same underlying hardware resources, such as CPU, memory, and I/O bandwidth.
So, if you have a resource-intensive collection that is degrading performance for its entire database, migrating other collections to a different database within the same cluster may not significantly improve performance if the resource bottleneck is at the cluster level. In this case, you may need to consider scaling up the cluster or upgrading to a higher-tier plan to increase the available resources and improve overall cluster performance."
Reference: https://www.mongodb.com/community/forums/t/creating-a-new-database-vs-a-new-collection-vs-a-new-cluster/99187/2

The term "cluster" is overloaded. It can refer to a replica set or to a sharded cluster.
A sharded cluster is effectively a group of replica set with a query router.
If you are using a sharded cluster, you can design a sharding strategy that will put the busy collection on its own shard, the rest of the data on the other shard(s), and still have a common point to query them both.

Related

How to decide when to use replicate sets for mongodb in production

We are currently hosting the MongoDB using its official docker image in ec2, for our production environment, its 32gb memory server dedicated to just this service.
How can using replica sets help us in the improvement of the performance of our MongoDB, we are currently facing that the response for queries is getting slower day by day.
Are there any measures through which we can determine that investing in the replica set will provide worthy benefits as well and will not be premature optimization.
MongoDB replication is a high availability solution (see note at the end of the post for more details on Replication). Replication is not a performance improvement solution.
MongoDB query performance depends upon various factors: size of collection, size of document, database design, query definition and indexes. Inadequate hardware (memory, hard drive, cpu and network) can affect the query performance. The number of operations at a given time can also affect the performance.
For faster query performance the main consideration is using indexes. Indexes affect directly the query filter and sort operations. To find if your query is performing optimally and using the proper indexes generate a query plan using the explainwith "executionStats" mode; study the plan. Explain can be run on MongoDB find, update, delete and aggregation queries. All these queries can benefit from indexes. See Query Optimization.
Adding capabilities to the existing hardware is known as vertical scaling; and replication is not vertical scaling.
Replication:
This is configured as a replica-set - a primary node and multiple secondary nodes. The primary is the main point of contact for application - all writes happen on the primary, (and reads, by default). The data written to the primary is replicated to the secondaries. This way data redundancy is accomplished. When the primary goes down one of the secondaries takes over as primary and keep the system running via a failover process. Data durability, high availability, redundancy and failover are the man concepts with replication. In MongoDB a replica-set cluster can have up to fifty nodes.
It is recommended to use replica-set in production due to HA functionality.
As a result of source limits on one hand and the need of HA in production on the other hand, I would suggest you to create a minimal replica-set which will consist of Primary, Secondary and an Arbiter (an arbiter does not contain any data and is very low memory consumer).
Also, Writes typically effect your memory performance much more than reads. In order to achieve better write performance I would advice you to create more shards (the more masters you have, the more writes you can handle at the same time).
However, I'm not sure what case your mongo's performance to slow so fast. I think you should:
Check what is most effect your production's performance (complicated queries or hard writes).
Change your read preference to "nearest".
Consider to disable Read Concern "majority" (remember that by default there is a write "majority" concern. Members should be up to date).
Check for a better index.
And of curse create a replica-set!
Good Luck! :P

How to handle different server types in MongoDB sharded cluster

Is there a way to deal with different server types in a sharded cluster? According to MongoDB documentation the balancer attempts to achieve an even distribution of chunks across all shards in the cluster. So, it purely seems to be based on the amount of data.
However, when you add new servers to an existing sharded cluster then typically the new server has more disc space, disc is faster and CPU has more power. Especially when you run an application for several years then this condition might come a fact.
Does the balancer take such topics into account or do you have to ensure that all servers in a sharded cluster have similar performance and resources?
You are correct that the balancer would assume that all parts of the cluster is of similar hardware. However you can use zone sharding to custom tailor the behaviour of the balancer.
To quote from the zone sharding docs page:
In sharded clusters, you can create zones of sharded data based on the shard key. You can associate each zone with one or more shards in the cluster. A shard can associate with any number of zones. In a balanced cluster, MongoDB migrates chunks covered by a zone only to those shards associated with the zone.
Using zones, you can specify data distribution to be by location, by hardware spec, by application/customer, and others.
To directly answer your question, the use case you'll be most interested in would be Tiered Hardware for Varying SLA or SLO. Please see the link for a tutorial on how to achieve this.
Note that defining the zones is a design decision on your part, and there is currently no automated way for the server to do this for you.
Small note: the balancer balances the cluster purely using the shard key. It doesn't take into account the amount of data at all. Thus in an improperly designed shard key, it is possible to have some shard overflowing with data while others are completely empty. In a pathological mis-design case, some chunks are not divisible, leading to a situation where the cluster is forever unbalanced until an extensive redesign is done.

What's a Cluster / Bucket in couchbase Server

I'm new to Couchbase and NoSql technologies in general, but I'm working on a web chat application running on node js using express and some other modules.
I've chosen to work with NoSql to store sessions and all needed data on server-side. But I don't really understand some important features of Couchbase : What is a Cluster, a Bucket? Where can I find some clear definitions of how the server works?
Couchbase uses the term cluster in the same way as many other products, a Couchbase cluster is simply a collection of machines running as a co-ordinated, distributed system of Couchbase nodes.
A Bucket is a Couchbase specific term that is roughly analogous to a 'database' in traditional RDBMS terms. A Bucket provides a container for grouping your data, both in terms of organisation and grouping of similar data and resource allocation. You can configure your buckets separately, providing different quotas, different IO priorities and different security settings on a per bucket basis. Buckets are also the primary method for namespacing documents in Couchbase.
For further information, the Architecture and Concepts overview in the Couchbase documentation, specifically data storage, is a good starting point. A somewhat outdated, but still useful video on Introduction to Couchbase might also be useful to you.
Even though it's answered, hope the following would be more helpful for someone.
A Couchbase cluster contains nodes. Nodes contain buckets. Buckets contain documents. Documents can be retrieved multiple ways: by their keys, queried with N1QL, and also by using Views.(Ref)
As specified in the Couchbase Documentation,
Node
A single Couchbase Server instance running on a physical server,
virtual machine, or a container. All nodes are identical: they consist
of the same components and services and provide the same interfaces.
Cluster
A cluster is a collection of nodes that are accessed and managed as a
single group. Each node is an equal partner in orchestrating the
cluster to provide facilities such as operational information
(monitoring) or managing cluster membership of nodes and health of
nodes.
Clusters are scalable. You can expand a cluster by adding new nodes
and shrink a cluster by removing nodes.
The Cluster Manager is the main component that orchestrates the
cluster level operations. For more information, see Cluster Manager.
Bucket
A bucket is a logical container for a related set of items such as
key-value pairs or documents. Buckets are similar to databases in
relational databases. They provide a resource management facility for
the group of data that they contain. Applications can use one or more
buckets to store their data. Through configuration, buckets provide
segregation along the following boundaries:
Cache and IO management
Authentication
Replication and Cross Datacenter Replication (XDCR)
Indexing and Views
For further info : Couchbase Terminology

Hosting and scaling mongodb

I'm looking for a hosting service to host my mongodb database, such as MongoLab-MongoHQ-Heroku-AWS EBS, etc.
What I need is to find one of this services (or another) that provides auto-scaling my storage size.
Is there a way (service) to auto-scale mongodb? How?
There are many hosting providers for MongoDB that provide scaling solutions.
Storage size is only one dimension to scaling; there are other resources to monitor like memory, CPU, and I/O throughput.
There are several different approaches you can use to scale your MongoDB storage size:
1) Deploy a MongoDB replica set and scale vertically
The minimum footprint you would need is two data bearing nodes (a primary and a secondary) plus a third node which is either another data node (secondary) or a voting-only node without data (arbiter).
As your storage needs change, you can adjust the storage on your secondaries by temporarily stopping the mongod service and doing the O/S changes to adjust storage provisioning on your dbpath. After you adjust each secondary, you would restart mongod and allow that secondary to catch up on replication sync before changing the next secondary. Lastly, you would step down and upgrade the primary.
2) Deploy a MongoDB sharded cluster and scale horizontally.
A sharded cluster allows you to partition databases across a number of shards, which are typically backed by replica sets (although technically a shard can be a standalone mongod). In this case you can add (or remove) shards based on your requirements. Changing the number of shards can have a notable performance impact due to chunk migration across shards, so this isn't something you'd do too reactively (i.e. far more likely on a daily or weekly basis rather than hourly).
3) Let a hosted database-as-a-service provider take care of the details for you :)
Replica sets and sharding are the basic building blocks, but hosted providers can often take care of the details for you. Armed with this terminology you should be able to approach hosting providers and ask them about available plan options and costing.

In Mongo what is the difference between sharding and replication?

Replication seems to be a lot simpler than sharding, unless I am missing the benefits of what sharding is actually trying to achieve. Don't they both provide horizontal scaling?
In the context of scaling MongoDB:
replication creates additional copies of the data and allows for automatic failover to another node. Replication may help with horizontal scaling of reads if you are OK to read data that potentially isn't the latest.
sharding allows for horizontal scaling of data writes by partitioning data across multiple servers using a shard key. It's important to choose a good shard key. For example, a poor choice of shard key could lead to "hot spots" of data only being written on a single shard.
A sharded environment does add more complexity because MongoDB now has to manage distributing data and requests between shards -- additional configuration and routing processes are added to manage those aspects.
Replication and sharding are typically combined to created a sharded cluster where each shard is supported by a replica set.
From a client application point of view you also have some control in relation to the replication/sharding interaction, in particular:
Read preferences
Write concerns
Consider you have a great music collection on your hard disk, you store the music in logical order based on year of release in different folders.
You are concerned that your collection will be lost if drive fails.
So you get a new disk and occasionally copy the entire collection keeping the same folder structure.
Sharding >> Keeping your music files in different folders
Replication >> Syncing your collection to other drives
Replication is a mostly traditional master/slave setup, data is synced to backup members and if the primary fails one of them can take its place. It is a reasonably simple tool. It's primarily meant for redundancy, although you can scale reads by adding replica set members. That's a little complicated, but works very well for some apps.
Sharding sits on top of replication, usually. "Shards" in MongoDB are just replica sets with something called a "router" in front of them. Your application will connect to the router, issue queries, and it will decide which replica set (shard) to forward things on to. It's significantly more complex than a single replica set because you have the router and config servers to deal with (these keep track of what data is stored where).
If you want to scale Mongo horizontally, you'd shard. 10gen likes to call the router/config server setup auto-sharding. It's possible to do a more ghetto form of sharding where you have the app decide which DB to write to as well.
Sharding
Sharding is a technique of splitting up a large collection amongst multiple servers. When we shard, we deploy multiple mongod servers. And in the front, mongos which is a router. The application talks to this router. This router then talks to various servers, the mongods. The application and the mongos are usually co-located on the same server. We can have multiple mongos services running on the same machine. It's also recommended to keep set of multiple mongods (together called replica set), instead of one single mongod on each server. A replica set keeps the data in sync across several different instances so that if one of them goes down, we won't lose any data. Logically, each replica set can be seen as a shard. It's transparent to the application, the way MongoDB chooses to shard is we choose a shard key.
Assume, for student collection we have stdt_id as the shard key or it could be a compound key. And the mongos server, it's a range based system. So based on the stdt_id that we send as the shard key, it'll send the request to the right mongod instance.
So, what do we need to really know as a developer?
insert must include a shard key, so if it's a multi-parted shard key, we must include the entire shard key
we've to understand what the shard key is on collection itself
for an update, remove, find - if mongos is not given a shard key - then it's going to have to broadcast the request to all the different shards that cover the collection.
for an update - if we don't specify the entire shard key, we have to make it a multi update so that it knows that it needs to broadcast it
Whenever you're thinking about sharding or replication, you need to think in the context of writers/update operations. If you don't need to scale writes then replications, as it fairly simpler, is a good choice for you.
On the other hand, if you workload mostly updates/writes then at some point you'll hit a write bottleneck. If write request comes Mongo blocks other writes request. Those write request blocks until the first request will be done. If you want to scale this writes and want parallelize it then you need to implement sharding.
Just to put this somewhere...
The most basic way to run mongo is as standalone server.
You write a config (file or cli options)
initiate the server using mongod
For this picture, I didn't include the "client". Check the next one.
A replica set is a set of servers initialized exactly as above with a different config file.
To link them, we connect to one of them, and initialize the replica set mode.
They will mirror each other (in the most common configuration). This system guarantees high availability of data.
The initialization of the replica set is represented in the red border box.
Sharding is not about replicating data, but about fragmenting data.
Each fragment of data is called chunk and goes to a different shard. shard = each replica set.
"main" server, running mongos instead of mongod. This is a router for queries from the client.
Obvious: The trade-off is a more complex architecture.
Novelty: configuration server (again, a different config file).
There is much more to add, but apart from the words the pictures hold much the same.
Even mongoDB recommends to study your case carefully before going sharding. Vertical scaling (vs) is probably a good idea at least once before horizontal scaling (hs).
vs is done upgrading hardware (cpu, ram, etc). hs is needs more computers (but could be cheap computers).
Both replication and sharding can be used (individually or together) for horizontal scaling of a MongoDB installation.
Sharding is MongoDB's solution for meeting the demands of data growth. Sharding stores data records across multiple servers to provide faster throughput on read and write queries, particularly for very large data sets.
Any of the servers in the sharded cluster can respond to a read or write operation, which greatly speeds up query responses.
Replication is MongoDB's solution for providing stability, backup, and disaster recovery to a MongoDB installation. This process copies and synchronizes the replica data set across multiple servers. This prevents downtime if one server goes offline.
Any of the secondary servers can respond to read queries, but only the primary server will perform write operations. The results of the write operation will then be propagated out to the secondary servers.
Scenario 1: Fault-Tolerance
In this scenario, the user is storing billing data in a MongoDB installation. This data is mission-critical to the user's business, and needs to be available 24/7, even if a server crashes or is taken offline.
MongoDB replication is the best solution for this user. With replication, the entire data set is mirrored on multiple servers. If a server fails or is taken offline, the other servers in the cluster take over.
Scenario 2: High Performance
In this scenario, the user is running a social networking site which is run from a MongoDB database. As the social network grows, the MongoDB data set has grown along with it. The user is seeing query times and page loads increase beyond an acceptable point. It is critical that the user's MongoDB installation receives a major performance boost.
Setting up a sharded MongoDB cluster is the best solution for this user. The sharded cluster will break up the user's data set and store parts of it on separate secondary servers. Each secondary server can respond to read or write queries on its portion of the data, which greatly increases the installation's response time
MongoDB Atlas is a Database as a service in could. It support three major cloud providers such as Azure , AWS and GCP. In cloud environment , we usually talk about high availability and scalability. In Atlas “clusters”, can be either a replica set or a sharded cluster.
These two address high availability and scalability features of our cloud environment.
In general Cluster is a group of servers used to achieve a specific task. So sharded clusters are used to store data in across multiple machines to meet the demand of data growth. As the size of the data increases, a single machine may not be sufficient to store the data nor provide an acceptable read and write throughput. Sharded clusters supports the horizontal scalability of the underling cloud environment.
A replica set in MongoDB is a group of mongod processes that maintain the same data set. Replica sets provide redundancy and high availability, and are the basis for all production deployments.In a replica, one node is a primary node that receives all write operations. All other instances, such as secondaries, apply operations from the primary so that they have the same data set. Replica set mainly focus on the availability of data.
Please check the documentation
Thank You.