Mongodb : Distribution of collections using sharding

Mongodb : Distribution of collections using sharding - mongodb

The application that we developed is a data crunching application and in this process, there will multiple collections that are created. So, there is a high chance that the maximum limit of 3 million collections (provided we use all the 2GB provided for NS) will be crossed.
Can I have different collections distributed in different mongodb instances and manage them using sharding ?

I guess your question is about how to use sharding for distributing collections across the nodes? Sharding is actually meant to be a method to meeting the demands of data growth in terms of amount not number of collections. That means you partition your collection into multiple shards.
To deal with the issue you are having, you could run multiple instances of mongod on the same server that live on their own (not a replica set). You could distribute your collections into these instances. But you will need to hold onto which collection is hosted on which mongod process. You would have to store pid of each mongod processes.

Related

mongodb - can i write to the database while sharding?

I know this may sounds like a naive question, but I don't seem to see much on google. Is it ok to insert / upsert documents into a mongodb (3.6) database, while sharding is going on (i.e. the chunk balancer is running)?
Thanks

The documentation is pretty clear on this:
Sharding improves concurrency by distributing collections over
multiple mongod instances, allowing shard servers (i.e. mongos
processes) to perform any number of operations concurrently to the
various downstream mongod instances.
In a sharded cluster, locks apply to each individual shard, not to the
whole cluster; i.e. each mongod instance is independent of the others
in the sharded cluster and uses its own locks. The operations on one
mongod instance do not block the operations on any others.
So, yes, you can run any query against a cluster at any time. This should be completely transparent to your client and MongoDB will internally manage potentially required locks.

Why do individual shards in MongoDB report more delete operations compared to corresponding mongos in a sharded cluster?

So I have a production sharded MongoDB cluster that has 8 shards (replica sets) managed by mongos. Let's say I have 20 servers which are running my application and each of the servers runs a mongos process that manages the 8 shards.
Given this setup, when I check the number of ops on each of my mongos on the 20 servers, I can see that my number of inserts and deletes are in proportion - which is in accordance with my application logic. However, when I run mongostat --discover on the individual shards, I see that deletes are nearly 4x the number of inserts which violates both my application logic as well as the 1:1 ratio indicated by mongos. Straightforward intuition supports that mongos would write to only one shard and so the average ratio of inserts and deletes across individual shards should be the same as that on mongos (which the application directly writes to) unless mongos does something different internally with the shards.
Could anyone point me to any relevant info on why this would happen or let me know if something could possibly wrong with my infra?
Thanks

The reason for this is that I was running the remove() queries to mongos without specifying my shard key. In that case, mongos does not know which shard to direct the query to and thus broadcasts the query to all the shards effectively performing more deletes than a targeted query.
Check documentation for more information.

mongodb indices and scaling

Reading the MongoDB documentation for indexes, i was left a little mystified and unsettled by this assertion found at: http://docs.mongodb.org/manual/applications/indexes/#ensure-indexes-fit-ram
If you have and use multiple collections, you must consider the size
of all indexes on all collections. The indexes and the working set
must be able to fit in RAM at the same time.
So, how is this supposed to scale when new nodes in the shard are added? suppose all my 576 nodes are bounded at 8Gb, and i have 12 collections of 4Gb each (including their associated indices) and 3 collections of 16Gb (including indices). How does the sharding spread work between nodes so that the 12 collections can be queried efficiently?

When sharding you spread the data across different shards. The mongos process routes queries to shards it needs to get data from. As such you only need to look at the data a shard is holding. To quote from When to Use Sharding:
You should consider deploying a sharded cluster, if:
your data set approaches or exceeds the storage capacity of a single node in your system.
the size of your system’s active working set will soon exceed the capacity of the maximum amount of RAM for your system.
Also note that the working set != whole collection. The working set is defined as:
The collection of data that MongoDB uses regularly. This data is typically (or preferably) held in RAM.
E.g. you have 1TB of data but typically only 50GB is used/queried. That subset is preferably held in RAM.

In Mongo what is the difference between sharding and replication?

Replication seems to be a lot simpler than sharding, unless I am missing the benefits of what sharding is actually trying to achieve. Don't they both provide horizontal scaling?

In the context of scaling MongoDB:
replication creates additional copies of the data and allows for automatic failover to another node. Replication may help with horizontal scaling of reads if you are OK to read data that potentially isn't the latest.
sharding allows for horizontal scaling of data writes by partitioning data across multiple servers using a shard key. It's important to choose a good shard key. For example, a poor choice of shard key could lead to "hot spots" of data only being written on a single shard.
A sharded environment does add more complexity because MongoDB now has to manage distributing data and requests between shards -- additional configuration and routing processes are added to manage those aspects.
Replication and sharding are typically combined to created a sharded cluster where each shard is supported by a replica set.
From a client application point of view you also have some control in relation to the replication/sharding interaction, in particular:
Read preferences
Write concerns

Consider you have a great music collection on your hard disk, you store the music in logical order based on year of release in different folders.
You are concerned that your collection will be lost if drive fails.
So you get a new disk and occasionally copy the entire collection keeping the same folder structure.
Sharding >> Keeping your music files in different folders
Replication >> Syncing your collection to other drives

Replication is a mostly traditional master/slave setup, data is synced to backup members and if the primary fails one of them can take its place. It is a reasonably simple tool. It's primarily meant for redundancy, although you can scale reads by adding replica set members. That's a little complicated, but works very well for some apps.
Sharding sits on top of replication, usually. "Shards" in MongoDB are just replica sets with something called a "router" in front of them. Your application will connect to the router, issue queries, and it will decide which replica set (shard) to forward things on to. It's significantly more complex than a single replica set because you have the router and config servers to deal with (these keep track of what data is stored where).
If you want to scale Mongo horizontally, you'd shard. 10gen likes to call the router/config server setup auto-sharding. It's possible to do a more ghetto form of sharding where you have the app decide which DB to write to as well.

Sharding
Sharding is a technique of splitting up a large collection amongst multiple servers. When we shard, we deploy multiple mongod servers. And in the front, mongos which is a router. The application talks to this router. This router then talks to various servers, the mongods. The application and the mongos are usually co-located on the same server. We can have multiple mongos services running on the same machine. It's also recommended to keep set of multiple mongods (together called replica set), instead of one single mongod on each server. A replica set keeps the data in sync across several different instances so that if one of them goes down, we won't lose any data. Logically, each replica set can be seen as a shard. It's transparent to the application, the way MongoDB chooses to shard is we choose a shard key.
Assume, for student collection we have stdt_id as the shard key or it could be a compound key. And the mongos server, it's a range based system. So based on the stdt_id that we send as the shard key, it'll send the request to the right mongod instance.
So, what do we need to really know as a developer?
insert must include a shard key, so if it's a multi-parted shard key, we must include the entire shard key
we've to understand what the shard key is on collection itself
for an update, remove, find - if mongos is not given a shard key - then it's going to have to broadcast the request to all the different shards that cover the collection.
for an update - if we don't specify the entire shard key, we have to make it a multi update so that it knows that it needs to broadcast it

Whenever you're thinking about sharding or replication, you need to think in the context of writers/update operations. If you don't need to scale writes then replications, as it fairly simpler, is a good choice for you.
On the other hand, if you workload mostly updates/writes then at some point you'll hit a write bottleneck. If write request comes Mongo blocks other writes request. Those write request blocks until the first request will be done. If you want to scale this writes and want parallelize it then you need to implement sharding.

Just to put this somewhere...
The most basic way to run mongo is as standalone server.
You write a config (file or cli options)
initiate the server using mongod
For this picture, I didn't include the "client". Check the next one.
A replica set is a set of servers initialized exactly as above with a different config file.
To link them, we connect to one of them, and initialize the replica set mode.
They will mirror each other (in the most common configuration). This system guarantees high availability of data.
The initialization of the replica set is represented in the red border box.
Sharding is not about replicating data, but about fragmenting data.
Each fragment of data is called chunk and goes to a different shard. shard = each replica set.
"main" server, running mongos instead of mongod. This is a router for queries from the client.
Obvious: The trade-off is a more complex architecture.
Novelty: configuration server (again, a different config file).
There is much more to add, but apart from the words the pictures hold much the same.
Even mongoDB recommends to study your case carefully before going sharding. Vertical scaling (vs) is probably a good idea at least once before horizontal scaling (hs).
vs is done upgrading hardware (cpu, ram, etc). hs is needs more computers (but could be cheap computers).

Both replication and sharding can be used (individually or together) for horizontal scaling of a MongoDB installation.
Sharding is MongoDB's solution for meeting the demands of data growth. Sharding stores data records across multiple servers to provide faster throughput on read and write queries, particularly for very large data sets.
Any of the servers in the sharded cluster can respond to a read or write operation, which greatly speeds up query responses.
Replication is MongoDB's solution for providing stability, backup, and disaster recovery to a MongoDB installation. This process copies and synchronizes the replica data set across multiple servers. This prevents downtime if one server goes offline.
Any of the secondary servers can respond to read queries, but only the primary server will perform write operations. The results of the write operation will then be propagated out to the secondary servers.
Scenario 1: Fault-Tolerance
In this scenario, the user is storing billing data in a MongoDB installation. This data is mission-critical to the user's business, and needs to be available 24/7, even if a server crashes or is taken offline.
MongoDB replication is the best solution for this user. With replication, the entire data set is mirrored on multiple servers. If a server fails or is taken offline, the other servers in the cluster take over.
Scenario 2: High Performance
In this scenario, the user is running a social networking site which is run from a MongoDB database. As the social network grows, the MongoDB data set has grown along with it. The user is seeing query times and page loads increase beyond an acceptable point. It is critical that the user's MongoDB installation receives a major performance boost.
Setting up a sharded MongoDB cluster is the best solution for this user. The sharded cluster will break up the user's data set and store parts of it on separate secondary servers. Each secondary server can respond to read or write queries on its portion of the data, which greatly increases the installation's response time

MongoDB Atlas is a Database as a service in could. It support three major cloud providers such as Azure , AWS and GCP. In cloud environment , we usually talk about high availability and scalability. In Atlas “clusters”, can be either a replica set or a sharded cluster.
These two address high availability and scalability features of our cloud environment.
In general Cluster is a group of servers used to achieve a specific task. So sharded clusters are used to store data in across multiple machines to meet the demand of data growth. As the size of the data increases, a single machine may not be sufficient to store the data nor provide an acceptable read and write throughput. Sharded clusters supports the horizontal scalability of the underling cloud environment.
A replica set in MongoDB is a group of mongod processes that maintain the same data set. Replica sets provide redundancy and high availability, and are the basis for all production deployments.In a replica, one node is a primary node that receives all write operations. All other instances, such as secondaries, apply operations from the primary so that they have the same data set. Replica set mainly focus on the availability of data.
Please check the documentation
Thank You.

MongoDb - Utilizing multi CPU server for a write heavy application

I am currently evaluating MongoDb for our write heavy application...
Currently MongoDb uses single thread for write operation and also uses global lock whenever it is doing the write... Is it possible to exploit multiple CPU on a multi-CPU server to get better write performance? What are your workaround for global write lock?

No, it is still recommended to use sharding to utilize multiple CPU cores.
As stated in the FAQ
Sharding improves concurrency by distributing collections over multiple mongod instances, allowing shard servers (i.e. mongos processes) to perform any number of operations concurrently to the various downstream mongod instances.
Each mongod instance is independent of the others in the shard cluster and uses the MongoDB readers-writer lock). The operations on one mongod instance do not block the operations on any others.
Sharding on a single box has its issues, as one user stated in the mongodb-user mailing list
After some significant experimentation, I've found a single MongoDB shard daemon CANNOT use more than one CPU. On a 24 CPU box, performance scales up until we hit about 8 shards, then another limit kicks in.

So right now, the easy solution is to shard.
Yes, normally sharding is done across servers. However, it is completely possible to shard on a single box. You simply fire up the shards on different ports and provide them with different folders. Here's a sample configuration of 2 shards on one box.
The MongoDB team recognizes that this is kind of sub-par, and I know from talking to them that they're looking at better ways to do this.
Obviously once you get multiple shards on one box and increase your write threads, you will have to be wary of disk IO. In my experience, I've been able to saturate disks with a single write thread. If your inserts/updates are relatively simple, you may find that extra write threads don't do anything. (Map-Reduces are the exception here, sharding definitely helps there)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse