MongoDB sharding storage usage - mongodb

I am reading about sharding in MongoDB. After understanding how it works, I have a very basic question regarding the storage space used by it.
Suppose, I have a server containing 1 GB of storage. Now assuming my data will grow beyond 1 GB, it won't be sufficient for my purpose. So, I add one more server and shard Mongo.
So now, let's say I have 2 servers, with storage space say 1 GB each, which are to be included in the cluster. If I perform sharding, then both of these servers will be used to distribute Mongo data. So, in total, I must have 2 GB storage available for Mongo. But, I find that the official sharding documentation mentions that shards are replica sets. If that is so, then wouldn't the addition of 1 GB server just mean that I have only 1 GB storage (like before) for actual MongoDB data and remaining 1 GB is just replicated data?
If my understanding is correct, then is there any way to not create a replica set? Can we use 2 GB storage from both the servers like a logical volume?
Otherwise, if my understanding is wrong, what is the correct thing?

The documentation of Sharding at MongoDB says that - "Sharding distributes data across the shards in the cluster, allowing each shard to contain a subset of the total cluster data. As the data set grows, additional shards increase the storage capacity of the cluster". Here: https://docs.mongodb.com/manual/sharding/ (storage capacity)
Since its a subset both contain different sets of data. So there can be multiple use's of a replica set (shards to store subset of data or saving data as a backup and creating replicas) based on the usage.
Sharding happens one level above replication.
When you use both sharding and replication, your cluster consists of many replica-sets and one replica-set consists of many mongod instances.
However, it is also possible to create a cluster of stand-alone mongod instances which are not replicated or have only some shards implemented as replica-sets and some shards implemented as stand-alone mongod instances.
Each shard is a replica set, not the shards are replica sets.
I hope this helps.

Related

MongoDB bulk data load on sharded cluster: any specific settings I can take advantage of?

I have to insert #150 million records into a mongodb sharded cluster. The cluster comprises 5 3-member replica sets (primary, secondary, arbiter). Each record is uniform, about 1kb/record.
Given that I know the number and size of records, is there anything I should do with respect to shard configuration to optimize this data load? I'm thinking number/size of chunks, etc?

How to understand "The shards are replica sets."

When I put shard and Replica Set together, I am confused.
Why does the reference say that the shards are replica sets?
Do replica sets contains shards?
Can someone give me a conceptual explanation?
Replica Set is a cluster of MongoDB servers which implements Master - slave implementation. So, basically same data is shared between multiple replica i.e Master and Slave(s). Master is also termed as primary node and Slave(s) is/are considered as Secondary nodes.
It replicates your data on multiple mongo instances to solve/avoid fail overs. MongoDB also perform election of Primary node between secondary nodes automatically whenever Primary node goes down.
Sharding is used to store large data set between multiple machines. So basically, if you simply wants to compare Sharded nodes doesnt/may not contain same data where as Relicated nodes contains same data.
Sharding has different purpose,large data set is spread accross multiple machines.
Now, this large data set's subset can also be replicated to multiple nodes as primary and secondary to overcome failovers. So basically a shard can have multiple replica-set. These replica set of a shard contains subset of data for a large data set.
So, multiple shards can complete the whole large data set which are separated in the form of chunks. These chunks can be replicated within a Shard using Replica set.
You can also get more details related to this in MongoDB manual.
Sharding happens one level above replication.
When you use both sharding and replication, your cluster consists of many replica-sets and one replica-set consists of many mongod instances.
However, it is also possible to create a cluster of stand-alone mongod instances which are not replicated or have only some shards implemented as replica-sets and some shards implemented as stand-alone mongod instances.
Each shard is a replica set, not the shards are replica sets.
This is a language barrier, in English to say such a thing really means the same as "each shard is a replica set" in this context.
So to explain, say you have a collection of names a-z. Shard 1 holds a-b. This shard is also a replica set which means it has automated failover and replication of that range as well. So sharding in this sense is a top level term that comes above replica sets.
Shards are used to break a collection and store parts of it in different places. It is not necessary that a shard be a replica set, it can be a single server, but to achieve reliability and avoid loss of data, a replica set can be used as a shard instead of a single server. So, if one of the servers in the replica set goes down, the others will still hold the data.

MongoDB sharding concept

I'm trying to figure out the mechanism of MongoDB's sharding.
Can someone tell me if I got this thing right ?
Here is shards example of what I figured out:
Sharded Cluster 1 :
Shard1 - contains chunk1, chunk2 and chunk5 (Replica Set of primary and two secondaries, so we have backup for those chunks)
Shard2 - contains chunk3, chunk4 and chunk6 (MongoDB single instance, so we do not have any back up for those chunks)
Sharded Cluster 2:
Shard1 - contains chunk2, chunk3 and chunk6 (MongoDB single instance, so we do not have any back up for those chunks)
Shard2 - contains chunk1, chunk4 and chunk5 (Replica Set of primary and two secondaries, so we have backup for those chunks)
Thanks for providing such a wonderful information. Also I would like to add that sharding is basically done to increase I/O bandwidth and partition in-memory data for more proficient usage of distributed caching. Do click here for more information regarding modern alternative to database sharding.

mongodb indices and scaling

Reading the MongoDB documentation for indexes, i was left a little mystified and unsettled by this assertion found at: http://docs.mongodb.org/manual/applications/indexes/#ensure-indexes-fit-ram
If you have and use multiple collections, you must consider the size
of all indexes on all collections. The indexes and the working set
must be able to fit in RAM at the same time.
So, how is this supposed to scale when new nodes in the shard are added? suppose all my 576 nodes are bounded at 8Gb, and i have 12 collections of 4Gb each (including their associated indices) and 3 collections of 16Gb (including indices). How does the sharding spread work between nodes so that the 12 collections can be queried efficiently?
When sharding you spread the data across different shards. The mongos process routes queries to shards it needs to get data from. As such you only need to look at the data a shard is holding. To quote from When to Use Sharding:
You should consider deploying a sharded cluster, if:
your data set approaches or exceeds the storage capacity of a single node in your system.
the size of your system’s active working set will soon exceed the capacity of the maximum amount of RAM for your system.
Also note that the working set != whole collection. The working set is defined as:
The collection of data that MongoDB uses regularly. This data is typically (or preferably) held in RAM.
E.g. you have 1TB of data but typically only 50GB is used/queried. That subset is preferably held in RAM.

Does mongo replication split data or duplicate it

I am creating a mongoDB/nodejs based CMS and I am using GridFS to store all the uploaded docs. The question I have is this:
Does MongoDB replication sets allow increased amount of DB Storage, or
simply duplicates of the database. For Instance, if I have 5 servers
with 1TB of storage each, if I replica mongo across all of them, would
my GridFS system have theoretically 5TB of storage (minus caching and
padding) or 1TB of storage duplicated several times for better read
performance?
Thanks!
Informal description:
Replication = The same copy of the data on multiple nodes, i.e., 5 nodes with 1TB each provide 1TB overall.
Sharding / Partitioning = Fraction of the data goes to the nodes, i.e., 5 nodes with 1TB each provide 5TB overall.
Each approach has certain advantages and disadvantages, e.g., replication can help with read throughput and is good as backup, but slows down inserts (depending on commit level), whereas partitioning can help with insert throughput and distributed lookups.
Again, details left to the storage system implementor.
Sharding means splitting your data across multiple nodes, this is useful when you have a huge amount of data.
Replication means copying the data from a node to another node, and it's useful when your application is read heavy or you want to backup your data for example.
Resources:
http://www.mongodb.org/display/DOCS/Sharding
http://www.mongodb.org/display/DOCS/Replication
http://nosql-exp.blogspot.com/2010/09/mongodb-sharding-and-replication-with.html
Does MongoDB replication sets allow increased amount of DB Storage, or
simply duplicates of the database.
Mongo can do both.
The first case is called sharding.
The second case is called replication.