Does mongo replication split data or duplicate it - mongodb

I am creating a mongoDB/nodejs based CMS and I am using GridFS to store all the uploaded docs. The question I have is this:
Does MongoDB replication sets allow increased amount of DB Storage, or
simply duplicates of the database. For Instance, if I have 5 servers
with 1TB of storage each, if I replica mongo across all of them, would
my GridFS system have theoretically 5TB of storage (minus caching and
padding) or 1TB of storage duplicated several times for better read
performance?
Thanks!

Informal description:
Replication = The same copy of the data on multiple nodes, i.e., 5 nodes with 1TB each provide 1TB overall.
Sharding / Partitioning = Fraction of the data goes to the nodes, i.e., 5 nodes with 1TB each provide 5TB overall.
Each approach has certain advantages and disadvantages, e.g., replication can help with read throughput and is good as backup, but slows down inserts (depending on commit level), whereas partitioning can help with insert throughput and distributed lookups.
Again, details left to the storage system implementor.

Sharding means splitting your data across multiple nodes, this is useful when you have a huge amount of data.
Replication means copying the data from a node to another node, and it's useful when your application is read heavy or you want to backup your data for example.
Resources:
http://www.mongodb.org/display/DOCS/Sharding
http://www.mongodb.org/display/DOCS/Replication
http://nosql-exp.blogspot.com/2010/09/mongodb-sharding-and-replication-with.html

Does MongoDB replication sets allow increased amount of DB Storage, or
simply duplicates of the database.
Mongo can do both.
The first case is called sharding.
The second case is called replication.

Related

MongoDB general install question about replica set

I have a server where I have a mongodb database of 150Gb
There are 8 databases in it with intensive write activity (as I'm storing tweets)
I notice some latency when reading data and am wondering if it would be interesting to switch to a replica set, considering I only have one machine.
Idea would be to have mongo running on 3 different ports, and each pointing to a different folder.
Would there be a benefit? I imagine that having 3 mongo instances with one dedicated to writing would be better but I'm not sure.
If yes, how should I configure the replica set (priority? arbiter?...)
Thanks for your help
To improve your read performance :
Add indexes if you don't have yet or optimize the current ones.
Add better project stages in your queries (to fetch only the data that you need)
Add more RAM
Add more wiredTiger cache
Improve filesystem performance (reduce swappiness, mount with noatime, etc.)
If your queries have a lot of aggregation + sorting add more CPU.
Disable your WiredTiger FTDC
Mount the journal to separate partition.
Distribute the 8x databases in 8x different disks (ssd/xfs preferred)
You can switch to replicaSet when:
You need redundancy (for production you always need)
Your use case allow read preference from SECONDARY (then adding more members benefits read performance since you can distribute reads to SECONDARY members)
Advice:
Considering you have write intensive applications sounds like a better option to switch to sharded cluster, adding more shards will distribute the writes better. Of course for 150GB (such small size), you need to decide if is worth the effort to do this if it is only for testing :)
Something like this sounds nice to me:
2x mongos + 3x CSRS + 3x shards (3x members in each replicaSet/shard)

MongoDB sharding storage usage

I am reading about sharding in MongoDB. After understanding how it works, I have a very basic question regarding the storage space used by it.
Suppose, I have a server containing 1 GB of storage. Now assuming my data will grow beyond 1 GB, it won't be sufficient for my purpose. So, I add one more server and shard Mongo.
So now, let's say I have 2 servers, with storage space say 1 GB each, which are to be included in the cluster. If I perform sharding, then both of these servers will be used to distribute Mongo data. So, in total, I must have 2 GB storage available for Mongo. But, I find that the official sharding documentation mentions that shards are replica sets. If that is so, then wouldn't the addition of 1 GB server just mean that I have only 1 GB storage (like before) for actual MongoDB data and remaining 1 GB is just replicated data?
If my understanding is correct, then is there any way to not create a replica set? Can we use 2 GB storage from both the servers like a logical volume?
Otherwise, if my understanding is wrong, what is the correct thing?
The documentation of Sharding at MongoDB says that - "Sharding distributes data across the shards in the cluster, allowing each shard to contain a subset of the total cluster data. As the data set grows, additional shards increase the storage capacity of the cluster". Here: https://docs.mongodb.com/manual/sharding/ (storage capacity)
Since its a subset both contain different sets of data. So there can be multiple use's of a replica set (shards to store subset of data or saving data as a backup and creating replicas) based on the usage.
Sharding happens one level above replication.
When you use both sharding and replication, your cluster consists of many replica-sets and one replica-set consists of many mongod instances.
However, it is also possible to create a cluster of stand-alone mongod instances which are not replicated or have only some shards implemented as replica-sets and some shards implemented as stand-alone mongod instances.
Each shard is a replica set, not the shards are replica sets.
I hope this helps.

Migrating chunks from primary shard to the other single shard takes too long

Each chunk move takes about 30-40 mins.
The shard key is a random looking but monotically increasing integer string which is a long sequence of digits. A "hashed" index is created for that field.
There are 150M documents each about 1.5Kb in size. The sharded collection has 10 indexes (some of them compound).
I have a total of ~11k chunks reported in sh.status(). So far I could only transfer 42 of them to the other shard.
The system consists of one mongos, one config server and one primary (mongod) shard and other (mongod) shard. All in the same server which has 8 cores and 32 GB ram.
I know the ideal is to use seperate machines but none of the CPUs were utilized so I thought it was good for a start.
What is your comment?
What do I need to investigate?
Is it normal?
As said on the mongodb documentation : " Sharding is the process of storing data records across multiple machines and is MongoDB’s approach to meeting the demands of data growth. As the size of the data increases, a single machine may not be sufficient to store the data nor provide an acceptable read and write throughput. Sharding solves the problem with horizontal scaling. With sharding, you add more machines to support data growth and the demands of read and write operations."
You should definitely not have your shards on the same machine. It is useless. The interest of sharding is that you scale horizontaly. So if you shard on the same machine.... You are just killing your throughput.
Your database will be faster without sharding if you have one machine.
To avoid data loss, before using sharding you should use : raid (not 0), replicaset and then sharding.

Difference between Sharding And Replication on MongoDB

I am just confuse about the Sharding and Replication that how they works..According to Definition
Replication: A replica set in MongoDB is a group of mongod processes that maintain the same data set.
Sharding: Sharding is a method for storing data across multiple machines.
As per my understanding if there is data of 75 GB then by replication (3 servers), it will store 75GB data on each servers means 75GB on Server-1, 75GB on server-2 and 75GB on server-3..(correct me if i am wrong)..and by sharding it will be stored as 25GB data on server-1, 25Gb data on server-2 and 25GB data on server-3.(Right?)...but then i encountered this line in the tutorial
Shards store the data. To provide high availability and data
consistency, in a production sharded cluster, each shard is a replica
set
As replica set is of 75GB but shard is of 25GB then how they can be equivalent...this makes me confuse a lot...I think i am missing something great in this. Please help me in this.
Lets try with this analogy. You are running the library.
As any person who has is running a library you have books in the library. You store all the books you have on the shelf. This is good, but your library became so good that your rival wants to burn it. So you decide to make many additional shelves in other places. There is one the most important shelf and whenever you add some new books you quickly add the same books to other shelves. Now if the rival destroys a shelf - this is not a problem, you just open another one and copy it with the books.
This is replication (just substitute library with application, shelf with a server, book with a document in the collection and your rival is just failed HDD on the server). It just makes additional copies of the data and if something goes wrong it automatically selects another primary.
This concept may help if you
want to scale reads (but they might lag behind the primary).
do some offline reads which do not touch main server
serve some part of the data for a specific region from a server from that specific region
But the main reason behind replication is data availability. So here you are right: if you have 75Gb of data and replicate it with 2 secondaries - you will get 75*3 Gb of data.
Look at another scenario. There is no rival so you do not want to make copy of your shelves. But right now you have another problem. You became so good that one shelf is not enough. You decide to distribute your books between many shelves. You decide to distribute them between shelves based on the author name (this is not be a good idea and read how to select sharding key here). So everything that starts with name less then K goes to one shelf everything that is K and more goes to another. This is sharding.
This concept may help you:
distribute a workload
be able to save data which much more then can fit on a single server
do map-reduce things
store more data in ram for faster queries
Here you are partially correct. If you have 75Gb, then in sum on all the servers there will be still 75 Gb, but it does not necessarily be divided equally.
But here is a problem with only sharding. Right now your rival appeared and he just came to one of your shelves and burned it. All the data on that shelf is lost. So you want to replicate every shard as well. Basically the notion that
each shard is a replica set
is not true. But if you are doing sharding you have to create a replication for every shard. Because the more shards you have, the bigger is the probability that at least one will die.
Answering Saad's followup answer:
Also you can have shards and replicas together on the same server, it is not recommended way of doing it. Each server should have a single role in the system. If for example you decide to have 2 shards and to replicate it 3 times, you will end up with 6 machines.
I know that this might sound too costly, but you have to remember that this is a commodity hardware and if the service you providing is already so good, that you think about high availability and does not fit one machine, then this is a rather cheap price to pay (in comparison to a dedicated one big machine).
I am writing it as an answer but actually its a question to #Salvador Sir's answer.
Like you said that in sharding 75 GB data "may be" stored as 25GB data on server-1, 25GB on server-2 and 25Gb on server-3. (this distribution depends on the Sharding Key)...then to prevent it from loss we also need to replicate the shard. so this means now every server contains it shards and also the replication of other shards present on other server..means Server-1 will have
1) Its own shard.
2) Replication of Shard present on server-2
3) Replication of Shard present on server-3
same goes with Server-2 and server-3. Am i right?..if this is the case then each server again have 75GB of data again. Right or wrong?
Since we want to make 3 shards and also replicate the data so following is the solution to the above problem.
r has shard and also replica set then in that case the failure of that server will lead to loss of replica set and shard.
However you can have the shard 1 and replica set (replica of shard 2 and shard 3) on same server but this is not advisable..
Sharding is like partition of data.
Lets say you have around 3GB of data, and you defined 3 shards, So each shard MIGHT take 1GB of data(And it truly depends on the shard key)
Why sharding is needed? Searching a specific data out of 3GB is 3 times complex than searching in 1GB of data. So its almost similar to partition. And sharding helps for fast accessing of data.
Now coming to Replica, Lets say you have the same 3GB of data without any replication(That means only a single copy of data exists) so if anything happens to that machine or the drive, your data is gone. So replication comes into picture to solve this problem, Lets say when you set up the DB, you have given your Replication as 3, which means the same 3GB of data is available 3 times(So the total size could be 9GB divided by each of 3GB copies). Replication helps for fail over.

mongodb indices and scaling

Reading the MongoDB documentation for indexes, i was left a little mystified and unsettled by this assertion found at: http://docs.mongodb.org/manual/applications/indexes/#ensure-indexes-fit-ram
If you have and use multiple collections, you must consider the size
of all indexes on all collections. The indexes and the working set
must be able to fit in RAM at the same time.
So, how is this supposed to scale when new nodes in the shard are added? suppose all my 576 nodes are bounded at 8Gb, and i have 12 collections of 4Gb each (including their associated indices) and 3 collections of 16Gb (including indices). How does the sharding spread work between nodes so that the 12 collections can be queried efficiently?
When sharding you spread the data across different shards. The mongos process routes queries to shards it needs to get data from. As such you only need to look at the data a shard is holding. To quote from When to Use Sharding:
You should consider deploying a sharded cluster, if:
your data set approaches or exceeds the storage capacity of a single node in your system.
the size of your system’s active working set will soon exceed the capacity of the maximum amount of RAM for your system.
Also note that the working set != whole collection. The working set is defined as:
The collection of data that MongoDB uses regularly. This data is typically (or preferably) held in RAM.
E.g. you have 1TB of data but typically only 50GB is used/queried. That subset is preferably held in RAM.