Difference between Sharding And Replication on MongoDB - mongodb

I am just confuse about the Sharding and Replication that how they works..According to Definition
Replication: A replica set in MongoDB is a group of mongod processes that maintain the same data set.
Sharding: Sharding is a method for storing data across multiple machines.
As per my understanding if there is data of 75 GB then by replication (3 servers), it will store 75GB data on each servers means 75GB on Server-1, 75GB on server-2 and 75GB on server-3..(correct me if i am wrong)..and by sharding it will be stored as 25GB data on server-1, 25Gb data on server-2 and 25GB data on server-3.(Right?)...but then i encountered this line in the tutorial
Shards store the data. To provide high availability and data
consistency, in a production sharded cluster, each shard is a replica
set
As replica set is of 75GB but shard is of 25GB then how they can be equivalent...this makes me confuse a lot...I think i am missing something great in this. Please help me in this.

Lets try with this analogy. You are running the library.
As any person who has is running a library you have books in the library. You store all the books you have on the shelf. This is good, but your library became so good that your rival wants to burn it. So you decide to make many additional shelves in other places. There is one the most important shelf and whenever you add some new books you quickly add the same books to other shelves. Now if the rival destroys a shelf - this is not a problem, you just open another one and copy it with the books.
This is replication (just substitute library with application, shelf with a server, book with a document in the collection and your rival is just failed HDD on the server). It just makes additional copies of the data and if something goes wrong it automatically selects another primary.
This concept may help if you
want to scale reads (but they might lag behind the primary).
do some offline reads which do not touch main server
serve some part of the data for a specific region from a server from that specific region
But the main reason behind replication is data availability. So here you are right: if you have 75Gb of data and replicate it with 2 secondaries - you will get 75*3 Gb of data.
Look at another scenario. There is no rival so you do not want to make copy of your shelves. But right now you have another problem. You became so good that one shelf is not enough. You decide to distribute your books between many shelves. You decide to distribute them between shelves based on the author name (this is not be a good idea and read how to select sharding key here). So everything that starts with name less then K goes to one shelf everything that is K and more goes to another. This is sharding.
This concept may help you:
distribute a workload
be able to save data which much more then can fit on a single server
do map-reduce things
store more data in ram for faster queries
Here you are partially correct. If you have 75Gb, then in sum on all the servers there will be still 75 Gb, but it does not necessarily be divided equally.
But here is a problem with only sharding. Right now your rival appeared and he just came to one of your shelves and burned it. All the data on that shelf is lost. So you want to replicate every shard as well. Basically the notion that
each shard is a replica set
is not true. But if you are doing sharding you have to create a replication for every shard. Because the more shards you have, the bigger is the probability that at least one will die.

Answering Saad's followup answer:
Also you can have shards and replicas together on the same server, it is not recommended way of doing it. Each server should have a single role in the system. If for example you decide to have 2 shards and to replicate it 3 times, you will end up with 6 machines.
I know that this might sound too costly, but you have to remember that this is a commodity hardware and if the service you providing is already so good, that you think about high availability and does not fit one machine, then this is a rather cheap price to pay (in comparison to a dedicated one big machine).

I am writing it as an answer but actually its a question to #Salvador Sir's answer.
Like you said that in sharding 75 GB data "may be" stored as 25GB data on server-1, 25GB on server-2 and 25Gb on server-3. (this distribution depends on the Sharding Key)...then to prevent it from loss we also need to replicate the shard. so this means now every server contains it shards and also the replication of other shards present on other server..means Server-1 will have
1) Its own shard.
2) Replication of Shard present on server-2
3) Replication of Shard present on server-3
same goes with Server-2 and server-3. Am i right?..if this is the case then each server again have 75GB of data again. Right or wrong?

Since we want to make 3 shards and also replicate the data so following is the solution to the above problem.
r has shard and also replica set then in that case the failure of that server will lead to loss of replica set and shard.
However you can have the shard 1 and replica set (replica of shard 2 and shard 3) on same server but this is not advisable..

Sharding is like partition of data.
Lets say you have around 3GB of data, and you defined 3 shards, So each shard MIGHT take 1GB of data(And it truly depends on the shard key)
Why sharding is needed? Searching a specific data out of 3GB is 3 times complex than searching in 1GB of data. So its almost similar to partition. And sharding helps for fast accessing of data.
Now coming to Replica, Lets say you have the same 3GB of data without any replication(That means only a single copy of data exists) so if anything happens to that machine or the drive, your data is gone. So replication comes into picture to solve this problem, Lets say when you set up the DB, you have given your Replication as 3, which means the same 3GB of data is available 3 times(So the total size could be 9GB divided by each of 3GB copies). Replication helps for fail over.

Related

MongoDB load balancing in multiple AWS instances

We're using amazon web service for a business application which is using node.js server and mongodb as database. Currently the node.js server is runing on a EC2 medium instance. And we're keeping our mongodb database in a separate micro instance. Now we want to deploy replica set in our mongodb database, so that if the mongodb gets locked or unavailble, we still can run our database and get data from it.
So we're trying to keep each member of the replica set in separate instances, so that we can get data from the database even if the instance of the primary memeber shuts down.
Now, I want to add load balancer in the database, so that the database works fine even in huge traffic load at a time. In that case I can read balance the database by adding slaveOK config in the replicaSet. But it'll not load balance the database if there is huge traffic load for write operation in the database.
To solve this problem I got two options till now.
Option 1: I've to shard the database and keep each shard in separate instance. And under each shard there will be a reaplica set in the same instance. But there is a problem, as the shard divides the database in multiple parts, so each shard will not keep same data within it. So if one instance shuts down, we'll not be able to access the data from the shard within that instance.
To solve this problem I'm trying to divide the database in shards and each shard will have a replicaSet in separate instances. So even if one instance shuts down, we'll not face any problem. But if we've 2 shards and each shard has 3 members in the replicaSet then I need 6 aws instances. So I think it's not the optimal solution.
Option 2: We can create a master-master configuration in the mongodb, that means all the database will be primary and all will have read/write access, but I would also like them to auto-sync with each other every so often, so they all end up being clones of each other. And all these primary databases will be in separate instance. But I don't know whether mongodb supports this structure or not.
I've not got any mongodb doc/ blog for this situation. So, please suggest me what should be the best solution for this problem.
This won't be a complete answer by far, there is too many details and I could write an entire essay about this question as could many others however, since I don't have that kind of time to spare, I will add some commentary about what I see.
Now, I want to add load balancer in the database, so that the database works fine even in huge traffic load at a time.
Replica sets are not designed to work like that. If you wish to load balance you might in fact be looking for sharding which will allow you to do this.
Replication is for automatic failover.
In that case I can read balance the database by adding slaveOK config in the replicaSet.
Since, to stay up to date, your members will be getting just as many ops as the primary it seems like this might not help too much.
In reality instead of having one server with many connections queued you have many connections on many servers queueing for stale data since member consistency is eventual, not immediate unlike ACID technologies, however, that being said they are only eventually consistent by 32-odd ms which means they are not lagging enough to give decent throughput if the primary is loaded.
Since reads ARE concurrent you will get the same speed whether you are reading from the primary or secondary. I suppose you could delay a slave to create a pause of OPs but that would bring back massively stale data in return.
Not to mention that MongoDB is not multi-master as such you can only write to one node a time makes slaveOK not the most useful setting in the world any more and I have seen numerous times where 10gen themselves recommend you use sharding over this setting.
Option 2: We can create a master-master configuration in the mongodb,
This would require you own coding. At which point you may want to consider actually using a database that supports http://en.wikipedia.org/wiki/Multi-master_replication
This is since the speed you are looking for is most likely in fact in writes not reads as I discussed above.
Option 1: I've to shard the database and keep each shard in separate instance.
This is the recommended way but you have found the caveat with it. This is unfortunately something that remains unsolved that multi-master replication is supposed to solve, however, multi-master replication does add its own ship of plague rats to Europe itself and I would strongly recommend you do some serious research before you think as to whether MongoDB cannot currently service your needs.
You might be worrying about nothing really since the fsync queue is designed to deal with the IO bottleneck slowing down your writes as it would in SQL and reads are concurrent so if you plan your schema and working set right you should be able to get a massive amount of OPs.
There is in fact a linked question around here from a 10gen employee that is very good to read: https://stackoverflow.com/a/17459488/383478 and it shows just how much throughput MongoDB can achieve under load.
It will grow soon with the new document level locking that is already in dev branch.
Option 1 is the recommended way as pointed out by #Sammaye but you would not need 6 instances and can manage it with 4 instances.
Assuming you need below configuration.
2 shards (S1, S2)
1 copy for each shard (Replica set secondary) (RS1, RS2)
1 Arbiter for each shard (RA1, RA2)
You could then divide your server configuration like below.
Instance 1 : Runs : S1 (Primary Node)
Instance 2 : Runs : S2 (Primary Node)
Instance 3 : Runs : RS1 (Secondary Node S1) and RA2 (Arbiter Node S2)
Instance 4 : Runs : RS2 (Secondary Node S2) and RA1 (Arbiter Node S1)
You could run arbiter nodes along with your secondary nodes which would help you in election during fail-overs.

Mirror Production Mongo Data for Analytics

I have a Mongo cluster that backs an application that I use in production. It's very important to my business and clustered across a number of boxes to optimize for speed and redundancy. I'd like to make the data in said cluster available for running analytical queries and enqueued tasks, but I definitely don't want these to harm production performance. Is it possible to just mirror all of my data against a single box I throw into the cluster with some special tag that I can then use for analytics? It's fine if it's slow. I just want it to be cheap and not to affect production read/write speeds.
Since you're talking about redundancy, I assume you have a replica set.
In that case you can use a hidden replica set member to perform the calculations you need.
Just keep in mind that the member count must be odd. If you add a node you might need to also add an arbiter. Or maybe you can just hide one of the already existing members.
If you are looking for a way to increase querying speed having a lot of data, you have to look might look into sharding with mongodb. Basically what it does is dividing your big amount of data into small shards and stores them on different machines.
If you are looking to increase redundancy (in order to make backup or to be able to do offline processing without touching primary servers) you have to look into replication with mongodb. If you are doing replication, keep in mind that the data on the replicas will be always lagging behind a primary (nothing to worry about, but just need to know this fact to decide can you allow read from the replicas). As it was pointed by Rafa, hidden replica sets are well suited for backup and offline data processing. They will still be able to get all the data from primary (with small lag), but are invisible to secondary reads and can not become primary.
There is a nice mongodb course which is talking in depth about replication and sharding, so may be it is worth listening and trying it.

MongoDB - how to best achieve active/active configuration?

I have an application which is very low on writes. I'm therefore interested in deploying a mongo installation which maximizes the read throughput for the hardware I have (3 database servers in one location). I don't really care for redundancy (backups), but would like automatic failover. Additionally, I'm fine with "eventual consistency", and don't mind if data which isn't the latest data is returned.
I've looked into both sharding and replica sets, and as far as I can tell, I don't really need to use sharding as its benefits suit more for applications with many writes.
I therefore went ahead and installed a replica set on the three servers I have, and I then set the reading preference to "Nearest", as that would allow reads to take place on any server.
The problem is, I later read that the client is "sticky" and basically once it has chosen a "nearest" mongo server, it's not likely to change it. Besides, even if it were to "check for nearest" again, it'll probably choose the same one over. This pretty much results in an active/passive configuration, without any load-balancing. I do have two application servers, so if they choose different mongo servers, it might work ok, but say I wanted to have more than 3 mongo servers in the replica set, then any servers besides specific two would be passive.
Basically my question is, what's the best way to have an active/active configuration for my deployment? All I want is for requests to go to free mongo servers rather than busy ones.
One way to force this which I thought of is to create three sharded-clusters (each server participating in all three), where each server is the primary in one of these clusters - but this is still not optimal, because besides the relative complexity involved in this configuration, this also doesn't guarantee complete load balancing (for example, in case all requests at a given moment happen to go to one specific shard).
What's the right way to achieve what I want? If it's not possible to achieve this kind of load balancing with mongo, would you recommend that I go with the sharded-clusters solution?
As you already suspected, scaling reads is not a "one size fits all" problem. Everything will depend on your data, your access patterns, your requirements and probably a few other things only you can determine.
In a nutshell, the main thing to consider is why a single server can't handle your read load. If it's because of the size of your data set and the size of your indexes then sharding your data across three shards will reduce the RAM requirements of each of them (or to put it another way will give you the combined RAM of all three systems). As long as you pick a good shard key (one that will distribute the load approximately evenly across all the systems) you will get almost three times the throughput on targeted queries.
If the main requirement for your reads is to reduce as much as possible the latency of reading the data, then a replica set can serve your purposes well as reading from the "nearest" node will reduce the network round-trip time without changing the duration of the operation on the MongoDB server. This assumes that your writes are infrequent enough or that your application has tolerance of possibly stale data.

In Mongo what is the difference between sharding and replication?

Replication seems to be a lot simpler than sharding, unless I am missing the benefits of what sharding is actually trying to achieve. Don't they both provide horizontal scaling?
In the context of scaling MongoDB:
replication creates additional copies of the data and allows for automatic failover to another node. Replication may help with horizontal scaling of reads if you are OK to read data that potentially isn't the latest.
sharding allows for horizontal scaling of data writes by partitioning data across multiple servers using a shard key. It's important to choose a good shard key. For example, a poor choice of shard key could lead to "hot spots" of data only being written on a single shard.
A sharded environment does add more complexity because MongoDB now has to manage distributing data and requests between shards -- additional configuration and routing processes are added to manage those aspects.
Replication and sharding are typically combined to created a sharded cluster where each shard is supported by a replica set.
From a client application point of view you also have some control in relation to the replication/sharding interaction, in particular:
Read preferences
Write concerns
Consider you have a great music collection on your hard disk, you store the music in logical order based on year of release in different folders.
You are concerned that your collection will be lost if drive fails.
So you get a new disk and occasionally copy the entire collection keeping the same folder structure.
Sharding >> Keeping your music files in different folders
Replication >> Syncing your collection to other drives
Replication is a mostly traditional master/slave setup, data is synced to backup members and if the primary fails one of them can take its place. It is a reasonably simple tool. It's primarily meant for redundancy, although you can scale reads by adding replica set members. That's a little complicated, but works very well for some apps.
Sharding sits on top of replication, usually. "Shards" in MongoDB are just replica sets with something called a "router" in front of them. Your application will connect to the router, issue queries, and it will decide which replica set (shard) to forward things on to. It's significantly more complex than a single replica set because you have the router and config servers to deal with (these keep track of what data is stored where).
If you want to scale Mongo horizontally, you'd shard. 10gen likes to call the router/config server setup auto-sharding. It's possible to do a more ghetto form of sharding where you have the app decide which DB to write to as well.
Sharding
Sharding is a technique of splitting up a large collection amongst multiple servers. When we shard, we deploy multiple mongod servers. And in the front, mongos which is a router. The application talks to this router. This router then talks to various servers, the mongods. The application and the mongos are usually co-located on the same server. We can have multiple mongos services running on the same machine. It's also recommended to keep set of multiple mongods (together called replica set), instead of one single mongod on each server. A replica set keeps the data in sync across several different instances so that if one of them goes down, we won't lose any data. Logically, each replica set can be seen as a shard. It's transparent to the application, the way MongoDB chooses to shard is we choose a shard key.
Assume, for student collection we have stdt_id as the shard key or it could be a compound key. And the mongos server, it's a range based system. So based on the stdt_id that we send as the shard key, it'll send the request to the right mongod instance.
So, what do we need to really know as a developer?
insert must include a shard key, so if it's a multi-parted shard key, we must include the entire shard key
we've to understand what the shard key is on collection itself
for an update, remove, find - if mongos is not given a shard key - then it's going to have to broadcast the request to all the different shards that cover the collection.
for an update - if we don't specify the entire shard key, we have to make it a multi update so that it knows that it needs to broadcast it
Whenever you're thinking about sharding or replication, you need to think in the context of writers/update operations. If you don't need to scale writes then replications, as it fairly simpler, is a good choice for you.
On the other hand, if you workload mostly updates/writes then at some point you'll hit a write bottleneck. If write request comes Mongo blocks other writes request. Those write request blocks until the first request will be done. If you want to scale this writes and want parallelize it then you need to implement sharding.
Just to put this somewhere...
The most basic way to run mongo is as standalone server.
You write a config (file or cli options)
initiate the server using mongod
For this picture, I didn't include the "client". Check the next one.
A replica set is a set of servers initialized exactly as above with a different config file.
To link them, we connect to one of them, and initialize the replica set mode.
They will mirror each other (in the most common configuration). This system guarantees high availability of data.
The initialization of the replica set is represented in the red border box.
Sharding is not about replicating data, but about fragmenting data.
Each fragment of data is called chunk and goes to a different shard. shard = each replica set.
"main" server, running mongos instead of mongod. This is a router for queries from the client.
Obvious: The trade-off is a more complex architecture.
Novelty: configuration server (again, a different config file).
There is much more to add, but apart from the words the pictures hold much the same.
Even mongoDB recommends to study your case carefully before going sharding. Vertical scaling (vs) is probably a good idea at least once before horizontal scaling (hs).
vs is done upgrading hardware (cpu, ram, etc). hs is needs more computers (but could be cheap computers).
Both replication and sharding can be used (individually or together) for horizontal scaling of a MongoDB installation.
Sharding is MongoDB's solution for meeting the demands of data growth. Sharding stores data records across multiple servers to provide faster throughput on read and write queries, particularly for very large data sets.
Any of the servers in the sharded cluster can respond to a read or write operation, which greatly speeds up query responses.
Replication is MongoDB's solution for providing stability, backup, and disaster recovery to a MongoDB installation. This process copies and synchronizes the replica data set across multiple servers. This prevents downtime if one server goes offline.
Any of the secondary servers can respond to read queries, but only the primary server will perform write operations. The results of the write operation will then be propagated out to the secondary servers.
Scenario 1: Fault-Tolerance
In this scenario, the user is storing billing data in a MongoDB installation. This data is mission-critical to the user's business, and needs to be available 24/7, even if a server crashes or is taken offline.
MongoDB replication is the best solution for this user. With replication, the entire data set is mirrored on multiple servers. If a server fails or is taken offline, the other servers in the cluster take over.
Scenario 2: High Performance
In this scenario, the user is running a social networking site which is run from a MongoDB database. As the social network grows, the MongoDB data set has grown along with it. The user is seeing query times and page loads increase beyond an acceptable point. It is critical that the user's MongoDB installation receives a major performance boost.
Setting up a sharded MongoDB cluster is the best solution for this user. The sharded cluster will break up the user's data set and store parts of it on separate secondary servers. Each secondary server can respond to read or write queries on its portion of the data, which greatly increases the installation's response time
MongoDB Atlas is a Database as a service in could. It support three major cloud providers such as Azure , AWS and GCP. In cloud environment , we usually talk about high availability and scalability. In Atlas “clusters”, can be either a replica set or a sharded cluster.
These two address high availability and scalability features of our cloud environment.
In general Cluster is a group of servers used to achieve a specific task. So sharded clusters are used to store data in across multiple machines to meet the demand of data growth. As the size of the data increases, a single machine may not be sufficient to store the data nor provide an acceptable read and write throughput. Sharded clusters supports the horizontal scalability of the underling cloud environment.
A replica set in MongoDB is a group of mongod processes that maintain the same data set. Replica sets provide redundancy and high availability, and are the basis for all production deployments.In a replica, one node is a primary node that receives all write operations. All other instances, such as secondaries, apply operations from the primary so that they have the same data set. Replica set mainly focus on the availability of data.
Please check the documentation
Thank You.

Mongodb and Cassandra data storing mechanism

I have been reading about MongoDB and Cassandra. MongoDB is a master/slave where as Cassandra is masterless (all nodes are equal). My doubt is about how the data is stored in these both.
Let's say a user is writing a request to MongoDB(a cluster with master and different slaves each in a separate machine). This means the master will decide(or through some application implementation) to which slave this update should be written to . That is same data will not be available in all the nodes in MongoDB. Each node size may vary. Am i right ? Also when queried will the master know to which node this request should be sent ?
In the case of cassandra, the same data will be written to all the nodes ie) effectively if one node size is 10GB, then the other nodes size is also 10GB. Because if only this is the case, then when one node fails, the user will not lose any data by querying in another node. Am i right here ? If I am right, the same data is available in all the nodes, then what is the advantage of using map/reduce function in Cassandra ? If I am wrong, then how availability is maintained in Cassandra since the same data will not be available in the other node ?
I was searching in stackoverflow about MongoDB vs cassandra and have read about some 10 posts but my questions could not be cleared with the answers in those posts. Please clear my doubts and If I had assumed wrongly, also correct me.
Regarding MongoDB, yep you're right, there is only one primary.
Any secondary can become primary as long as everything is in sync as this will mean the secondary has all the data. Each node doesn't have to be the same on-disk size and this can vary depending on when the replication was done, however, they do have the same data (as long as they're in sync).
I don't know much about Cassandra, sorry!
I've written a thesis about NoSQL stores and therefor I hope that I remember the most parts correctly for Cassandra:
Cassandra is a mixture of Amazon Dynamo, from which it inherit the replication and sharding, and Googles BigTable from which it got the datamodel. So Cassandra basically shards your data, while keeping copies of it on other nodes. Let's have a five node cluster, with nodes called A to E. Your keys are hashed to the keyring through consistent hashing, where continuous areas of your keyring are stored on a given node. So if we have a value range from 1 to 100, per default each node will get 1/5 of the ring. A will range from [1,20), B from [20,40) and so on.
An important Concept for Dynamo is the triple (R,W,N) which tells how many nodes have to read, write, and keep a given value.
Per default you have 3 (N) copies of your data, which is stored on the primary node and two following nodes, which hold backups. When I remember it right from the Dynamo paper your writes go per Default to the first W nodes of your N copies, the other nodes are updated through an Gossip Protocol eventually.
As long as everything is going fine you'll get consistent results, if your primary node is down for some time another node takes your data, through a hinted hand-off. Once the primary comes back your data will be merged, or tried to be merged (this part I can't really remember but check those Vector Clocks which are used to tell the update history).
So if not too big parts of your cluster go down, you'll have a consistent view on your data. If bigger parts of your node are down or you request from only a small parts of your copies you may see inconsistencies, which (may) eventually be consistent.
Hope that helped, I can highly recommend to read those original papers about Amazon Dynamo and Google BigTable, but I think you're mostly interested in Amazon Dynamo. Additionally this post from Werner Vogels may come handy as well.
As for the sharding size I think that those can vary depending on your machine and how hot given areas of your keyring are.
Cassandra does not, typically, keep all data on all nodes. As you suggest, this would defeat some of the advantages offered by it's distributed data model (in particular fast writes would be hampered). The amount of replication desired (how many nodes should keep copies of your data) is customizable by the client at write time. As such, you can set it up to replicate across all nodes, or just keep your data at a single node with no replication. It's up to you. The specific node(s) to which the data gets written is determined by the hash value of the key. Each node is assigned a range of hash values it will store, so when you go to look up a value, again the key is hashed and that indicates on which node to find the data.