MongoDB - Cluster of small/many or few/big nodes - mongodb

Can it be said what is best in a general case (where the database size is really big): To have a MongoDB cluster consisting of a larger number of smaller blade servers, or a few, really fat, servers?
Given is that the shard key has a quite fine granularity, so splitting should not be a problem.
If there are no "golden bullet", what is the pros and cons with either setup?

Best in what aspect? From a financial point of view, I'd go for lots of cheap hardware :)
MongoDB has been built to easily scale across nodes, so why not take advantage of this? The reason you'd want just one or a few beefy servers for a SQL server is to minimize the spread of relational data across physical nodes. But since MongoDB uses documents, most of your related data is stored in a single document. This means that it's all stored at the same physical location and you don't have to do costly lookups on other nodes to reconstruct the 'complete picture' of your data.
Another thing to keep in mind is that map-reduce jobs can only run in parallel in a sharded environment. So if you plan to do a lot of map-reducing, more shards/servers will result in better performance.
What if your database outgrows your beefy servers? Are you going to invest in another beefy server that handles that small amount of extra growth? Or what if one of them crashes? With smaller and cheaper servers, you can scale up (or down) more gradually if the need arises. Also, the impact of a server crash is much smaller, as it will affect only a small portion of your data.
To summarize: a large cluster of smaller servers isn't a silver bullet, as managing such a cluster has its own challenges, but it is significantly cheaper and possibly faster as well if you're doing map-reduce.

Related

Mirror Production Mongo Data for Analytics

I have a Mongo cluster that backs an application that I use in production. It's very important to my business and clustered across a number of boxes to optimize for speed and redundancy. I'd like to make the data in said cluster available for running analytical queries and enqueued tasks, but I definitely don't want these to harm production performance. Is it possible to just mirror all of my data against a single box I throw into the cluster with some special tag that I can then use for analytics? It's fine if it's slow. I just want it to be cheap and not to affect production read/write speeds.
Since you're talking about redundancy, I assume you have a replica set.
In that case you can use a hidden replica set member to perform the calculations you need.
Just keep in mind that the member count must be odd. If you add a node you might need to also add an arbiter. Or maybe you can just hide one of the already existing members.
If you are looking for a way to increase querying speed having a lot of data, you have to look might look into sharding with mongodb. Basically what it does is dividing your big amount of data into small shards and stores them on different machines.
If you are looking to increase redundancy (in order to make backup or to be able to do offline processing without touching primary servers) you have to look into replication with mongodb. If you are doing replication, keep in mind that the data on the replicas will be always lagging behind a primary (nothing to worry about, but just need to know this fact to decide can you allow read from the replicas). As it was pointed by Rafa, hidden replica sets are well suited for backup and offline data processing. They will still be able to get all the data from primary (with small lag), but are invisible to secondary reads and can not become primary.
There is a nice mongodb course which is talking in depth about replication and sharding, so may be it is worth listening and trying it.

MongoDB - how to best achieve active/active configuration?

I have an application which is very low on writes. I'm therefore interested in deploying a mongo installation which maximizes the read throughput for the hardware I have (3 database servers in one location). I don't really care for redundancy (backups), but would like automatic failover. Additionally, I'm fine with "eventual consistency", and don't mind if data which isn't the latest data is returned.
I've looked into both sharding and replica sets, and as far as I can tell, I don't really need to use sharding as its benefits suit more for applications with many writes.
I therefore went ahead and installed a replica set on the three servers I have, and I then set the reading preference to "Nearest", as that would allow reads to take place on any server.
The problem is, I later read that the client is "sticky" and basically once it has chosen a "nearest" mongo server, it's not likely to change it. Besides, even if it were to "check for nearest" again, it'll probably choose the same one over. This pretty much results in an active/passive configuration, without any load-balancing. I do have two application servers, so if they choose different mongo servers, it might work ok, but say I wanted to have more than 3 mongo servers in the replica set, then any servers besides specific two would be passive.
Basically my question is, what's the best way to have an active/active configuration for my deployment? All I want is for requests to go to free mongo servers rather than busy ones.
One way to force this which I thought of is to create three sharded-clusters (each server participating in all three), where each server is the primary in one of these clusters - but this is still not optimal, because besides the relative complexity involved in this configuration, this also doesn't guarantee complete load balancing (for example, in case all requests at a given moment happen to go to one specific shard).
What's the right way to achieve what I want? If it's not possible to achieve this kind of load balancing with mongo, would you recommend that I go with the sharded-clusters solution?
As you already suspected, scaling reads is not a "one size fits all" problem. Everything will depend on your data, your access patterns, your requirements and probably a few other things only you can determine.
In a nutshell, the main thing to consider is why a single server can't handle your read load. If it's because of the size of your data set and the size of your indexes then sharding your data across three shards will reduce the RAM requirements of each of them (or to put it another way will give you the combined RAM of all three systems). As long as you pick a good shard key (one that will distribute the load approximately evenly across all the systems) you will get almost three times the throughput on targeted queries.
If the main requirement for your reads is to reduce as much as possible the latency of reading the data, then a replica set can serve your purposes well as reading from the "nearest" node will reduce the network round-trip time without changing the duration of the operation on the MongoDB server. This assumes that your writes are infrequent enough or that your application has tolerance of possibly stale data.

How many requests can mongodb handle before sharding is necessary?

Does anybody know (from personal experience or official documentation) how many concurrent requests can a single MongoDb server handle before sharding is advised?
If your working set exceeds the RAM you can afford for a single server, or your disk I/O requirements exceed what you can provide on a single server, or (less likely) your CPU requirements exceed what you can get on one server, then you'll need to shard. All these depend tremendously on your specific workload. See http://docs.mongodb.org/manual/faq/storage/#what-is-the-working-set
One factor is hardware. Although for this you have replica sets. They reduce the load from the master server by answering read-only queries with replicated data. Another option would be memcaching for very frequent and repetitive queries, which would be even faster.
A factor for whether sharding is necessary is the data size & variation. When you have a wide range of varying data you need to access, which would render a server's cache uneffective by distributing the access to the data to the wide range, then you would consider using sharding. Off-loading work is merely a side-effect of this.

Does auto-sharding in MongoDB work on shards with many small collections/small databases

In the MongoDB documentation for auto-sharding it says: "Sharding is performed on a per-collection basis. Small collections need not be sharded."
Our business has many databases (~100), with many small collections (~30), each with a document count of 1 - 3000. Our DB system is looking at approximately 100,000,000 page views per month.
In that scenario will sharding ever activate since the collections are never big enough even though the DB usage and site traffic is certainly high enough to require load balancing. From the docs I can't seem to find a clear answer.
Whether it makes sense to shard depends a little bit on whether you have mostly writes or reads to the database. Sharding is primarily used for write-scaling, but if you are not doing a lot of writes, then simply using replicasets with "slaveOkay" for the reads might work just as well.
From the numbers that you provided you seem to get about 9 million documents, but are they large documents? If they easily fit in memory, then there is most likely not even going to be a need for replicasets besides for failover capabilities.
This is hard to answer without knowing more about your use case, but I'll give it a shot.
Are you sure sharding is what you need? What does your insert rate look like?
If you are going to have a static set of data, or even a relatively static set, then you probably don't need to shard, you could simply use more secondaries and enable slaveOK reads. The reads will be distributed to the various secondaries and scale up your read capacity.
If that is not the case, and you do need to shard, then there are options. But first, to explain briefly and at a high level how automatic sharding works:
The mongos process is responsible for splitting and migrating chunks in general. These are two separate operations - splitting and balancing.
Splits occur when the mongos sees that a certain portion of the
maximum chunk size has been written, it initiates a split if there is
in fact enough data to warrant it. Over time, with enough data
written, the number of chunks grows.
Balancing occurs when there is an imbalance of chunks (currently 8 in
2.0, though moving to a more dynamic heuristic in 2.2). The balancer migrates the chunks around the shards until a balance is achieved.
So, you need to be writing enough data relative to the max chunk size (default is 64MB in 2.0) to generate the chunks needed for the balancer to move them around appropriately. If that is not going to happen with your data, then you can look at:
Decreasing the chunk size (has drawbacks too - http://www.mongodb.org/display/DOCS/Sharding+Administration#ShardingAdministration-ChunkSizeConsiderations)
Manually split/move the chunks
For the manual instructions see:
http://www.mongodb.org/display/DOCS/Splitting+Shard+Chunks
http://www.mongodb.org/display/DOCS/Moving+Chunks

MongoDB Sharding On One Machine

Does it make sense to implement mongodb sharding with say 100 shards on one beefier machine just to achieve higher concurrenct write into the database as I am told, there is a global lock for each monogod.exe process? Assuming that is possible, will that aproach give me higher write concurrency?
Running multiple mongods on a machine is not a good idea. Every one of the mongod processes will try to use all the available memory, forcing other mongod's memory mapped pages out of memory. This will create an enormous amount of swapping in most cases.
The global database lock is generally not a problem as is demonstrated in: http://blog.pythonisito.com/2011/12/mongodbs-write-lock.html
Only use one mongod per machine (but it's fine to add a mongos or config server as well), unless it's for some simple testing.
cheers,
Derick
I totally disagree. We run 8 shards per box in our setup. It consists of two head nodes each with two other machines for replication. 6 boxes total. These are beefy boxes with about 120GB of RAM, 32 Cores and 2TB each. By having 8 shards per box (we could go higher by the way this is set at 8 for historic purposes) we make sure we utilize the CPU efficiently. The RAM sorts itself out. You do have to watch the metrics and make sure you aren't paging too much but with SSD drives (which we have) if you do spill onto the disk drives it isn't too bad.
The only use case where I found running several mongod on the same server was to increase replication speed on high latency connection.
As highlighted by Derick, the write lock is not really your issue when running mongodb.
To answer your question : yes you can demonstrate mongo scaling with several instance per machine (4 instances per server sems to be enough) if your test does not involve too much data (otherwise page out will dramatically decrase your performance, I have already tested it)
However, instances will still compete for resources. All you will manage to do is to shift the database lock issue to a resource lock issue.
Yes, you can and in fact that's what we do for 50+ mil write-heavy database. Just make sure all your indexes per mongod fit into the RAM and there's room for growth and maintenance.
However, there's a small trade-off: Depending on what your target QPS is, this kind of sharing requires machines with more horsepower, whereas sharding on a single machine will not and in most cases you can do away with commodity, cheaper hardware.
Whatever the case is, do the series of performance tests (ageinst IO, Network, PQS etc) and establish your baseline carefully and consider SSD drives for storage and this may sound biased, but Linux XFS storage is also something to consider.