Does auto-sharding in MongoDB work on shards with many small collections/small databases - mongodb

In the MongoDB documentation for auto-sharding it says: "Sharding is performed on a per-collection basis. Small collections need not be sharded."
Our business has many databases (~100), with many small collections (~30), each with a document count of 1 - 3000. Our DB system is looking at approximately 100,000,000 page views per month.
In that scenario will sharding ever activate since the collections are never big enough even though the DB usage and site traffic is certainly high enough to require load balancing. From the docs I can't seem to find a clear answer.

Whether it makes sense to shard depends a little bit on whether you have mostly writes or reads to the database. Sharding is primarily used for write-scaling, but if you are not doing a lot of writes, then simply using replicasets with "slaveOkay" for the reads might work just as well.
From the numbers that you provided you seem to get about 9 million documents, but are they large documents? If they easily fit in memory, then there is most likely not even going to be a need for replicasets besides for failover capabilities.

This is hard to answer without knowing more about your use case, but I'll give it a shot.
Are you sure sharding is what you need? What does your insert rate look like?
If you are going to have a static set of data, or even a relatively static set, then you probably don't need to shard, you could simply use more secondaries and enable slaveOK reads. The reads will be distributed to the various secondaries and scale up your read capacity.
If that is not the case, and you do need to shard, then there are options. But first, to explain briefly and at a high level how automatic sharding works:
The mongos process is responsible for splitting and migrating chunks in general. These are two separate operations - splitting and balancing.
Splits occur when the mongos sees that a certain portion of the
maximum chunk size has been written, it initiates a split if there is
in fact enough data to warrant it. Over time, with enough data
written, the number of chunks grows.
Balancing occurs when there is an imbalance of chunks (currently 8 in
2.0, though moving to a more dynamic heuristic in 2.2). The balancer migrates the chunks around the shards until a balance is achieved.
So, you need to be writing enough data relative to the max chunk size (default is 64MB in 2.0) to generate the chunks needed for the balancer to move them around appropriately. If that is not going to happen with your data, then you can look at:
Decreasing the chunk size (has drawbacks too - http://www.mongodb.org/display/DOCS/Sharding+Administration#ShardingAdministration-ChunkSizeConsiderations)
Manually split/move the chunks
For the manual instructions see:
http://www.mongodb.org/display/DOCS/Splitting+Shard+Chunks
http://www.mongodb.org/display/DOCS/Moving+Chunks

Related

When to start MongoDB sharding

At the moment we run a MongoDB Replicaset containing 2 Servers + 1 Arbiter.
And we store about 150 GB of data in the databases on the replicaset.
Right now we are thinking about when to start with sharding. Because we are wondering if there is a point where you can't start sharding anymore.
It is obvious that we would have to start sharding before we run out of hard disk space, our cpu is overloaded or the overall performance goes down because of too little RAM.
Somebody also told me that there is a limit of 256 GB data size after which you can't start sharding anymore. Also I read the official documentation http://docs.mongodb.org/manual/sharding/ and "MongoDB the definitive guide", I could not proove that.
From your experience is there a limit where you should have started with sharding ?
I would start sharding when you hit about 60-70% resource utilisation. This could be both hard disk space and RAM. The 256 GB limit is indeed there, it's documented at http://docs.mongodb.org/manual/reference/limits/#Sharding%20Existing%20Collection%20Data%20Size
I have found the limit to be based on reads/writes; afterall sharding is about increasing capacity, mainly writes, while replica sets being more concerned with reads. However, using separate servers (nodes) for ranges of data (shard key) can help reads too so it does have a knock on effect for both.
For example you could be only using 40% of your current servers memory with your current working set but due to the amount of writes being sent to that single server you could actually be seeing speed problems due to IO. At this time you would take sharding into account.
So really I would personally say, and this question is heavily opinion based, that you should shard when you feel as though you need more capacity for operations than is cost effective for a single replica set.
I have known of single replica setups that can take what, normally, an entire cluster would but it depends on how big your budget is. As a computer gets bigger it gets more expensive.

MongoDB - how to best achieve active/active configuration?

I have an application which is very low on writes. I'm therefore interested in deploying a mongo installation which maximizes the read throughput for the hardware I have (3 database servers in one location). I don't really care for redundancy (backups), but would like automatic failover. Additionally, I'm fine with "eventual consistency", and don't mind if data which isn't the latest data is returned.
I've looked into both sharding and replica sets, and as far as I can tell, I don't really need to use sharding as its benefits suit more for applications with many writes.
I therefore went ahead and installed a replica set on the three servers I have, and I then set the reading preference to "Nearest", as that would allow reads to take place on any server.
The problem is, I later read that the client is "sticky" and basically once it has chosen a "nearest" mongo server, it's not likely to change it. Besides, even if it were to "check for nearest" again, it'll probably choose the same one over. This pretty much results in an active/passive configuration, without any load-balancing. I do have two application servers, so if they choose different mongo servers, it might work ok, but say I wanted to have more than 3 mongo servers in the replica set, then any servers besides specific two would be passive.
Basically my question is, what's the best way to have an active/active configuration for my deployment? All I want is for requests to go to free mongo servers rather than busy ones.
One way to force this which I thought of is to create three sharded-clusters (each server participating in all three), where each server is the primary in one of these clusters - but this is still not optimal, because besides the relative complexity involved in this configuration, this also doesn't guarantee complete load balancing (for example, in case all requests at a given moment happen to go to one specific shard).
What's the right way to achieve what I want? If it's not possible to achieve this kind of load balancing with mongo, would you recommend that I go with the sharded-clusters solution?
As you already suspected, scaling reads is not a "one size fits all" problem. Everything will depend on your data, your access patterns, your requirements and probably a few other things only you can determine.
In a nutshell, the main thing to consider is why a single server can't handle your read load. If it's because of the size of your data set and the size of your indexes then sharding your data across three shards will reduce the RAM requirements of each of them (or to put it another way will give you the combined RAM of all three systems). As long as you pick a good shard key (one that will distribute the load approximately evenly across all the systems) you will get almost three times the throughput on targeted queries.
If the main requirement for your reads is to reduce as much as possible the latency of reading the data, then a replica set can serve your purposes well as reading from the "nearest" node will reduce the network round-trip time without changing the duration of the operation on the MongoDB server. This assumes that your writes are infrequent enough or that your application has tolerance of possibly stale data.

How many requests can mongodb handle before sharding is necessary?

Does anybody know (from personal experience or official documentation) how many concurrent requests can a single MongoDb server handle before sharding is advised?
If your working set exceeds the RAM you can afford for a single server, or your disk I/O requirements exceed what you can provide on a single server, or (less likely) your CPU requirements exceed what you can get on one server, then you'll need to shard. All these depend tremendously on your specific workload. See http://docs.mongodb.org/manual/faq/storage/#what-is-the-working-set
One factor is hardware. Although for this you have replica sets. They reduce the load from the master server by answering read-only queries with replicated data. Another option would be memcaching for very frequent and repetitive queries, which would be even faster.
A factor for whether sharding is necessary is the data size & variation. When you have a wide range of varying data you need to access, which would render a server's cache uneffective by distributing the access to the data to the wide range, then you would consider using sharding. Off-loading work is merely a side-effect of this.

Adding a new secondary in MongoDB to Distribute Load

I have two shards on three machines (using mongodb 1.8.2):
nodeI including: shard1(primary) and shard2(primary)
nodeII including: shard1(secondary) and shard2(secondary)
nodeIII including: shard1(arbiter) and shard2 (arbiter)
NodeII load is getting very high(CPU and IO), and NodeI is high as well, but a little better than nodeII.
In my java client I designated code to only query NodeII, while NodeI is just used for writing.
I am planning to convert nodeIII from arbiter to secondary to share the read load on NodeII.
Do you think this is a good idea and if I do this, what should I consider, or do you have other suggestions to lower the load?
As long as the arbiter hardware has similar specifications to your secondary, the approach you are suggesting seems reasonable as it will distribute the secondary reads. Usually arbiters have very low hardware specs or are on shared hardware, but I am assuming that this is not the case in your configuration.
If you have an odd number of servers in the replica set you will no longer need an arbiter.
You may want to look into Read Preference here, in particular you might be interested in specifying tag sets to select a secondary.
Reading from a secondary does not necessarily "distribute" the load as you might expect. Without getting to the root of your performance problems, you may just be setting up for more challenges.
In particular, adding a secondary to your existing servers will:
increase the I/O load on the server where you add the secondary (you are now replicating & writing a full extra copy of the data)
provide more contention for reading from the server the secondary is syncing from
potentially cause that secondary to lag behind the primary during heavy read activity (which may be of concern if you are expecting strong consistency).
You should also consider what happens in the case of failure. If your servers are struggling under the current load, things will probably dramatically melt down if any one of your physical servers has problems and all the traffic ends up hitting a single server.
Ideally you should run mongostat or similar monitoring tools to get a better understanding of the performance characteristics of your servers and what might be contributing to the load (memory pressure, lock %, I/O, network, ..). It would be helpful if you could post a sampling of mongostat output to PasteBin or similar.
You should also review your common queries with explain() to understand index usage, and check if they require access to all shards or are being directed to a specific one.
If all 3 servers are the same hardware spec, as a short term improvement I would consider:
Removing the arbiters and replace them with secondary nodes. This will provide extra data redundancy in the event one of your servers fails and help prevent all of the load from landing on one server.
Stepping down the primary on NodeI, so that NodeI and NodeII each have a primary and secondary (rather than the two primaries on NodeI and two secondaries on NodeII). The primary and secondary servers have different write characteristics so this may balance the load better.
Checking your shard key(s) and common queries to confirm they will reasonably balance reads and writes. Potential problems including a "hot spot" where all writes to a collection hit a single shard .. or queries which hit all shards to get a result.
Testing the change in performance if you don't read from the secondaries. It may seem counter-intuitive, but reading from secondaries may actually be causing you other issues depending on the nature of your queries.
Lastly, you mention using 1.8.2. There are significant performance and locking/yielding improvements in MongoDB 2.0 and 2.2, as well as other bug fixes. It would be worth testing an upgrade in your development environment as this may address some of your issues.

MongoDB - Cluster of small/many or few/big nodes

Can it be said what is best in a general case (where the database size is really big): To have a MongoDB cluster consisting of a larger number of smaller blade servers, or a few, really fat, servers?
Given is that the shard key has a quite fine granularity, so splitting should not be a problem.
If there are no "golden bullet", what is the pros and cons with either setup?
Best in what aspect? From a financial point of view, I'd go for lots of cheap hardware :)
MongoDB has been built to easily scale across nodes, so why not take advantage of this? The reason you'd want just one or a few beefy servers for a SQL server is to minimize the spread of relational data across physical nodes. But since MongoDB uses documents, most of your related data is stored in a single document. This means that it's all stored at the same physical location and you don't have to do costly lookups on other nodes to reconstruct the 'complete picture' of your data.
Another thing to keep in mind is that map-reduce jobs can only run in parallel in a sharded environment. So if you plan to do a lot of map-reducing, more shards/servers will result in better performance.
What if your database outgrows your beefy servers? Are you going to invest in another beefy server that handles that small amount of extra growth? Or what if one of them crashes? With smaller and cheaper servers, you can scale up (or down) more gradually if the need arises. Also, the impact of a server crash is much smaller, as it will affect only a small portion of your data.
To summarize: a large cluster of smaller servers isn't a silver bullet, as managing such a cluster has its own challenges, but it is significantly cheaper and possibly faster as well if you're doing map-reduce.