Mixing Linux and Windows MongoDB replica set and is equal hardware important for sharding - mongodb

My first question is really simple, I just want to know if I can mix Linux and Windows mongodb servers in my sharding/replica cluster, ofc with same version like 2.2?
For 2nd question, can someone explain what heppens if I have sharding servers with this hardware
Server 1 : High Cpu, SSD disk
Server 2 : Normal CPU, Sata disk
Server 3 : Normal CPU , SSD disk
How will sharding work when we have different hardware servers?
Is it important to build clusters on machines with almost the same hardware?

There shouldn't be a problem mixing Linux and Windows servers, assuming (as you stated) that the MongoDB versions are equivalent.
Regarding differences in hardware, MongoDB tries to distribute data evenly across shards, and each server will use as many available resources as possible to give the best performance possible. When another process on a server requests some resources, Mongo will relinquish those resources. This will usually mean swapping some data to disk until more RAM becomes available again.
Because your servers will do this, and other operations, at different speeds, the important question becomes "What happens when one server in a shard is running slowly". According to the FAQ:
If a shard is responding slowly, mongos will merely wait for the shard to return results.
So in practice, parts of your sharded collections will be slower than others, but everything should work just fine. You would probably have a better experience if the hardware matched, but it doesn't have to.

Related

Running MongoDB on heterogeneous cluster

We're currently trying to find a solution, where we can host tens of millions of images of various sizes on our website. We're evaluating Riak and MongoDB. Currently Riak looks very nice, but all the servers in the cluster should be homogeneous, because Riak treats each node equally.
It is a bit hard to find information about MongoDB regarding to this, except the fact, that you can set a priority on the nodes. My questions are:
What is the consequence of creating a MongoDB cluster composed of machines with wildly varying specifications (cpu, diskspace, disk speed, amount and speed of memory, network speed)?
Can the MongoDB cluster detect and compensate for such differences automatically?
I'm also open for ideas for other solutions, where heterogeneous cluster would work nicely.
What is the consequence of creating a MongoDB cluster composed of machines with wildly varying specifications (cpu, diskspace, disk speed, amount and speed of memory, network speed)?
Each shard in MongoDB is its own mongod, an isolated MongoDB in itself.
It isn't tied by the performance of others, so say you have a shard that requires more power, you can just upgrade its hardware and job is done. MongoDB likes environments like this and actively promotes them.
Can the MongoDB cluster detect and compensate for such differences automatically?
It doesn't need to detect but it will automatically compensate, naturally without any work within the mongos or configsrv
In fact you will find you can even run different MongoDB versions.

When to start MongoDB sharding

At the moment we run a MongoDB Replicaset containing 2 Servers + 1 Arbiter.
And we store about 150 GB of data in the databases on the replicaset.
Right now we are thinking about when to start with sharding. Because we are wondering if there is a point where you can't start sharding anymore.
It is obvious that we would have to start sharding before we run out of hard disk space, our cpu is overloaded or the overall performance goes down because of too little RAM.
Somebody also told me that there is a limit of 256 GB data size after which you can't start sharding anymore. Also I read the official documentation http://docs.mongodb.org/manual/sharding/ and "MongoDB the definitive guide", I could not proove that.
From your experience is there a limit where you should have started with sharding ?
I would start sharding when you hit about 60-70% resource utilisation. This could be both hard disk space and RAM. The 256 GB limit is indeed there, it's documented at http://docs.mongodb.org/manual/reference/limits/#Sharding%20Existing%20Collection%20Data%20Size
I have found the limit to be based on reads/writes; afterall sharding is about increasing capacity, mainly writes, while replica sets being more concerned with reads. However, using separate servers (nodes) for ranges of data (shard key) can help reads too so it does have a knock on effect for both.
For example you could be only using 40% of your current servers memory with your current working set but due to the amount of writes being sent to that single server you could actually be seeing speed problems due to IO. At this time you would take sharding into account.
So really I would personally say, and this question is heavily opinion based, that you should shard when you feel as though you need more capacity for operations than is cost effective for a single replica set.
I have known of single replica setups that can take what, normally, an entire cluster would but it depends on how big your budget is. As a computer gets bigger it gets more expensive.

MongoDB scaling

How much a MongoDB can scale? I heard a talk about 32bit system have 2-4GB of space available or something like that? Can it save 32GB of data in a single Mongo database in a computer and support querying that 32GB of data from that database using a regular query?
How powerful is MongoDB anyway in terms of size? And when/if the sharding comes into play. I'm looking for a gigantic database as long as the disk permits using MongoDB? It would be funny if MongoDB supports 4GB per database. I'm looking towards 200GB of storage in 5 collections in 1 mongo database in 1 computer running Mongo.
It's true that a single instance of MongoDB on a 32-bit system supports up to 2Gb of data. This is due to the storage engine being directly built on top of memory mapped files which have a maximum addressable space of 2Gb.
That said, I'd say very few, if any, companies will actually run a production database on 32-bit hardware so it's hardly ever an issue. On 64-bit builds the theoretical maximum storage is 2^63, but that's obviously well beyond the size of any real world dataset.
So, on a single 64-bit system you can very easily run 200Gb of data. Whether or not you want to on a production environment is another question. If you only run a single instance there's no real fail-over available. With journaling enabled and safe writes (w >= 1) you should be relatively fine though.
You can have a look at this document about sharding and scaling limits:
http://www.mongodb.org/display/DOCS/Sharding+Limits
Scale Limits
Goal is support of systems of up to 1,000 shards. Testing so far has
been limited to clusters with a modest number of shards (e.g., 100).

MongoDB Sharding On One Machine

Does it make sense to implement mongodb sharding with say 100 shards on one beefier machine just to achieve higher concurrenct write into the database as I am told, there is a global lock for each monogod.exe process? Assuming that is possible, will that aproach give me higher write concurrency?
Running multiple mongods on a machine is not a good idea. Every one of the mongod processes will try to use all the available memory, forcing other mongod's memory mapped pages out of memory. This will create an enormous amount of swapping in most cases.
The global database lock is generally not a problem as is demonstrated in: http://blog.pythonisito.com/2011/12/mongodbs-write-lock.html
Only use one mongod per machine (but it's fine to add a mongos or config server as well), unless it's for some simple testing.
cheers,
Derick
I totally disagree. We run 8 shards per box in our setup. It consists of two head nodes each with two other machines for replication. 6 boxes total. These are beefy boxes with about 120GB of RAM, 32 Cores and 2TB each. By having 8 shards per box (we could go higher by the way this is set at 8 for historic purposes) we make sure we utilize the CPU efficiently. The RAM sorts itself out. You do have to watch the metrics and make sure you aren't paging too much but with SSD drives (which we have) if you do spill onto the disk drives it isn't too bad.
The only use case where I found running several mongod on the same server was to increase replication speed on high latency connection.
As highlighted by Derick, the write lock is not really your issue when running mongodb.
To answer your question : yes you can demonstrate mongo scaling with several instance per machine (4 instances per server sems to be enough) if your test does not involve too much data (otherwise page out will dramatically decrase your performance, I have already tested it)
However, instances will still compete for resources. All you will manage to do is to shift the database lock issue to a resource lock issue.
Yes, you can and in fact that's what we do for 50+ mil write-heavy database. Just make sure all your indexes per mongod fit into the RAM and there's room for growth and maintenance.
However, there's a small trade-off: Depending on what your target QPS is, this kind of sharing requires machines with more horsepower, whereas sharding on a single machine will not and in most cases you can do away with commodity, cheaper hardware.
Whatever the case is, do the series of performance tests (ageinst IO, Network, PQS etc) and establish your baseline carefully and consider SSD drives for storage and this may sound biased, but Linux XFS storage is also something to consider.

Benefits of multiple memcached instances

Is there any difference between having 4 .5GB memcache servers running or one 2GB instance?
Does running multiple instances offer any benifits?
If one instance fails, you're still get advantages of using the cache. This is especially true if you are using the Consistenthashing that will bring the same data to the same instance, rather than spreading new reads/writes among the machines that are still up.
You may also elect to run servers on 32 bit operating systems, that cannot address more than around 3GB of memory.
Check the FAQ: http://www.socialtext.net/memcached/ and http://www.danga.com/memcached/
High availability is nice, and memcached will automatically distribute your cache across the 4 servers. If one of those servers dies for some reason, you can handle that error by either just continuing as if the cache was blank, redirecting to a different server, or any sort of custom error handling you want. If your 1x 2gb server dies, then your options are pretty limited.
The important thing to remember is that you do not have 4 copies of your cache, it is 1 cache, split amongst the 4 servers.
The only downside is that it's easier to run out of 4x .5 than it is to run out of 1x 2gb memory.
I would also add that theoretically, in case of several machines, it might save you some performance, as if you have a lot of frontends doing a lot of heavy reads, it's much better to split them into different machines: you know, network capabilities and processing power of one machine can become an upper bound for you.
This advantage is highly dependent on memcache utilization, however (sometimes it might be ways faster to fetch everything from one machine).