MongoDB scaling - mongodb

How much a MongoDB can scale? I heard a talk about 32bit system have 2-4GB of space available or something like that? Can it save 32GB of data in a single Mongo database in a computer and support querying that 32GB of data from that database using a regular query?
How powerful is MongoDB anyway in terms of size? And when/if the sharding comes into play. I'm looking for a gigantic database as long as the disk permits using MongoDB? It would be funny if MongoDB supports 4GB per database. I'm looking towards 200GB of storage in 5 collections in 1 mongo database in 1 computer running Mongo.

It's true that a single instance of MongoDB on a 32-bit system supports up to 2Gb of data. This is due to the storage engine being directly built on top of memory mapped files which have a maximum addressable space of 2Gb.
That said, I'd say very few, if any, companies will actually run a production database on 32-bit hardware so it's hardly ever an issue. On 64-bit builds the theoretical maximum storage is 2^63, but that's obviously well beyond the size of any real world dataset.
So, on a single 64-bit system you can very easily run 200Gb of data. Whether or not you want to on a production environment is another question. If you only run a single instance there's no real fail-over available. With journaling enabled and safe writes (w >= 1) you should be relatively fine though.

You can have a look at this document about sharding and scaling limits:
http://www.mongodb.org/display/DOCS/Sharding+Limits
Scale Limits
Goal is support of systems of up to 1,000 shards. Testing so far has
been limited to clusters with a modest number of shards (e.g., 100).

Related

High level of fragmentation with MongoDB 2.2.1

On a legacy system that is running MongoDB 2.2.1 we are running out of disk space due to excessively large database files. Our actual data size is just under 3 GB, with about 1.7 GB index size, but the storage size is over 70 GB. So, the storage to data+index ratio is close to factor 15. There are about 40 data files, most of which are at the 2 GB maximum file size.
We are contemplating to run a compact() or repair() to regain some of the unused space, but we are worried about the problem recurring soon after. It seems that the current configuration (pretty close to the default configuration) is not suitable for the database usage pattern of our application.
What other tools, diagnostics, remedies or configuration changes are available that could help MongoDB make better use of the disk space?
WiredTiger, used in MongoDB 3.0 and later, is much more efficient in terms of disk usage.
However, migrating from MongoDB 2.2 to 3.0 is going to be a huge leap.
Another option, assuming this is configured as a replica set, is to re-sync the Secondary nodes individually and then perform a failover. This will have the same affect as performing a repair without the downtime that would occur as a result of using the repairDatabase command.

Memory usage remains 99 % even after create index is done on MongoDB Collection

While doing indexing on MongoDB. Now we have nearly 350 GBs of data in the database and its deployed as a windows service in AWS EC2.
And we are doing indexing for some experimentation. But every time I run the indexing command the memory usage goes to 99% and even after the indexing is done the memory usage keeps like that until I restart the service.
The instance has 30 GB of RAM and SSD drive. And right now we have the DB setup as stand alone (not sharded till now). And we are using the latest version of MongoDB.
Any feedback related to this will be helpful.
Thanks,
Arpan
That's normal behavior for MongoDB.
MongoDB grabs all the RAM it can get to cache each accessed document as long as possible. When you add an index to a collection, each document needs to be read once to build the index, which causes MongoDB to load each document into RAM. It then keeps them in RAM in case you want to access them later. But MongoDB will not squat the RAM. When another process needs memory, MongoDB will willingly release it.
This is explained in the FAQ:
Does MongoDB require a lot of RAM?
Not necessarily. It’s certainly
possible to run MongoDB on a machine with a small amount of free RAM.
MongoDB automatically uses all free memory on the machine as its
cache. System resource monitors show that MongoDB uses a lot of
memory, but its usage is dynamic. If another process suddenly needs
half the server’s RAM, MongoDB will yield cached memory to the other
process.
Technically, the operating system’s virtual memory subsystem manages
MongoDB’s memory. This means that MongoDB will use as much free memory
as it can, swapping to disk as needed. Deployments with enough memory
to fit the application’s working data set in RAM will achieve the best
performance.
See also: FAQ: MongoDB Diagnostics for answers to additional questions
about MongoDB and Memory use.

When to start MongoDB sharding

At the moment we run a MongoDB Replicaset containing 2 Servers + 1 Arbiter.
And we store about 150 GB of data in the databases on the replicaset.
Right now we are thinking about when to start with sharding. Because we are wondering if there is a point where you can't start sharding anymore.
It is obvious that we would have to start sharding before we run out of hard disk space, our cpu is overloaded or the overall performance goes down because of too little RAM.
Somebody also told me that there is a limit of 256 GB data size after which you can't start sharding anymore. Also I read the official documentation http://docs.mongodb.org/manual/sharding/ and "MongoDB the definitive guide", I could not proove that.
From your experience is there a limit where you should have started with sharding ?
I would start sharding when you hit about 60-70% resource utilisation. This could be both hard disk space and RAM. The 256 GB limit is indeed there, it's documented at http://docs.mongodb.org/manual/reference/limits/#Sharding%20Existing%20Collection%20Data%20Size
I have found the limit to be based on reads/writes; afterall sharding is about increasing capacity, mainly writes, while replica sets being more concerned with reads. However, using separate servers (nodes) for ranges of data (shard key) can help reads too so it does have a knock on effect for both.
For example you could be only using 40% of your current servers memory with your current working set but due to the amount of writes being sent to that single server you could actually be seeing speed problems due to IO. At this time you would take sharding into account.
So really I would personally say, and this question is heavily opinion based, that you should shard when you feel as though you need more capacity for operations than is cost effective for a single replica set.
I have known of single replica setups that can take what, normally, an entire cluster would but it depends on how big your budget is. As a computer gets bigger it gets more expensive.

Can't map file memory-mongo requires 64 bit build for larger datasets

I have a sharded cluster in 3 systems.
While inserting I get the error message:
cant map file memory-mongo requires 64 bit build for larger datasets
I know that 32 bit machine have a limit size of 2 gb.
I have two questions to ask.
The 2 gb limit is for 1 system, so the total data will be, 6gb as my sharding is done in 3 systems. So it would be only 2 gb or 6 gb?
While sharding is done properly, all the data are stored in single system in spite of distributing data in all the three sharded system?
Does Sharding play any role in increasing the datasize limit?
Does chunk size play any vital role in performance?
I would not recommend you do anything with 32bit MongoDB beyond running it on a development machine where you perhaps cannot run 64bit. Once you hit the limit the file becomes unuseable.
The documentation states "Use 64 bit for production. This is important as if you hit the mmap size limit (exact limit varies but less than 2GB) you will be unable to write to the database (analogous to a disk full condition)."
Sharding is all about scaling out your data set across multiple nodes so in answer to your question, yes you have increased the possible size of your data set. Remember though that namespaces and indexes also take up space.
You haven't specified where your mongos resides??? Where are you seeing the error from - a mongod or the mongos? I suspect that it's the mongod. I believe that you need to look at pre-splitting the chunks - http://docs.mongodb.org/manual/administration/sharding/#splitting-chunks.
which would seem to indicate that all your data is going to the one mongod.
If you have a mongos, what does sh.status() return? Are chunks spread across all mongod's?
For testing, I'd recommend a chunk size of 1mb. In production, it's best to stick with the default of 64mb unless you've some really important reason why you don't want the default and you really know what you are doing. If you have too small of a chunk size, then you will be performing splits far too often.

MongoDB Sharding On One Machine

Does it make sense to implement mongodb sharding with say 100 shards on one beefier machine just to achieve higher concurrenct write into the database as I am told, there is a global lock for each monogod.exe process? Assuming that is possible, will that aproach give me higher write concurrency?
Running multiple mongods on a machine is not a good idea. Every one of the mongod processes will try to use all the available memory, forcing other mongod's memory mapped pages out of memory. This will create an enormous amount of swapping in most cases.
The global database lock is generally not a problem as is demonstrated in: http://blog.pythonisito.com/2011/12/mongodbs-write-lock.html
Only use one mongod per machine (but it's fine to add a mongos or config server as well), unless it's for some simple testing.
cheers,
Derick
I totally disagree. We run 8 shards per box in our setup. It consists of two head nodes each with two other machines for replication. 6 boxes total. These are beefy boxes with about 120GB of RAM, 32 Cores and 2TB each. By having 8 shards per box (we could go higher by the way this is set at 8 for historic purposes) we make sure we utilize the CPU efficiently. The RAM sorts itself out. You do have to watch the metrics and make sure you aren't paging too much but with SSD drives (which we have) if you do spill onto the disk drives it isn't too bad.
The only use case where I found running several mongod on the same server was to increase replication speed on high latency connection.
As highlighted by Derick, the write lock is not really your issue when running mongodb.
To answer your question : yes you can demonstrate mongo scaling with several instance per machine (4 instances per server sems to be enough) if your test does not involve too much data (otherwise page out will dramatically decrase your performance, I have already tested it)
However, instances will still compete for resources. All you will manage to do is to shift the database lock issue to a resource lock issue.
Yes, you can and in fact that's what we do for 50+ mil write-heavy database. Just make sure all your indexes per mongod fit into the RAM and there's room for growth and maintenance.
However, there's a small trade-off: Depending on what your target QPS is, this kind of sharing requires machines with more horsepower, whereas sharding on a single machine will not and in most cases you can do away with commodity, cheaper hardware.
Whatever the case is, do the series of performance tests (ageinst IO, Network, PQS etc) and establish your baseline carefully and consider SSD drives for storage and this may sound biased, but Linux XFS storage is also something to consider.