mongodb architect design with hardware specification - mongodb

Currently we have one replica set of 3 members, 25 GB of data, normal cpu usage is 1.5 in both secondary, 0.5 in primary(read happen in secondary instance only), normally 1200 users hit our website. Now we have planned to increase the no of hit to our website. We are expecting about 5000 concurrent users to our website, can you please suggest no of instance needed to add in my replica set.
Current infra in our replica set:
1. Primary instance
CPUs: 16
RAM: 32 GB
HDD: 100 GB
2. Secondary instance
CPUs: 8
RAM: 16 GB
HDD: 100 GB
3. Secondary instance
CPUs: 8
RAM: 16 GB
HDD: 100 GB

Assuming your application scales linearly with the number of users, the CPU capacity should not be a problem (does it? Only you can tell - we don't know what your application does).
The question is: how much do you expect your data to grow? When you currently have 25 GB of data and 16 GB of ram, 64% of your data fits into RAM. That likely means that many queries can be served directly from the RAM cache without hitting the hard drives. These queries are usually very fast. But when your working set increases further beyond the size of your RAM, you might experience some increased latency when accessing the data which now needs to be read from the hard drives (it depends, though: when your application interacts primarily with recent data and rarely with older data, you might not even notice much of a difference).
The solution to this is obvious: get more RAM. Should this not be an option (for example because the server reached the maximum RAM capacity the hardware allows), your next option is building a sharded cluster where each shard is responsible for serving an interval of your data.

Related

How much disk space for MongoDB

I am about to setup MongoDB on AWS EC2 (Amazon Linux HVM 64bits) and implement RAID 10.
I am expecting a couple of millions of records for a system of videos on demand.
I could not find any good advice on how much disk space I should use for that instance.
The dilemma is that I can't spend too much on EBS volume right now, but if I have to add a new bigger volume in less than a year and turn the db off to move the data to that new volume, that is a problem.
For the initial stage, I was thinking 16Gb (available after RAID 10 implementation) on a t2.medium, with plan of upgrading to m4.medium and adding replica sets later.
Any thoughts on this?
The math is pretty simple:
Space required = bytes per record x number of records
If you have an average of 145 bytes per record with an expectation of 5 million records, you can work with 1 GB of storage.
EBS storage is pretty cheap. 1 GB of SSD is $0.10 per month in us-east-1. You could allocate 5 GB for only $0.50 per month.
Also, RAID 10 is RAID 0 and RAID 1 combined. Read over this SO question regarding RAID 0 and RAID 1 configurations with EBS.
https://serverfault.com/questions/253908/is-raid-1-overkill-on-amazon-ebs-drives-in-terms-of-reliability

MongoDB on Azure operational cost

We are using MongoDB as a virtual machine (A3) on Azure. We are trying to simulate running cost of using MongoDB for our following scenario:
Scenario is to insert/update around 2k amount of data (time series data) every 5 minutes by 100,000 customers. We are using MongoDB on A3 instance (4 core) of Windows Server on Azure (that restricts 4TB per shard).
When we estimated running cost, it is coming out to be approx $34,000 per month - which includes MongoDB licensing, our MongoDB virtual machine, storage, backup storage and worker role.
This is way costly. We have some ideas to bring the cost down but need some advice on those ideas as some of you may have already done this.
Two questions:
1- As of today, we are estimating to use 28 MongoDB instances (with 4 TB limit). I have read that we can increase the disk size from 4TB to 64 TB on Linux VM or Windows Server 2012 server. This may reduce our number of shards needed. Is running MongoDB on 64TB disk size shard possible in Azure?
You may ask why 28 number of instances..
2- We are calculating our number of shards required based on "number of inserts per core"; which is itself depend on number of values inserted in the MongoDB per message. each value is 82 bytes. We did some load testing and it comes out that we can only run 8000 inserts per second and each core can handle approx. 193 inserts per second - resulting into need of 41 cores (which is way too high). You can divide 41 cores/4 resulting into A3 11 instances -- which is another cost....
Looking for help to see - if our calculation is wrong or the way we have setup is wrong.
Any help will be appreciated.
Question nr. 1:
1- As of today, we are estimating to use 28 MongoDB instances (with 4
TB limit). I have read that we can increase the disk size from 4TB to
64 TB on Linux VM or Windows Server 2012 server. This may reduce our
number of shards needed. Is running MongoDB on 64TB disk size shard
possible in Azure?
According to documentation here, the maximum you can achieve is 16TB, which is 16 Data disks attached, max. 1 TB each. So, technically the largest disk you can attach is 1TB, but you can build RAID 0 stripe with the 16 disks attached, so you can get 16TB storage. But this (16TB) is the maximum amount of storage you can officially get.
According to the Azure documentation A3 size can have a maximum of 8 data disks. So a maximum of 8TB. A4 can handle 16 disks. I would assume your bottleneck here is disk and not the number of cores. So i'm not convinced you need such a big cluster.

MongoDB - Forcefully keeping index + working set in RAM

I am currently using MongoDB to store a single collection of data. This data is 50 GB in size and has about 95 GB of indexes. I am running a machine with 256 GB RAM. I basically want to have all the index and working set in the RAM since the machine is exclusively allocated to mongo.
Currently I see that, though mongo is running with the collection size of 50 GB + Index size of 95 GB, total RAM being used in the machine is less than 20 GB.
Is there a way to force mongo to leverage the RAM available so that it can store all its indexes and working set in memory ?
When your mongod process starts it has none of your data in resident memory. Data is then paged in as it is accessed. Given your collection fits in memory (which is the case here) you can run the touch command on it. On Linux this will call the readahead system call to pull your data and indexes into the filesystem cache, making them available in memory to mongod. On Windows mongod will read the first byte of each page, pulling it in to memory.
If your collection+indexes do not fit into memory, only the tail end of data accessed during the touch will be available.

Can't map file memory-mongo requires 64 bit build for larger datasets

I have a sharded cluster in 3 systems.
While inserting I get the error message:
cant map file memory-mongo requires 64 bit build for larger datasets
I know that 32 bit machine have a limit size of 2 gb.
I have two questions to ask.
The 2 gb limit is for 1 system, so the total data will be, 6gb as my sharding is done in 3 systems. So it would be only 2 gb or 6 gb?
While sharding is done properly, all the data are stored in single system in spite of distributing data in all the three sharded system?
Does Sharding play any role in increasing the datasize limit?
Does chunk size play any vital role in performance?
I would not recommend you do anything with 32bit MongoDB beyond running it on a development machine where you perhaps cannot run 64bit. Once you hit the limit the file becomes unuseable.
The documentation states "Use 64 bit for production. This is important as if you hit the mmap size limit (exact limit varies but less than 2GB) you will be unable to write to the database (analogous to a disk full condition)."
Sharding is all about scaling out your data set across multiple nodes so in answer to your question, yes you have increased the possible size of your data set. Remember though that namespaces and indexes also take up space.
You haven't specified where your mongos resides??? Where are you seeing the error from - a mongod or the mongos? I suspect that it's the mongod. I believe that you need to look at pre-splitting the chunks - http://docs.mongodb.org/manual/administration/sharding/#splitting-chunks.
which would seem to indicate that all your data is going to the one mongod.
If you have a mongos, what does sh.status() return? Are chunks spread across all mongod's?
For testing, I'd recommend a chunk size of 1mb. In production, it's best to stick with the default of 64mb unless you've some really important reason why you don't want the default and you really know what you are doing. If you have too small of a chunk size, then you will be performing splits far too often.

Cassandra storage vs memory sizing

I am considering developing an application with a Cassandra backend. I am hoping that I will be able to run each cassandra node on commodity hardware with the following specs:
Quad Core 2GHz i7 CPU
2x 750GB disk drives
16 GB installed RAM
Now, I have been reading online that the available disk-space for Cassandra should be double the amount that is stored on the disks, which would mean that each node (set up in a RAID-1 configuration) would be able to store 375 GB of data, which is acceptable.
My question is this if 16GB RAM is enough to efficiently serve 375 GB of data per node. The data in the application developed will also be fairly time-dependant, such that recent data will be the data most read from the database. In fact, most of the data will be deleted after about 6 months.
Also, would I assign Cassandra a Heap (-Xmx) close to 16 GB, or does Cassandra utilize off-heap memory ?
You should not set the Cassandra heap to more than 8GB; bigger than that, and garbage collection will kill you with large pauses. Cassandra will use the buffer cache (like other applications) so the remaining memory isn't wasted.
16GB of RAM will be enough to serve the data if your hot set will all fit in RAM, or if serving rate can be served off disk. Disks can do about 100 random IO/s, so with your setup if you need more than 200 reads / second you will need to make sure the data is in cache. Cassandra exports good cache statistics (cassandra-cli show keyspaces) so you should easily be able to tell how effective your cache is being.
Do bear in mind, with only two disks in RAID-1, you will not have a dedicated commit log. This could hamper write performance quite badly. You may want to consider turning off the commit log if it does affect performance, and forgo durable writes.
Although it is probably wise not to use a really huge heap with Cassandra, at my company we have used 10GB to 12GB heaps without any issues so far. Our servers typically have at least 48 GB of memory (RAM is cheap -- so why not :-)) and so we may try expanding the heap a bit more and see what happens.