mongodb indices and scaling - mongodb

Reading the MongoDB documentation for indexes, i was left a little mystified and unsettled by this assertion found at: http://docs.mongodb.org/manual/applications/indexes/#ensure-indexes-fit-ram
If you have and use multiple collections, you must consider the size
of all indexes on all collections. The indexes and the working set
must be able to fit in RAM at the same time.
So, how is this supposed to scale when new nodes in the shard are added? suppose all my 576 nodes are bounded at 8Gb, and i have 12 collections of 4Gb each (including their associated indices) and 3 collections of 16Gb (including indices). How does the sharding spread work between nodes so that the 12 collections can be queried efficiently?

When sharding you spread the data across different shards. The mongos process routes queries to shards it needs to get data from. As such you only need to look at the data a shard is holding. To quote from When to Use Sharding:
You should consider deploying a sharded cluster, if:
your data set approaches or exceeds the storage capacity of a single node in your system.
the size of your system’s active working set will soon exceed the capacity of the maximum amount of RAM for your system.
Also note that the working set != whole collection. The working set is defined as:
The collection of data that MongoDB uses regularly. This data is typically (or preferably) held in RAM.
E.g. you have 1TB of data but typically only 50GB is used/queried. That subset is preferably held in RAM.

Related

MongoDB sharding storage usage

I am reading about sharding in MongoDB. After understanding how it works, I have a very basic question regarding the storage space used by it.
Suppose, I have a server containing 1 GB of storage. Now assuming my data will grow beyond 1 GB, it won't be sufficient for my purpose. So, I add one more server and shard Mongo.
So now, let's say I have 2 servers, with storage space say 1 GB each, which are to be included in the cluster. If I perform sharding, then both of these servers will be used to distribute Mongo data. So, in total, I must have 2 GB storage available for Mongo. But, I find that the official sharding documentation mentions that shards are replica sets. If that is so, then wouldn't the addition of 1 GB server just mean that I have only 1 GB storage (like before) for actual MongoDB data and remaining 1 GB is just replicated data?
If my understanding is correct, then is there any way to not create a replica set? Can we use 2 GB storage from both the servers like a logical volume?
Otherwise, if my understanding is wrong, what is the correct thing?
The documentation of Sharding at MongoDB says that - "Sharding distributes data across the shards in the cluster, allowing each shard to contain a subset of the total cluster data. As the data set grows, additional shards increase the storage capacity of the cluster". Here: https://docs.mongodb.com/manual/sharding/ (storage capacity)
Since its a subset both contain different sets of data. So there can be multiple use's of a replica set (shards to store subset of data or saving data as a backup and creating replicas) based on the usage.
Sharding happens one level above replication.
When you use both sharding and replication, your cluster consists of many replica-sets and one replica-set consists of many mongod instances.
However, it is also possible to create a cluster of stand-alone mongod instances which are not replicated or have only some shards implemented as replica-sets and some shards implemented as stand-alone mongod instances.
Each shard is a replica set, not the shards are replica sets.
I hope this helps.

Choosing between MongoDB and ElasticSearch - Scaling/Sharding

I'm currently deciding between MongoDB and Elasticsearch as a backend to a logging and analytics platform. I plan to use a cluster of 5 Intel Xeon Quad Core servers with 64GB RAM and a 500GB NVMe drive in each. With 1 replica set, it should support 1TB+ of data I'm guessing.
From what I've read on Elasticsearch, the recommended set-up for the above servers would be 5-10 shards, but shards cannot be increased in the future without a huge migration. So maybe I can add 5 more servers/nodes to the cluster for the same index, but not 10 or 20, because I can't create more shards to spread across the new nodes/servers - correct?
MongoDB appears to automatically manage sharding based on a key value and redistribute those shards as more nodes get added. So does that mean that I can add 50 more servers to the cluster in the future and MongoDB will happily spread the data from this one index across all the servers?
I basically only need 1TB of storage right now, but don't want to paint myself into a corner, should this 1 dataset end up growing to 100TB.
Without starting Elasticsearch with 100 shards at the beginning, which seems inefficient and bad practice, how can it scale past 5/10 servers for this single dataset?
As Val said, you would normally have time based indices, so you can easily (in a performant way) remove data after a certain retention period. So as your requirements change over time, you change your shard number (normally through an index template).
Current versions of Elasticsearch now support a _split API, which does exactly what you are asking for: Use 5 shards initially, but have the option to go up to any factor of 20 (just as an example) — so 5 -> 10 -> 30 would be options.
If you have 5 primary shards and a replication factor of 1, you could still spread out the load over 10 nodes: Writes to the 5 primary and 5 replica shards; reads will go to either one of them. Elasticsearch's write / read model is generally different than MongoDB's.
PS disclaimer: I work for Elastic now, but I have used MongoDB in production for 5 years as well.

Migrating chunks from primary shard to the other single shard takes too long

Each chunk move takes about 30-40 mins.
The shard key is a random looking but monotically increasing integer string which is a long sequence of digits. A "hashed" index is created for that field.
There are 150M documents each about 1.5Kb in size. The sharded collection has 10 indexes (some of them compound).
I have a total of ~11k chunks reported in sh.status(). So far I could only transfer 42 of them to the other shard.
The system consists of one mongos, one config server and one primary (mongod) shard and other (mongod) shard. All in the same server which has 8 cores and 32 GB ram.
I know the ideal is to use seperate machines but none of the CPUs were utilized so I thought it was good for a start.
What is your comment?
What do I need to investigate?
Is it normal?
As said on the mongodb documentation : " Sharding is the process of storing data records across multiple machines and is MongoDB’s approach to meeting the demands of data growth. As the size of the data increases, a single machine may not be sufficient to store the data nor provide an acceptable read and write throughput. Sharding solves the problem with horizontal scaling. With sharding, you add more machines to support data growth and the demands of read and write operations."
You should definitely not have your shards on the same machine. It is useless. The interest of sharding is that you scale horizontaly. So if you shard on the same machine.... You are just killing your throughput.
Your database will be faster without sharding if you have one machine.
To avoid data loss, before using sharding you should use : raid (not 0), replicaset and then sharding.

Mongodb : Distribution of collections using sharding

The application that we developed is a data crunching application and in this process, there will multiple collections that are created. So, there is a high chance that the maximum limit of 3 million collections (provided we use all the 2GB provided for NS) will be crossed.
Can I have different collections distributed in different mongodb instances and manage them using sharding ?
I guess your question is about how to use sharding for distributing collections across the nodes? Sharding is actually meant to be a method to meeting the demands of data growth in terms of amount not number of collections. That means you partition your collection into multiple shards.
To deal with the issue you are having, you could run multiple instances of mongod on the same server that live on their own (not a replica set). You could distribute your collections into these instances. But you will need to hold onto which collection is hosted on which mongod process. You would have to store pid of each mongod processes.

Does mongo replication split data or duplicate it

I am creating a mongoDB/nodejs based CMS and I am using GridFS to store all the uploaded docs. The question I have is this:
Does MongoDB replication sets allow increased amount of DB Storage, or
simply duplicates of the database. For Instance, if I have 5 servers
with 1TB of storage each, if I replica mongo across all of them, would
my GridFS system have theoretically 5TB of storage (minus caching and
padding) or 1TB of storage duplicated several times for better read
performance?
Thanks!
Informal description:
Replication = The same copy of the data on multiple nodes, i.e., 5 nodes with 1TB each provide 1TB overall.
Sharding / Partitioning = Fraction of the data goes to the nodes, i.e., 5 nodes with 1TB each provide 5TB overall.
Each approach has certain advantages and disadvantages, e.g., replication can help with read throughput and is good as backup, but slows down inserts (depending on commit level), whereas partitioning can help with insert throughput and distributed lookups.
Again, details left to the storage system implementor.
Sharding means splitting your data across multiple nodes, this is useful when you have a huge amount of data.
Replication means copying the data from a node to another node, and it's useful when your application is read heavy or you want to backup your data for example.
Resources:
http://www.mongodb.org/display/DOCS/Sharding
http://www.mongodb.org/display/DOCS/Replication
http://nosql-exp.blogspot.com/2010/09/mongodb-sharding-and-replication-with.html
Does MongoDB replication sets allow increased amount of DB Storage, or
simply duplicates of the database.
Mongo can do both.
The first case is called sharding.
The second case is called replication.