MongoDB Index in Memory with Sharding - mongodb

The word on the street is that MongoDB gets slow if you can't keep the indexes you're using in memory. How does this work with sharding? Does a sharded only keep its own BTree in memory, or does every shard need to keep the index for the entire collection in memory?

Does a sharded only keep its own BTree in memory...?
Yes, each shard manages its own indexes.
The word on the street is that MongoDB gets slow if you can't keep the indexes you're using in memory.
You can actually expect worse when using sharding and secondary indexes. The key problem is that the router process (mongos) knows nothing about data in secondary indexes.
If you do a query using the shard key, it will be routed directly to the correct server(s). In most cases, this levels out the workload. So 100 queries can be spread across 100 servers and each server only answers 1 query.
However, if you do a query using the secondary key, that query has to go to every server. So 100 queries to the router will result 10,000 queries across 100 servers or 100 queries per server. As you add more servers, these "non-shardkey" queries become less and less efficient. The workload does not become more balanced.
Some details are available in the MongoDB docs here.

Just its own portion of the index (it doesn't know about the other shards' data). Scaling wouldn't work very well, otherwise. See this documentation for some more information about sharding:
http://www.mongodb.org/display/DOCS/Sharding+Introduction
http://www.mongodb.org/display/DOCS/Choosing+a+Shard+Key

Related

Is it good to add sharding on single server?

is it good to do sharding on single machine/server, if size of mongodb documents is above 10GB, will it perform well?
The key rule of sharding is don't shard until you absolutely have to. If you're not having problems with performance now don't need to shard. Choosing sharding keys can be a difficult process to make your data gets balanced correctly between shards. Sharding the database can add severe overhead to your deployment that can take a lot to manage since you will need additional mongod process, config servers, and replica sets in order for it to be stable for production.
I'm assuming you mean your collections are 10GB. Depending on the size of your machine a 10GB collection is not a lot for mongo to handle. IF you're having performance issue with queries my first step would be to go through your mongo log and see if there are any queries you can add indexes for.

MongoDB Sharded cluster: Inserts only hitting one shard

We are using a cluster with 6 shards.
The collection uses a non-hashed key.
The documents are rather big and our chunk-size is set to 512MB.
Two huge bulk inserts hit our cluster but everything is inserted on a single shard.
This leads to 120% effective lock, while the other shards are chilling at 5% lock.
I think that the bulk inserts only append the last chunk since the inserts are ordered. Due to heavy load there is no redistribution of chunks until the insert ends.
After the bulk insert redistribution works nicely.
MongoDB version is 2.6.5.
How can I configure the config servers to automatically distribute bulk inserts?
I will edit the post if more information is required.
Thank You all!!!
As answered below:
pre-splitting is the best solution for us. This allows us to evenly distribute the whole set before insertion since we know the key space! Thank you!
Sounds like your shard key is monotonic? The documentation has a large section about bulk insert in sharded environments.
Essentially,
either pre-split the collection
or insert to different mongos (not for the initial insert)
and/or make sure that your shard key doesn't increase monotonically (for non-hashed collections, that's usually a good idea).

What is the performance of a query that doesn't contains the shard key in a sharded MongoDB environment?

The title is saying everything. Assume that you have a sharded MongoDB environment and the user provide a query, which doesn't contain the shard key. What is the actual performance of the query? What happens in the background?
The performance depends on any number of factors however, the default action of MongoDB in this case is to do a global scatter and gather operation whereby it will send the query to all shards and then merge duplicates to give you an end result.
Returning to the performance, it normally depends upon the indexes on each shard and the isolated optimisation of their data sets and how much range of a dataset they hold.
However processing is parallel in sharding which means they all get the query and the "master" mongod will just merge as they come in, so the performance shouldn't be: go to shard 1, get it, then shard 2; instead it should be: go to all shards, each shard return its results and the master merges and returns.
Here is a good presentation (with nice pictures) on exactly how queries with sharding work in certain situations: http://www.slideshare.net/mongodb/how-queries-work-with-sharding
If the query is maked on the sharded collections the query is maked on all shard, if the query is maked on non shared collections, mongoDB take all data on the same shard.
I add the link for shard FAQ on MongoDB
http://docs.mongodb.org/manual/faq/sharding/

Mongodb Sharding and Indexing

I have been struggling to deploy a large database.
I have deployed 3 shard clusters and started indexing my data.
However it's been 16 days and I'm only half way through.
Question is, should I import all data to a non sharded cluster and then activate sharding once the raw data is in the database and then attach more clusters and start indexing? Will this auto balance my data?
Or I should wait another 16 days for the current method I am using...
*Edit:
Here is more explanation of the setup and data that is being imported...
So we have 160 million documents that are like this
"_id" : ObjectId("5146ae7de4b0d58a864bcfda"),
"subject" : "<concept/resource/propert/122322xyz>",
"predicate" : "<concept/property/os/123ABCDXZYZ>",
"object" : "<http://host/uri_to_object_abcdy>"
Indexes: subject, predicate, object, subject > predicate, object > predicate
Shard keys: subject, predicate, object
Setup:
3 clusters on AWS (each with 3 Replica sets) with each node having 8 GiB RAM
(Config servers are within each cluster and Mongos is in a separate server)
The data gets imported by a Java program into a the Mongos.
What would be the ideal way to import this data, index and shard. (without waiting a month for the process to be completed)
If you are doing a massive bulk insert, it is often faster to perform the insert without an index and then index the collection. This has to do with the way Mongo manages index updates on the fly.
Also, MongoDB is particularly sensitive to memory when it indexes. Check the size of your indexes in your db.stats() and hook up your DBs to the Mongo Monitoring Service.
In my experience, whenever MongoDB takes a lot more time than expected, it is due to one of two things:
It running out of physical memory or getting itself into a poor I/O pattern. MMS can help diagnose both. Check out the page faults graph in particular.
Operating on unindexed collections, which does not apply in your case.

MongoDB Cluster and 100,000 Capped Collections

How does a MongoDB cluster distribute Capped Collections across nodes for balancing load? I am planning to use a Capped Collection for comments of each Post in a MongoDB based CMS. Lets assume we have 100,000 Posts and hence 100,000 Capped Collections storing comments for each post. Will these Capped Collections be distributed evenly across cluster for read and write scalability?
I dont want to shard a capped collection. I want to distribute all the capped collections evenly across the cluster for read and write scalability.
Lets assume we have 5 machines. When we create new collections, I need them to be created on different machines/nodes and also redistribute them when new machines are added.
1) When creating a collection (capped or not) it is set on the primary shard of the database. The solution would be to set a collection per database so that mongo equilibrate the databases across ythe cluster. The rule for equilibrium is not clear but depends mainly on the current load on each shard.
2) Believe me, you should use one big collection for all your post and shard it in a clever way. It will ensure really efficient and automatic balance of your data across your cluster.
More over capped collection are not really space efficient because it will pre-allocate all the space for all your collections (meaning that you'll have a lot of wasted space for nothing)
Unless you have a very good reason to go for capping, you have better try sharding.
One advice : use the 'postId' field in your shard key, it will probably the most performance.
Apparently it is not implemented yet for mongodb: Issue
Quote from similar question:
But you can create multiple capped collections on different shards to
increase write throughput; however, you must then run multiple queries
to access all your data.