What is the performance of a query that doesn't contains the shard key in a sharded MongoDB environment? - mongodb

The title is saying everything. Assume that you have a sharded MongoDB environment and the user provide a query, which doesn't contain the shard key. What is the actual performance of the query? What happens in the background?

The performance depends on any number of factors however, the default action of MongoDB in this case is to do a global scatter and gather operation whereby it will send the query to all shards and then merge duplicates to give you an end result.
Returning to the performance, it normally depends upon the indexes on each shard and the isolated optimisation of their data sets and how much range of a dataset they hold.
However processing is parallel in sharding which means they all get the query and the "master" mongod will just merge as they come in, so the performance shouldn't be: go to shard 1, get it, then shard 2; instead it should be: go to all shards, each shard return its results and the master merges and returns.
Here is a good presentation (with nice pictures) on exactly how queries with sharding work in certain situations: http://www.slideshare.net/mongodb/how-queries-work-with-sharding

If the query is maked on the sharded collections the query is maked on all shard, if the query is maked on non shared collections, mongoDB take all data on the same shard.
I add the link for shard FAQ on MongoDB
http://docs.mongodb.org/manual/faq/sharding/

Related

MongoDB sharding with repeated documents

I am new to mongodb and wish to create a distributed database environment using docker-compose with mongodb. I've created multiple docker with shards to simulate multiple sites. However, I have a problem to replicate the same set of documents into multiple shards.
For example I have a collection with a key that has value "A" and "B". I want to distribute this collection into 2 shards where
Shard 1 = A & B
Shard 2 = B only
However, when I run the balancer it distributes all A's into shard 1 and B's into shard 2. Is there any way I can do the sharding with repeated data or am I using the wrong approach for my problem?
You might be approaching sharding (horizontal scaling) incorrectly. What makes sharding in Mongo work is that the sharding key is chosen such that it results in (vertical) shards which have a roughly even distribution of data, or a similar number of Mongo documents. A requirement of sharding which makes it work well is that queries would typically be directed to only a single shard. If you have queries which need to return some field having the different values of A and B, then it implies that this field should not be the sharding key. Queries can go across shards, but certain cross-shard operations, such as joins, can be very costly. In your particular case, perhaps some other field could be used as sharding key.
Redundancy in MongoDB is provided by replica sets, not sharded clusters.
Each shard can be backed by a replica set with your desired number of nodes to provide the required redundancy level.
It is not possible to have the same document be (authoritatively) located in multiple shards.

Writing on multiple shards in mongodb

Generally, if a query spreads across multiple shards, it is considered less optimized. It takes more time than reading from single shard.
Does it hold true for writing as well? If I am writing some data and it will distribute among multiple shards, will it be considered less optimized?
If yes, what is the best way to write a batch that should go to different shard?
It depends on the operations, see https://docs.mongodb.com/manual/core/sharded-cluster-query-router/#sharding-mongos-targeted.
All insertOne() operations target to one shard. Each document in the insertMany() array targets to a single shard, but there is no guarantee all documents in the array insert into a single shard.
All updateOne(), replaceOne() and deleteOne() operations must include the shard key or _id in the query document. MongoDB returns an error if these methods are used without the shard key or _id.
Depending on the distribution of data in the cluster and the selectivity of the query, mongos may still perform a broadcast operation to fulfill these queries.

MongoDB db.collection.count() vs db.collection.find().length()

I would like to understand why these commands, when run from a mongos instance against the same MongoDB collection, return different numbers?
db.users.count()
db.users.find().length()
What can be the reason and can it be a sign of underlying issues?
I believe your collection is sharded.
Most sharded databases solutions have such discrepancy, due to the fact that some commands consider the entire collection, meaning all the documents of all the shards, while some other commands only consider the documents of the shard it is connected to.
This is something to always keep in mind. It mostly applies to commands which:
count
return the document having the lowest value for a given field
return the document having the biggest value for a given field
...
Found on Mongo docs:
count() is equivalent to the db.collection.find(query).count()
construct. ... Sharded Clusters
On a sharded cluster, db.collection.count() can result in an
inaccurate count if orphaned documents exist or if a chunk migration
is in progress. ...
So in the case of Mongo, it is simply because Mongo always runs, in a background process, some rebalancing of the documents within a shard, in order to keep the shards distribution compliant with the sharding policy defined on the collection.
Keep in mind that to offer the best performance, most sharded solutions will write the documents on the shard the client is connected to, and then later put it where it is really meant to be.
This is why nosql DBs are often flagged as eventually consistent.

When the sort field is not the part of shard key, will mongos sort the data returned by all mongod?

When the sort field is not part of the shard key, mongos will send the query to all mongod instances. After all mongod instances return data, mongos will merge them.
Does this merge operation include a sort?
We know the sort field is not part of the shard key, so the returned data should be unordered, mongos must do sort. If so, when the returned data is very large, mongos will take up a lot of memory.
Is my understanding correct?
It's not the sort field that needs to be in the shard key, but rather the criteria you are using to select the data. That is, if the mongos cannot determine from the fields you are using as part of your query where the data lives specifically then it will send to all shards. This is the same as any other non-sort query. Sorting on a non-shardkey field does not affect the ability of the mongos to route the queries appropriately.
This is mentioned in the docs here:
https://docs.mongodb.org/v2.4/core/sharded-cluster-query-router/#how-mongos-handles-query-modifiers
The shards will receive the queries from mongos, they will sort their subset of results, and send them back to the mongos. The mongos then has to do a merge sort on the returned results before presenting them back. This is not as intensive as the full sort would be, since the results are ordered initially by the shards, but will still require resources. The amount of memory consumed will be related to the size of the result sets returned by the various shards.
Edit (May 2016): the above was true when originally answered in 2012, but (as pointed out in the comments below) the behavior changed with version 2.6 in 2014. The results are now sent to the primary shard for the sharded database to be merge sorted before being returned to the mongos (and then to the user). This makes a lot of sense since mongos instances are far less likely to have the resources to perform a large sort, but it does mean that you should pay close attention to where any databases which will be sorted frequently have their primary as it will see higher load as a result.
In 3.2 version, if primary shard is not used in fetch (in other words, the primary shard does NOT contain any of the documents in the find command), then a secondary shard may be used instead.

MongoDB Index in Memory with Sharding

The word on the street is that MongoDB gets slow if you can't keep the indexes you're using in memory. How does this work with sharding? Does a sharded only keep its own BTree in memory, or does every shard need to keep the index for the entire collection in memory?
Does a sharded only keep its own BTree in memory...?
Yes, each shard manages its own indexes.
The word on the street is that MongoDB gets slow if you can't keep the indexes you're using in memory.
You can actually expect worse when using sharding and secondary indexes. The key problem is that the router process (mongos) knows nothing about data in secondary indexes.
If you do a query using the shard key, it will be routed directly to the correct server(s). In most cases, this levels out the workload. So 100 queries can be spread across 100 servers and each server only answers 1 query.
However, if you do a query using the secondary key, that query has to go to every server. So 100 queries to the router will result 10,000 queries across 100 servers or 100 queries per server. As you add more servers, these "non-shardkey" queries become less and less efficient. The workload does not become more balanced.
Some details are available in the MongoDB docs here.
Just its own portion of the index (it doesn't know about the other shards' data). Scaling wouldn't work very well, otherwise. See this documentation for some more information about sharding:
http://www.mongodb.org/display/DOCS/Sharding+Introduction
http://www.mongodb.org/display/DOCS/Choosing+a+Shard+Key