When I run an aggregation on a sharped MongoDB environment. If $match matches documents on multiple shards, does the next command in the pipeline ‘$group’ then run on multiple shards?
Or does all the matching data go to a single shard?
If it goes to a single shard, which shard will it go to?
It doesn't matter whether you run with shards or not, it will perform the aggregation automatically on all the shards if the data is in different shards.
Related
Generally, if a query spreads across multiple shards, it is considered less optimized. It takes more time than reading from single shard.
Does it hold true for writing as well? If I am writing some data and it will distribute among multiple shards, will it be considered less optimized?
If yes, what is the best way to write a batch that should go to different shard?
It depends on the operations, see https://docs.mongodb.com/manual/core/sharded-cluster-query-router/#sharding-mongos-targeted.
All insertOne() operations target to one shard. Each document in the insertMany() array targets to a single shard, but there is no guarantee all documents in the array insert into a single shard.
All updateOne(), replaceOne() and deleteOne() operations must include the shard key or _id in the query document. MongoDB returns an error if these methods are used without the shard key or _id.
Depending on the distribution of data in the cluster and the selectivity of the query, mongos may still perform a broadcast operation to fulfill these queries.
Just referring to Mongodb aggregation link https://docs.mongodb.com/v3.2/aggregation/#aggregation-pipeline and it mentions that
"The aggregation pipeline can operate on a sharded collection."
Please lemme know If database is sharded then all the collections in the database will be sharded. Also Please confirm that if sharded the aggregate query will be run in many servers, and delivery the results fast. If so how the aggregation query functions.
Regards
Kris
The aggregation pipeline supports operations on sharded collections.
If the pipeline starts with an exact $match on a shard key, the entire pipeline runs on the matching shard only. Previously(prior to version 3.2), the pipeline would have been split into two parts, and the work of merging it would have to be done on the primary shard.
In the case of aggregation operations that must run on multiple shards, if the operations do not require running on the database’s primary shard, these operations will route the results to a random shard to merge the results to avoid overloading the primary shard for that database. The $out stage and the $lookup stage require running on the database’s primary shard.
When splitting the aggregation pipeline into two parts, the pipeline is split to ensure that the shards perform as many stages as possible with consideration for optimization.
Reference:
https://docs.mongodb.com/manual/core/aggregation-pipeline-sharded-collections/
A sharded cluster always has a Primary Shard, and one or more secondary shards.
Please lemme know If database is sharded then all the collections in
the database will be sharded
No, by default none of your collections will be sharded. All such collections stay wholly on the primary shard. To shard a collection, use the shardCollection command
Also Please confirm that if sharded the aggregate query will be run in
many servers, and delivery the results fast.
One important thing while defining a collection in a sharded environment is the shard key. You should ensure that choose a good shard key which is responsible for distribution of data across the shards. Thus, if you choose a good shard key, you can expect a better performance than a non-sharded environment.
If so how the aggregation query functions.
The aggregation query is split by the $match to different shards depending on where the docs are present, and are finally merged together on a shard. A good read is https://docs.mongodb.com/v3.2/core/aggregation-pipeline-sharded-collections/
I would like to understand why these commands, when run from a mongos instance against the same MongoDB collection, return different numbers?
db.users.count()
db.users.find().length()
What can be the reason and can it be a sign of underlying issues?
I believe your collection is sharded.
Most sharded databases solutions have such discrepancy, due to the fact that some commands consider the entire collection, meaning all the documents of all the shards, while some other commands only consider the documents of the shard it is connected to.
This is something to always keep in mind. It mostly applies to commands which:
count
return the document having the lowest value for a given field
return the document having the biggest value for a given field
...
Found on Mongo docs:
count() is equivalent to the db.collection.find(query).count()
construct. ... Sharded Clusters
On a sharded cluster, db.collection.count() can result in an
inaccurate count if orphaned documents exist or if a chunk migration
is in progress. ...
So in the case of Mongo, it is simply because Mongo always runs, in a background process, some rebalancing of the documents within a shard, in order to keep the shards distribution compliant with the sharding policy defined on the collection.
Keep in mind that to offer the best performance, most sharded solutions will write the documents on the shard the client is connected to, and then later put it where it is really meant to be.
This is why nosql DBs are often flagged as eventually consistent.
If I have MongoDB shurded cluster in sharded key: "my_key".
I have to find in collection pack documents (about 10-500 items) with different my_key's.
Foe example:
db.test.find({my_key: {$in:[1,3,5,67,45,56...]}})
Mongos knows where chunks with 'my_key' stored.
Can mongos split my query to small queries to exact shards where documents stored? Or mongos will send this query to ALL shards?
And the same question about $or
db.test.find({$or:[{my_key: 1},{my_key: 3},{my_key: 5}...]})
I have run tests.
If $in contains values only from one shard mongos will send SINGLE_SHARD query.
If $in contains values from several shards then mongos will send SHARD_MERGE query only for shards than contains needed data (not all cluster).
The title is saying everything. Assume that you have a sharded MongoDB environment and the user provide a query, which doesn't contain the shard key. What is the actual performance of the query? What happens in the background?
The performance depends on any number of factors however, the default action of MongoDB in this case is to do a global scatter and gather operation whereby it will send the query to all shards and then merge duplicates to give you an end result.
Returning to the performance, it normally depends upon the indexes on each shard and the isolated optimisation of their data sets and how much range of a dataset they hold.
However processing is parallel in sharding which means they all get the query and the "master" mongod will just merge as they come in, so the performance shouldn't be: go to shard 1, get it, then shard 2; instead it should be: go to all shards, each shard return its results and the master merges and returns.
Here is a good presentation (with nice pictures) on exactly how queries with sharding work in certain situations: http://www.slideshare.net/mongodb/how-queries-work-with-sharding
If the query is maked on the sharded collections the query is maked on all shard, if the query is maked on non shared collections, mongoDB take all data on the same shard.
I add the link for shard FAQ on MongoDB
http://docs.mongodb.org/manual/faq/sharding/