Just referring to Mongodb aggregation link https://docs.mongodb.com/v3.2/aggregation/#aggregation-pipeline and it mentions that
"The aggregation pipeline can operate on a sharded collection."
Please lemme know If database is sharded then all the collections in the database will be sharded. Also Please confirm that if sharded the aggregate query will be run in many servers, and delivery the results fast. If so how the aggregation query functions.
Regards
Kris
The aggregation pipeline supports operations on sharded collections.
If the pipeline starts with an exact $match on a shard key, the entire pipeline runs on the matching shard only. Previously(prior to version 3.2), the pipeline would have been split into two parts, and the work of merging it would have to be done on the primary shard.
In the case of aggregation operations that must run on multiple shards, if the operations do not require running on the database’s primary shard, these operations will route the results to a random shard to merge the results to avoid overloading the primary shard for that database. The $out stage and the $lookup stage require running on the database’s primary shard.
When splitting the aggregation pipeline into two parts, the pipeline is split to ensure that the shards perform as many stages as possible with consideration for optimization.
Reference:
https://docs.mongodb.com/manual/core/aggregation-pipeline-sharded-collections/
A sharded cluster always has a Primary Shard, and one or more secondary shards.
Please lemme know If database is sharded then all the collections in
the database will be sharded
No, by default none of your collections will be sharded. All such collections stay wholly on the primary shard. To shard a collection, use the shardCollection command
Also Please confirm that if sharded the aggregate query will be run in
many servers, and delivery the results fast.
One important thing while defining a collection in a sharded environment is the shard key. You should ensure that choose a good shard key which is responsible for distribution of data across the shards. Thus, if you choose a good shard key, you can expect a better performance than a non-sharded environment.
If so how the aggregation query functions.
The aggregation query is split by the $match to different shards depending on where the docs are present, and are finally merged together on a shard. A good read is https://docs.mongodb.com/v3.2/core/aggregation-pipeline-sharded-collections/
Related
I have been trying to use $lookup on mongos shards which isn't allowed.
If I create an unsharded collection, I know it is by default only created at primary shards. However, by using $lookup only from shards to primary shard is not efficient.
Therefore, what I have been thinking was to create the same collection on each shard then insert specifically on that shard using the same shard rule on config.
Then if I use $lookup from the sharded collection on the local collection, it will achieve my goal.
I was searching about this and found comments on this jira has the same issue SERVER-29159 as below.
Is there a way to achieve what I have just explained this?
From logical point of view, it should be achievable but the way to connect to shards is through routers so I believe unless mongodb offers such feature on routers, it is not possible... At least please tell me if it is not possible if you know mongo well.
P.S. I am using spring-data-mongodb as a client.
For my application I need to shard a fairly big collection, the entire collection will contain app. 500 billion documents.
I have two potential fields which can be used as Sharding Key:
For inserting either Sharding Key will distribute documents evenly throughout the cluster, there is does not matter which field I use as Sharding Key.
For query it is different.
Field(1) is usually part of the query filter condition, thus query would be processed usually on a single shard only.
Field(2) is typically not part of the query filter condition, thus query would be processed over all shards and typically several shards will contribute to final query result.
Which one is the better field to be used as Sharding Key? I did not find anything in MongoDB documentation about that topic.
Either fields have the same range and very similar cardinality figures, there won't be any difference. Usually the number of documents returned by a query is very low (typically less than 20-30 documents).
In a sharded cluster the mongos router determines which shard is to be targeted for a read or write operation - based on the available shard key meta-data stored on the config servers.
For inserting either Sharding Key will distribute documents evenly
throughout the cluster, there is does not matter which field I use as
Sharding Key.
When you insert a document it will have a shard key and the document will be stored on a designated shard.
Field(1) is usually part of the query filter condition, thus query
would be processed usually on a single shard only.
The shard key's main purposes are (a) to distribute data evenly across shards in a cluster, and (b) to be able to query the data in such a way that the query targets a single shard.
For a query to target a single shard, the shard key must be part of the query's filter criteria. The mongos router will target the single shard using the shard key.
If the shard key is not part of the filter criteria it will be a scatter-gather operation (a long running query). It is important that the most important query operations of the application using the sharded collection must be able use the shard key.
Field(2) is typically not part of the query filter condition, thus
query would be processed over all shards and typically several shards
will contribute to final query result.
When the shard key is not part of the query filter, the operation will span across multiple shards (a scatter-gather operation) and it will be a slow running operation. The mongos router will not be able to determine which shards have the target data, and all the shards in the cluster will be queried to return the final result.
Which one is the better field to be used as Sharding Key?
It can be concluded that the Field(1) must be used as a shard key.
See documentation on shard keys and choosing a shard key # MongoDB docs on Shard Keys.
When I run an aggregation on a sharped MongoDB environment. If $match matches documents on multiple shards, does the next command in the pipeline ‘$group’ then run on multiple shards?
Or does all the matching data go to a single shard?
If it goes to a single shard, which shard will it go to?
It doesn't matter whether you run with shards or not, it will perform the aggregation automatically on all the shards if the data is in different shards.
Generally, if a query spreads across multiple shards, it is considered less optimized. It takes more time than reading from single shard.
Does it hold true for writing as well? If I am writing some data and it will distribute among multiple shards, will it be considered less optimized?
If yes, what is the best way to write a batch that should go to different shard?
It depends on the operations, see https://docs.mongodb.com/manual/core/sharded-cluster-query-router/#sharding-mongos-targeted.
All insertOne() operations target to one shard. Each document in the insertMany() array targets to a single shard, but there is no guarantee all documents in the array insert into a single shard.
All updateOne(), replaceOne() and deleteOne() operations must include the shard key or _id in the query document. MongoDB returns an error if these methods are used without the shard key or _id.
Depending on the distribution of data in the cluster and the selectivity of the query, mongos may still perform a broadcast operation to fulfill these queries.
The title is saying everything. Assume that you have a sharded MongoDB environment and the user provide a query, which doesn't contain the shard key. What is the actual performance of the query? What happens in the background?
The performance depends on any number of factors however, the default action of MongoDB in this case is to do a global scatter and gather operation whereby it will send the query to all shards and then merge duplicates to give you an end result.
Returning to the performance, it normally depends upon the indexes on each shard and the isolated optimisation of their data sets and how much range of a dataset they hold.
However processing is parallel in sharding which means they all get the query and the "master" mongod will just merge as they come in, so the performance shouldn't be: go to shard 1, get it, then shard 2; instead it should be: go to all shards, each shard return its results and the master merges and returns.
Here is a good presentation (with nice pictures) on exactly how queries with sharding work in certain situations: http://www.slideshare.net/mongodb/how-queries-work-with-sharding
If the query is maked on the sharded collections the query is maked on all shard, if the query is maked on non shared collections, mongoDB take all data on the same shard.
I add the link for shard FAQ on MongoDB
http://docs.mongodb.org/manual/faq/sharding/