Mongodb sharded cluster $in VS $or - mongodb

If I have MongoDB shurded cluster in sharded key: "my_key".
I have to find in collection pack documents (about 10-500 items) with different my_key's.
Foe example:
db.test.find({my_key: {$in:[1,3,5,67,45,56...]}})
Mongos knows where chunks with 'my_key' stored.
Can mongos split my query to small queries to exact shards where documents stored? Or mongos will send this query to ALL shards?
And the same question about $or
db.test.find({$or:[{my_key: 1},{my_key: 3},{my_key: 5}...]})

I have run tests.
If $in contains values only from one shard mongos will send SINGLE_SHARD query.
If $in contains values from several shards then mongos will send SHARD_MERGE query only for shards than contains needed data (not all cluster).

Related

Need help to select sharding key in MongoDB

For my application I need to shard a fairly big collection, the entire collection will contain app. 500 billion documents.
I have two potential fields which can be used as Sharding Key:
For inserting either Sharding Key will distribute documents evenly throughout the cluster, there is does not matter which field I use as Sharding Key.
For query it is different.
Field(1) is usually part of the query filter condition, thus query would be processed usually on a single shard only.
Field(2) is typically not part of the query filter condition, thus query would be processed over all shards and typically several shards will contribute to final query result.
Which one is the better field to be used as Sharding Key? I did not find anything in MongoDB documentation about that topic.
Either fields have the same range and very similar cardinality figures, there won't be any difference. Usually the number of documents returned by a query is very low (typically less than 20-30 documents).
In a sharded cluster the mongos router determines which shard is to be targeted for a read or write operation - based on the available shard key meta-data stored on the config servers.
For inserting either Sharding Key will distribute documents evenly
throughout the cluster, there is does not matter which field I use as
Sharding Key.
When you insert a document it will have a shard key and the document will be stored on a designated shard.
Field(1) is usually part of the query filter condition, thus query
would be processed usually on a single shard only.
The shard key's main purposes are (a) to distribute data evenly across shards in a cluster, and (b) to be able to query the data in such a way that the query targets a single shard.
For a query to target a single shard, the shard key must be part of the query's filter criteria. The mongos router will target the single shard using the shard key.
If the shard key is not part of the filter criteria it will be a scatter-gather operation (a long running query). It is important that the most important query operations of the application using the sharded collection must be able use the shard key.
Field(2) is typically not part of the query filter condition, thus
query would be processed over all shards and typically several shards
will contribute to final query result.
When the shard key is not part of the query filter, the operation will span across multiple shards (a scatter-gather operation) and it will be a slow running operation. The mongos router will not be able to determine which shards have the target data, and all the shards in the cluster will be queried to return the final result.
Which one is the better field to be used as Sharding Key?
It can be concluded that the Field(1) must be used as a shard key.
See documentation on shard keys and choosing a shard key # MongoDB docs on Shard Keys.

Does MongoDB aggregation $group command run on all shares

When I run an aggregation on a sharped MongoDB environment. If $match matches documents on multiple shards, does the next command in the pipeline ‘$group’ then run on multiple shards?
Or does all the matching data go to a single shard?
If it goes to a single shard, which shard will it go to?
It doesn't matter whether you run with shards or not, it will perform the aggregation automatically on all the shards if the data is in different shards.

How does MongoDB distribute data across a cluster

I've read about sharding a collection in MongoDB. MongoDB lets me shard a collection explicitly by calling shardCollection method. There I can choose whether I want it to be rangely shareded or hashingly sharded.
My question is, what would happen if I didn't call the shardCollection method, and I had say 100 nodes?
Would MongoDB keep the collections intact and distribute them across the cluster?
Would MongoDB keep all the collections in a single node?
Do I completely not understand how this works?
A database can have a mixture of sharded and unsharded collections. Sharded collections are partitioned and distributed across shards in the cluster. As at MongoDB 3.4, each database has a primary shard where the unsharded collections are stored. If your deployment has a number of databases this may result in some distribution of unsharded collections, but there is no balancing activity for unsharded data. For more information on expected behaviours, see the Sharding section in the MongoDB manual.
If you are interested in distribution of unsharded collections within a sharded database, there is a relevant feature request you can watch/upvote in the MongoDB issue tracker: SERVER-939: Ability to distribute collections in a single DB.

Aggregation Pipeline in mongodb in sharded collection

Just referring to Mongodb aggregation link https://docs.mongodb.com/v3.2/aggregation/#aggregation-pipeline and it mentions that
"The aggregation pipeline can operate on a sharded collection."
Please lemme know If database is sharded then all the collections in the database will be sharded. Also Please confirm that if sharded the aggregate query will be run in many servers, and delivery the results fast. If so how the aggregation query functions.
Regards
Kris
The aggregation pipeline supports operations on sharded collections.
If the pipeline starts with an exact $match on a shard key, the entire pipeline runs on the matching shard only. Previously(prior to version 3.2), the pipeline would have been split into two parts, and the work of merging it would have to be done on the primary shard.
In the case of aggregation operations that must run on multiple shards, if the operations do not require running on the database’s primary shard, these operations will route the results to a random shard to merge the results to avoid overloading the primary shard for that database. The $out stage and the $lookup stage require running on the database’s primary shard.
When splitting the aggregation pipeline into two parts, the pipeline is split to ensure that the shards perform as many stages as possible with consideration for optimization.
Reference:
https://docs.mongodb.com/manual/core/aggregation-pipeline-sharded-collections/
A sharded cluster always has a Primary Shard, and one or more secondary shards.
Please lemme know If database is sharded then all the collections in
the database will be sharded
No, by default none of your collections will be sharded. All such collections stay wholly on the primary shard. To shard a collection, use the shardCollection command
Also Please confirm that if sharded the aggregate query will be run in
many servers, and delivery the results fast.
One important thing while defining a collection in a sharded environment is the shard key. You should ensure that choose a good shard key which is responsible for distribution of data across the shards. Thus, if you choose a good shard key, you can expect a better performance than a non-sharded environment.
If so how the aggregation query functions.
The aggregation query is split by the $match to different shards depending on where the docs are present, and are finally merged together on a shard. A good read is https://docs.mongodb.com/v3.2/core/aggregation-pipeline-sharded-collections/

MongoDB sharding is possible on collections?

Can is it possible sharding only on collections ? if yes than how..?
What is difference between sharding on database and on collections?
Mongodb shards collections. You enable sharding on database but just enabling sharding on database will not distribute data across shards. To distribute data accross shards you need to tell mongodb what collection to distribute. So, you have to shard your collection and then only that collection will be spread across the shards.
Remember, mongodb will distribute data on the basis of collections sharded. If you have 2 collections in your database and you shard one of them then data of sharded collection will be spread out across the shards but the other collection will have all data on one shard.
In plain language, mongodb doesn't shard whole database automatically. Mongodb sharding works on collection level.