sharded collection's indexes need to start with the shard key? - mongodb

I read through the sharding docs on the mongo official site.
However, I can't an answer for these:
Do all of a sharded collection's indexes need to start with the shard key?
If I required a TTL index on a field for a sharded collection, and since compound indexes are not supported for TTL, what kind I do in this case? (field != shard key)

No. You can have any index on a sharded collection. However, queries which do not include the shard key will be sent to all shards. The individual shard will then make use of any existing index, sending back it's result to the mongos query router, which in turn will sort the results, if required, and send the result set back to the client. Please read Routing Process in the MongoDB docs for further details.
The TTL removal is a background process which runs on a date field. Each of your shards will spawn said background process. So you can simply create the TTL index on the date field of your choice. Each individual shard will take care of the documents which are to be deleted.

Related

Need help to select sharding key in MongoDB

For my application I need to shard a fairly big collection, the entire collection will contain app. 500 billion documents.
I have two potential fields which can be used as Sharding Key:
For inserting either Sharding Key will distribute documents evenly throughout the cluster, there is does not matter which field I use as Sharding Key.
For query it is different.
Field(1) is usually part of the query filter condition, thus query would be processed usually on a single shard only.
Field(2) is typically not part of the query filter condition, thus query would be processed over all shards and typically several shards will contribute to final query result.
Which one is the better field to be used as Sharding Key? I did not find anything in MongoDB documentation about that topic.
Either fields have the same range and very similar cardinality figures, there won't be any difference. Usually the number of documents returned by a query is very low (typically less than 20-30 documents).
In a sharded cluster the mongos router determines which shard is to be targeted for a read or write operation - based on the available shard key meta-data stored on the config servers.
For inserting either Sharding Key will distribute documents evenly
throughout the cluster, there is does not matter which field I use as
Sharding Key.
When you insert a document it will have a shard key and the document will be stored on a designated shard.
Field(1) is usually part of the query filter condition, thus query
would be processed usually on a single shard only.
The shard key's main purposes are (a) to distribute data evenly across shards in a cluster, and (b) to be able to query the data in such a way that the query targets a single shard.
For a query to target a single shard, the shard key must be part of the query's filter criteria. The mongos router will target the single shard using the shard key.
If the shard key is not part of the filter criteria it will be a scatter-gather operation (a long running query). It is important that the most important query operations of the application using the sharded collection must be able use the shard key.
Field(2) is typically not part of the query filter condition, thus
query would be processed over all shards and typically several shards
will contribute to final query result.
When the shard key is not part of the query filter, the operation will span across multiple shards (a scatter-gather operation) and it will be a slow running operation. The mongos router will not be able to determine which shards have the target data, and all the shards in the cluster will be queried to return the final result.
Which one is the better field to be used as Sharding Key?
It can be concluded that the Field(1) must be used as a shard key.
See documentation on shard keys and choosing a shard key # MongoDB docs on Shard Keys.

MongoDB Shard key Selection

I've a scenario in which I don't know what would be the structure & fields of collections in MongoDb. Also there will be like multiple single DB per user(Like Multi-tenant DB).
I'll be deploying Replicated sharded cluster in production.For scaling & better machine optimization, I'm applying sharding on per DB basis during the creation of each DB, and each collection under the same DB will be sharded to different shards. Now in this scenario I'm not sure which key would be the best choice since the structure & field(s) of collection(s) which would be created under each DB will be unknown. Since the structure of DB, Collection is unknown I can't forecast which type of query will be used most of the time. So I want to select a shard key which would fulfill all the criteria for shard key selection like: Cardinality, Query Isolation, Monotonically increasing, Write scaling, Easily divisible.
What would be the solution in this scenario?
Also What if I select all the fields under that collection for shard key along with hashed _id field as compound key?
Once you create a shard key you can not edit it.
So keep pumping the data into the collection, once you get clarity on the fields you can shard the collections any time.
Rebalancing happens automatically after sharding.

Query against local MongoDB shard data only

I have a sharded collection, with a shard key "user id". I would like to perform a query where, instead of passing the shard key, I simply restrict the query to only the data on the local mongos shard.
Is this possible / advisable?
Furthermore, can it be used with findAndModify? This would allow me to perform atomic updates on local documents, without specifying a shard key in the query.
Edit
As stated in some answers and comments below, my understanding of mongos vs. mongod was a little skewed. I now appreciate that mongos doesn't hold the local data.
Does mongos have any "local" data?
No. Each mongos daemon routes queries to your shards and does not store any data itself, so there is no such concept as "local" documents stored by a mongos. The mongos interface provides a logical view of the entire sharded cluster and does not have affinity to a specific shard.
Based on the type of query/command you send to mongos, the query will be:
Directed: sent to a specific shard if the query uses the shard key
Targeted: sent to applicable shards if the query includes multiple shard key values (or uses a prefix subset of a compound shard key)
Scatter/gather: sent to all shards, if the query is not using the shard key
Should I read from shards directly?
No. It's technically possible to read data from the shards directly but definitely not recommendable as you can get an inconsistent view of data. For example, if there is a migration in progress the data will temporarily exist on both the donor shard and the target shard. Similarly, copies of documents may be orphaned as the result of failed migrations.
A query through mongos correctly directs queries to the appropriate shard(s) and filters results based on the sharded cluster metadata.
Can I use findAndModify() on a sharded collection without a query based on a shard key?
No. For a sharded collection, findAndModify() requires a query based on the shard key. The shard key provides a guarantee that the requested document only exists on one shard.
Can I update sharded collections without going through mongos?
No. All updates to a sharded collection must go through mongos.
Please keep in mind, that doing so is unadvised as traffic to a shared cluster should go through a mongos service.
That being said, It's possible to query the shard itself if you're performing the query locally on the shard instance.
I've never tried to do that programatically, but It may worth a shot.
You can either login directly to the machine running the shard, and open a mongo shell there (if you've never created a local user/password on it, I believe you can connect without credentials, otherwise, the mongod process on that specific shard must have it's own user/pass (as those which were created via the mongos are not recognised in the mongod shards.
As each shard knows its own data files only, and for example you'll run a count() operation on one of your collection you'll see that the result is only a portion of the actual collection size.
Your question is a little vague since you mix your English:
I simply restrict the query to only the data on the local mongos shard.
The shard will infact be a mongod process, not a mongos process, however your English can make sense if you have a mongos per shard in which case it makes sense that you want to direct to a mongos on that shard that can query its local mongod data.
If you are considering on circumventing the mongos then #Stennies comment answers your question however, if your English means something else then I do not believe the mongos has a command switch to allow you to direct queries without a shard key currently.

MongoDB query on all sharded collections without shardkey

I have several shard-(ed) collections.
The collection is user requests. and the shard key is User Id.
I have a field named "Execution Time"
and I want query all the requests in a period of time (lte and gte).
The index is with the shard key, but my query is without.
I would like not to put all the shard Key in query with a "in" operator because I have a 1000 shard keys (users)..
futher more to do that i need to get all user ids on every query - it means 2 queries each time instead of 1.
But still i want to use an index..
what option is to add userId > 0 < maxUserId to the query?
What is the right approach?
Thanks in advance
For ideal performance, shard keys should be chosen in a way the router (mongos) can tell which shard will have the data for the most common queries. This is only possible when the find-query has a criteria which is also the shard-key.
But in this case it is impossible for the router to tell which shard has the data for the query. It is not unlikely that there are relevant results on every shard. In that case the query needs to be forwarded to all shards, which will process it simultaneously. But when you have an appropriate index, this will help them doing so.

MongoDB and dynamic shard keys

I have been thinking about sharding with MongoDB and came across a use case which I haven't been able to figure out ... so here it is:
If I have documents that look like this one...
_id [Integer]
username [String]
password [String] <-- SHA1 hash
firstname [String]
lastname [String]
...and I now choose the password field as my shard key, it would be a good fit for sharding since it has a very high cardinality and would scale nicely. But the question remains, what happens if a user changes his password? Will the corresponding document be automatically migrated to a different chunk?
Does someone know how MongoDB handles cases like this one?
Thanks
No, shard keys are immutable.
Consider the mongo documentation, Can I change the shard key after sharding a collection?:
Can I change the shard key after sharding a collection?
No.
There is no automatic support in MongoDB for changing a shard key
after sharding a collection. This reality underscores the importance
of choosing a good shard key. If you must change a shard key
after sharding a collection, the best option is to:
dump all data from MongoDB into an external format.
drop the original sharded collection.
configure sharding using a more ideal shard key.
pre-split the shard key range to ensure initial even distribution.
restore the dumped data into MongoDB.
My understanding of your question is that you asked:
what happens if a user changes his password?
Not:
what happens if I change the shard key?
Completely different questions. For the second case the accepted answer is correct.
For your original question:
In shared clusters mongodb has a component called balancer. The balancer will balance your shards and migrate your chunks so they are balanced in size if possible.
Please read: Sharded Cluster Balancer.
So, yes, if user changes their password the corresponding document will be automatically migrated to a different chunk, only if balancer thinks is needed. The balancer takes care of this.
As an important note with the release of new version starting 4.2, the following statement does not apply.
"Once inserted, a document's shard key value cannot be modified" .
So the answer to the question, Can shard key be changed?
Although you cannot select a different shard key for a sharded collection, starting in MongoDB 4.2, you can update a document's shard key value unless the shard key field is the immutable _id field