I have been thinking about sharding with MongoDB and came across a use case which I haven't been able to figure out ... so here it is:
If I have documents that look like this one...
_id [Integer]
username [String]
password [String] <-- SHA1 hash
firstname [String]
lastname [String]
...and I now choose the password field as my shard key, it would be a good fit for sharding since it has a very high cardinality and would scale nicely. But the question remains, what happens if a user changes his password? Will the corresponding document be automatically migrated to a different chunk?
Does someone know how MongoDB handles cases like this one?
Thanks
No, shard keys are immutable.
Consider the mongo documentation, Can I change the shard key after sharding a collection?:
Can I change the shard key after sharding a collection?
No.
There is no automatic support in MongoDB for changing a shard key
after sharding a collection. This reality underscores the importance
of choosing a good shard key. If you must change a shard key
after sharding a collection, the best option is to:
dump all data from MongoDB into an external format.
drop the original sharded collection.
configure sharding using a more ideal shard key.
pre-split the shard key range to ensure initial even distribution.
restore the dumped data into MongoDB.
My understanding of your question is that you asked:
what happens if a user changes his password?
Not:
what happens if I change the shard key?
Completely different questions. For the second case the accepted answer is correct.
For your original question:
In shared clusters mongodb has a component called balancer. The balancer will balance your shards and migrate your chunks so they are balanced in size if possible.
Please read: Sharded Cluster Balancer.
So, yes, if user changes their password the corresponding document will be automatically migrated to a different chunk, only if balancer thinks is needed. The balancer takes care of this.
As an important note with the release of new version starting 4.2, the following statement does not apply.
"Once inserted, a document's shard key value cannot be modified" .
So the answer to the question, Can shard key be changed?
Although you cannot select a different shard key for a sharded collection, starting in MongoDB 4.2, you can update a document's shard key value unless the shard key field is the immutable _id field
Related
I have a collection with 170 millions+ documents and it is only going
to increase. The size of the collection is not that huge, currently
around 70 GB.
The collection has two fields indexed on: {AgentId:1, PropertyId:1}.
Generally one imports a huge file(millions of documents) belonging to
a particular AgentId but the PropertyId(non numeric nullable) is
mostly random unique value.
Currently I have two shards with shard key based on {_id: hashed}. But
I am planning to change the shard key to compound Index {AgentId:1,
PropertyId:1} because I think it will improve query performance( most
of the queries are based on AgentId filter). Not sure whether one can
have a nullable field in the shard key. If this is the case then app
will make sure that the PropertyId is random no.
So looking to get a picture as to
How the data will be distributed to shards during insertion
and how the range of a chunks are calculated during insertion?
Since the PropertyId is random value. Does the compound key fits the
definition of monotonically increasing value?
I am a newbie to mongodb. And wanted to know if I am on the right path?
Thanks
There is no automatic support in MongoDB for changing a shard key after sharding a collection.
This reality underscores the importance of choosing a good shard key. If you must change a shard key after sharding a collection, the best option is to:
dump all data from MongoDB into an external format.
drop the original sharded collection.
configure sharding using a more ideal shard key.
pre-split the shard key range to ensure initial even distribution.
restore the dumped data into MongoDB.
I've a scenario in which I don't know what would be the structure & fields of collections in MongoDb. Also there will be like multiple single DB per user(Like Multi-tenant DB).
I'll be deploying Replicated sharded cluster in production.For scaling & better machine optimization, I'm applying sharding on per DB basis during the creation of each DB, and each collection under the same DB will be sharded to different shards. Now in this scenario I'm not sure which key would be the best choice since the structure & field(s) of collection(s) which would be created under each DB will be unknown. Since the structure of DB, Collection is unknown I can't forecast which type of query will be used most of the time. So I want to select a shard key which would fulfill all the criteria for shard key selection like: Cardinality, Query Isolation, Monotonically increasing, Write scaling, Easily divisible.
What would be the solution in this scenario?
Also What if I select all the fields under that collection for shard key along with hashed _id field as compound key?
Once you create a shard key you can not edit it.
So keep pumping the data into the collection, once you get clarity on the fields you can shard the collections any time.
Rebalancing happens automatically after sharding.
I read through the sharding docs on the mongo official site.
However, I can't an answer for these:
Do all of a sharded collection's indexes need to start with the shard key?
If I required a TTL index on a field for a sharded collection, and since compound indexes are not supported for TTL, what kind I do in this case? (field != shard key)
No. You can have any index on a sharded collection. However, queries which do not include the shard key will be sent to all shards. The individual shard will then make use of any existing index, sending back it's result to the mongos query router, which in turn will sort the results, if required, and send the result set back to the client. Please read Routing Process in the MongoDB docs for further details.
The TTL removal is a background process which runs on a date field. Each of your shards will spawn said background process. So you can simply create the TTL index on the date field of your choice. Each individual shard will take care of the documents which are to be deleted.
I'm new to Mongo as well as sharding.
Our app will be served by MongoDB and we expect billions of records to be stored. However, the db will grow slowly, ie it'll probably take years to reach that huge size.
Furthermore, we will mostly use a special encrypted value to look up records. This holds all the info needed to find the given record, ie the primary key, the shard key etc.
My question is: Should we enable sharding from the first day and encrypt shard key + PK? Or could we enable sharding later (when needed) and tell mongo to look up certain records (the ones whose encrypted ID holds no shard key) in the default, "unsharded" collection?
What's the best way to do this?
Thanks in advance!
I'm documenting about the GridFS and the possibility to shard it among different machines.
Reading the documentation here, the suggested shard key is chunks.files_id. This key will be linked to the _id of the files collection, thus this _id is incremental. Every new file i save in the Grid will have a new incremental _id.
In the O'Reilly "Scaling MongoDB" book the use of an incremental shard key is discouraged to avoid HotSpots (the last shard will receive all the write and read).
what is your suggestion for sharding the GridFS collection?
have anybody experienced the HotSpot problem?
thank you.
You should shard on files_id to keep file chunks together, but you are correct that that will create a hotspot. If you can, use something other than ObjectId for _ids in the fs.files collection (probably MD5s would be better than ObjectIds).
We'll be adding hashing for sharding, which will solve this, but not until at least 2.0.
You can shard gridfs data because gridfs it just two collecttions: chunks and files. And gridfs sharding it's very useful and great thing. About gridfs shard key it's always bad choose random or incremental shard key, because data not evenly distribute across shards. In case of incremental shard key all writes going to the last shard and it growth and once difference between become 10 or more chunks, balancer move data to another shards. Moving data to another shard always difficult task that should be avoided as it possible.
So when you choose shard key you should care about even distribution of data.
Also if you get luck mb author of 'Scaling MongoDB' kristina(great specialist in shard keys) will answer to your question.
Documentation says that in common cases you should choose default index fileId:1,n:1 as shard key:
There are different ways that GridFS
can be sharded, depending on the need.
One common way to shard, based on
pre-existing indexes, is:
"files" collection is not sharded. All
file records will live in 1 shard. It
is highly recommended to make that
shard very resilient (at least 3 node
replica set) "chunks" collection gets
sharded using the existing index
"files_id: 1, n: 1". Some files at the
end of ranges may have their chunks
split across shards, but most files
will be fully contained within the
same shard.
Currently MongoDB as of version 1.8.1 supports only sharding on "file_id" field, because of using md5 to verify the upload, but it doesn't
work across shards yet. So you cannot split single file across shards.
Answer on google group7