I have a collection with 170 millions+ documents and it is only going
to increase. The size of the collection is not that huge, currently
around 70 GB.
The collection has two fields indexed on: {AgentId:1, PropertyId:1}.
Generally one imports a huge file(millions of documents) belonging to
a particular AgentId but the PropertyId(non numeric nullable) is
mostly random unique value.
Currently I have two shards with shard key based on {_id: hashed}. But
I am planning to change the shard key to compound Index {AgentId:1,
PropertyId:1} because I think it will improve query performance( most
of the queries are based on AgentId filter). Not sure whether one can
have a nullable field in the shard key. If this is the case then app
will make sure that the PropertyId is random no.
So looking to get a picture as to
How the data will be distributed to shards during insertion
and how the range of a chunks are calculated during insertion?
Since the PropertyId is random value. Does the compound key fits the
definition of monotonically increasing value?
I am a newbie to mongodb. And wanted to know if I am on the right path?
Thanks
There is no automatic support in MongoDB for changing a shard key after sharding a collection.
This reality underscores the importance of choosing a good shard key. If you must change a shard key after sharding a collection, the best option is to:
dump all data from MongoDB into an external format.
drop the original sharded collection.
configure sharding using a more ideal shard key.
pre-split the shard key range to ensure initial even distribution.
restore the dumped data into MongoDB.
Related
I have set up my first mongodb sharded cluster and am finally at the stage where I create a db/collection and choose the shard key. I’ve read about how to choose an appropriate shard key and am likely going with a hashed index but I might be having some conceptual misunderstandings.
My documents are super simple and contain a document id (some natural number), a document version id (a natural number), and a string of the raw text itself. If I understand correctly from the documentation, I can choose to shard on the document id but this can lead to jumbo shards since the document id will be incremented and new documents will be added to the same shard. And so I could set the shard key as a hashed value of the document id.
My question is whether or not I can still continue to query by the document id? My brain is making me doubt this and making me think that the indexing of the documents is over the hashed shard key and not over the document id. I am hoping that the hashed shard key is used strictly for sharding and that I can set any key (i.e., document id) to be indexed. Is this correct?
Yes, you can still query by the value of the shard key.
If you are referring to _id, that will be automatically indexed with it's natural value, otherwise you could explicitly create and index on the document id that is not hashed in addition to the shard key index.
As long as you test for equality to a single or explicit list of values, the query should be handled by the minimum number of shards.
However, if you use a ranged test such as $gte, the query will have to be forwarded to every shard to be processed.
Using the hashed document id as the shard key will result in the creation of an index for the hashed value in addition to any other indexes.
There is a pretty good description of hashed sharding in the documentation
I've a scenario in which I don't know what would be the structure & fields of collections in MongoDb. Also there will be like multiple single DB per user(Like Multi-tenant DB).
I'll be deploying Replicated sharded cluster in production.For scaling & better machine optimization, I'm applying sharding on per DB basis during the creation of each DB, and each collection under the same DB will be sharded to different shards. Now in this scenario I'm not sure which key would be the best choice since the structure & field(s) of collection(s) which would be created under each DB will be unknown. Since the structure of DB, Collection is unknown I can't forecast which type of query will be used most of the time. So I want to select a shard key which would fulfill all the criteria for shard key selection like: Cardinality, Query Isolation, Monotonically increasing, Write scaling, Easily divisible.
What would be the solution in this scenario?
Also What if I select all the fields under that collection for shard key along with hashed _id field as compound key?
Once you create a shard key you can not edit it.
So keep pumping the data into the collection, once you get clarity on the fields you can shard the collections any time.
Rebalancing happens automatically after sharding.
I would like to shard my collection on the basis of range on mongodb shards, my question is if shard key is string field then how will we divide string based shard key in different chunks for range based sharding ???
You can divide a string across shards using tag aware sharding. You create the "tags" denoting the ranges of the key to assign to a specific shard. Mongo's balancer will handle the distribution of the data and when you write a query for the key in question Mongo will know to target only that shard.
For more information see the following URL from the vendor. sharding-introduction/
I have a mongodb collection which I want to shard. This collection holds messages from users and a document from the collection has the following properties
{
_id : ObjectId,
conversationId: ObjectId,
created: DateTime
}
All queries will be done using the converstionId property and sorter by created.
Sharding by _id obviously won't work because I need to query by conversationId (plus _id is of type ObjectId which won't scale very well to many inserts)
Sharding by conversationId would be a logical choice in terms of query isolation but I'm afraid that it won't scale very well many inserts (even if I use a hashed shard key on conversationId or if I change the type of the property from ObjectId to some other type which isn't incremental like GUID) because some conversation might be much more active than others (i.e.: have many more message added to them)
From what I see in the mongo documentation The shard key is either an indexed field or an indexed compound field that exists in every document in the collection.
Does this mean that I can create a shard key on a compound index ?
Bottom line is that:
creating a hashed shard key from the _id property would offer good distribution of the data
creating a shard key on conversationId would offer good query isolation
So a combination of these two things would be great, if it could be done.
Any ideas?
Thanks
For your case, neither of fields look good choice for sharding. For instance, if you shard on conversationId, it will result in hot spotting, i.e. most of your inserts will happen to the last shard as conversationId would monotonically increase over time. Same problem with other two fields as well.
Also, conversationId will not offer high degree of isolation as conversationId would monotonically increase over time. (Since newer conversations will get updated much more frequently than very old ones)
In your case, a "hashed shard key"(version 2.4 onwards) over conversationId would be the smart choice as one would imagine that there can be tons of conversations going on in parallel.
Refer following link for details on creating hashed shard key: [ http://docs.mongodb.org/manual/tutorial/shard-collection-with-a-hashed-shard-key/ ]
I looked through the docs, and couldn't find a clear answer
Say I have a sparse index on [a,b,c]
Will documents with "a" "b" fields but not "c" be inserted to the index?
Is having the shard key indexed obligatory in the latest mongodb version ?
If so, is it possible to shard on [a] using the above compound sparse index?
(say a,b will always exist)
If c is not present, and query uses c index in the query plan, then document will not be found because it is not present in the index.
Shard key must be indexed and be unique. Also have a look at the notes on shard key on the sharding reference doc, it says
The ideal shard key:
is easily divisible which makes it easy for MongoDB to distribute
content among the shards. Shard keys that have a limited number of
possible values are not ideal as they can result in some chunks that
are “unsplitable.” See the Cardinality section for more information.
will distribute write operations among the cluster, to prevent any
single shard from becoming a bottleneck. Shard keys that have a high
correlation with insert time are poor choices for this reason;
however, shard keys that have higher “randomness” satisfy this
requirement better. See the Write Scaling section for additional
background. will make it possible for the mongos to return most query
operations directly from a single specific mongod instance. Your shard
key should be the primary field used by your queries, and fields with
a high degree of “randomness” are poor choices for this reason. See
the Query Isolation section for specific examples.
so if hypothetically, if mongo accepts a sparse index as shard key, mongo will not know where to place docs which don't fit in the index. One can argue, put them all in another shard for this purpose. Counter argument would be, what happens if it outgrows ... hence I don't think it would make sense to do it, even if it is allowed.
3- I doubt sparse index will work because shards require a unique index and a sparse index does not fulfill the criteria. The unique index requirement, I haven't found in docs, but if you use the mongo admin shell help, it tells you about it.