does multiple shard key helps performance in mongodb? - mongodb

Since sharding database use shard key to split chunk AND route queries, so I think maybe more shard key can helps to make more queries targeted
I tried to specify multiple keys like this
db.runCommand( { shardcollection : "test.users" , key : {_id:1, email : 1 ,address:1}
but I have no idea if it works and what the downsides of doing this

To be clear here, you can only have one shard key. So you cannot have multiple shard keys.
However, you are suggesting a compound index as the shard key. This can be done, but there are some limitations.
For example the combination of _id, email and address must be unique.
Documents for choosing a shard key. There are several more considerations that I cannot list here. Please see that document.

Selection of shard key based on :
{coarseLocality : 1, search : 1}
coarseLocality is whatever locality you want for your data,
search is a common search on your data.
You must have an index on the key you shard by, so if you choose a randomly-valued
key that you don’t query by, you’re basically wasting an index. Every additional index
makes writes slower, so it’s important to keep the number of indexes as low as possible.
So,increasing shard key combination doesn't help much.
Extract taken from Kristina Chodrow's book "Scaling MongoDB".

Related

Mongodb: Determining shard key strategy on compound index

I have a collection with 170 millions+ documents and it is only going
to increase. The size of the collection is not that huge, currently
around 70 GB.
The collection has two fields indexed on: {AgentId:1, PropertyId:1}.
Generally one imports a huge file(millions of documents) belonging to
a particular AgentId but the PropertyId(non numeric nullable) is
mostly random unique value.
Currently I have two shards with shard key based on {_id: hashed}. But
I am planning to change the shard key to compound Index {AgentId:1,
PropertyId:1} because I think it will improve query performance( most
of the queries are based on AgentId filter). Not sure whether one can
have a nullable field in the shard key. If this is the case then app
will make sure that the PropertyId is random no.
So looking to get a picture as to
How the data will be distributed to shards during insertion
and how the range of a chunks are calculated during insertion?
Since the PropertyId is random value. Does the compound key fits the
definition of monotonically increasing value?
I am a newbie to mongodb. And wanted to know if I am on the right path?
Thanks
There is no automatic support in MongoDB for changing a shard key after sharding a collection.
This reality underscores the importance of choosing a good shard key. If you must change a shard key after sharding a collection, the best option is to:
dump all data from MongoDB into an external format.
drop the original sharded collection.
configure sharding using a more ideal shard key.
pre-split the shard key range to ensure initial even distribution.
restore the dumped data into MongoDB.

Uniqueness of _id within a shard

I'm looking into sharding using mongodb, and most if it is rather straight forward. I have some experience with sharding in other databases, so I'm not asking about the concept itself. There's one thing I'm confused by, and there doesn't seem to be anything in the documentation about this, so here goes.
Is _id required to be unique within the shard, regardless of shard key?
A small scale (single shard) test seems to confirm that this is the case. It does however seem like a less than stellar approach to sharding, which has me confused. To me it would make more sense to require shard-key + _id to be unique (i.e. use a compound key), or you'll have inconsistent behavior depending on where your shard-keys end up being routed to. My data model uses deterministic keys, and the shard key is an intrinsic part of it. So I guess it comes down to, did I do something wrong in my small scale test? Do I need to store the shard-key twice, once as a shard-key field and once as part of _id? Or is there some special case where I can somehow declare a compound key using shard-key and _id?
Update
For completeness, this is the trivial case I'm testing, inserting the following two documents:
{"_id": 1, "shardkey": 1}
{"_id": 1, "shardkey": 2}
First one obviously goes through, second one fails. If I would've had two shards, and the shard keys would've been routed to different shards, I assume both would've succeeded.
I can obviously just combine the shard-key and the id to create the _id field for mongodb, since this is really the key I'm using, but it seems like a weird way to approach the problem from a database architectural standpoint.
_id needs to be unique, always, whether the collection is sharded or not. The shard key does not need to be unique. It is used to split the collection into chunks which can be split onto the shards making up the database. The shard key needs to provide enough granularity to split the documents in the collection into chunks. Its obviously a good idea to link the shard key to how you query the data, and use a shard key which relates to the fields that you query on. This way the queries you run will be easily directed to the relevant shards to satisfy the query. If the shard key isnt selective enough then the query will need to go to multiple shards to find the correct documents. You can create a compound index on _id + shard-key and make it unique if you want.
I realise this doesnt fully answer the question. tbh I am struggling to understand what you're asking. Perhaps if you could post an example of the documents you're storing and the queries you're running it would help.

Mongodb choose shard key

I have a mongodb collection which I want to shard. This collection holds messages from users and a document from the collection has the following properties
{
_id : ObjectId,
conversationId: ObjectId,
created: DateTime
}
All queries will be done using the converstionId property and sorter by created.
Sharding by _id obviously won't work because I need to query by conversationId (plus _id is of type ObjectId which won't scale very well to many inserts)
Sharding by conversationId would be a logical choice in terms of query isolation but I'm afraid that it won't scale very well many inserts (even if I use a hashed shard key on conversationId or if I change the type of the property from ObjectId to some other type which isn't incremental like GUID) because some conversation might be much more active than others (i.e.: have many more message added to them)
From what I see in the mongo documentation The shard key is either an indexed field or an indexed compound field that exists in every document in the collection.
Does this mean that I can create a shard key on a compound index ?
Bottom line is that:
creating a hashed shard key from the _id property would offer good distribution of the data
creating a shard key on conversationId would offer good query isolation
So a combination of these two things would be great, if it could be done.
Any ideas?
Thanks
For your case, neither of fields look good choice for sharding. For instance, if you shard on conversationId, it will result in hot spotting, i.e. most of your inserts will happen to the last shard as conversationId would monotonically increase over time. Same problem with other two fields as well.
Also, conversationId will not offer high degree of isolation as conversationId would monotonically increase over time. (Since newer conversations will get updated much more frequently than very old ones)
In your case, a "hashed shard key"(version 2.4 onwards) over conversationId would be the smart choice as one would imagine that there can be tons of conversations going on in parallel.
Refer following link for details on creating hashed shard key: [ http://docs.mongodb.org/manual/tutorial/shard-collection-with-a-hashed-shard-key/ ]

mongodb sparse indexes and shard key

I looked through the docs, and couldn't find a clear answer
Say I have a sparse index on [a,b,c]
Will documents with "a" "b" fields but not "c" be inserted to the index?
Is having the shard key indexed obligatory in the latest mongodb version ?
If so, is it possible to shard on [a] using the above compound sparse index?
(say a,b will always exist)
If c is not present, and query uses c index in the query plan, then document will not be found because it is not present in the index.
Shard key must be indexed and be unique. Also have a look at the notes on shard key on the sharding reference doc, it says
The ideal shard key:
is easily divisible which makes it easy for MongoDB to distribute
content among the shards. Shard keys that have a limited number of
possible values are not ideal as they can result in some chunks that
are “unsplitable.” See the Cardinality section for more information.
will distribute write operations among the cluster, to prevent any
single shard from becoming a bottleneck. Shard keys that have a high
correlation with insert time are poor choices for this reason;
however, shard keys that have higher “randomness” satisfy this
requirement better. See the Write Scaling section for additional
background. will make it possible for the mongos to return most query
operations directly from a single specific mongod instance. Your shard
key should be the primary field used by your queries, and fields with
a high degree of “randomness” are poor choices for this reason. See
the Query Isolation section for specific examples.
so if hypothetically, if mongo accepts a sparse index as shard key, mongo will not know where to place docs which don't fit in the index. One can argue, put them all in another shard for this purpose. Counter argument would be, what happens if it outgrows ... hence I don't think it would make sense to do it, even if it is allowed.
3- I doubt sparse index will work because shards require a unique index and a sparse index does not fulfill the criteria. The unique index requirement, I haven't found in docs, but if you use the mongo admin shell help, it tells you about it.

Good Shard Keys in MongoDB

From the book Scaling MongoDB:
The general case
We can generalize this to a formula for shard keys:
{coarseLocality : 1, search : 1}
So my question is, is that correct? shouldn't be the oposite for better writing?
Also from the book:
This pattern continues: everything will always be added to the “last”
chunk, meaning everything will be added to one shard. This shard key
gives you a single, undistributable hot spot.
So saying that my app always search by user_id, and last entries in the collection.
What is the best shard key i should have, this:
{_id:1, user_id:1}
or:
{user_id:1,_id:1}
Kristina (author of Scaling MongoDB) wrote a blog post which has some example strategies explained in the guise of a game: How to Choose a Shard Key: The Card Game.
There are many considerations to choosing a good shard key based on your application requirements and use cases.
The general advice of {coarseLocality : 1, search : 1} order is to ensure there is some locality of your data for reading.
So in your case, you would most likely want: {user_id:1,_id:1}.
That will provide some locality of data for the same user_id when querying, and ideally your common queries will be able to get their data from a single shard.
The opposite order may provide for better write distribution (assuming _id is not a monotonically increasing key like a default ObjectId) but a potential downside is reliability: if your data for a read query is scattered across all shards, you will have retrieval problems if any one shard is down.
So saying that my app always search by user_id, and last entries in the collection.
If you commonly search by user_id (and without _id) this will also affect your choice of shard key and index optimization. To find the last entries MongoDB will have to do a sort; you will want to be doing that sort on a single shard rather than having to gather the data from all shards and sorting. If your _id happens to be date-based that would be beneficial as part of the shard key in order to find the last entries.