mongodb sparse indexes and shard key - mongodb

I looked through the docs, and couldn't find a clear answer
Say I have a sparse index on [a,b,c]
Will documents with "a" "b" fields but not "c" be inserted to the index?
Is having the shard key indexed obligatory in the latest mongodb version ?
If so, is it possible to shard on [a] using the above compound sparse index?
(say a,b will always exist)

If c is not present, and query uses c index in the query plan, then document will not be found because it is not present in the index.
Shard key must be indexed and be unique. Also have a look at the notes on shard key on the sharding reference doc, it says
The ideal shard key:
is easily divisible which makes it easy for MongoDB to distribute
content among the shards. Shard keys that have a limited number of
possible values are not ideal as they can result in some chunks that
are “unsplitable.” See the Cardinality section for more information.
will distribute write operations among the cluster, to prevent any
single shard from becoming a bottleneck. Shard keys that have a high
correlation with insert time are poor choices for this reason;
however, shard keys that have higher “randomness” satisfy this
requirement better. See the Write Scaling section for additional
background. will make it possible for the mongos to return most query
operations directly from a single specific mongod instance. Your shard
key should be the primary field used by your queries, and fields with
a high degree of “randomness” are poor choices for this reason. See
the Query Isolation section for specific examples.
so if hypothetically, if mongo accepts a sparse index as shard key, mongo will not know where to place docs which don't fit in the index. One can argue, put them all in another shard for this purpose. Counter argument would be, what happens if it outgrows ... hence I don't think it would make sense to do it, even if it is allowed.
3- I doubt sparse index will work because shards require a unique index and a sparse index does not fulfill the criteria. The unique index requirement, I haven't found in docs, but if you use the mongo admin shell help, it tells you about it.

Related

Do compound shard keys in MongoDB work similar to compound indexes?

Suppose my collection uses a compound shard key consisting of BlockHash and BlockHeight fields.
If I ran a query to look up documents for a given BLockHeight, will Mongo have to hit every shard since we did not filter by BlockHash? Does having BlockHeight in the shard key help the query at all?
Ideally every query should have the shard key. Choose based on cardinality and logical categorisation of your data.
If you are sharding on BlockHash and BlockHeight (in that order), and you just run a query on BlockHeight. You will end up with hitting all the shards.
As a best practice, make it a habit of running .explain("executionStats") with your queries. This will tell you how your query is parsed. And which Shards did it touch.

Mongodb: Determining shard key strategy on compound index

I have a collection with 170 millions+ documents and it is only going
to increase. The size of the collection is not that huge, currently
around 70 GB.
The collection has two fields indexed on: {AgentId:1, PropertyId:1}.
Generally one imports a huge file(millions of documents) belonging to
a particular AgentId but the PropertyId(non numeric nullable) is
mostly random unique value.
Currently I have two shards with shard key based on {_id: hashed}. But
I am planning to change the shard key to compound Index {AgentId:1,
PropertyId:1} because I think it will improve query performance( most
of the queries are based on AgentId filter). Not sure whether one can
have a nullable field in the shard key. If this is the case then app
will make sure that the PropertyId is random no.
So looking to get a picture as to
How the data will be distributed to shards during insertion
and how the range of a chunks are calculated during insertion?
Since the PropertyId is random value. Does the compound key fits the
definition of monotonically increasing value?
I am a newbie to mongodb. And wanted to know if I am on the right path?
Thanks
There is no automatic support in MongoDB for changing a shard key after sharding a collection.
This reality underscores the importance of choosing a good shard key. If you must change a shard key after sharding a collection, the best option is to:
dump all data from MongoDB into an external format.
drop the original sharded collection.
configure sharding using a more ideal shard key.
pre-split the shard key range to ensure initial even distribution.
restore the dumped data into MongoDB.

Uniqueness of _id within a shard

I'm looking into sharding using mongodb, and most if it is rather straight forward. I have some experience with sharding in other databases, so I'm not asking about the concept itself. There's one thing I'm confused by, and there doesn't seem to be anything in the documentation about this, so here goes.
Is _id required to be unique within the shard, regardless of shard key?
A small scale (single shard) test seems to confirm that this is the case. It does however seem like a less than stellar approach to sharding, which has me confused. To me it would make more sense to require shard-key + _id to be unique (i.e. use a compound key), or you'll have inconsistent behavior depending on where your shard-keys end up being routed to. My data model uses deterministic keys, and the shard key is an intrinsic part of it. So I guess it comes down to, did I do something wrong in my small scale test? Do I need to store the shard-key twice, once as a shard-key field and once as part of _id? Or is there some special case where I can somehow declare a compound key using shard-key and _id?
Update
For completeness, this is the trivial case I'm testing, inserting the following two documents:
{"_id": 1, "shardkey": 1}
{"_id": 1, "shardkey": 2}
First one obviously goes through, second one fails. If I would've had two shards, and the shard keys would've been routed to different shards, I assume both would've succeeded.
I can obviously just combine the shard-key and the id to create the _id field for mongodb, since this is really the key I'm using, but it seems like a weird way to approach the problem from a database architectural standpoint.
_id needs to be unique, always, whether the collection is sharded or not. The shard key does not need to be unique. It is used to split the collection into chunks which can be split onto the shards making up the database. The shard key needs to provide enough granularity to split the documents in the collection into chunks. Its obviously a good idea to link the shard key to how you query the data, and use a shard key which relates to the fields that you query on. This way the queries you run will be easily directed to the relevant shards to satisfy the query. If the shard key isnt selective enough then the query will need to go to multiple shards to find the correct documents. You can create a compound index on _id + shard-key and make it unique if you want.
I realise this doesnt fully answer the question. tbh I am struggling to understand what you're asking. Perhaps if you could post an example of the documents you're storing and the queries you're running it would help.

Good Shard Keys in MongoDB

From the book Scaling MongoDB:
The general case
We can generalize this to a formula for shard keys:
{coarseLocality : 1, search : 1}
So my question is, is that correct? shouldn't be the oposite for better writing?
Also from the book:
This pattern continues: everything will always be added to the “last”
chunk, meaning everything will be added to one shard. This shard key
gives you a single, undistributable hot spot.
So saying that my app always search by user_id, and last entries in the collection.
What is the best shard key i should have, this:
{_id:1, user_id:1}
or:
{user_id:1,_id:1}
Kristina (author of Scaling MongoDB) wrote a blog post which has some example strategies explained in the guise of a game: How to Choose a Shard Key: The Card Game.
There are many considerations to choosing a good shard key based on your application requirements and use cases.
The general advice of {coarseLocality : 1, search : 1} order is to ensure there is some locality of your data for reading.
So in your case, you would most likely want: {user_id:1,_id:1}.
That will provide some locality of data for the same user_id when querying, and ideally your common queries will be able to get their data from a single shard.
The opposite order may provide for better write distribution (assuming _id is not a monotonically increasing key like a default ObjectId) but a potential downside is reliability: if your data for a read query is scattered across all shards, you will have retrieval problems if any one shard is down.
So saying that my app always search by user_id, and last entries in the collection.
If you commonly search by user_id (and without _id) this will also affect your choice of shard key and index optimization. To find the last entries MongoDB will have to do a sort; you will want to be doing that sort on a single shard rather than having to gather the data from all shards and sorting. If your _id happens to be date-based that would be beneficial as part of the shard key in order to find the last entries.

Duplicate documents on _id (in mongo)

I have a sharded mongo collection, with over 1.5 mil documents. I use the _id column as a shard key, and the values in this column are integers (rather than ObjectIds).
I do a lot of write operations on this collection, using the Perl driver (insert, update, remove, save) and mongoimport.
My problem is that somehow, I have duplicate documents on the same _id. From what I've read, this shouldn't be possible.
I've removed the duplicates, but others still appear.
Do you have any ideas where could they come from, or what should I start looking at?
(Also, I've tried to replicate this on a smaller, test collection, but no duplicates are inserted, no matter what write operation I perform).
This actually isn't a problem with the Perl driver .. it is related to the characteristics of sharding. MongoDB is only able to enforce uniqueness among the documents located on a single shard at the time of creation, so the default index does not require uniqueness.
In the MongoDB: Configuring Sharding documentation there is specific mention that:
When you shard a collection, you must specify the shard key. If there is data in the collection, mongo will require an index to be created upfront (it speeds up the chunking process); otherwise, an index will be automatically created for you.
You can use the {unique: true} option to ensure that the underlying index enforces uniqueness so long as the unique index is a prefix of the shard key.
If the "unique: true" option is not used, the shard key does not have to be unique.
How have you implemented generating the integer Ids?
If you use a system like the one suggested on the MongoDB website, you should be fine. For reference:
function counter(name) {
var ret = db.counters.findAndModify({
query:{_id:name},
update:{$inc:{next:1}},
"new":true,
upsert:true});
return ret.next;
}
db.users.insert({_id:counter("users"), name:"Sarah C."}) // _id : 1
db.users.insert({_id:counter("users"), name:"Bob D."}) // _id : 2
If you are generating your Ids by reading a most recent record in the document store, then incrementing the number in the perl code, then inserting with the incremented number you could be running into timing issues.