Upsert performance with sharding on MongoDB - mongodb

I am analyzing some performance issues that we have with our MongoDB cluster, and it led me to a question I'm not able to find an answer for at the moment.
Let's consider the collection MyCollection that is sharded on the index {myField:1} and contains the two indexes {_id:1} and {myField:1}
If I execute the following request:
db.MyCollection.update({_id: X}, {$set: {otherField: Y}, $setOnInsert: {myField: Z}}, {upsert:true});
Will it lead to performance issues as the query is made of only the id that is not part of the sharding key?
Would it be significantly better with the following query:
db.MyCollection.update({_id: X, myField: Z}, {$set: {otherField: Y}}, {upsert:true});
Or would it be the same?
My reasoning is that for the first query, as it doesn't have the sharding key in the query, it will ask all shards to find _id:X whereas with the second, it'll go directly to the appropriate shard.
However, I still have some doubts about the second one. Because even if a sharding key is immutable, won't it check on all other shards too to ensure that the id I provided is not present with a different sharding key?
Note: we're on version 4.0

If the query does not include the shard key, the mongos must send the
query to all shards as a scatter/gather operation. Otherwise mongos will know what shard(s) to look.
UPDATE:
To determine if the query is a Targeted Operations or Broadcast Operations, use the .explain() method, look for the numbers of shards involved in the operation. "queryPlanner" > "winningPlan" > "shards"

Related

Do compound shard keys in MongoDB work similar to compound indexes?

Suppose my collection uses a compound shard key consisting of BlockHash and BlockHeight fields.
If I ran a query to look up documents for a given BLockHeight, will Mongo have to hit every shard since we did not filter by BlockHash? Does having BlockHeight in the shard key help the query at all?
Ideally every query should have the shard key. Choose based on cardinality and logical categorisation of your data.
If you are sharding on BlockHash and BlockHeight (in that order), and you just run a query on BlockHeight. You will end up with hitting all the shards.
As a best practice, make it a habit of running .explain("executionStats") with your queries. This will tell you how your query is parsed. And which Shards did it touch.

Aggregation Pipeline in mongodb in sharded collection

Just referring to Mongodb aggregation link https://docs.mongodb.com/v3.2/aggregation/#aggregation-pipeline and it mentions that
"The aggregation pipeline can operate on a sharded collection."
Please lemme know If database is sharded then all the collections in the database will be sharded. Also Please confirm that if sharded the aggregate query will be run in many servers, and delivery the results fast. If so how the aggregation query functions.
Regards
Kris
The aggregation pipeline supports operations on sharded collections.
If the pipeline starts with an exact $match on a shard key, the entire pipeline runs on the matching shard only. Previously(prior to version 3.2), the pipeline would have been split into two parts, and the work of merging it would have to be done on the primary shard.
In the case of aggregation operations that must run on multiple shards, if the operations do not require running on the database’s primary shard, these operations will route the results to a random shard to merge the results to avoid overloading the primary shard for that database. The $out stage and the $lookup stage require running on the database’s primary shard.
When splitting the aggregation pipeline into two parts, the pipeline is split to ensure that the shards perform as many stages as possible with consideration for optimization.
Reference:
https://docs.mongodb.com/manual/core/aggregation-pipeline-sharded-collections/
A sharded cluster always has a Primary Shard, and one or more secondary shards.
Please lemme know If database is sharded then all the collections in
the database will be sharded
No, by default none of your collections will be sharded. All such collections stay wholly on the primary shard. To shard a collection, use the shardCollection command
Also Please confirm that if sharded the aggregate query will be run in
many servers, and delivery the results fast.
One important thing while defining a collection in a sharded environment is the shard key. You should ensure that choose a good shard key which is responsible for distribution of data across the shards. Thus, if you choose a good shard key, you can expect a better performance than a non-sharded environment.
If so how the aggregation query functions.
The aggregation query is split by the $match to different shards depending on where the docs are present, and are finally merged together on a shard. A good read is https://docs.mongodb.com/v3.2/core/aggregation-pipeline-sharded-collections/

What is the performance of a query that doesn't contains the shard key in a sharded MongoDB environment?

The title is saying everything. Assume that you have a sharded MongoDB environment and the user provide a query, which doesn't contain the shard key. What is the actual performance of the query? What happens in the background?
The performance depends on any number of factors however, the default action of MongoDB in this case is to do a global scatter and gather operation whereby it will send the query to all shards and then merge duplicates to give you an end result.
Returning to the performance, it normally depends upon the indexes on each shard and the isolated optimisation of their data sets and how much range of a dataset they hold.
However processing is parallel in sharding which means they all get the query and the "master" mongod will just merge as they come in, so the performance shouldn't be: go to shard 1, get it, then shard 2; instead it should be: go to all shards, each shard return its results and the master merges and returns.
Here is a good presentation (with nice pictures) on exactly how queries with sharding work in certain situations: http://www.slideshare.net/mongodb/how-queries-work-with-sharding
If the query is maked on the sharded collections the query is maked on all shard, if the query is maked on non shared collections, mongoDB take all data on the same shard.
I add the link for shard FAQ on MongoDB
http://docs.mongodb.org/manual/faq/sharding/

Good Shard Keys in MongoDB

From the book Scaling MongoDB:
The general case
We can generalize this to a formula for shard keys:
{coarseLocality : 1, search : 1}
So my question is, is that correct? shouldn't be the oposite for better writing?
Also from the book:
This pattern continues: everything will always be added to the “last”
chunk, meaning everything will be added to one shard. This shard key
gives you a single, undistributable hot spot.
So saying that my app always search by user_id, and last entries in the collection.
What is the best shard key i should have, this:
{_id:1, user_id:1}
or:
{user_id:1,_id:1}
Kristina (author of Scaling MongoDB) wrote a blog post which has some example strategies explained in the guise of a game: How to Choose a Shard Key: The Card Game.
There are many considerations to choosing a good shard key based on your application requirements and use cases.
The general advice of {coarseLocality : 1, search : 1} order is to ensure there is some locality of your data for reading.
So in your case, you would most likely want: {user_id:1,_id:1}.
That will provide some locality of data for the same user_id when querying, and ideally your common queries will be able to get their data from a single shard.
The opposite order may provide for better write distribution (assuming _id is not a monotonically increasing key like a default ObjectId) but a potential downside is reliability: if your data for a read query is scattered across all shards, you will have retrieval problems if any one shard is down.
So saying that my app always search by user_id, and last entries in the collection.
If you commonly search by user_id (and without _id) this will also affect your choice of shard key and index optimization. To find the last entries MongoDB will have to do a sort; you will want to be doing that sort on a single shard rather than having to gather the data from all shards and sorting. If your _id happens to be date-based that would be beneficial as part of the shard key in order to find the last entries.

Duplicate documents on _id (in mongo)

I have a sharded mongo collection, with over 1.5 mil documents. I use the _id column as a shard key, and the values in this column are integers (rather than ObjectIds).
I do a lot of write operations on this collection, using the Perl driver (insert, update, remove, save) and mongoimport.
My problem is that somehow, I have duplicate documents on the same _id. From what I've read, this shouldn't be possible.
I've removed the duplicates, but others still appear.
Do you have any ideas where could they come from, or what should I start looking at?
(Also, I've tried to replicate this on a smaller, test collection, but no duplicates are inserted, no matter what write operation I perform).
This actually isn't a problem with the Perl driver .. it is related to the characteristics of sharding. MongoDB is only able to enforce uniqueness among the documents located on a single shard at the time of creation, so the default index does not require uniqueness.
In the MongoDB: Configuring Sharding documentation there is specific mention that:
When you shard a collection, you must specify the shard key. If there is data in the collection, mongo will require an index to be created upfront (it speeds up the chunking process); otherwise, an index will be automatically created for you.
You can use the {unique: true} option to ensure that the underlying index enforces uniqueness so long as the unique index is a prefix of the shard key.
If the "unique: true" option is not used, the shard key does not have to be unique.
How have you implemented generating the integer Ids?
If you use a system like the one suggested on the MongoDB website, you should be fine. For reference:
function counter(name) {
var ret = db.counters.findAndModify({
query:{_id:name},
update:{$inc:{next:1}},
"new":true,
upsert:true});
return ret.next;
}
db.users.insert({_id:counter("users"), name:"Sarah C."}) // _id : 1
db.users.insert({_id:counter("users"), name:"Bob D."}) // _id : 2
If you are generating your Ids by reading a most recent record in the document store, then incrementing the number in the perl code, then inserting with the incremented number you could be running into timing issues.