Bug for collections that are sharded over a hashed key - mongodb

When querying for large amounts of data in sharded collections we benefited a lot from querying the shards in parallel.
The following problem does only occur in collections that are sharded over a hashed key.
In Mongo 2.4 it was possible to query with hash borders in order to get all data of one chunk.
We used the query from this post.
It is a range query with hash values as borders:
db.collection.find(
{ "_id" : { "$gte" : -9219144072535768301,
"$lt" : -9214747938866076750}
}).hint({ "_id" : "hashed"})
The same query also works in 2.6 but takes a long time.
The explain() shows that it is using the index but scanned objects is way to high.
"cursor" : "BtreeCursor _id_hashed",
Furthermore the borders are wrong.
"indexBounds" : {
"_id" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
Was there some big change from 2.4 t0 2.6 which breaks this query?
Even if the borders are interpreted as non-hash values, why does it take so long?
Is there some other way to get all documents of one chunk or hash index range?
Also the mongo internal hadoop connector has this problem with sharded collections.
Thanks!

The query above working in 2.4 was not supported behavior. See SERVER-14557 with a similar complaint and an explanation of how to properly perform this query. Reformatted for proper behavior, your query becomes:
db.collection.find().min({ _id : -9219144072535768301}).max({ _id : -9214747938866076750}).hint({_id : "hashed"})
As reported in the SERVER ticket, there is an additional bug (SERVER-14400) that prevents this query from being targeted towards a single shard. At this point in time there are no plans to address in 2.6. This should however prevent the table scan you are seeing under 2.6 and allow for more efficient retrieval.

Related

How to eliminate Query Targeting: Scanned Objects / Returned has gone above 1000 in MongoDB?

There are some questions 1, 2 talk about the MongoDB warning Query Targeting: Scanned Objects / Returned has gone above 1000, however, my question is another case.
The schema of our document is
{
"_id" : ObjectId("abc"),
"key" : "key_1",
"val" : "1",
"created_at" : ISODate("2021-09-25T07:38:04.985Z"),
"a_has_sent" : false,
"b_has_sent" : false,
"updated_at" : ISODate("2021-09-25T07:38:04.985Z")
}
The indexes of this collections are
{
"key" : {
"updated_at" : 1
},
"name" : "updated_at_1",
"expireAfterSeconds" : 5184000,
"background" : true
},
{
"key" : {
"updated_at" : 1,
"a_has_sent" : 1,
"b_has_sent" : 1
},
"name" : "updated_at_1_a_has_sent_1_b_has_sent_1",
"background" : true
}
The total number of documents after 2021-09-24 is over 600000, and the distinct value of key is 5.
The above waning caused by the query
db.collectionname.find({ "updated_at": { "$gte": ISODate("2021-09-24")}, "$or": [{ "a_has_sent": false }, {"b_has_sent": false}], "key": "key_1"})
Our server sends one document to a and b simutinously with batch size 2000. After sending to a successfully, mark a_has_sent to true. The same logic to b. As sending process goes on, the number of documents with a_has_sent: false reduce. And the above warning comes up.
After checking the explain result of this query, the index named updated_at_1 is used rather than updated_at_1_a_has_sent_1_b_has_sent_1.
What we had tried.
We add another new index {"updated_at": 1, "key": 1}, and expect this query could use the new index to reduce the number of scanned documents. Unfortunately, we failed. The index named updated_at_1 is still used.
We try to replace find with aggregate
aggregate([{"$match": { "updated_at": { "$gte": ISODate("2021-09-24") }, "$or": [{ "a_has_sent": false }, { "b_has_sent": false}], "key": "key_1"}}]). Unfortunately, The index named updated_at_1 is still used.
We want to know how to eliminate this warning Scanned Objects / Returned has gone above 1000?
Mongo 4.0 is used in our case.
Follow the ESR rule
For compound indexes, this rule of thumb is helpful in deciding the order of fields in the index:
First, add those fields against which Equality queries are run.
The next fields to be indexed should reflect the Sort order of the query.
The last fields represent the Range of data to be accessed.
We create the index {"action_key" : 1,"adjust_sent" : 1,"facebook_sent" : 1,"updated_at" : 1}, this index could be used by the query now
Update 08/15/2022
Query Targeting alerts indicate inefficient queries.
Query Targeting: Scanned Objects / Returned occurs if the number of documents examined to fulfill a query relative to the actual number of returned documents meets or exceeds a user-defined threshold. The default is 1000, which means that a query must scan more than 1000 documents for each document returned to trigger the alert.
Here are some steps to solve this issue
First, The Performance Advisor provides the easiest and quickest way to create an index. If there is any create Indexes suggestion, you can create this recommended index.
Then, you could check the query profile if there is no recommended index in Performance Advisor. The Query Profiler contains several metrics you can use to pinpoint specific inefficient queries. The Query Profiler can show the Examined : Returned Ratio (index keys examined to documents returned) of logged queries, which might help you identify the queries that triggered a
Query Targeting: Scanned / Returned
alert. The chart shows the number of index keys examined to fulfill a query relative to the actual number of returned documents.
You can use the following resources to determine which query generated the alert:
The Real-Time Performance Panel monitors and displays current network traffic and database operations on machines hosting MongoDB in your Atlas clusters.
The MongoDB logs maintain an account of activity, including queries, for each mongod instance in your Atlas clusters.
The following mongod log entry shows statistics generated from an inefficient query:
<Timestamp> COMMAND <query>
planSummary: COLLSCAN keysExamined:0
docsExamined: 10000 cursorExhausted:1 numYields:234
nreturned:4 protocol:op_query 358ms
This query scanned 10,000 documents and returned only 4 for a ratio of 2500, which is highly inefficient. No index keys were examined, so MongoDB scanned all documents in the collection, known as a collection scan
The cursor.explain() command for mongosh provides performance details for all queries.
The Data Profiler records operations that Atlas considers slow when compared to average execution time for all operations on your cluster.
Note - Enabling the Database Profiler incurs a performance overhead.
MongoDB cannot use a single index to process an $or that looks at different field values.
The index on
{
"updated_at" : 1,
"a_has_sent" : 1,
"b_has_sent" : 1
}
can be used with the $or expression to match either a_has_sent or b_has_sent.
To minimize the number of documents examined, create 2 indexes, one for each branch of the $or, combined with the enclosing $and (the filter implicitly combines the top-level query predicates with and). Such as:
{
"updated_at" : 1,
"a_has_sent" : 1
}
and
{
"updated_at" : 1,
"b_has_sent" : 1
}
Also note that the alert for Query Targeting: Scanned Objects / Returned has gone above 1000 does not refer to a single query.
The MongoDB server keeps a counter (64-bit?) that tracks the number of documents examined since the server was start, and another counter for the number of documents returned.
That scanned per returned ration is derive by simply dividing the examined counter by the returned counter.
This means that if you have something like a count query that requires examining documents, you may have hundreds or thousands of documents examined, but only 1 returned. It won't take many of these kinds of queries to push the ratio over the 1000 alert limit

MongoDB 4.0: Use sharding and query with different collations

we are working with a sharded MongoDB 4.0 setup and would like to use the collation option for some of our current queries, so that certain searches are down case-insensitive instead case-sensitive.
Our shards are partitioned in different sectors - say sectors "S1", "S2" and "S3". The sharding was done a while ago without stating a collation.
We have a lot of different queries, most of which include the shard key. So that the queries will be run only against one specific shard.
db.data.find({"section": "S1", ...).explain()
{
"queryPlanner" : {
"mongosPlannerVersion" : 1,
"winningPlan" : {
"stage" : "SINGLE_SHARD",
(...)
Executing the same query, but with a collation it will result in:
db.data.find({"section": "S1", ...).collation("locale": "en", "strength": 2).explain()
{
"queryPlanner" : {
"mongosPlannerVersion" : 1,
"winningPlan" : {
"stage" : "SHARD_MERGE",
(...)
Because of the performance implications we don't want to make scatter-gather queries (SHARD_MERGE).
Is there a way to refine the query that is run against our MongoS so that is able to tell which shard to query, even though the collation of the query and the sharding do not match?
Lowercase your input prior to insertion, lowercase the queries, perform the queries without specifying the collation.
If you need the original case you can preserve it in another field.

MongoDB find and iterate vs count

I have a peculiar problem with Mongo.
We have a collection of 800k documents with the following structure.
{
"_id" : ObjectId("5bd844199114bab3b2c19fab"),
"u" : 0,
"c" : 0,
"iden" : "343754856",
"name" : "alan",
"email" : "mkasd#abc.com",
"mobile" : "987654321093456",
"expires" : ISODate("2018-11-29T11:44:25.453Z"),
"created" : ISODate("2018-10-30T11:44:25.453Z")
}
We have indexed iden and name on which we generally query.
We tried two types of queries.
db.Collection.find({"iden": "343754856", "name": "alan", "created":
{"$gt": ....}).count()
where "created" is an unindexed field.
db.Collection.find({"iden": "343754856", "name": "alan"})
and iterate over all records to filter based on created.
However, MongoDB seems to be taking enormous amount of time in executing the second query while it was supposed to be an optimization over 1.
Any leads on what is going wrong here?
We are using Go library.
How could the second version be an optimization over the first?
Your first query retrieves a single number from the MongoDB server: the overall count of the query result. While your second version fetches all matching documents, and you do the counting at the "client" side.
Believe me that the MongoDB can count internally the result documents just as fast as you could in your Go client. Making the MongoDB server send the results, fetching them and unmarshaling them at the client takes orders of magnitude more time (depending on lot of factors).
Please note that if you have a composite index containing "iden" and "name", even if you add more filters (like "created" in your example), the index may still be used, but the MongoDB have to iterate over the partial results to apply the rest of the query. To find out if the index is used, please execute the following command:
db.Collection.find(
{"iden": "343754856", "name": "alan", "created": {"$gt": ....}
).explain()

Why is full text search of MongoDB shards directly much faster than going through the cluster manager (mongos) instance?

I have been very unhappy with full text search performance in MongoDB so I have been looking for outside-the-box solutions. With a relatively small collection of 25 million documents sharded across 8 beefy machines (4 shards with redundancy) I see some queries taking 10 seconds. That is awful. On a lark, I tried a 10 second query to the shards directly, and it seems like the mongos is sending the queries to shards serially, rather than in parallel. Across the 4 shards I saw search times of 2.5 seconds on one shard and the other 3 shards under 2 seconds each. That is a total of less than 8.5 seconds, but it took 10 through mongos. Facepalm.
Can someone confirm these queries to shards are being run serially? Or offer some other explanation?
What are the pitfalls to querying the shards directly?
We are on 4.0 and the query looks like this:
db.items.aggregate(
[
{ "$match" : {
"$text" : { "$search" : "search terms"}
}
},
{ "$project": { "type_id" : 1, "source_id": 1 } },
{ "$facet" : { "types" : [ { "$unwind" : "$type_id"} , { "$sortByCount" : "$type_id"}] , "sources" : [ { "$unwind" : "$source_id"} , { "$sortByCount" : "$source_id"}]}}
]
);
I made a mistake before, this is the query being sent that has the issue. And I talked to a MongoDB expert and was clued into a big part of what's going on (I think), but happy to see what others have to say so I can pay the bounty and make it official.
Can someone confirm these queries to shards are being run serially? Or
offer some other explanation?
Without a shard key in the query, the query is sent to all shards and processed in parallel. However, the results from all shards will be merged at the primary shard, and thus it'll wait until the slowest shard returns.
What are the pitfalls to querying the shards directly?
You can potentially include orphaned documents. Query via mongos also checks orphaned documents to ensure data consistency. Therefore, querying via mongos has more overhead than querying directly from each shard.
Measured using Robo 3T's query time
Using Robo 3T doesn't measure the query time correctly. By default, Robo 3T returns first 50 documents. For driver implementations, if the number of returned documents is more than the default batch size, to retrieve the all docs, there will be getmore requests followed to database. Robo 3T only gives you the first batch, i.e. a subset of results.
To evaluate your query, add explain('executionStats') to your query. The performance hit is likely the data transfer between shards. Because the lacking of a shard key in the query, the results of all shards have to be sent to a shard before merging. The total time is not only the query time (locating the docs) from mongo engine, but also documents retrieval time.
Execute the command below and you'll see inputStages from each shard to better evaluate your query.
db.items.explain('executionStats').aggregate(
[
{ "$match" : {
"$text" : { "$search" : "search terms"}
}
},
{ "$project": { "type_id" : 1, "source_id": 1 } },
{ "$facet" : { "types" : [ { "$unwind" : "$type_id"} , { "$sortByCount" : "$type_id"}] , "sources" : [ { "$unwind" : "$source_id"} , { "$sortByCount" : "$source_id"}]}}
]
);

$natural order avoids indexes. How does orderby effect the use of indexes?

Profiling slow queries I found something really strange: For the following operation the entire collection was scanned (33061 documents) even though there is an index on the query parameter family_id:
{
"ts" : ISODate("2013-11-27T10:20:26.103Z"),
"op" : "query",
"ns" : "mydb.zones",
"query" : {
"$query" : {
"family_id" : ObjectId("52812295ea84d249934f3d12")
},
"$orderby" : {
"$natural" : 1
}
},
"ntoreturn" : 20,
"ntoskip" : 0,
"nscanned" : 33061,
"keyUpdates" : 0,
"numYield" : 91,
"lockStats" : {
"timeLockedMicros" : {
"r" : NumberLong(83271),
"w" : NumberLong(0)
},
"timeAcquiringMicros" : {
"r" : NumberLong(388988),
"w" : NumberLong(22362)
}
},
"nreturned" : 7,
"responseLength" : 2863,
"millis" : 393,
"client" : "127.0.0.1",
"user" : "mydb"
}
After some Google searches without results I found out that leaving out the "$orderby": { "$natural" : 1} the query is very fast and only 7 documents are scanned instead of 33061. So I assume using $orderby in my case does avoid using the index on family_id. The strange thing is that the resulting order is not different in either case. As far as I understand $natural order it is tautologically to use "$orderby": { "$natural" : 1} or no explicit order. Another very interesting observation is that this issue does not arise on capped collection!!
This issue arises the following questions:
If not using any ordering/sorting, shouldn't the resulting order be the order on disk, i.e. $natural order?
Can I create a (compound-)index that would be used sorting naturally?
How can I invert the ordering of a simple query that uses an index an no sorting without severe performance losses?
What happens behind the scenes when using query parameters and orderby? Why is this not happening on capped collections? I would like to understand this strange behaviour.
Are the answers of the above questions independent of whether you use sharding/replication or not? What is the natural order of a query over multiple shards?
Note I am using MongoDB 2.2. There is a ticket related to this issue: https://jira.mongodb.org/browse/SERVER-5672. Though it seems in that ticket that the issue occures in capped collections too, which I cannot confirm (maybe due to different mongo versions).
As far as I understand $natural order it is tautologically to use
"$orderby": { "$natural" : 1} or no explicit order.
This is a misdescription of $natural order. MongoDB stores records in a some order on disk and keeps track of them via a doubly linked list. $natural order is the order that you get when you traverse the linked list. However, if you do not specify $natural, that is what you will always get - not random order, not insertion order, not physical disk order, but "logical" disk order - the order they appear in when traversing the linked list.
If not using any ordering/sorting, shouldn't the resulting order be
the order on disk, i.e. $natural order?
Yes, assuming that you understand that "disk order" is not strictly physical order, it's order they are in the linked list of records.
Can I create a (compound-)index that would be used sorting naturally?
I don't know what you mean by sorting naturally - if you are using an index during a query, the documents are traversed in index order, not in $natural order.
How can I invert the ordering of a simple query that uses an index and no sorting without severe performance losses?
You cannot - if you are using an index then you will get the records in index order - your options are to get them in that order, in reverse of that order or to create a compound index where you index by fields you are searching and field(s) you want to sort on.
What happens behind the scenes when using query parameters and orderby? Why is this not happening on capped collections? I would like to understand this strange behaviour.
What happens depends on what indexes are available, but the query optimizer tries to use an index that helps with both filtering and sorting - if that's not possible it will pick the index that has the best actual performance.
Are the answers of the above questions independent of whether you use
sharding/replication or not? What is the natural order of a query over
multiple shards?
It's some non-deterministic merge of $natural orders from each individual shard.