MongoDB 4.0: Use sharding and query with different collations - mongodb

we are working with a sharded MongoDB 4.0 setup and would like to use the collation option for some of our current queries, so that certain searches are down case-insensitive instead case-sensitive.
Our shards are partitioned in different sectors - say sectors "S1", "S2" and "S3". The sharding was done a while ago without stating a collation.
We have a lot of different queries, most of which include the shard key. So that the queries will be run only against one specific shard.
db.data.find({"section": "S1", ...).explain()
{
"queryPlanner" : {
"mongosPlannerVersion" : 1,
"winningPlan" : {
"stage" : "SINGLE_SHARD",
(...)
Executing the same query, but with a collation it will result in:
db.data.find({"section": "S1", ...).collation("locale": "en", "strength": 2).explain()
{
"queryPlanner" : {
"mongosPlannerVersion" : 1,
"winningPlan" : {
"stage" : "SHARD_MERGE",
(...)
Because of the performance implications we don't want to make scatter-gather queries (SHARD_MERGE).
Is there a way to refine the query that is run against our MongoS so that is able to tell which shard to query, even though the collation of the query and the sharding do not match?

Lowercase your input prior to insertion, lowercase the queries, perform the queries without specifying the collation.
If you need the original case you can preserve it in another field.

Related

In Mongodb, what are the differences of working process between Primary key(IDHACK) and Secondary key(IXSCAN)?

I have made some tests to compare the find() performance between Primary key and Secondary key.
I inserted 1million dummy data in a collection and an object looks like bellow.
{ "_id" : "1/1/1",
"1stDocumentNum : "1/1/1"
}
After Inserting all dummy data, "_id" field already had an index(Primary key) that Mongodb automatically assigns on _id field and I created additional Index on "1stDocumentNum" filed(Secondary Key).
When I have tested finding an object in two different ways,
finding an object with "_id" field
finding an object with "1stDocumentNum" field
the results were like below.
Using Jmeter, Nginx, Php
finding an object with "_id" field
Throughput per second: 10622
In explain(),
"winningPlan" : {
"stage" : "IDHACK"
},
finding an object with "1stDocumentNum" field
Throughput per second: 8751
In explain(),
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"1stDocumentNum" : 1
},
I want to know what are the differencs between IDHACK and IXSCAN and how they work differently.
If they work as exactly the same, the reason why the throughputs are different is from the number of stages in explain()?
Anybody can help me understand it?
Thanks in advance!
IDHACK means that your query has chosen to use _id field index.
IXSCAN means your that query is using a regular index.
It is just about query path optimization.
"_id" is by default a field with hashed index and hashed index works slow for small data sets (the hashing overhead).
"1stDocumentNum" : 1 is a binary tree index and it works faster for small data sets.
That's why in your example 1stDocumentNum index is faster then _id.
Try to run the same tests with 10 billion documents and you will see that hashed index works faster (usually,.. depends on hash algorithm distribution).

MongoDB find and iterate vs count

I have a peculiar problem with Mongo.
We have a collection of 800k documents with the following structure.
{
"_id" : ObjectId("5bd844199114bab3b2c19fab"),
"u" : 0,
"c" : 0,
"iden" : "343754856",
"name" : "alan",
"email" : "mkasd#abc.com",
"mobile" : "987654321093456",
"expires" : ISODate("2018-11-29T11:44:25.453Z"),
"created" : ISODate("2018-10-30T11:44:25.453Z")
}
We have indexed iden and name on which we generally query.
We tried two types of queries.
db.Collection.find({"iden": "343754856", "name": "alan", "created":
{"$gt": ....}).count()
where "created" is an unindexed field.
db.Collection.find({"iden": "343754856", "name": "alan"})
and iterate over all records to filter based on created.
However, MongoDB seems to be taking enormous amount of time in executing the second query while it was supposed to be an optimization over 1.
Any leads on what is going wrong here?
We are using Go library.
How could the second version be an optimization over the first?
Your first query retrieves a single number from the MongoDB server: the overall count of the query result. While your second version fetches all matching documents, and you do the counting at the "client" side.
Believe me that the MongoDB can count internally the result documents just as fast as you could in your Go client. Making the MongoDB server send the results, fetching them and unmarshaling them at the client takes orders of magnitude more time (depending on lot of factors).
Please note that if you have a composite index containing "iden" and "name", even if you add more filters (like "created" in your example), the index may still be used, but the MongoDB have to iterate over the partial results to apply the rest of the query. To find out if the index is used, please execute the following command:
db.Collection.find(
{"iden": "343754856", "name": "alan", "created": {"$gt": ....}
).explain()

Why is full text search of MongoDB shards directly much faster than going through the cluster manager (mongos) instance?

I have been very unhappy with full text search performance in MongoDB so I have been looking for outside-the-box solutions. With a relatively small collection of 25 million documents sharded across 8 beefy machines (4 shards with redundancy) I see some queries taking 10 seconds. That is awful. On a lark, I tried a 10 second query to the shards directly, and it seems like the mongos is sending the queries to shards serially, rather than in parallel. Across the 4 shards I saw search times of 2.5 seconds on one shard and the other 3 shards under 2 seconds each. That is a total of less than 8.5 seconds, but it took 10 through mongos. Facepalm.
Can someone confirm these queries to shards are being run serially? Or offer some other explanation?
What are the pitfalls to querying the shards directly?
We are on 4.0 and the query looks like this:
db.items.aggregate(
[
{ "$match" : {
"$text" : { "$search" : "search terms"}
}
},
{ "$project": { "type_id" : 1, "source_id": 1 } },
{ "$facet" : { "types" : [ { "$unwind" : "$type_id"} , { "$sortByCount" : "$type_id"}] , "sources" : [ { "$unwind" : "$source_id"} , { "$sortByCount" : "$source_id"}]}}
]
);
I made a mistake before, this is the query being sent that has the issue. And I talked to a MongoDB expert and was clued into a big part of what's going on (I think), but happy to see what others have to say so I can pay the bounty and make it official.
Can someone confirm these queries to shards are being run serially? Or
offer some other explanation?
Without a shard key in the query, the query is sent to all shards and processed in parallel. However, the results from all shards will be merged at the primary shard, and thus it'll wait until the slowest shard returns.
What are the pitfalls to querying the shards directly?
You can potentially include orphaned documents. Query via mongos also checks orphaned documents to ensure data consistency. Therefore, querying via mongos has more overhead than querying directly from each shard.
Measured using Robo 3T's query time
Using Robo 3T doesn't measure the query time correctly. By default, Robo 3T returns first 50 documents. For driver implementations, if the number of returned documents is more than the default batch size, to retrieve the all docs, there will be getmore requests followed to database. Robo 3T only gives you the first batch, i.e. a subset of results.
To evaluate your query, add explain('executionStats') to your query. The performance hit is likely the data transfer between shards. Because the lacking of a shard key in the query, the results of all shards have to be sent to a shard before merging. The total time is not only the query time (locating the docs) from mongo engine, but also documents retrieval time.
Execute the command below and you'll see inputStages from each shard to better evaluate your query.
db.items.explain('executionStats').aggregate(
[
{ "$match" : {
"$text" : { "$search" : "search terms"}
}
},
{ "$project": { "type_id" : 1, "source_id": 1 } },
{ "$facet" : { "types" : [ { "$unwind" : "$type_id"} , { "$sortByCount" : "$type_id"}] , "sources" : [ { "$unwind" : "$source_id"} , { "$sortByCount" : "$source_id"}]}}
]
);

Bug for collections that are sharded over a hashed key

When querying for large amounts of data in sharded collections we benefited a lot from querying the shards in parallel.
The following problem does only occur in collections that are sharded over a hashed key.
In Mongo 2.4 it was possible to query with hash borders in order to get all data of one chunk.
We used the query from this post.
It is a range query with hash values as borders:
db.collection.find(
{ "_id" : { "$gte" : -9219144072535768301,
"$lt" : -9214747938866076750}
}).hint({ "_id" : "hashed"})
The same query also works in 2.6 but takes a long time.
The explain() shows that it is using the index but scanned objects is way to high.
"cursor" : "BtreeCursor _id_hashed",
Furthermore the borders are wrong.
"indexBounds" : {
"_id" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
Was there some big change from 2.4 t0 2.6 which breaks this query?
Even if the borders are interpreted as non-hash values, why does it take so long?
Is there some other way to get all documents of one chunk or hash index range?
Also the mongo internal hadoop connector has this problem with sharded collections.
Thanks!
The query above working in 2.4 was not supported behavior. See SERVER-14557 with a similar complaint and an explanation of how to properly perform this query. Reformatted for proper behavior, your query becomes:
db.collection.find().min({ _id : -9219144072535768301}).max({ _id : -9214747938866076750}).hint({_id : "hashed"})
As reported in the SERVER ticket, there is an additional bug (SERVER-14400) that prevents this query from being targeted towards a single shard. At this point in time there are no plans to address in 2.6. This should however prevent the table scan you are seeing under 2.6 and allow for more efficient retrieval.

$natural order avoids indexes. How does orderby effect the use of indexes?

Profiling slow queries I found something really strange: For the following operation the entire collection was scanned (33061 documents) even though there is an index on the query parameter family_id:
{
"ts" : ISODate("2013-11-27T10:20:26.103Z"),
"op" : "query",
"ns" : "mydb.zones",
"query" : {
"$query" : {
"family_id" : ObjectId("52812295ea84d249934f3d12")
},
"$orderby" : {
"$natural" : 1
}
},
"ntoreturn" : 20,
"ntoskip" : 0,
"nscanned" : 33061,
"keyUpdates" : 0,
"numYield" : 91,
"lockStats" : {
"timeLockedMicros" : {
"r" : NumberLong(83271),
"w" : NumberLong(0)
},
"timeAcquiringMicros" : {
"r" : NumberLong(388988),
"w" : NumberLong(22362)
}
},
"nreturned" : 7,
"responseLength" : 2863,
"millis" : 393,
"client" : "127.0.0.1",
"user" : "mydb"
}
After some Google searches without results I found out that leaving out the "$orderby": { "$natural" : 1} the query is very fast and only 7 documents are scanned instead of 33061. So I assume using $orderby in my case does avoid using the index on family_id. The strange thing is that the resulting order is not different in either case. As far as I understand $natural order it is tautologically to use "$orderby": { "$natural" : 1} or no explicit order. Another very interesting observation is that this issue does not arise on capped collection!!
This issue arises the following questions:
If not using any ordering/sorting, shouldn't the resulting order be the order on disk, i.e. $natural order?
Can I create a (compound-)index that would be used sorting naturally?
How can I invert the ordering of a simple query that uses an index an no sorting without severe performance losses?
What happens behind the scenes when using query parameters and orderby? Why is this not happening on capped collections? I would like to understand this strange behaviour.
Are the answers of the above questions independent of whether you use sharding/replication or not? What is the natural order of a query over multiple shards?
Note I am using MongoDB 2.2. There is a ticket related to this issue: https://jira.mongodb.org/browse/SERVER-5672. Though it seems in that ticket that the issue occures in capped collections too, which I cannot confirm (maybe due to different mongo versions).
As far as I understand $natural order it is tautologically to use
"$orderby": { "$natural" : 1} or no explicit order.
This is a misdescription of $natural order. MongoDB stores records in a some order on disk and keeps track of them via a doubly linked list. $natural order is the order that you get when you traverse the linked list. However, if you do not specify $natural, that is what you will always get - not random order, not insertion order, not physical disk order, but "logical" disk order - the order they appear in when traversing the linked list.
If not using any ordering/sorting, shouldn't the resulting order be
the order on disk, i.e. $natural order?
Yes, assuming that you understand that "disk order" is not strictly physical order, it's order they are in the linked list of records.
Can I create a (compound-)index that would be used sorting naturally?
I don't know what you mean by sorting naturally - if you are using an index during a query, the documents are traversed in index order, not in $natural order.
How can I invert the ordering of a simple query that uses an index and no sorting without severe performance losses?
You cannot - if you are using an index then you will get the records in index order - your options are to get them in that order, in reverse of that order or to create a compound index where you index by fields you are searching and field(s) you want to sort on.
What happens behind the scenes when using query parameters and orderby? Why is this not happening on capped collections? I would like to understand this strange behaviour.
What happens depends on what indexes are available, but the query optimizer tries to use an index that helps with both filtering and sorting - if that's not possible it will pick the index that has the best actual performance.
Are the answers of the above questions independent of whether you use
sharding/replication or not? What is the natural order of a query over
multiple shards?
It's some non-deterministic merge of $natural orders from each individual shard.