MongoDB: Index is pretty slow on 100+ mio docs

MongoDB: Index is pretty slow on 100+ mio docs - mongodb

I'm doing count on a collection with more than 100 millions documents.
My query is:
{
"domain": domain,
"categories" : "buzz",
"visit.timestamp" : { "$gte": date_from, "$lt": date_to },
}
I project only _id.
I have some indexes on it, like, per example:
{ "visit.timestamp": -1 }
and compound index like:
{ "visit.timestamp": -1, "domain": 1, "categories" : 1 }
A count based on, per example, 30 last days gives results in ~30 seconds.
An explain() shows me that the query use the simplest index: { "visit.timestamp": -1 }
So I tried to force the compound index in other order:
{ "categories" : 1, "domain": 1, "visit.timestamp": -1 }
{ "domain": 1, "categories" : 1, "visit.timestamp": -1 }
Then, the query uses one of them, but the result takes much longer: ~60 seconds in first case, and for the other one, more than 241 seconds!
Note 1: It's the same result with aggregation framework, but it's not surprising.
Note 2: "visit.timestamp" is an ISODate. Each document is more recent than the previous one.
Note 3: The count returns ~1.4 million documents (among the ~105 millions) but examined 12 millions docs (see below).
Question:
1/ I don't get why a query takes longer when using an index that should covered it completely. Do you have an explanation?
2/ Do you have any hint to improve the response time of this query?
The explain() shows that the query looked at:
"totalKeysExamined": 12628476,
"totalDocsExamined": 12628476,
Because, as I can understand, the index cover only the date index visit.timestamp and so all docs within the time-frame has to be examined.

Second question:
Make sure the problem is in MongoDB's scope. Isolate it from your application code and I/O. Do that by connecting to (one of) your MongoDB server(s) locally and execute the query.
Happens locally? Check CPU and disk health of your server(s).
CPU(s) and disk(s) all fit no sweat? Make sure your index fits in to RAM. Citing from MongoDB's FAQ:
What happens if an index does not fit into RAM?
When an index is too large to fit into RAM, MongoDB must read the
index from disk, which is a much slower operation than reading from
RAM. Keep in mind an index fits into RAM when your server has RAM
available for the index combined with the rest of the working set.
In certain cases, an index does not need to fit entirely into RAM. For
details, see Indexes that Hold Only Recent Values in RAM.
First question:
Maybe your index doesn't fit into RAM. And making it compound may increase the number of I/O operations to the disk. I'm no MongoDB expert though.

Related

How to eliminate Query Targeting: Scanned Objects / Returned has gone above 1000 in MongoDB?

There are some questions 1, 2 talk about the MongoDB warning Query Targeting: Scanned Objects / Returned has gone above 1000, however, my question is another case.
The schema of our document is
{
"_id" : ObjectId("abc"),
"key" : "key_1",
"val" : "1",
"created_at" : ISODate("2021-09-25T07:38:04.985Z"),
"a_has_sent" : false,
"b_has_sent" : false,
"updated_at" : ISODate("2021-09-25T07:38:04.985Z")
}
The indexes of this collections are
{
"key" : {
"updated_at" : 1
},
"name" : "updated_at_1",
"expireAfterSeconds" : 5184000,
"background" : true
},
{
"key" : {
"updated_at" : 1,
"a_has_sent" : 1,
"b_has_sent" : 1
},
"name" : "updated_at_1_a_has_sent_1_b_has_sent_1",
"background" : true
}
The total number of documents after 2021-09-24 is over 600000, and the distinct value of key is 5.
The above waning caused by the query
db.collectionname.find({ "updated_at": { "$gte": ISODate("2021-09-24")}, "$or": [{ "a_has_sent": false }, {"b_has_sent": false}], "key": "key_1"})
Our server sends one document to a and b simutinously with batch size 2000. After sending to a successfully, mark a_has_sent to true. The same logic to b. As sending process goes on, the number of documents with a_has_sent: false reduce. And the above warning comes up.
After checking the explain result of this query, the index named updated_at_1 is used rather than updated_at_1_a_has_sent_1_b_has_sent_1.
What we had tried.
We add another new index {"updated_at": 1, "key": 1}, and expect this query could use the new index to reduce the number of scanned documents. Unfortunately, we failed. The index named updated_at_1 is still used.
We try to replace find with aggregate
aggregate([{"$match": { "updated_at": { "$gte": ISODate("2021-09-24") }, "$or": [{ "a_has_sent": false }, { "b_has_sent": false}], "key": "key_1"}}]). Unfortunately, The index named updated_at_1 is still used.
We want to know how to eliminate this warning Scanned Objects / Returned has gone above 1000?
Mongo 4.0 is used in our case.

Follow the ESR rule
For compound indexes, this rule of thumb is helpful in deciding the order of fields in the index:
First, add those fields against which Equality queries are run.
The next fields to be indexed should reflect the Sort order of the query.
The last fields represent the Range of data to be accessed.
We create the index {"action_key" : 1,"adjust_sent" : 1,"facebook_sent" : 1,"updated_at" : 1}, this index could be used by the query now
Update 08/15/2022
Query Targeting alerts indicate inefficient queries.
Query Targeting: Scanned Objects / Returned occurs if the number of documents examined to fulfill a query relative to the actual number of returned documents meets or exceeds a user-defined threshold. The default is 1000, which means that a query must scan more than 1000 documents for each document returned to trigger the alert.
Here are some steps to solve this issue
First, The Performance Advisor provides the easiest and quickest way to create an index. If there is any create Indexes suggestion, you can create this recommended index.
Then, you could check the query profile if there is no recommended index in Performance Advisor. The Query Profiler contains several metrics you can use to pinpoint specific inefficient queries. The Query Profiler can show the Examined : Returned Ratio (index keys examined to documents returned) of logged queries, which might help you identify the queries that triggered a
Query Targeting: Scanned / Returned
alert. The chart shows the number of index keys examined to fulfill a query relative to the actual number of returned documents.
You can use the following resources to determine which query generated the alert:
The Real-Time Performance Panel monitors and displays current network traffic and database operations on machines hosting MongoDB in your Atlas clusters.
The MongoDB logs maintain an account of activity, including queries, for each mongod instance in your Atlas clusters.
The following mongod log entry shows statistics generated from an inefficient query:
<Timestamp> COMMAND <query>
planSummary: COLLSCAN keysExamined:0
docsExamined: 10000 cursorExhausted:1 numYields:234
nreturned:4 protocol:op_query 358ms
This query scanned 10,000 documents and returned only 4 for a ratio of 2500, which is highly inefficient. No index keys were examined, so MongoDB scanned all documents in the collection, known as a collection scan
The cursor.explain() command for mongosh provides performance details for all queries.
The Data Profiler records operations that Atlas considers slow when compared to average execution time for all operations on your cluster.
Note - Enabling the Database Profiler incurs a performance overhead.

MongoDB cannot use a single index to process an $or that looks at different field values.
The index on
{
"updated_at" : 1,
"a_has_sent" : 1,
"b_has_sent" : 1
}
can be used with the $or expression to match either a_has_sent or b_has_sent.
To minimize the number of documents examined, create 2 indexes, one for each branch of the $or, combined with the enclosing $and (the filter implicitly combines the top-level query predicates with and). Such as:
{
"updated_at" : 1,
"a_has_sent" : 1
}
and
{
"updated_at" : 1,
"b_has_sent" : 1
}
Also note that the alert for Query Targeting: Scanned Objects / Returned has gone above 1000 does not refer to a single query.
The MongoDB server keeps a counter (64-bit?) that tracks the number of documents examined since the server was start, and another counter for the number of documents returned.
That scanned per returned ration is derive by simply dividing the examined counter by the returned counter.
This means that if you have something like a count query that requires examining documents, you may have hundreds or thousands of documents examined, but only 1 returned. It won't take many of these kinds of queries to push the ratio over the 1000 alert limit

MongoDB query optimizer keeps choosing the least efficient index for the query

I have a large collection (~20M records) with some moderate documents with ~20 indexed fields. All of those indexes are single field. This collection also has quite a lot of write and read traffic.
MongoDB version is 4.0.9.
I am seeing at peak times that the query optimizer keeps selecting a very inefficient index for the winning plan.
In the example query:
{
name: 'Alfred Mason',
created_at: { $gt: ... },
active: true
}
All of the fields are indexed:
{ name: 1 }
{ created_at: 1 }
{ active: 1 }
When I run explain(), the winning plan will use created_at index, which will scan ~200k documents before returning 4 that match the query. Query execution time is ~6000 ms.
If I use $hint to force the name index, it will scan 6 documents before returning 4 that match the query. Execution time is ~2 ms.
Why does query optimizer keeps selecting the slowest index? It does seem suspicious that it only happens during peak hours, when there is more write activity with the collection, but what is the exact reasoning? What can I do about it?
Is it safe to use $hint in production environment?
Is is reasonable to remove the index on the date field completely as $gt query doesn't seem any faster than a COLLSCAN? That could force the query optimizer to use an indexed field. But then again, it could also select another inefficient index (the boolean field).
I can't use compound indexes as there are a lot of use cases that use different combinations of all 20 indexes available.

There could be a number of reasons why Mongo appears to not be using the best execution plan, including:
The running time and execution plan estimate using the single field index on the name field is not accurate. This could be due to bad statistics, i.e. Mongo is making an estimate using stale or not up to date information.
While for your particular query the created_at index is not optimal, in general, for most of the possible queries on this field, the created_at index would be optimal.
My answer here is actually that you should probably be using a multiple field index, given that you are filtering on multiple fields. For the example filter you gave in the question:
{
name: 'Alfred Mason',
created_at: { $gt: ... },
active: true
}
I would suggest trying both of the following indices:
db.getCollection('your_collection').createIndex(
{ "name": 1, "created_at": 1, "active": 1 } );
and
db.getCollection('your_collection').createIndex(
{ "created_at": 1, "name": 1, "active": 1 } );
Whether you would want created_at to be first in the index, or rather name to be first, would depend on which field has the higher cardinality. Cardinality basically means how unique are all of the values in a given field. If every name in the collection be distinct, then you would probably want name to be first. On the other hand, if every created_at timestamp is expected to be unique, then it might make sense to put that field first. As for active, it appears to a boolean field, and as such, can only take on two values (true/false). It should be last in the index (and you might even be able to omit it entirely).

I do not think it is necessary to index all fields, and it is better to choose the appropriate fields.
Prefixes in Compound Indexes may be useful for you

What is the correct way to Index in MongoDB when big combination of fields exist

Considering I have search pannel that inculude multiple options like in the picture below:
I'm working with mongo and create compound index on 3-4 properties with specific order.
But when i run a different combinations of searches i see every time different order in execution plan (explain()). Sometime i see it on Collection scan (bad) , and sometime it fit right to the index (IXSCAN).
The selective fields that should handle by mongo indexes are:(brand,Types,Status,Warehouse,Carries ,Search - only by id)
My question is:
Do I have to create all combination with all fields with different order , it can be 10-20 compound indexes. Or 1-3 big Compound Index , but again it will not solve the order.
What is the best strategy to deal with big various of fields combinations.
I use same structure queries with different combinations of pairs
// Example Query.
// fields could be different every time according to user select (and order) !!
db.getCollection("orders").find({
'$and': [
{
'status': {
'$in': [
'XXX',
'YYY'
]
}
},
{
'searchId': {
'$in': [
'3859447'
]
}
},
{
'origin.brand': {
'$in': [
'aaaa',
'bbbb',
'cccc',
'ddd',
'eee',
'bundle'
]
}
},
{
'$or': [
{
'origin.carries': 'YYY'
},
{
'origin.carries': 'ZZZ'
},
{
'origin.carries': 'WWWW'
}
]
}
]
}).sort({"timestamp":1})
// My compound index is:
{status:1 ,searchId:-1,origin.brand:1, origin.carries:1 , timestamp:1}
but it only 1 combination ...it could be plenty like
a. {status:1} {b.status:1 ,searchId:-1} {c. status:1 ,searchId:-1,origin.brand:1} {d.status:1 ,searchId:-1,origin.brand:1, origin.carries:1} ........
Additionally , What will happened with Performance write/read ? , I think write will decreased over reads ...
The queries pattern are :
1.find(...) with '$and'/'$or' + sort
2.Aggregation with Match/sort
thanks

Generally, indexes are only useful if they are over a selective field. This means the number of documents that have a particular value is small relative to the overall number of documents.
What "small" means varies on the data set and the query. A 1% selectivity is pretty safe when deciding whether an index makes sense. If an particular value exists in, say, 10% of documents, performing a table scan may be more efficient than using an index over the respective field.
With that in mind, some of your fields will be selective and some will not be. For example, I suspect filtering by "OK" will not be very selective. You can eliminate non-selective fields from indexing considerations - if someone wants all orders which are "OK" with no other conditions they'll end up doing a table scan. If someone wants orders which are "OK" and have other conditions, whatever index is applicable to other conditions will be used.
Now that you are left with selective (or at least somewhat selective) fields, consider what queries are both popular and selective. For example, perhaps brand+type would be such a combination. You could add compound indexes that match popular queries which you expect to be selective.
Now, what happens if someone filters by brand only? This could be selective or not depending on the data. If you already have a compound index on brand+type, you'd leave it up to the database to determine whether a brand only query is more efficient to fulfill via the brand+type index or via a collection scan.
Continue in this manner with other popular queries and fields.

So you have subdocuments, ranged queries, and sorting by 1 field only.
It can eliminate most of the possible permutations. Assuming there are no other surprises.
D. SM already covered selectivity - you should really listen what the man says and at least upvote.
The other things to consider is the order of the fields in the compound index:
fields that have direct match like $eq
fields you sort on
fields with ranged queries: $in, $lt, $or etc
These are common rules for all b-trees. Now things that are specific to mongo:
A compound index can have no more than 1 multikey index - the index by a field in subdocuments like "origin.brand". Again I assume origins are embedded docs, so the document's shape is like this:
{
_id: ...,
status: ...,
timestamp: ....,
origin: [
{brand: ..., carries: ...},
{brand: ..., carries: ...},
{brand: ..., carries: ...}
]
}
For your query the best index would be
{
searchId: 1,
timestamp: 1,
status: 1, /** only if it is selective enough **/
"origin.carries" : 1 /** or brand, depending on data **/
}
Regarding the number of indexes - it depends on data size. Ensure all indexes fit into RAM otherwise it will be really slow.
Last but not least - indexing is not a one off job but a lifestyle. Data change over time, so do queries. If you care about performance and have finite resources you should keep an eye on the database. Check slow queries to add new indexes, collect stats from user's queries to remove unused indexes and free up some room. Basically apply common sense.

I noticed this one-year-old topic, because I am more or less struggling with a similar issue: users can request queries with an unpredictable set of the fields, which makes it near to impossible to decide (or change) how indexes should be defined.
Even worse: the user should indicate some value (or range) for the fields that make up the sharding-key, otherwise we cannot help MongoDB to limit its search in only a few shards (or chunks, for that matter).
When the user needs the liberty to search on other fields that are not necessariy the ones which make up the sharding-key, then we're stuck with a full-database search. Our dbase is some 10's of TB size...
Indexes should fit in RAM ? This can only be achieved with small databases, meaning some 100's GB max. How about my 37 TB database ? Indexes won't fit in RAM.
So I am trying out a POC inspired by the UNIX filesystem structures where we have inodes pointing to data blocks:
we have a cluster with 108 shards, each contains 100 chunks
at insert time, we take some fields of which we know they yield a good cardinality of the data, and we compute the sharding-key with those fields; the document goes into the main collection (call it "Main_col") on that computed shard, so with a certain chunk-number (equals our computed sharding-key value)
from the original document, we take a few 'crucial' fields (the list of such fields can evolve as your needs change) and store a small extra document in another collection (call these "Crucial_col_A", Crucial_col_B", etc, one for each such field): that document contains the value of this crucial field, plus an array with the chunk-number where the original full document has been stored in the 'big' collection "Main_col"; consider this as a 'pointer' to the chunk in collecton "Main_col" where this full document exists. These "Crucial_col_X" collections are sharded based on the value of the 'crucial' field.
when we insert another document that has the same value for some 'crucial' field "A", then that array in "Crucial_col_A" with chunk-numbers with be updated (with 'merge') to contain the different or same chunk number of this next full document from "Main_col"
a user can now define queries with criteria for at least one of those 'crucial' fields, plus (optional) any other criteria on other fields in the documents; the first criterium for the crucial field (say field "B") will run very quickly (because sharded on the value of "B") and return the small document from "Crucial_col_B", in which we have the array of chunk-numbers in "Main_col" where any document exists that has field "B" equal to the given criterium. Then we run a second set of parallel queries, one for each shardkey-value=chunk-number (or one per shard, to be decided) that we find in the array from before. We combine the results of those parallel subqueries, and then apply further filtering if the user gave additional criteria.
Thus this involves 2 query-steps: first in the "Crucial_col_X" collection to obtain the array with chunk-numbers where the full documents exist, and then the second query on those specific chunks in "Main_col".
The first query is done with a precise value for the 'crucial' field, so the exact shard/chunk is known, thus this query goes very fast.
The second (set of) queries are done with precise values for the sharding-keys (= the chunk numbers), so these are expected to go also very fast.
This way of working would eliminate the burden of defining many index combinations.

How to query data efficiently in large mongodb collection?

I have one big mongodb collection (3-million docs, 50 GigaBytes), and it would be very slow to query the data even I have created the indexs.
db.collection.find({"C123":1, "C122":2})
e.g. the query will be timeout or will be extreme slow (10s at least), even if I have created the separate indexes for C123 and C122.
Should I create more indexs or increase the physical memory to accelerate the querying?

For such a query you should create compound indexes. One on both fields. And then it should be very efficient. Creating separate indexes won't help you much, because MongoDB engine will use first to get results of first part of query, but second if is used won't help much (or even can slow down in some cases your query because of lookup in indexes table and then in real data again). You can confirm used indexes by using .explain() on your query in shell.
See compound indexes:
https://docs.mongodb.com/manual/core/index-compound/
Also consider sorting directions on both your fields while making indexes.

The answer is really simple.
You don't need to create more indexes, you need to create the right indexes. Index on field c124 won't help queries on field c123, so no point in creating it.
Use better/more hardware. More RAM, more machines (sharding).

Create Right indices and carefully use compound index. (You can have max. 64 indices per collection and 31 fields in compound index)
Use mongo side pagination
Try to find out most used queries and build compound index around that.
Compound index strictly follow sequence so read documentation and do trials
Also try covered query for 'summary' like queries
Learned it hard way..

Use skip and limit. Run a loop for 50000 data at once .
https://docs.mongodb.com/manual/reference/method/cursor.skip/
https://docs.mongodb.com/manual/reference/method/cursor.limit/
example :
[
{
$group: {
_id: "$myDoc,homepage_domain",
count: {$sum: 1},
entry: {
$push: {
location_city: "$myDoc.location_city",
homepage_domain: "$myDoc.homepage_domain",
country: "$myDoc.country",
employee_linkedin: "$myDoc.employee_linkedin",
linkedin_url: "$myDoc.inkedin_url",
homepage_url: "$myDoc.homepage_url",
industry: "$myDoc.industry",
read_at: "$myDoc.read_at"
}
}
}
}, {
$limit : 50000
}, {
$skip: 50000
}
],
{
allowDiskUse: true
},
print(
db.Or9.insert({
"HomepageDomain":myDoc.homepage_domain,
"location_city":myDoc.location_city
})
)

Best shard key (or optimised query) for range query on sub-document array

Below is a simplified version of a document in my database:
{
_id : 1,
main_data : 100,
sub_docs: [
{
_id : a,
data : 100
},
{
_id: b,
data : 200
},
{
_id: c,
data: 150
}
]
}
So imagine I have lots of these documents with varied data values (say 0 - 1000).
Currently my query is something like:
db.myDb.find(
{ sub_docs.data : { $elemMatch: { $gte: 110, $lt: 160 } } }
)
Is there any shard key I could use to help this query? As currently it is querying all shards.
If not is there a better way to structure my query?

Jackson,
You are thinking about this problem the right way. The problem with broadcast queries in MongoDB is that they can't scale.
Any MongoDB query that does not filter on the shard key, will be broadcast to all shards. Also, range queries are likely to either cause broadcasts of at the very least cause your queries to be sent to multiple shards.
So here is some things to think about
Query Frequency -- Is the range query your most frequent query? What
is the expected workload?
Range Logic -- Is there any instrinsic logic to how you are going to
apply the ranges? Let's say, you would say 0-200 is small, 200 - 400
is medium. You could potentially add another field to your document
and shard on it.
Additional shard key candidates -- Sometimes there are other fields
that can be included in all or most of your queries and it would
provide good distribution. By combining filtering with your range
queries you could restrict your query to one or fewer shards.
Break array -- You could potentially have multiple documents instead
of an array. In this scenario, you would have multiple docs, one for
each occurrence of the array and main data would be duplicated across
mulitple documents. Range query on this item would still be a
problem, but you could involve multiple shards, not necessarily all
(it depends on your data demographics and query patterns)
It boils down to the nature of your data and queries. The sample document that you provided is very anonymized so it is harder to know what would be good shard key candidates in your domain.
One last piece of advice is to be careful on your insert/update query patterns if you plan to update your document frequently to add more entries to the array. Growing documents present scaling problems for MongoDB. See this article on this topic.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse