MongoDB find and iterate vs count - mongodb

I have a peculiar problem with Mongo.
We have a collection of 800k documents with the following structure.
{
"_id" : ObjectId("5bd844199114bab3b2c19fab"),
"u" : 0,
"c" : 0,
"iden" : "343754856",
"name" : "alan",
"email" : "mkasd#abc.com",
"mobile" : "987654321093456",
"expires" : ISODate("2018-11-29T11:44:25.453Z"),
"created" : ISODate("2018-10-30T11:44:25.453Z")
}
We have indexed iden and name on which we generally query.
We tried two types of queries.
db.Collection.find({"iden": "343754856", "name": "alan", "created":
{"$gt": ....}).count()
where "created" is an unindexed field.
db.Collection.find({"iden": "343754856", "name": "alan"})
and iterate over all records to filter based on created.
However, MongoDB seems to be taking enormous amount of time in executing the second query while it was supposed to be an optimization over 1.
Any leads on what is going wrong here?
We are using Go library.

How could the second version be an optimization over the first?
Your first query retrieves a single number from the MongoDB server: the overall count of the query result. While your second version fetches all matching documents, and you do the counting at the "client" side.
Believe me that the MongoDB can count internally the result documents just as fast as you could in your Go client. Making the MongoDB server send the results, fetching them and unmarshaling them at the client takes orders of magnitude more time (depending on lot of factors).
Please note that if you have a composite index containing "iden" and "name", even if you add more filters (like "created" in your example), the index may still be used, but the MongoDB have to iterate over the partial results to apply the rest of the query. To find out if the index is used, please execute the following command:
db.Collection.find(
{"iden": "343754856", "name": "alan", "created": {"$gt": ....}
).explain()

Related

How to eliminate Query Targeting: Scanned Objects / Returned has gone above 1000 in MongoDB?

There are some questions 1, 2 talk about the MongoDB warning Query Targeting: Scanned Objects / Returned has gone above 1000, however, my question is another case.
The schema of our document is
{
"_id" : ObjectId("abc"),
"key" : "key_1",
"val" : "1",
"created_at" : ISODate("2021-09-25T07:38:04.985Z"),
"a_has_sent" : false,
"b_has_sent" : false,
"updated_at" : ISODate("2021-09-25T07:38:04.985Z")
}
The indexes of this collections are
{
"key" : {
"updated_at" : 1
},
"name" : "updated_at_1",
"expireAfterSeconds" : 5184000,
"background" : true
},
{
"key" : {
"updated_at" : 1,
"a_has_sent" : 1,
"b_has_sent" : 1
},
"name" : "updated_at_1_a_has_sent_1_b_has_sent_1",
"background" : true
}
The total number of documents after 2021-09-24 is over 600000, and the distinct value of key is 5.
The above waning caused by the query
db.collectionname.find({ "updated_at": { "$gte": ISODate("2021-09-24")}, "$or": [{ "a_has_sent": false }, {"b_has_sent": false}], "key": "key_1"})
Our server sends one document to a and b simutinously with batch size 2000. After sending to a successfully, mark a_has_sent to true. The same logic to b. As sending process goes on, the number of documents with a_has_sent: false reduce. And the above warning comes up.
After checking the explain result of this query, the index named updated_at_1 is used rather than updated_at_1_a_has_sent_1_b_has_sent_1.
What we had tried.
We add another new index {"updated_at": 1, "key": 1}, and expect this query could use the new index to reduce the number of scanned documents. Unfortunately, we failed. The index named updated_at_1 is still used.
We try to replace find with aggregate
aggregate([{"$match": { "updated_at": { "$gte": ISODate("2021-09-24") }, "$or": [{ "a_has_sent": false }, { "b_has_sent": false}], "key": "key_1"}}]). Unfortunately, The index named updated_at_1 is still used.
We want to know how to eliminate this warning Scanned Objects / Returned has gone above 1000?
Mongo 4.0 is used in our case.
Follow the ESR rule
For compound indexes, this rule of thumb is helpful in deciding the order of fields in the index:
First, add those fields against which Equality queries are run.
The next fields to be indexed should reflect the Sort order of the query.
The last fields represent the Range of data to be accessed.
We create the index {"action_key" : 1,"adjust_sent" : 1,"facebook_sent" : 1,"updated_at" : 1}, this index could be used by the query now
Update 08/15/2022
Query Targeting alerts indicate inefficient queries.
Query Targeting: Scanned Objects / Returned occurs if the number of documents examined to fulfill a query relative to the actual number of returned documents meets or exceeds a user-defined threshold. The default is 1000, which means that a query must scan more than 1000 documents for each document returned to trigger the alert.
Here are some steps to solve this issue
First, The Performance Advisor provides the easiest and quickest way to create an index. If there is any create Indexes suggestion, you can create this recommended index.
Then, you could check the query profile if there is no recommended index in Performance Advisor. The Query Profiler contains several metrics you can use to pinpoint specific inefficient queries. The Query Profiler can show the Examined : Returned Ratio (index keys examined to documents returned) of logged queries, which might help you identify the queries that triggered a
Query Targeting: Scanned / Returned
alert. The chart shows the number of index keys examined to fulfill a query relative to the actual number of returned documents.
You can use the following resources to determine which query generated the alert:
The Real-Time Performance Panel monitors and displays current network traffic and database operations on machines hosting MongoDB in your Atlas clusters.
The MongoDB logs maintain an account of activity, including queries, for each mongod instance in your Atlas clusters.
The following mongod log entry shows statistics generated from an inefficient query:
<Timestamp> COMMAND <query>
planSummary: COLLSCAN keysExamined:0
docsExamined: 10000 cursorExhausted:1 numYields:234
nreturned:4 protocol:op_query 358ms
This query scanned 10,000 documents and returned only 4 for a ratio of 2500, which is highly inefficient. No index keys were examined, so MongoDB scanned all documents in the collection, known as a collection scan
The cursor.explain() command for mongosh provides performance details for all queries.
The Data Profiler records operations that Atlas considers slow when compared to average execution time for all operations on your cluster.
Note - Enabling the Database Profiler incurs a performance overhead.
MongoDB cannot use a single index to process an $or that looks at different field values.
The index on
{
"updated_at" : 1,
"a_has_sent" : 1,
"b_has_sent" : 1
}
can be used with the $or expression to match either a_has_sent or b_has_sent.
To minimize the number of documents examined, create 2 indexes, one for each branch of the $or, combined with the enclosing $and (the filter implicitly combines the top-level query predicates with and). Such as:
{
"updated_at" : 1,
"a_has_sent" : 1
}
and
{
"updated_at" : 1,
"b_has_sent" : 1
}
Also note that the alert for Query Targeting: Scanned Objects / Returned has gone above 1000 does not refer to a single query.
The MongoDB server keeps a counter (64-bit?) that tracks the number of documents examined since the server was start, and another counter for the number of documents returned.
That scanned per returned ration is derive by simply dividing the examined counter by the returned counter.
This means that if you have something like a count query that requires examining documents, you may have hundreds or thousands of documents examined, but only 1 returned. It won't take many of these kinds of queries to push the ratio over the 1000 alert limit

Multiple nested arrays in MongoDB

I am having difficulties figuring out an effective way of working with a multiple nested document. It looks like the following:
{ "_id" :
{ "$oid" : "53ce46e3f0c25036e7b0ddd8"} ,
"someid" : 7757099 ,
"otherids" :
[ { "id" : 100 ,
"line" : "test" ,
"otherids" :
[ { "id" : 129}]}
]}
and there will be another level of array in addition.
I can not find a way to query this structure except for "otherids" array, but no deeper. Is this possible to do in an effective way at all?
These arrays might grow a bit, but not hugely.
My thought was to use it like this since it will be effective to fetch a lot of data in one go. But this data also needs to be updated quite often. Is this a hopeless solution with mongoDB?
Regards mongoDB newb
EDIT:
I would like to do it as simply and fast as possible :-)
Like: someid.4.otherids.2.line -> somevalue
I know that probably I would have to do a query to check if values exist, but it would be nice to do it as an upsert. Now I only work with objects in java, and it takes 14 secs to insert 10 000 records. Most of these inserts are "leaf nodes", meaning I have to query, then find out what is already there, modify the document, then update the whole root. This takes too long.

Bug for collections that are sharded over a hashed key

When querying for large amounts of data in sharded collections we benefited a lot from querying the shards in parallel.
The following problem does only occur in collections that are sharded over a hashed key.
In Mongo 2.4 it was possible to query with hash borders in order to get all data of one chunk.
We used the query from this post.
It is a range query with hash values as borders:
db.collection.find(
{ "_id" : { "$gte" : -9219144072535768301,
"$lt" : -9214747938866076750}
}).hint({ "_id" : "hashed"})
The same query also works in 2.6 but takes a long time.
The explain() shows that it is using the index but scanned objects is way to high.
"cursor" : "BtreeCursor _id_hashed",
Furthermore the borders are wrong.
"indexBounds" : {
"_id" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
Was there some big change from 2.4 t0 2.6 which breaks this query?
Even if the borders are interpreted as non-hash values, why does it take so long?
Is there some other way to get all documents of one chunk or hash index range?
Also the mongo internal hadoop connector has this problem with sharded collections.
Thanks!
The query above working in 2.4 was not supported behavior. See SERVER-14557 with a similar complaint and an explanation of how to properly perform this query. Reformatted for proper behavior, your query becomes:
db.collection.find().min({ _id : -9219144072535768301}).max({ _id : -9214747938866076750}).hint({_id : "hashed"})
As reported in the SERVER ticket, there is an additional bug (SERVER-14400) that prevents this query from being targeted towards a single shard. At this point in time there are no plans to address in 2.6. This should however prevent the table scan you are seeing under 2.6 and allow for more efficient retrieval.

$natural order avoids indexes. How does orderby effect the use of indexes?

Profiling slow queries I found something really strange: For the following operation the entire collection was scanned (33061 documents) even though there is an index on the query parameter family_id:
{
"ts" : ISODate("2013-11-27T10:20:26.103Z"),
"op" : "query",
"ns" : "mydb.zones",
"query" : {
"$query" : {
"family_id" : ObjectId("52812295ea84d249934f3d12")
},
"$orderby" : {
"$natural" : 1
}
},
"ntoreturn" : 20,
"ntoskip" : 0,
"nscanned" : 33061,
"keyUpdates" : 0,
"numYield" : 91,
"lockStats" : {
"timeLockedMicros" : {
"r" : NumberLong(83271),
"w" : NumberLong(0)
},
"timeAcquiringMicros" : {
"r" : NumberLong(388988),
"w" : NumberLong(22362)
}
},
"nreturned" : 7,
"responseLength" : 2863,
"millis" : 393,
"client" : "127.0.0.1",
"user" : "mydb"
}
After some Google searches without results I found out that leaving out the "$orderby": { "$natural" : 1} the query is very fast and only 7 documents are scanned instead of 33061. So I assume using $orderby in my case does avoid using the index on family_id. The strange thing is that the resulting order is not different in either case. As far as I understand $natural order it is tautologically to use "$orderby": { "$natural" : 1} or no explicit order. Another very interesting observation is that this issue does not arise on capped collection!!
This issue arises the following questions:
If not using any ordering/sorting, shouldn't the resulting order be the order on disk, i.e. $natural order?
Can I create a (compound-)index that would be used sorting naturally?
How can I invert the ordering of a simple query that uses an index an no sorting without severe performance losses?
What happens behind the scenes when using query parameters and orderby? Why is this not happening on capped collections? I would like to understand this strange behaviour.
Are the answers of the above questions independent of whether you use sharding/replication or not? What is the natural order of a query over multiple shards?
Note I am using MongoDB 2.2. There is a ticket related to this issue: https://jira.mongodb.org/browse/SERVER-5672. Though it seems in that ticket that the issue occures in capped collections too, which I cannot confirm (maybe due to different mongo versions).
As far as I understand $natural order it is tautologically to use
"$orderby": { "$natural" : 1} or no explicit order.
This is a misdescription of $natural order. MongoDB stores records in a some order on disk and keeps track of them via a doubly linked list. $natural order is the order that you get when you traverse the linked list. However, if you do not specify $natural, that is what you will always get - not random order, not insertion order, not physical disk order, but "logical" disk order - the order they appear in when traversing the linked list.
If not using any ordering/sorting, shouldn't the resulting order be
the order on disk, i.e. $natural order?
Yes, assuming that you understand that "disk order" is not strictly physical order, it's order they are in the linked list of records.
Can I create a (compound-)index that would be used sorting naturally?
I don't know what you mean by sorting naturally - if you are using an index during a query, the documents are traversed in index order, not in $natural order.
How can I invert the ordering of a simple query that uses an index and no sorting without severe performance losses?
You cannot - if you are using an index then you will get the records in index order - your options are to get them in that order, in reverse of that order or to create a compound index where you index by fields you are searching and field(s) you want to sort on.
What happens behind the scenes when using query parameters and orderby? Why is this not happening on capped collections? I would like to understand this strange behaviour.
What happens depends on what indexes are available, but the query optimizer tries to use an index that helps with both filtering and sorting - if that's not possible it will pick the index that has the best actual performance.
Are the answers of the above questions independent of whether you use
sharding/replication or not? What is the natural order of a query over
multiple shards?
It's some non-deterministic merge of $natural orders from each individual shard.

MongoDB design for scalability

We want to design a scalable database. If we have N users with 1 Billion user responses, from the 2 options below which will be a good design? We would want to query based on userID as well as Reponse ID.
Having 2 Collections one for the user information and another to store the responses along with user ID. Each response is stored as a document so we will have 1 billion documents.
User Collection
{
"userid" : "userid1",
"password" : "xyz",
,
"City" : "New York",
},
{
"userid" : "userid2",
"password" : "abc",
,
"City" : "New York",
}
responses Collection
{
"userid": "userid1",
"responseID": "responseID1",
"response" : "xyz"
},
{
"userid": "userid1",
"responseID": "responseID2",
"response" : "abc"
},
{
"userid": "userid2",
"responseID": "responseID3",
"response" : "mno"
}
Having 1 Collection to store both the information as below. Each response is represented by a new key (responseIDX).
{
"userid" : "userid1",
"responseID1" : "xyz",
"responseID2" : "abc",
,
"responseN"; "mno",
"city" : "New York"
}
If you use your first options, I'd use a relational database (like MySQL) opposed to MongoDB. If you're heartfelt on MongoDB, use it to your advantage.
{
"userId": n,
"city": "foo"
"responses": {
"responseId1": "response message 1",
"responseId2": "response message 2"
}
}
As for which would render a better performance, run a few benchmark tests.
Between the two options you've listed - I would think using a separate collection would scale better - or possibly a combination of a separate collection and still using embedded documents.
Embedded documents can be a boon to your schema design - but do not work as well when you have an endlessly growing set of embedded documents (responses, in your case). This is because of document growth - as the document grows - and outgrows the allocated amount of space for it on disk, MongoDB must move that document to a new location to accommodate the new document size. That can be expensive and have severe performance penalties when it happens often or in high concurrency environments.
Also, querying on those embedded documents can become troublesome when you are looking to selectively return only a subset of responses, especially across users. As in - you can not return only the matching embedded documents. Using the positional operator, it is possible to get the first matching embedded document however.
So, I would recommend using a separate collection for the responses.
Though, as mentioned above, I would also suggest experimenting with other ways to group those responses in that collection. A document per day, per user, per ...whatever other dimensions you might have, etc.
Group them in ways that allow multiple embedded documents and compliments how you would query for them. If you can find the sweet spot between still using embedded documents in that collection and minimizing document growth, you'll have fewer overall documents and smaller index sizes. Obviously this requires benchmarking and testing, as the same caveats listed above can apply.
Lastly (and optionally), with that type of data set, consider using increment counters where you can on the front end to supply any type of aggregated reporting you might need down the road. Though the Aggregation Framework in MongoDB is great - having, say, the total response count for a user pre-aggregated is far more convenient then trying to get a count by running a aggregate query on the full dataset.