mongodb statistics 5 million doc too slow - mongodb

5 million mongo doc:
{
_id: xxx,
devID: 123,
logLevel: 5,
logTime: 1468464358697
}
indexes:
devID
my aggregate:
[
{$match: {devID: 123}},
{$group: {_id: {level: "$logLevel"}, count: {$sum: 1}}}
]
aggregate result:
{ "_id" : { "level" : 5 }, "count" : 5175872 }
{ "_id" : { "level" : 1 }, "count" : 200000 }
aggregate explain:
numYields:42305
29399ms
Q:
if mongo without writing(saving) data, it will take 29 seconds
if mongo is writing(saving) data, it will take 2 minutes
my aggregate result need to reply to web, so 29sec or 2min are too long
How can i solve it? preferably 10 seconds or less
Thanks all

In your example, the aggregation query for {devID: 123, logLevel:5} returns a count of 5,175,872 which looks like it counted all the documents in your collection (since you mentioned you have 5 million documents).
In this particular example, I'm guessing that the {$match: {devID: 123}} stage matches pretty much every document, hence the aggregation is doing what is essentially a collection scan. Depending on your RAM size, this could have the effect of pushing your working set out of memory, and slow down every other query your server is doing.
If you cannot provide a more selective criteria for the $match stage (e.g. by using a range of logTime as well as devID), then a pre-aggregated report may be your best option.
In general terms, a pre-aggregated report is a document that contains the aggregated information you require, and you update this document every time you insert into the related collection. For example, you could have a single document in a separate collection that looks like:
{log:
{devID: 123,
levelCount: [
{level: 5, count: 5175872},
{level: 1, count: 200000}
]
}}
where that document is updated with the relevant details every time you insert into the log collection.
Using a pre-aggregated report, you don't need to run the aggregation query anymore. The aggregated information you require is instead available using a single find() query instead.
For more examples on pre-aggregated reports, please see https://docs.mongodb.com/ecosystem/use-cases/pre-aggregated-reports/

Related

Query performances with $in and $nin mongodb at scale

I have 3 collections :
users
mappers
seens
Here is a document in users where "_id" is the id of a user and the ids array contains a list of other users _id :
{
_id: "uid",
ids: [
"uid0",
"uid5",
...
"uid100"
]
}
A document in seens looks exactly like the one in users but in the ids array, there are ids of mappers that have been seen by the user, the "_id" is the one of the user owner of the array.
Here is a mapper where "_id" is the ID of a user and map.id is an id potentially existing in the ids field of a document of seens :
{
_id: "uid",
at: 1453592,
map: {
id: "uid",
...
}
}
I want to retrieve all mappers that meet some conditions :
_id must be in the ids of the user
at must be $lt now and $gt than a given value (that is lower than now)
map.id must not be in ids of the seens of the user
The query looks like this :
{
"_id": {"$in": ids},
"$and": [
{"at": {"$lt": now}},
{"at": {"$gt": start_date}},
{"map.id": {"$nin": seens}}
],
},
Where ids is the array of the user ids and seens is the array of the mappers already seen.
I have done some experiment on this query, it's working very fine with a thousands of records.
However, if i have 10 000 ids, 10 000 seens and 10 000 mappers and performing this query, it takes 15seconds.
I have added an index on : at (descending) and map.id (ascending), it now takes 8sec.
I simply know that if my collections scale, this is only takes longer and longer.
How can i make it always returning results in less than 1sec not matter how many documents i have in my collections ?
The underlying question is how to keep the query efficiency using $in and $nin at scale ?

What is the proper way to query MongoDB matching $in with large collection?

There is deal collection containing 50 million documents with this structure
{
"user": <string>,
"price": <number>
}
The collection has this index
{
"v" : 2,
"unique" : false,
"key" : {
"user" : 1
},
"name" : "deal_user",
"background" : true,
"sparse" : true
}
Trying to execute this aggregation
db.getCollection('deal').aggregate([
{
$match: {user: {$in: [
"u-4daf809bd",
"u-b967c0a84",
"u-ae03417c1",
"u-805ce887d",
"u-0f5160823",
/* more items */
]}}
},
{
$group: {
_id: "$user",
minPrice: {$min: "$price"}
}
}
])
When array size inside $match / $in is less than ~50 then query response time mostly is less than 2 seconds. For large arrays (size > 200) response time is around 120 seconds.
Also tried an approach with chunking the array into parts containing 10-50 elements and querying in parallel. There is a strange (reproducible) effect: most of the queries 80% respond fast (2-5 seconds), but there are some that hung ~100 seconds, so parallelisation did not bring a fruit.
I guess that the are some kind of "slow" and "normal" values, but can not explain what is going on, because they all belong to the same index and are expected to be fetched in approx same time. It slightly correlates with amount of duplicates in user field (i.e. each grouped value), but looks like:
"big size of duplicates" does not always entail "slow"
Please, help me to understand why this MongoDB query behaves this way.
What is the proper way to do this kind of queries?

How to delete duplicates using MongoDB Aggregations in MongoDB Compass Community

I somehow created duplicates of every single entry in my database. Currently, there are 176039 documents and counting, half are duplicates. Each document is structured like so
_id : 5b41d9ccf10fcf0014fe8917
originName : "Hartsfield Jackson Atlanta International Airport"
destinationName : "Antigua"
totalDuration : 337
Inside the MongoDB Compass Community App for Mac under the Aggregations tab, I was able to find duplicates using this pipeline
[
{$group: {
_id: {originName: "$originName", destinationName: "$destinationName"},
count: {$sum: 1}}},
{$match: {count: {"$gt": 1}}}
]
I'm not sure how to move forward and delete the duplicates at this point. I'm assuming it has something to do with $out.
Edit: Something I didn't notice until now is that the values for totalDuration on each double are actually different.
Add
{$project:{_id:0, "originName":"$_id.originName", "destinationName":"$_id.destinationName"}},
{ $out : collectionname }
This will replace the documents in your current collection with documents from aggregation pipeline. If you need totalDuration in the collection then add that field in both group and project stage before running the pipeline

MongoDB sort is slow for non-index dynamic field

Following is my MongoDB query to show the organization listing along with the user count per organization. As per my data model, the "users" collection has an array userOrgMap which maintains the organizations ( by orgId) to which the user belongs to. The "organization" collection doesn't store the list of assigned users in its collection. The "users" collection has 11,200 documents and the "organizations" has 10,500 documents.
db.organizations.aggregate([
{$lookup : {from:"users",localField:"_id", foreignField:"userOrgMap.orgId",as:"user" }},
{ $project : {_id:1,name:1,"noOfUsers":{$size:"$user"}}},
{$sort:{noOfUsers:-1},
{$limit : 15},
{$skip : 0}
]);
Without the sort, the query works fast. With the sort, the query works very slow. It takes around 200 secs.
I even tried another way which is also taking more time.
db.organizations.aggregate([
{$lookup : {from:"users",localField:"_id", foreignField:"userOrgMap.orgId",as:"user" }},
{$unwind:"$user"}
{$group :{_id:"$_id"},name:{"$firstName":"$name"},userCount:{$sum:1}},
{$sort:{noOfUsers:-1},
{$limit : 15},
{$skip : 0}
]);
For the above query, without the $sort itself takes more time.
Need help on how to solve this issue.
Get the aggregation to use an index that begins with noOfUsers as I do not see a $match stage here.
The problem is resolved. I created an index on "userOrgMap.orgId". The query is fast now.

MongoDB pagination with document modified date

I am implementing a stream of sort, which is storing data like below
{_id:ObjectId(), owner:1, text:"some text1", created:"2015-11-14T01:01:00+00:00", updated:"2015-11-14T01:01:00+00:00"}
{_id:ObjectId(), owner:1, text:"some text2", created:"2015-11-14T01:01:00+00:00", updated:"2015-11-14T01:01:00+00:00"}
{_id:ObjectId(), owner:1, text:"some text3", created:"2015-11-14T01:01:00+00:00", updated:"2015-11-14T01:02:00+00:00"}
...
{_id:ObjectId(), owner:1, text:"some text4", created:"2015-11-14T01:30:00+00:00", updated:"2015-11-14T01:31:00+00:00"}
...
Note that there is frequent bulk update, which means there can be more than 10-20 documents updated at the same time.
My aggregation query looks like this
db.stream.aggregate([
$match: {
'owner': 1,
'updated': { $lte: '2015-11-14T01:30:00+00:00' } },
$sort: { 'updated': -1 },
$limit: 10
])
If there are more than 10 objects with same updated date my pagination will not continue.
for pagination I have last document of previous page
I can extract _id and updated
Is it possible to do something like
$match : {updated :{ $lt: last_page_ele.updated} }
$sort : { updated }
$match : { `documents which comes after the last element in this order` }
$limit : 10
Update:
It is possible to use the method suggested by #Blakes Seven, i.e to store all the seen documents in a session array and use $nin. This was an obvious solution, but it screws up my front end behaviour. You cannot open multiple tabs, which will use the same session space used by the other tab.
Plus in my case storing 100s of items in RAM seems to be impractical and not scalable for lot of users.