I have 3 collections :
users
mappers
seens
Here is a document in users where "_id" is the id of a user and the ids array contains a list of other users _id :
{
_id: "uid",
ids: [
"uid0",
"uid5",
...
"uid100"
]
}
A document in seens looks exactly like the one in users but in the ids array, there are ids of mappers that have been seen by the user, the "_id" is the one of the user owner of the array.
Here is a mapper where "_id" is the ID of a user and map.id is an id potentially existing in the ids field of a document of seens :
{
_id: "uid",
at: 1453592,
map: {
id: "uid",
...
}
}
I want to retrieve all mappers that meet some conditions :
_id must be in the ids of the user
at must be $lt now and $gt than a given value (that is lower than now)
map.id must not be in ids of the seens of the user
The query looks like this :
{
"_id": {"$in": ids},
"$and": [
{"at": {"$lt": now}},
{"at": {"$gt": start_date}},
{"map.id": {"$nin": seens}}
],
},
Where ids is the array of the user ids and seens is the array of the mappers already seen.
I have done some experiment on this query, it's working very fine with a thousands of records.
However, if i have 10 000 ids, 10 000 seens and 10 000 mappers and performing this query, it takes 15seconds.
I have added an index on : at (descending) and map.id (ascending), it now takes 8sec.
I simply know that if my collections scale, this is only takes longer and longer.
How can i make it always returning results in less than 1sec not matter how many documents i have in my collections ?
The underlying question is how to keep the query efficiency using $in and $nin at scale ?
Related
Following is my MongoDB query to show the organization listing along with the user count per organization. As per my data model, the "users" collection has an array userOrgMap which maintains the organizations ( by orgId) to which the user belongs to. The "organization" collection doesn't store the list of assigned users in its collection. The "users" collection has 11,200 documents and the "organizations" has 10,500 documents.
db.organizations.aggregate([
{$lookup : {from:"users",localField:"_id", foreignField:"userOrgMap.orgId",as:"user" }},
{ $project : {_id:1,name:1,"noOfUsers":{$size:"$user"}}},
{$sort:{noOfUsers:-1},
{$limit : 15},
{$skip : 0}
]);
Without the sort, the query works fast. With the sort, the query works very slow. It takes around 200 secs.
I even tried another way which is also taking more time.
db.organizations.aggregate([
{$lookup : {from:"users",localField:"_id", foreignField:"userOrgMap.orgId",as:"user" }},
{$unwind:"$user"}
{$group :{_id:"$_id"},name:{"$firstName":"$name"},userCount:{$sum:1}},
{$sort:{noOfUsers:-1},
{$limit : 15},
{$skip : 0}
]);
For the above query, without the $sort itself takes more time.
Need help on how to solve this issue.
Get the aggregation to use an index that begins with noOfUsers as I do not see a $match stage here.
The problem is resolved. I created an index on "userOrgMap.orgId". The query is fast now.
I have a document in which data are like
collection A
{
"_id" : 69.0,
"values" : [
{
"date_data" : "2016-12-16 10:00:00",
"valueA" : 8,
"valuB" : 9
},
{
"date_data" : "2016-12-16 11:00:00",
"valueA" : 8,
"valuB" : 9
},.......
}
collection B
{
"_id" : 69.0,
"values" : [
{
"date_data" : "2017-12-16 10:00:00",
"valueA" : 8,
"valuB" : 9
},
{
"date_data" : "2017-12-16 11:00:00",
"valueA" : 8,
"valuB" : 9
},.......
}
data is being stored at each hour, as it store in one documents, it may reach its limit 16Mb at some point, that's why i'm thinking to spread data across the years, means in one collection all the id's will hold the data on yearly basis. But when we want to show data combined, how we can use aggregate function?
For example, collectionA has data from 7th dec'16 to 7th dec'17 and collectionB has data from 6th dec'15 to 6th dec'16. how i can show data between 1st dec'16 to 1st jan'17 which are in different collection?
Very simple, use mongodb $lookup query which is the equivalent of a left outer join. All the documents on the left will be scanned for a value inside a field and the documents from the right considered the foreign document will match with respect to value. For your case, here is the parent collection
Parent A
Child collection B
Now all we have to do is make a query from the collection A
With a very simple aggregation $lookup query, you ll see the following result
db.getCollection('uniques').aggregate([
{
"$lookup": {
"from": "values",//Get data from values table
"localField": "_id", //The field _id of the current table uniques
"foreignField": "parent_id", //The foreign column containing a matching value
"as": "related" //An array containing all items under 69
}
},
{
"$unwind": "$related" //Unwind that array
},
{
"$project": {
"value_id": "$related._id",//project only what you need
"date": "$related.date_data",
"a": "$related.valueA",
"b": "$related.valueB"
}
}
], {"allowDiskUse": true})
Remember a few things
Local field for the lookup doesnt care if you have indexed it or not so run it over a table with the least number of rows
Foreign field works best when indexed or directly on an _id
There is an option to specify a pipeline and do some custom filtering work while matching, I wont recommend it as pipelines are ridiculously slow
Dont forget to "allowDiskUse" if you are going to aggregate large amounts of data
5 million mongo doc:
{
_id: xxx,
devID: 123,
logLevel: 5,
logTime: 1468464358697
}
indexes:
devID
my aggregate:
[
{$match: {devID: 123}},
{$group: {_id: {level: "$logLevel"}, count: {$sum: 1}}}
]
aggregate result:
{ "_id" : { "level" : 5 }, "count" : 5175872 }
{ "_id" : { "level" : 1 }, "count" : 200000 }
aggregate explain:
numYields:42305
29399ms
Q:
if mongo without writing(saving) data, it will take 29 seconds
if mongo is writing(saving) data, it will take 2 minutes
my aggregate result need to reply to web, so 29sec or 2min are too long
How can i solve it? preferably 10 seconds or less
Thanks all
In your example, the aggregation query for {devID: 123, logLevel:5} returns a count of 5,175,872 which looks like it counted all the documents in your collection (since you mentioned you have 5 million documents).
In this particular example, I'm guessing that the {$match: {devID: 123}} stage matches pretty much every document, hence the aggregation is doing what is essentially a collection scan. Depending on your RAM size, this could have the effect of pushing your working set out of memory, and slow down every other query your server is doing.
If you cannot provide a more selective criteria for the $match stage (e.g. by using a range of logTime as well as devID), then a pre-aggregated report may be your best option.
In general terms, a pre-aggregated report is a document that contains the aggregated information you require, and you update this document every time you insert into the related collection. For example, you could have a single document in a separate collection that looks like:
{log:
{devID: 123,
levelCount: [
{level: 5, count: 5175872},
{level: 1, count: 200000}
]
}}
where that document is updated with the relevant details every time you insert into the log collection.
Using a pre-aggregated report, you don't need to run the aggregation query anymore. The aggregated information you require is instead available using a single find() query instead.
For more examples on pre-aggregated reports, please see https://docs.mongodb.com/ecosystem/use-cases/pre-aggregated-reports/
I have two collections: persons (millions) and groups. When creating a group I have a rule, which are actually the criteria to find persons. Now, what I want to do is to add the groups _id to all the matching persons.
The request to my API:
POST /groups {
"rule": {"age": {"$gt": 18}},
"description": "persons above 18"
}
On my MongoDB:
db.persons.find({"age": {"$gt": 18}})
Now I want to add the group _id to a groups array field in each of the matching persons, so that I can later get all persons in the group. Can this be done directly in the same query?
Maybe I'm missing something, but a simple update statement should do it:
db.persons.update(
{ "age" : {$gt : 18} },
{ $addToSet : { "groups" : groupId }},
false, // no upsert ($addToSet and $push still add the field)
true); // update multiple docs in this query
Note that $push will add the same value over and over, while $addToSet will ensure uniqueness in the array, which probably makes sense in this case.
You'll have to find/insert the group itself in a different statement, though.
I've two collection, Buildings and Orders. A Building can have many Orders (1:N Relation).
I'm trying to achieve a "Top Ten Statistic"(Which Buildings have the most Orders) with the aggregation framework.
My Problem is, how can i get the total Orders per Building? Is there a way to "mix" data from two collections in one aggregation?
Currently i'm doing something like this:
db.buildings.aggregate( [
{ $group : _id : { street : "$street",
city : "$city",
orders_count : "$orders_count" }},
{ $sort : { _id.orders_count : -1 }},
{ $limit : 10}
] );
But in this case the "orders_count" is pre-calculated value. It works but is very inefficient and to slow for "live" aggregation.
Is there a way to count the related orders per building directly in the aggregation (im sure there is a way...)?
Many Thanks
You don't say how orders relate to buildings in your schema but if an order has a building id or name it references, just group by that:
db.orders.aggregate( { $group : { _id: "$buildingId",
sum: {$sum:1}
}
},
/* $sort by sum:-1, $limit:10 like you already have */
)