Hope someone can help with the slow Mongo query - it runs fine against smaller collections but once we test it against the larger production collections, it fails with the message "Not enough disk space" even though we had limited the result set to 100.
I feel like there is an issue with the query structure and/or appropriate indexes
Both collections are ~5 million records.
We need help to make this query fast.
// divide these by 1000 because the ts field isn't javascript milliseconds
const startDate = (ISODate("2022-07-01T00:00:00.000Z").getTime()/1000)
const endDate = (ISODate("2022-08-10T00:00:00.000Z").getTime()/1000)
const clientId = xxxx
const ordersCollection = "orders"
const itemsCollection = "items"
db[ordersCollection].aggregate(
[
{
$lookup: {
from: itemsCollection,
localField: "data.id",
foreignField: "data.orders_id",
as: "item"
}
},
{
$unwind: "$item"
},
{
$match: {"data.client_id": clientId}
},
{
$match: {"item.ts": {$gt: startDate, $lt: endDate}}
},
{
$project: {
order_id: "$data.id",
parent_id: "$data.parent_id",
owner_id: "$data.owner_id",
client_id: "$data.client_id",
ts: "$item.ts",
status: {
$cond: {
if: {$eq: ["$item.data.status",10] },then: 3,
else: {
$cond: { if: { $eq: ["$item.data.status",4] },
then: 2,
else: "$item.data.status"
}
}
}
}
}
},
{$group: { _id: {"order_id": "$order_id", "status": "$status"},
order_id: {$first:"$order_id"},
parent_id: {$first:"$parent_id"},
owner_id: {$first:"$owner_id"},
client_id: {$first:"$client_id"},
ts: {$first:"$ts"},
status:{$first:"$status"}
}},
{$sort: {"ts": 1}}
]
).limit(100).allowDiskUse(true)
Try pulling $match on the main collection up.
This way you limit the number of documents you need to $lookup on (otherwise we'll try to lookup 5 million documents in other collection of 5 million documents).
Be sure to have an index on data.client_id.
db[ordersCollection].aggregate(
[
{
$match: {"data.client_id": clientId}
},
{
$lookup: {
from: itemsCollection,
localField: "data.id",
foreignField: "data.orders_id",
as: "item"
}
},
{
$unwind: "$item"
},
{
$match: {"item.ts": {$gt: startDate, $lt: endDate}}
},
...
As a side note limiting the result set to 100 is not helping, as the heaviest part - aggregation with lookups and grouping can not be limited.
Related
DB Schema:
3 Collections - `Collection1`, `Collection2`, `Collection3`
Column Structure:
Collection1 - [ _id, rid, aid, timestamp, start_timestamp, end_timestamp ]
Collection2 - [ _id, rid, aid, task ]
Collection3 - [ _id, rid, event_type, timestamp ]
MONGO DB VERSION: 4.2.14
The problem statement is, we need to join all the three collections using rid. Collection1 is the parent source from where we need the analysis from.
Collection2 contains the task for each record for rid in collection1. We need a "count" for each rid in this collection. Collection3 contains the event log for each record for rid. This is quite huge so we just need to filter only two events EventA & EventB for each rid found in pipeline-1.
I could come up with this but it's not working. I am not getting the min date from Collection3 for each rid matched in previous pipeline.
Note: Event logs min date for each event should be associated with rid matched in Collection1 filter.
Query:
db.getCollection("Collection1").aggregate([
{
$match: {
start_timestamp: {
$gte: new Date(ISODate().getTime() - 1000 * 60 * 15),
},
},
},
{
$lookup: {
from: "Collection2",
localField: "rid",
foreignField: "rid",
as: "tasks",
},
},
{
$lookup: {
from: "Collection3",
pipeline: [
{
$match: {
event: {
$in: ["EventA", "EventB"]
}
}
},
{
$group: {
_id: "$event",
timestamp: {
"$min": {
"updated": "$timestamp"
}
}
}
}
],
as: "eventlogs",
},
}
]);
Expected Output:
[
rid: "123",
aid: "456",
timestamp: ISODate("2022-06-03T09:46:39.609Z"),
start_timestamp: ISODate("2022-06-03T09:46:39.657Z"),
tasks: [
count: 5
],
logs: [
{
"_id": "EventA",
"timestamp": {
"updated": ISODate("2022-04-27T06:10:44.323Z")
}
},
{
"_id": "EventB",
"timestamp": {
"updated": ISODate("2022-05-05T06:36:10.271Z")
}
}
]
]
I need to write a highly optimized query which would do the above in less time (Assuming proper indexes are in place for each collection on columns). That query should not do COLLSCAN for the entire data in Collection3, since it's going to be quite huge.
I have the following collections:
phones:
{"_id": {
"$oid": "61d376c0b9887d4e736e6acb"
},
"brand": "Nokia",
"name": "Nokia 3210",
"picture": "https://fdn2.gsmarena.com/vv/bigpic/no3210b.gif",
"phoneId": "1" }
reviews:
{"_id": {
"$oid": "61d333d0ac2d25f88d0bc8fa"
},
"phoneId": "1",
"rating": "3",
"dateOfReview": {
"$date": "2008-11-18T00:00:00.000Z"
},
"title": "Ok phone to tide me over",
"userId": "47599" }
I'm running the following aggregation both on MongoCompass and MongoShell and gives me the expected result:
db.phones.aggregate([{$lookup: {
from: 'reviews',
localField: 'phoneId',
foreignField: 'phoneId',
as: 'reviews'
}}])
{ _id: ObjectId("61d376c0b9887d4e736e6acb"),
brand: 'Nokia',
name: 'Nokia 3210',
picture: 'https://fdn2.gsmarena.com/vv/bigpic/no3210b.gif',
phoneId: '1',
reviews:
[ { _id: ObjectId("61d333d0ac2d25f88d0bc8fa"),
phoneId: '1',
rating: '3',
dateOfReview: 2008-11-18T00:00:00.000Z,
title: 'Ok phone to tide me over',
userId: '47599' } ] }
But when I check the collection there is no field reviews, how can I do to add it to the collection permanently? Since I have a lot of reviews for each phone I would also like to add to the reviews' array in phones only the 20 most recent ones that match the phoneId, is it possible?
You can use the $merge and $out pipeline stages to write data back to your database.
Note that these stages have to be the last stage in your pipeline.
E.g.
db.phones.aggregate([
{
$lookup: {
from: 'reviews',
localField: 'phoneId',
foreignField: 'phoneId',
as: 'reviews'
}
},
{
$merge: {
into: 'phones', // collection-name
on: 'phoneId', // the identifier, used to identify the document to merge into
}
}
])
Since for each phone # reviews is >> 20, you might wish to consider going after the reviews first, then doing a $lookup into phones. For a single known phone lookup the following will work; it will not work for >1 phone because the $limit cannot reference data fields (i.e. phoneId)
db.Xreviews.aggregate([
{$match: {"phoneId":"2"}}
,{$sort: {"dateOfReview":-1}} // no getting around the desc sort..
,{$limit: 20} // but we now limit to ONLY 20 reviews.
// Put them "back together" as an array called "reviews"
,{$group: {_id:"$phoneId", reviews: {$push: "$$CURRENT"}}}
// ... and pull in the phone data:
,{$lookup: {from: "Xphones", localField: "_id", foreignField: "phoneId", as: "X" }}
]);
The following will work across 1 or more phone or all of them BUT the consideration is the reviews array could be very large before being passed to the $slice operator to cut it back to 20:
db.Xreviews.aggregate([
// $match for anything or nothing here; then:
{$sort: {"dateOfReview":-1}}
// The order of the _id is not deterministic BUT the docs will be
// pushed onto the reviews array correctly in desc order:
,{$group: {_id:"$phoneId", reviews: {$push: "$$CURRENT"}}}
// Now simply overwrite the reviews array a shortened version:
,{$addFields: {reviews: {$slice: ["$reviews",20] }}}
,{$lookup: {from: "Xphones", localField: "_id", foreignField: "phoneId", as: "X" }}
]);
These two solutions end up with the phone details being stored in field 'X' which is an array of 1 item. Since we know the phoneID is 1:1, if we wish to get fancy, we can add this after the $lookup:
// Pull item[0] out and merge with reviews AND make that the new
// document:
,{$replaceRoot: { newRoot: {$mergeObjects: [ "$$CURRENT", {$arrayElemAt:["$X",0]} ]} }}
,{$unset: "X"} // drop X completely
I think I solved my problem combining the two solutions proposed by Buzz and MarcRo:
db.reviews.aggregate([
{
$sort: {
"dateOfReview":-1
}
},
{
$group: {
_id: '$phoneId',
reviews: {
$push: '$$CURRENT'
}
}
}, {
$addFields: {
reviews: {
$slice: [
'$reviews',
20
]
}
}
}, {
$merge: {
into: 'phones',
on: '_id'
}
}])
I've made a query using the aggregation framework in the latest version of MongoDB. The problem is that I can't find the way to get the result because it takes too long and the process crashes.
Collections
Constructora with 50.000 documents (including embedded documents and so on)
{
_id: 1,
edificio: [
{
_id:58,
pais: "Argentina",
provincia: "Buenos Aires",
ciudad: "Tandil",
direccion: "9 de Julio 845",
departamento: [
{
_id:45648651867,
nombre_depto: "a"
},
...
]
}
...
]
},
{
_id:2,
edificio: [...],
...
}
...
variable with 400.000 documents, including embedded documents.
{
_id:1
medicion_departamento: [
{
_id:1,
valmax:40
id_departamento:6,
...
},
...
]
},
{
_id: 2,
medicion_departamento: [...]
},
...
Medicion with 8.000.000 documents.
{
_id:1,
id_departamento: 6,
id_variable: 1,
valor: 6269
},
{
_id:2,
...
},
...
Query
I want to get the the adress of the departments (pais, provincia, ciudad, departamento.nombre) that exceeds the valmax value in "variable" fives time in the field valor in "medicion". My query is:
db.constructora.aggregate([
{$unwind:"$edificio"},
{$unwind:"$edificio.departamento"},
{$lookup:
{
from: "variable",
localField: "edificio.departamento._id",
foreignField: "medicion_departamento.id_departamento",
as: "var"
}
},
{$unwind: "$var"},
{$unwind: "$var.medicion_departamento"},
{$match: {"var.nombre":"electricidad"}},
{$lookup:
{
from: "medicion",
localField: "var.medicion_departamento.id_departamento",
foreignField: "id_departamento",
as: "med"
}
},
{$unwind:"$med"},
{$project:{"id_dto":"med.id_departamento", "consumo":"$med.valor","valorMult":{$multiply:["$var.valmax",5]}, "edificio.pais":1, "edificio.provincia":1, "edificio.direccion":1, "edificio.departamento.nombre_depto":1}},
{$match:{"consumo":{$gt:"valorMult"}}},
{$group:{_id:{"a": "edificio.pais", "b":"edificio.provincia", "c":"edificio.direccion", "d":"edificio.departamento.nombre_depto"}}}
]);
When I remove the last match and group in the pipeline, the query returns the data in 0.08s, but when I run this with match and group it runs until the process crashes. What can I do to fix (or optimize) it?
Thanks!
I'm using Meteor and MongoDB. I need to publish with aggregation (I'm using jcbernack:reactive-aggregate and ReactiveAggregate).
db.getCollection('Jobs').aggregate([
{
$lookup:
{
from: "JobMatches",
localField: "_id",
foreignField: "jobId",
as: "matches"
}
},
{ $project:
{
matches: {
'$filter': {
input: '$matches',
as: 'match',
cond: { '$and': [{$eq: ['$$match.userId', userId]}]}
}
}
}
},
{$match: { 'matches.score': { '$gte': 60 }},
{$sort: { "matches.score": -1 }},
{$limit: 6}
])
On the client I get only the data part (limit 6). So I will have to count the number of all the data on the server side. I can't use find().count() because in the find() call without aggregation I can't use a filter associated with other collection (like this { 'matches.score': { '$gte': 60 }). How can I calculate the data filtered in this way? There may be a need to use a $group in the pipeline?
I have two collections
Posts:
{
"_Id": "1",
"_PostTypeId": "1",
"_AcceptedAnswerId": "192",
"_CreationDate": "2012-02-08T20:02:48.790",
"_Score": "10",
...
"_OwnerUserId": "6",
...
},
...
and users:
{
"_Id": "1",
"_Reputation": "101",
"_CreationDate": "2012-02-08T19:45:13.447",
"_DisplayName": "Geoff Dalgas",
...
"_AccountId": "2"
},
...
and I want to find users who write between 5 and 15 posts.
This is how my query looks like:
db.posts.aggregate([
{
$lookup: {
from: "users",
localField: "_OwnerUserId",
foreignField: "_AccountId",
as: "X"
}
},
{
$group: {
_id: "$X._AccountId",
posts: { $sum: 1 }
}
},
{
$match : {posts: {$gte: 5, $lte: 15}}
},
{
$sort: {posts: -1 }
},
{
$project : {posts: 1}
}
])
and it works terrible slow. For 6k users and 10k posts it tooks over 40 seconds to get response while in relational database I get response in a split second.
Where's the problem? I'm just getting started with mongodb and it's quite possible that I messed up this query.
from https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/
foreignField Specifies the field from the documents in the from
collection. $lookup performs an equality match on the foreignField to
the localField from the input documents. If a document in the from
collection does not contain the foreignField, the $lookup treats the
value as null for matching purposes.
This will be performed the same as any other query.
If you don't have an index on the field _AccountId, it will do a full tablescan query for each one of the 10,000 posts. The bulk of the time will be spent in that tablescan.
db.users.ensureIndex("_AccountId", 1)
speeds up the process so it's doing 10,000 index hits instead of 10,000 table scans.
In addition to bauman.space's suggestion to put an index on the _accountId field (which is critical), you should also do your $match stage as early as possible in the aggregation pipeline (i.e. as the first stage). Even though it won't use any indexes (unless you index the posts field), it will filter the result set before doing the $lookup (join) stage.
The reason why your query is terribly slow is that for every post, it is doing a non-indexed lookup (sequential read) for every user. That's around 60m reads!
Check out the Pipeline Optimization section of the MongoDB Aggregation Docs.
First use $match then $lookup. $match filter the rows need to be examined to $lookup. It's efficient.
as long as you're going to group by user _AccountId, you should do the $group first by _OwnerUserId then lookup only after filtering accounts having 10<postsCount<15 this will reduce lookups:
db.posts.aggregate([{
$group: {
_id: "$_OwnerUserId",
postsCount: {
$sum: 1
},
posts: {
$push: "$$ROOT"
} //if you need to keep original posts data
}
},
{
$match: {
postsCount: {
$gte: 5,
$lte: 15
}
}
},
{
$lookup: {
from: "users",
localField: "_id",
foreignField: "_AccountId",
as: "X"
}
},
{
$unwind: "$X"
},
{
$sort: {
postsCount: -1
}
},
{
$project: {
postsCount: 1,
X: 1
}
}
])