Optimization and Indexing on Mongo Query - mongodb

Help me on what kind of indexes need to be created and fields to be indexed.
I have tested multiple indexes but still it taking long time to execute
db.collection_1.aggregate([{ $match: { $and: [ { date: { $gte: new Date(1593561600000), $lt: new Date(1604966400000) } },
{ type: 0 }, { $or: [ { partysr: 0 }, {} ] }, { $or: [ { code: "******" }, { _id: { $type: -1 } } ] } ] } },
{ $sort: { date: -1 } }, { $skip: 0 }, { $limit: 100 },
{ $lookup: { from: "collection_2", localField: "code", foreignField: "code", as: "j" } },
{ $group: { _id: "$codeTsr" } } ]).explain("executionStats")

You gave us very little information here.
Have you tried testing this query with and without the $lookup stage?
How does the query behave without lookup speed-wise?
My first guess here is that your collection_2 collection does not have a proper index and it slows query. If your query is much faster without the lookup stage, I would create index on collection_2 for the "code" property.
Also, one more performance optimization might be to first do the $group stage, and after that, you do the $lookup stage.

Related

Relate and Count Between Two Collections in MongoDB

How can I count the number of completed houses designed by a specific architect in MongoDB?
I have the next two collections, "plans" and "houses".
Where the only relationship between houses and plans is that houses have the id of a given plan.
Is there a way to do this in MongoDB with just one query?
plans
{
_id: ObjectId("6388024d0dfd27246fb47a5f")
"hight": 10,
"arquitec": "Aneesa Wade",
},
{
_id: ObjectId("1188024d0dfd27246fb4711f")
"hight": 50,
"arquitec": "Smith Stone",
}
houses
{
_id: ObjectId
"plansId": "6388024d0dfd27246fb47a5f" -> string,
"status": "under construction",
},
{
_id: ObjectId
"plansId": "6388024d0dfd27246fb47a5f" -> string,
"status": "completed",
}
What I tried was to use mongo aggregations while using $match and $lookup.
The "idea" with clear errors would be something like this.
db.houses.aggregate([
{"$match": {"status": "completed"}},
{
"$lookup": {
"from": "plans",
"pipeline": [
{
"$match": {
"$expr": {
"$and": [
{ "$eq": [ "houses.plansId", { "$toString": "$plans._id" }]},
{ "plans.arquitec" : "Smith Stone" },
]
}
}
},
],
}
}
If it's a single join condition, simply do a project to object ID to avoid any complicated lookup pipelines.
Example playground - https://mongoplayground.net/p/gaqxZ7SzDTg
db.houses.aggregate([
{
$match: {
status: "completed"
}
},
{
$project: {
_id: 1,
plansId: 1,
status: 1,
plans_id: {
$toObjectId: "$plansId"
}
}
},
{
$lookup: {
from: "plans",
localField: "plans_id",
foreignField: "_id",
as: "plan"
}
},
{
$project: {
_id: 1,
plansId: 1,
status: 1,
plan: {
$first: "$plan"
}
}
},
{
$match: {
"plan.arquitec": "Some One"
}
}
])
Update: As per OP comment, added additional match stage for filtering the final result based on the lookup response.

MongoDB optimizing big aggregation queries

I have a collection of documents in MongoDB, representing some entity. For every entity there are some statistics data gathered on a daily basis. The statistics are put as a separate documents into different collections.
Entity collection schema:
{
_id: ObjectId,
filterField1: String, //indexed
filterField2: String, //indexed
}
Example schema of statistics collection:
{
_id: ObjectId,
entityId: ObjectId, //indexed
statisticsValue: Int32,
date: Date //indexed
}
There is a dashboard that needs to display some aggregated statistics based on the gathered data over some time period e.x. average value, sum, count etc. The dashboard enables filtering in/out entities and applying different date ranges which makes precalculating those aggregated statistics impossible.
As for now, I've been using aggregation pipeline to:
apply the filters on the entities collection (using match stage)
make necessary lookups stages to acquire statistics for aggregation
make grouping and aggregation (avg, sum, count, etc.)
Here is the pipeline:
db.getCollection('entities').aggregate([
{ $match: { $expr: { $and: [
// ENTITIES FILTERS based on filterField1 and filterField2 fields
] } } },
{ $lookup: {
from: 'statistics',
let: { entityId: '$_id' },
pipeline: [{ $match: { $expr: { $and: [
{ $eq: ["$entityId", "$$entityId"] },
{ $gte: [ "$date", new ISODate("2022-06-01T00:00:00Z") ] },
{ $lte: [ "$date", new ISODate("2022-06-01T23:59:59Z") ] },
] } } },
as: 'stats_start_date_range',
} },
{ $lookup: {
from: 'statistics',
let: { key: '$_key' },
pipeline: [{ $match: { $expr: { $and: [
{ $eq: ["$entityId", "$$entityId"] },
{ $gte: [ "$date", new ISODate("2022-06-31T00:00:00Z") ] },
{ $lte: [ "$date", new ISODate("2022-06-31T23:59:59Z") ] },
] } } },
as: 'stats_end_date_range',
} },
{ $addFields:
{
start_stats: { $first: "$stats_start_date_range" },
end_stats: { $first: "$stats_end_date_range" }
}
},
{
$group: {
_id: null,
avg_start: { $avg: "$start_stats.statisticsValue" },
avg_end: { $avg: "$end_stats.statisticsValue" }
}
}
])
In case of this query, the expected result is the average value of the statisticsValue field for the start and end date for every entity matching the filters.
I applied the index on the field used to left join collections in lookup stage. as well as on the date field used for getting statistics for a specific date.
The problem is that the query takes about 1 second for the max number of documents after the match stage (about 1000 documents). And I need to perform 4 such queries. The statistic collection contains 800k documents and the number is growing every day.
I was wondering, if I can do anything to make the query execution faster, I considered:
time series collection
reorganizing collections structure (don't know how)
merging those 4 separate queries into 1, using facet stage
But I'm not sure if MongoDB is suitable data source for such operations and maybe I should consider another data source if I want to perform such queries.
Hard to guess, what you would like to get. An approach could be this one:
const entities = db.getCollection('entities').aggregate([
{ $match: { filterField1: "a" } }
]).toArray().map(x => x._id)
db.getCollection('statistics').aggregate([
{
$match: {
entityId: { $in: entities },
date: {
$gte: ISODate("2022-06-01T00:00:00Z"),
$lte: ISODate("2022-06-31T23:59:59Z")
}
}
},
{
$facet: {
stats_start_date_range: [
{
$match: {
date: {
$gte: ISODate("2022-06-01T00:00:00Z"),
$lte: ISODate("2022-06-01T23:59:59Z")
}
}
}
],
stats_end_date_range: [
{
$match: {
date: {
$gte: ISODate("2022-06-31T00:00:00Z"),
$lte: ISODate("2022-06-31T23:59:59Z")
}
}
}
]
}
},
{
$addFields: {
start_stats: { $first: "$stats_start_date_range" },
end_stats: { $first: "$stats_end_date_range" }
}
},
{
$group: {
_id: null,
avg_start: { $avg: "$start_stats.statisticsValue" },
avg_end: { $avg: "$end_stats.statisticsValue" }
}
}
]);

How to ggregate two collections and match field with array

I need to group the results of two collections candidatos and ofertas, and then "merge" those groups to return an array with matched values.
I've created this example with the aggregate and similar data to make this easier to test:
https://mongoplayground.net/p/m0PUfdjEye4
This is the explanation of the problem that I'm facing.
I can get both groups with the desired results independently:
candidatos collection:
db.getCollection('ofertas').aggregate([
{"$group" : {_id:"$ubicacion_puesto.provincia", countProvinciaOferta:{$sum:1}}}
]);
This is the result...
ofertas collection:
db.getCollection('candidatos').aggregate([
{"$group" : {_id:"$que_busco.ubicacion_puesto_trabajo.provincia", countProvinciaCandidato:{$sum:1}}}
]);
This is the result...
What I need to do, is to aggregate those groups to merge their results based on their _id coincidence. I think I'm going in the right way with the next aggregate, but the field countOfertas always returns 0.0. I think that there is something wrong in my project $cond, but I don't know what is it. This is the aggregate:
db.getCollection('candidatos').aggregate([
{"$group" : {_id:"$que_busco.ubicacion_puesto_trabajo.provincia", countProvinciaCandidato:{$sum:1}}},
{
$lookup: {
from: 'ofertas',
let: {},
pipeline: [
{"$group" : {_id:"$ubicacion_puesto.provincia", countProvinciaOferta:{$sum:1}}}
],
as: 'ofertas'
}
},
{
$project: {
_id: 1,
countProvinciaCandidato: 1,
countOfertas: {
$cond: {
if: {
$eq: ['$ofertas._id', "$_id"]
},
then: '$ofertas.countProvinciaOferta',
else: 0,
}
}
}
},
{ $sort: { "countProvinciaCandidato": -1}},
{ $limit: 20 }
]);
And this is the result, but as you can see, field countOfertas is always 0
Any kind of help will be welcome
What you have tried is so much appreciated. But in $project you need to use $reduce which helps to loop through the array and satisfy the condition
Here is the code
db.candidatos.aggregate([
{
"$group": {
_id: "$que_busco.ubicacion_puesto_trabajo.provincia",
countProvinciaCandidato: { $sum: 1 }
}
},
{
$lookup: {
from: "ofertas",
let: {},
pipeline: [
{
"$group": {
_id: "$ubicacion_puesto.provincia",
countProvinciaOferta: { $sum: 1 }
}
}
],
as: "ofertas"
}
},
{
$project: {
_id: 1,
countProvinciaCandidato: 1,
countOfertas: {
"$reduce": {
"input": "$ofertas",
initialValue: 0,
"in": {
$cond: [
{ $eq: [ "$$this._id", "$_id" ] },
{ $add: [ "$$value", 1 ] },
"$$value"
]
}
}
}
}
},
{ $sort: { "countProvinciaCandidato": -1 } },
{ $limit: 20 }
])
Working Mongo playground
Note : If you need to do with aggregations only, this is fine. But I personally feel this approach is not good. My suggestion is, you can concurrently call group aggregations in different service and do it with programmatically. Because $lookup is expensive, when you get massive data, this performance will be reduced
The $eq in the $cond is comparing an array to an ObjectId, so it never matches.
The $lookup stage results will be in the ofertas field as an array of documents, so '$ofertas._id' will be an array of all the _id values.
You will probably need to use $unwind, $reduce after the $lookup.

How to find duplicate records based on an id and a datetime field in MongoDB?

I have a MongoDB collection with millions of record. Sample records are shown below:
[
{
_id: ObjectId("609977b0e8e1c615cb551bf5"),
activityId: "123456789",
updateDateTime: "2021-03-24T20:12:02Z"
},
{
_id: ObjectId("739177b0e8e1c615cb551bf5"),
activityId: "123456789",
updateDateTime: "2021-03-24T20:15:02Z"
},
{
_id: ObjectId("805577b0e8e1c615cb551bf5"),
activityId: "123456789",
updateDateTime: "2021-03-24T20:18:02Z"
}
]
Multiple records could have the same activityId, in this case i want just the record that has the largest updateDateTime.
I have tried doing this and it works fine on a smaller collection but times out on a large collection.
[
{
$lookup: {
from: "MY_TABLE",
let: {
existing_date: "$updateDateTime",
existing_sensorActivityId: "$activityId"
},
pipeline: [
{
$match: {
$expr: {
$and: [
{ $eq: ["$activityId", "$$existing_sensorActivityId"] },
{ $gt: ["$updateDateTime", "$$existing_date"] }
]
}
}
}
],
as: "matched_records"
}
},
{ $match: { "matched_records.0": { $exists: true } } },
{ $project: { _id: 1 } }
]
This gives me _ids for all the records which have the same activity id but smaller updateDateTime.
The slowness occurs at this step -> "matched_records.0": {$exists:true}
Is there a way to speed up this step or are there any other approach to this problem?
You can find unique documents and write result in new collection using $out instead of finding duplicate documents and deleting them,
How to find unique documents?
$sort by updateDateTime in descending order
$group by activityId and get first root record
$replaceRoot to replace record in root
$out to write query result in new collection
[
{ $sort: { updateDateTime: -1 } },
{
$group: {
_id: "$activityId",
record: { $first: "$$ROOT" }
}
},
{ $replaceRoot: { newRoot: "$record" } },
{ $out: "newCollectionName" } // set new collection name
]
Playground

Complex aggregation query with in clause from document array

Below is the sample MongoDB Data Model for a user collection:
{
"_id": ObjectId('58842568c706f50f5c1de662'),
"userId": "123455",
"user_name":"Bob"
"interestedTags": [
"music",
"cricket",
"hiking",
"F1",
"Mobile",
"racing"
],
"listFriends": [
"123456",
"123457",
"123458"
]
}
listFriends is an array of userId for other users
For a particular userId I need to extract the listFriends (userId's) and for those userId's I need to aggregate the interestedTags and their count.
I would be able to achieve this by splitting the query into two parts:
1.) Extract the listFriends for a particular userId,
2.) Use this list in an aggregate() function, something like this
db.user.aggregate([
{ $match: { userId: { $in: [ "123456","123457","123458" ] } } },
{ $unwind: '$interestedTags' },
{ $group: { _id: '$interestedTags', countTags: { $sum : 1 } } }
])
I am trying to solve the question: Is there a way to achieve the above functionality (both steps 1 and 2) in a single aggregate function?
You could use $lookup to look for friend documents. This stage is usually used to join two different collection, but it can also do join upon one single collection, in your case I think it should be fine:
db.user.aggregate([{
$match: {
_id: 'user1',
}
}, {
$unwind: '$listFriends',
}, {
$lookup: {
from: 'user',
localField: 'listFriends',
foreignField: '_id',
as: 'friend',
}
}, {
$project: {
friend: {
$arrayElemAt: ['$friend', 0]
}
}
}, {
$unwind: '$friend.interestedTags'
}, {
$group: {
_id: '$friend.interestedTags',
count: {
$sum: 1
}
}
}]);
Note: I use $lookup and $arrayElemAt which are only available in Mongo 3.2 or newer version, so check your Mongo version before using this pipeline.