MongoDB aggregate ID's efficiently for bulk searches? - mongodb

I have more than 8 references in a MongoDB document. Those are Object ID's stored in the origin document and in order to get the real data of the foreign I have to make an aggregation query, something like this:
{
$lookup: {
from: "departments",
let: { "department": "$_department" },
pipeline: [
{ $match: { $expr: { $eq: ["$_id", "$$department"] }}},
],
as: "department"
}
},
{
$unwind: { "path": "$department", "preserveNullAndEmptyArrays": true }
},
That is working and instead of ObjectId I got the real department object.
However this takes time and make the finding queries to take lot of time.
I have noticed that I have the same ID's multiple times and it's better to collect all of the unique ID's and just fetch them once from DB and then just reuse the same object.
I don't know any plugin or a service doing so, using MongoDB. I can make one bymyself I just want to know before I work on something like this, if there any kind of a service or package in Github?

Related

Remove duplicates by field based on secondary field

I have a use case where I am working with objects that appear as such:
{
"data": {
"uuid": 0001-1234-5678-9101
},
"organizationId": 10192432,
"lastCheckin": 2022-03-19T08:23:02.435+00:00
}
Due to some old bugs in our application, we've accumulated many duplicates for these items in the database. The origin of the duplicates has been resolved in an upcoming release, but I need to ensure that prior to the release there are no such duplicates because the release includes a unique constraint on the "data.uuid" property.
I am trying to delete records based on the following criteria:
Any duplicate record based on "data.uuid" WHERE lastCheckin is NOT the most recent OR organizationId is missing.
Unfortunately, I am rather new to using MongoDB and do not know how to express this in a query. I have tried aggregated to obtain the duplicate records and, while I've been able to do so, I have so far been unable to exclude the records in each duplicate group containing the most recent "lastCheckin" value or even include "organizationId" as a part of the aggregation. Here's what I came up with:
db.collection.aggregate([
{ $match: {
"_id": { "$ne": null },
"count": { "$gt": 1 }
}},
{ $group: {
_id: "$data.uuid",
"count": {
"$sum": 1
}
}},
{ $project: {
"uuid": "$_id",
"_id": 0
}}
])
The above was mangled together based on various other stackoverflow posts describing the aggregation of duplicates. I am not sure whether this is the right way to approach this problem. One immediate problem that I can identify is that simply getting the "data.uuid" property without any additional criteria allowing me to identify the invalid duplicates makes it hard to envision a single query that can delete the invalid records without taking the valid records.
Thanks for any help.
I am not sure if this is possible via a single query, but this is how I would approach it, first sort the documents by lastCheckIn and then group the documents by data.uuid, like this:
db.collection.aggregate([
{
$sort: {
lastCheckIn: -1
}
},
{
$group: {
_id: "$data.uuid",
"docs": {
"$push": "$$ROOT"
}
}
},
]);
Playground link.
Once you have these results, you can filter out the documents, according to your criteria, which you want to delete and collect their _id. The documents per group will be sorted by lastCheckIn in descending order, so filtering should be easy.
Finally, delete the documents, using this query:
db.collection.remove({_id: { $in: [\\ array of _ids collected above] }});

Mongodb aggregation that returns all documents where $lookup foreign doc DOESN'T exist

I'm working with a CMS system right now where if pages are deleted the associated content isn't. In the case of one of my clients this has become burdensome, as we now have millions of content docs accumulated over time, and it's making it prohibitive to do daily functions, like restoring dbs, backing up dbs, etc.
Consider this structure:
Page document:
{
_id: pageId,
contentDocumentId: someContentDocId
}
Content document:
{
_id: someContentDocId,
page_id: pageId,
content: [someContent, ...etc]
}
Is there a way to craft a MongoDB aggregation where we aggregate Content docs based on checking page_id, and if our check for page_id returns null, then we aggregate that doc? It's not something as simple as foreignField in a $lookup being set to null, is it?
This should do the trick:
db.content.aggregate([
{
"$lookup": {
"from": "pages",
"localField": "page_id",
"foreignField": "_id",
"as": "pages"
}
},
{
"$addFields": {
"pages_length": {
"$size": "$pages"
}
}
},
{
"$match": {
"pages_length": 0
}
},
{
"$unset": [
"pages",
"pages_length"
]
}
])
We create an aggregation from the content collection and do a normal $loopup with the pages collection. When no matching page is found, the array pages will be [] so we just filter every document where the array is empty.
You can't use $size inside $match to filter array lengths, so we need to create a temporary field pages_length to store the length of the array.
In the end we remove the temporary fields with $unset.
Playground

Mongo $lookup, which way is the fastest?

it has been a while since I began using MongoDB aggregation.
It's a great way to perform complex queries and it has improved my app's performance in ways I never thought it was possible.
However, I came across $lookup and it appears that there are 3 ways of performing them. I would like to know what are the the advantages and drawbacks to each of them.
For the below examples, I am starting from collectionA using fieldA to match documents from collectionB using fieldB
What I'd call preset $lookup
{
$lookup: {
from: 'collectionB',
localField: 'fieldA',
foreignField: 'fieldB',
as: 'documentsB'
}
}
What I'd call custom $lookup
{
$lookup: {
from: 'collectionB',
let: { valueA: '$fieldA' },
pipeline: [
{
$match: {
$expr: {
$eq: ['$$valueA', '$fieldB']
}
}
}
],
as: 'documentsB'
}
}
Perfoming a find then an aggregate on collectionB
const docsA = db.collection('collectionA').find({}).toArray();
// Basically I will extract all values possible for the query to docB
const valuesForB = docsA.map((docA) => docA.fieldA);
db.collection('collectionB').aggregate([
{
$match: {
fieldB: { $in: valuesForB }
}
}
]);
I'd like to know which one is the fastest
If there are any parameters that makes one faster than the others
If there are any limitations to one of them
From what I can tell, I found :
find + aggregate faster than preset $lookup which is faster than custom $lookup
But then I wonder why custom $lookup exists...
If data is too large than the preset lookup will be faster.
why
All the data is looked up at the database level the data is to be held in another variable.
While in find and aggregate will take longer as data is larger and while aggregating you are just increasing the data.
TIP
If you want to use find and aggregate than you should see the distinct query of MongoDB.
Example
var arr = db.collection('collectionA').distinct('fieldA',{});
db.collection('collectionB').aggregate([
{
$match: {
fieldB: { $in: arr}
}
}
]);

Incredibly slow query performance with $lookup and "sub" aggregation pipeline

Let's say I have two collections, tasks and customers.
Customers have a 1:n relation with tasks via a "customerId" field in customers.
I now have a view where I need to display tasks with customer names. AND I also need to be able to filter and sort for customer names. Which means I can't do the $limit or $match stage before $lookup in the following query.
So here is my example query:
db.task.aggregate([
{
"$match": {
"_deleted": false
}
},
"$lookup": {
"from": "customer",
"let": {
"foreignId": "$customerId"
},
"pipeline": [
{
"$match": {
"$expr": {
"$and": [
{
"$eq": [
"$_id",
"$$foreignId"
]
},
{
"$eq": [
"$_deleted",
false
]
}
]
}
}
}
],
"as": "customer"
},
{
"$unwind": {
"path": "$customer",
"preserveNullAndEmptyArrays": true
}
},
{
"$match": {
"customer.name": 'some_search_string'
}
},
{
"$sort": {
"customer.name": -1
}
},
{
"$limit": 35
},
{
"$project": {
"_id": 1,
"customer._id": 1,
"customer.name": 1,
"description": 1,
"end": 1,
"start": 1,
"title": 1
}
}
])
This query is getting incredibly slow when the collections are growing in size. With 1000 tasks and 20 customers it already takes about 500ms to deliver result.
I'm aware, that this happens because the $lookup operator has to do a tablescan for each row that enters the aggregation pipeline's lookup stage.
I have tried to set indexes like described here: Poor lookup aggregation performance but that doesn't seem to have any impact.
My next guess was that the "sub"-pipeline in the $lookup stage is not capable of using indexes, so I replaced it with a simple
"$lookup": {
"from": "customer",
"localField": "customerId",
"foreignField": "_id",
"as": "customer"
}
But still the indexes are not used or don't have any impact on performance. (To be honest I don't know which of both is the case since .explain() won't work with aggregation pipelines.)
I have tried the following indexes:
Ascending, desecending, hashed and text index on customerId
Ascending, desecending, hashed and text index on customer.name
I'm grateful for any ideas on what I'm doing wrong or how I could achive the same thing with a better aggregation pipeline.
Additional info:
I'm using a three member replica set. I'm on MongoDB 4.0.
Please note: I'm aware that I'm using a non-relational database to achieve highly relational objectives, but in this project MongoDB was our choice due to it's ChangeStream feature. If anybody knows a different database with a comparable feature (realtime push notifications on changes), which can be run on-premise (so Firebase drops out), I would love to hear about it!
Thanks in advance!
I found out why my indexes weren't used.
I queried the collection using a different collation than the collection's own collation.
But the id indexes on a collection are always implemented using the collections default collation.
Therefore the indexes were not used.
I changed the collection's collation to the same as for the queries and now the query takes just a fraction of the time (but still slow :)).
(Yes you have to recreate the collections to change the collation, no on-the-fly change is possible.)
Have you considered having a single collection for customer with tasks as an embedded array in each document? That way, you would be able to index search on both customer and task fields.

How can I compare two fields in diffrent two collections in mongodb?

I am beginner in the MongoDB.
Right now, I am making one query by using mongo. Please look this and let me know is it possible? If it is possible, how can I do?
collection:students
[{id:a, name:a-name}, {id:b, name:b-name}, {id:c, name:c-name}]
collection:school
[{
name:schoolA,
students:[a,b,c]
}]
collection:room
[{
name:roomA,
students:[c,a]
}]
Expected result for roomA
{
name:roomA,
students:[
{id:a name:a-name isRoom:YES},
{id:b name:b-name isRoom:NO},
{id:c name:c-name isRoom:YES}
]
}
Not sure about the isRoom property, but to perform a join across collections, you'd have two basic options:
code it yourself, with multiple queries
use the aggregation pipeline with $lookup operator
As a quick example of $lookup, you can take a given room, unwind its students array (meaning separate out each student element into its own entity), and then look up the corresponding student id in the student collection.
Assuming a slight tweak to your room collection document:
[{
name:"roomA",
students:[ {studentId: "c"}, {studentId: "a"}]
}]
Something like:
db.room.aggregate([
{
$unwind: "$students"
},
{
$lookup:
{
from: "students",
localField: "studentid",
foreignField: "id",
as: "classroomStudents"
}
},
{
$project:
{ _id: 0, name : 1 , classroomStudents : 1 }
}
])
That would yield something like:
{
name:"roomA",
classroomStudents: [
{id:"a", name:"a-name"},
{id:"c", name:"c-name"}
]
}
Disclaimer: I haven't actually run this aggregation, so there may be a few slight issues. Just trying to give you an idea of how you'd go about solving this with $lookup.
More info on $lookup is here.