MongoDb - add index on 'calculated' fields - mongodb

I have a query that includes an $expr-operator with a $cond in it.
Basically, I want to have objects with a timestamp from a certain year. If the timestamp is not set, I'll use the creation date instead.
{
$expr: {
$eq: [
{
$cond: {
'if': {
TimeStamp: {
$type: 'null'
}
},
then: {
$year: '$Created'
},
'else': {
$year: '$TimeStamp'
}
}
},
<wanted-year>
]
}
}
It would be nice to have this query using a index. But is it possible to do so? Should I just add index to both TimeStamp and Created-fields? Or is it possible to create an index for a Year-field that doesn't really exist on the document itself...?

Not possible
Indexes are stored on disk before executing the query.
Workaround: On-Demand Materialized Views
You store in separate collection your calculated data (with indexes)

This can't be done today without precomputing that information and storing it in a field on the document. The closest alternative would probably be to use MongoDB 4.2's aggregation pipeline-powered updates to precompute and store a createdOrTimestamp field whenever your documents are updated. You could then create an index on createdOrTimestamp that would be used when querying for documents that match a certain year.
What this would look like when updating or after inserting your document:
db.collection.update({ _id: ObjectId("5e8523e7ea740b14fb16b5c3") }, [
{
$set: {
createdOrTimestamp: {
$cond: {
if: {$gt: ['$TimeStamp', null]},
then: '$TimeStamp',
else: '$Created'
}
}
}
}
])
If documents already exist, you could also send off an updateMany operation with that aggregation to get that computed field into all your existing documents.
It would be really nice to be able to define computed fields declaratively on a collection just like indexes, so that they take care of keeping themselves up to date!

Related

Remove duplicates by field based on secondary field

I have a use case where I am working with objects that appear as such:
{
"data": {
"uuid": 0001-1234-5678-9101
},
"organizationId": 10192432,
"lastCheckin": 2022-03-19T08:23:02.435+00:00
}
Due to some old bugs in our application, we've accumulated many duplicates for these items in the database. The origin of the duplicates has been resolved in an upcoming release, but I need to ensure that prior to the release there are no such duplicates because the release includes a unique constraint on the "data.uuid" property.
I am trying to delete records based on the following criteria:
Any duplicate record based on "data.uuid" WHERE lastCheckin is NOT the most recent OR organizationId is missing.
Unfortunately, I am rather new to using MongoDB and do not know how to express this in a query. I have tried aggregated to obtain the duplicate records and, while I've been able to do so, I have so far been unable to exclude the records in each duplicate group containing the most recent "lastCheckin" value or even include "organizationId" as a part of the aggregation. Here's what I came up with:
db.collection.aggregate([
{ $match: {
"_id": { "$ne": null },
"count": { "$gt": 1 }
}},
{ $group: {
_id: "$data.uuid",
"count": {
"$sum": 1
}
}},
{ $project: {
"uuid": "$_id",
"_id": 0
}}
])
The above was mangled together based on various other stackoverflow posts describing the aggregation of duplicates. I am not sure whether this is the right way to approach this problem. One immediate problem that I can identify is that simply getting the "data.uuid" property without any additional criteria allowing me to identify the invalid duplicates makes it hard to envision a single query that can delete the invalid records without taking the valid records.
Thanks for any help.
I am not sure if this is possible via a single query, but this is how I would approach it, first sort the documents by lastCheckIn and then group the documents by data.uuid, like this:
db.collection.aggregate([
{
$sort: {
lastCheckIn: -1
}
},
{
$group: {
_id: "$data.uuid",
"docs": {
"$push": "$$ROOT"
}
}
},
]);
Playground link.
Once you have these results, you can filter out the documents, according to your criteria, which you want to delete and collect their _id. The documents per group will be sorted by lastCheckIn in descending order, so filtering should be easy.
Finally, delete the documents, using this query:
db.collection.remove({_id: { $in: [\\ array of _ids collected above] }});

MongoDB $elemMatch comparison to field in same document

I'm wanting to create an aggregation step to match documents where the value of a field in a document exists within an array in the same document.
In a very worked example (note this is very simplified; this will be fitting into a larger existing pipeline), given documents:
{
"_id":{"$oid":"61a9085af9733d0274c41990"},
"myArray":[
{"$oid":"61a9085af9733d0274c41991"},
{"$oid":"61a9085af9733d0274c41992"},
{"$oid":"61a9085af9733d0274c41993"}
],
"myField":{"$oid":"61a9085af9733d0274c41991"} // < In 'myArray' collection
}
and
{
"_id":{"$oid":"61a9085af9733d0274c41990"},
"myArray":[
{"$oid":"61a9085af9733d0274c41991"},
{"$oid":"61a9085af9733d0274c41992"},
{"$oid":"61a9085af9733d0274c41993"}
],
"myField":{"$oid":"61a9085af9733d0274c41994"} // < Not in 'myArray' collection
}
I want to match the first one because the value of myField exists in the collection, but not the second document.
It feels like this should be a really simple $elemMatch operation with an $eq operator, but I can't make it work and every example I've found uses literals. What I've got currently is below, and I've tried with various combinations of quotes and dollar signs round myField.
[{
$match: {
myArray: {
$elemMatch: {
$eq: '$this.myField'
}
}
}
}]
Am I doing something very obviously wrong? Is it not possible to use the value of a field in the same document with an $eq?
Hoping that someone can come along and point out where I'm being stupid :)
Thanks
You can simply do a $in in an aggregation pipeline.
db.collection.aggregate([
{
"$match": {
$expr: {
"$in": [
"$myField",
"$myArray"
]
}
}
}
])
Here is the Mongo playground for your reference.

Delete all but one duplicate from a mongo db

So I mad the mistake and saved a lot of doduments twice because I messed up my document id. Because I did a Insert, i multiplied my documents everytime I saved them. So I want to delete all duplicates except the first one, that i wrote. Luckilly the documents have an implicit unique key (match._id) and I should be able to tell what the first one was, because I am using the object id.
The documents look like this:
{
_id: "5e8e2d28ca6e660006f263e6"
match : {
_id: 2345
...
}
...
}
So, right now I have a aggregation that tells me what elements are duplicated and stores them in a collection. There is for sure a more elegant way, but I am still learning.
[{$sort: {"$_id": 1},
{$group: {
_id: "$match._id",
duplicateIds: {$push: "$_id"},
count: {$sum: 1}
}},
{$match: {
count: { $gt: 1 }
}}, {$addFields: {
deletableIds: { $slice: ["$duplicateIds", 1, 1000 ] }
}},
{$out: 'DeleteableIds'}]
Now I do not know how to proceed further, as it does not seem to have a "delete" operation in aggregations and I do not want to write those temp data to a db just so I can write a delete command with that, as I want to delete them in one go. Is there any other way to do this? I am still learning with mongodb and feel a little bit overwhelmed :/
Rather than doing all of those you can just pick first document in group for each _id: "$match._id" & make it as root document. Also, I don't think you need to do sorting in your case :
db.collection.aggregate([
{
$group: {
_id: "$match._id",
doc: {
$first: "$$ROOT"
}
}
},
{
$replaceRoot: {
newRoot: "$doc"
}
}, {$out: 'DeleteableIds'}
])
Test : MongoDB-Playground
I think you're on the right track, however, to delete the duplicates you've found you can use a bulk write on the collection.
So if we imagine you aggregation query saved the following in the the DeleteableIds collection
> db.DeleteableIds.insertMany([
... {deletableIds: [1,2,3,4]},
... {deletableIds: [103,35,12]},
... {deletableIds: [345,311,232,500]}
... ]);
We can now take them and write a bulk write command:
const bulkwrite = db.DeleteableIds.find().map(x => ({ deleteMany : { filter: { _id: { $in: x.deletableIds } } } }))
then we can execute that against the database.
> db.collection1.bulkWrite(bulkwrite)
this will then delete all the duplicates.

MongoDB querying aggregation in one single document

I have a short but important question. I am new to MongoDB and querying.
My database looks like the following: I only have one document stored in my database (sorry for blurring).
The document consists of different fields:
two are blurred and not important
datum -> date
instance -> Array with an Embedded Document Object; Our instance has an id, two not important fields and a code.
Now I want to query how many times an object in my instance array has the group "a" and a text "sample"?
Is this even possible?
I only found methods to count how many documents have something...
I am using Mongo Compass, but i can also use Pymongo, Mongoengine or every other different tool for querying the mongodb.
Thank you in advance and if you have more questions please leave a comment!
You can try this
db.collection.aggregate([
{
$unwind: "$instance"
},
{
$unwind: "$instance.label"
},
{
$match: {
"instance.label.group": "a",
"instance.label.text": "sample",
}
},
{
$group: {
_id: {
group: "$instance.label.group",
text: "$instance.label.text"
},
count: {
$sum: 1
}
}
}
])

MongoDb aggregate with limit and without limit

There is a collection in mongo
In the collection of 40 million records
db.getCollection('feedposts').aggregate([
{
"$match": {
"$or": [
{
"isOfficial": true
},
{
"creator": ObjectId("537f267c984539401ff448d2"),
type: { $nin: ['challenge_answer', 'challenge_win']}
}
],
}
},
{
$sort: {timeline: -1}
}
])
This request never ends
But if you add a limit before sorting, and the limit is higher than the total number of records in advance, for example, 1,000,000,000,000,000 - the request will be processed instantly
db.getCollection('feedposts').aggregate([
{
"$match": {
"$or": [
{
"isOfficial": true
},
{
"creator": ObjectId("537f267c984539401ff448d2"),
type: { $nin: ['challenge_answer', 'challenge_win']}
}
],
}
},
{
$limit: 10000000000000000
},
{
$sort: {timeline: -1}
}
])
Please tell me why this is happening?
What problems can I expect in the future if I leave it this way?
TLDR: Mongo is using the wrong index for the query
Why is this happening?
Well basically every query you do Mongo simulates a quick "competition" between the relevant indexes in order to choose which one to use, the first index to retrieve 1001 documents "wins".
Now usually this situation of picking the wrong index occurs with ascending or descending fields and a matching index making this index with the fetching competition under certain conditions, Meaning this is very risky as you can have stable code that can suddenly become a huge bottleneck.
What can we do?
You have a few options:
Use the hint option and make Mongo use the compound index you have ready for this pipeline.
Drop the rogue index to ensure this will never happen again elsewhere (which is my recommended option).
Keep doing what you're doing. basically by adding this random $limit stage you're throwing Mongo's competition off and ensuring the right index will be picked.