Remove duplicates by field based on secondary field

Remove duplicates by field based on secondary field - mongodb

I have a use case where I am working with objects that appear as such:
{
"data": {
"uuid": 0001-1234-5678-9101
},
"organizationId": 10192432,
"lastCheckin": 2022-03-19T08:23:02.435+00:00
}
Due to some old bugs in our application, we've accumulated many duplicates for these items in the database. The origin of the duplicates has been resolved in an upcoming release, but I need to ensure that prior to the release there are no such duplicates because the release includes a unique constraint on the "data.uuid" property.
I am trying to delete records based on the following criteria:
Any duplicate record based on "data.uuid" WHERE lastCheckin is NOT the most recent OR organizationId is missing.
Unfortunately, I am rather new to using MongoDB and do not know how to express this in a query. I have tried aggregated to obtain the duplicate records and, while I've been able to do so, I have so far been unable to exclude the records in each duplicate group containing the most recent "lastCheckin" value or even include "organizationId" as a part of the aggregation. Here's what I came up with:
db.collection.aggregate([
{ $match: {
"_id": { "$ne": null },
"count": { "$gt": 1 }
}},
{ $group: {
_id: "$data.uuid",
"count": {
"$sum": 1
}
}},
{ $project: {
"uuid": "$_id",
"_id": 0
}}
])
The above was mangled together based on various other stackoverflow posts describing the aggregation of duplicates. I am not sure whether this is the right way to approach this problem. One immediate problem that I can identify is that simply getting the "data.uuid" property without any additional criteria allowing me to identify the invalid duplicates makes it hard to envision a single query that can delete the invalid records without taking the valid records.
Thanks for any help.

I am not sure if this is possible via a single query, but this is how I would approach it, first sort the documents by lastCheckIn and then group the documents by data.uuid, like this:
db.collection.aggregate([
{
$sort: {
lastCheckIn: -1
}
},
{
$group: {
_id: "$data.uuid",
"docs": {
"$push": "$$ROOT"
}
}
},
]);
Playground link.
Once you have these results, you can filter out the documents, according to your criteria, which you want to delete and collect their _id. The documents per group will be sorted by lastCheckIn in descending order, so filtering should be easy.
Finally, delete the documents, using this query:
db.collection.remove({_id: { $in: [\\ array of _ids collected above] }});

Related

mongoDB updating values to sum and record when record already exists in collection

I'm struggling wrapping my head around how to do this, but hopefully I can get some help here.
I have a collection in MongoDB that has values aggregated over a day. I have an index in the collection that enforces each record to be unique (name, date).
Because of issues I don't control, there is occasionally data that is split in two when it should be one.
What I want to do is when an insert is attempted but fails because the unique condition would fail, I want to update the record with an aggregated value.
This is what I have so far...
update = db.collection.aggregate(
[
{
"$addFields": {
"views": {"$sum": ["$views", "$views"]},
"avg_time": {"$avg": ["$avg_time", "$avg_time"]}
}
},
{
"$out": {"db": "collection"}
}
]
)
I think where i'm confused is, I don't see how mongoDB knows which record I'm attempting to update and how I refer to the old value in the query just can't be correct.

You should replace the $out Pipeline with the $merge Pipeline with whenMatched option set based on your requirement.
update = db.collection.aggregate([
{
"$addFields": {
"views": {
"$sum": [
"$views",
"$views"
]
},
"avg_time": {
"$avg": [
"$avg_time",
"$avg_time"
]
}
}
},
{
"$merge": {
into: "collectionName", // Collection name you want to merge with
on: "_id", // The unique indexed key name which is creating the conflict
whenMatched: "keepExisting", // Action to perform when the reference key already exists
whenNotMatched: "insert" // Action to perform when there are no conflicts
}
}
])
Refer to the MongoDB $merge pipeline documentation for more info on various match options available

Is there an easy way to select the most recent document per unique value for a particular field?

For example, if you had a dataset that had the fields _id, uuid, and timestamp, and the data contained many thousands of documents, spread across let's say 200 different uuids, and you wanted to return 200 documents, one per uuid, with each being the most recent (timestamp descending etc), how would you go about this?
I've tried a few solutions and searched through StackOverflow without much luck. I'm sure there is some way to do this with aggregate.
Any tips or nods in the right direction appreciated.
Thanks

Well, it turns out the solution is actually quite simple. Use the distinct field as as the $group _id and for the value you want, sort it first, and then select the last item with $last. Like so:
db.example.aggregate([
{ $sort: { "timestamp": 1 } },
{ $group: { _id: "$uuid", timestamp: { $last: "$timestamp" } } }
])
And if you want the most recent, I imagine you can just lose the sort with:
db.example.aggregate([
{ $group: { _id: "$uuid", timestamp: { $last: "$timestamp" } } }
])
Well, that settles that.

MongoDb aggregate with limit and without limit

There is a collection in mongo
In the collection of 40 million records
db.getCollection('feedposts').aggregate([
{
"$match": {
"$or": [
{
"isOfficial": true
},
{
"creator": ObjectId("537f267c984539401ff448d2"),
type: { $nin: ['challenge_answer', 'challenge_win']}
}
],
}
},
{
$sort: {timeline: -1}
}
])
This request never ends
But if you add a limit before sorting, and the limit is higher than the total number of records in advance, for example, 1,000,000,000,000,000 - the request will be processed instantly
db.getCollection('feedposts').aggregate([
{
"$match": {
"$or": [
{
"isOfficial": true
},
{
"creator": ObjectId("537f267c984539401ff448d2"),
type: { $nin: ['challenge_answer', 'challenge_win']}
}
],
}
},
{
$limit: 10000000000000000
},
{
$sort: {timeline: -1}
}
])
Please tell me why this is happening?
What problems can I expect in the future if I leave it this way?

TLDR: Mongo is using the wrong index for the query
Why is this happening?
Well basically every query you do Mongo simulates a quick "competition" between the relevant indexes in order to choose which one to use, the first index to retrieve 1001 documents "wins".
Now usually this situation of picking the wrong index occurs with ascending or descending fields and a matching index making this index with the fetching competition under certain conditions, Meaning this is very risky as you can have stable code that can suddenly become a huge bottleneck.
What can we do?
You have a few options:
Use the hint option and make Mongo use the compound index you have ready for this pipeline.
Drop the rogue index to ensure this will never happen again elsewhere (which is my recommended option).
Keep doing what you're doing. basically by adding this random $limit stage you're throwing Mongo's competition off and ensuring the right index will be picked.

Update Array Children Sorted Order

I have a collection containing objects with the following structure
{
"dep_id": "some_id",
"departament": "dep name",
"employees": [{
"name": "emp1",
"age": 31
},{
"name": "emp2",
"age": 35
}]
}
I would like to sort and save the array of employees for the object with id "some_id", by employees.age, descending. The best outcome would be to do this atomically using mongodb's query language. Is this possible?
If not, how can I rearrange the subdocuments without affecting the parent's other data or the data of the subdocuments? In case I have to download the data from the database and save back the sorted array of children, what would happen if something else performs an update to one of the children or children are added or removed in the meantime?
In the end, the data should be persisted to the database like this:
{
"dep_id": "some_id",
"departament": "dep name",
"employees": [{
"name": "emp2",
"age": 35
},{
"name": "emp1",
"age": 31
}]
}

The best way to do this is to actually apply the $sort modifier as you add items to the array. As you say in your comment "My actual objects have a "rank" and 'created_at'", which means that you really should have asked that in your question instead of writing a "contrived" case ( don't know why people do that ).
So for "sorting" by multiple properties, the following reference would adjust like this:
db.collection.update(
{ },
{ "$push": { "employees": { "$each": [], "$sort": { "rank": -1, "created_at": -1 } } } },
{ "multi": true }
)
But to update all the data you presently have "as is shown in the question", then you would sort on "age" with:
db.collection.update(
{ },
{ "$push": { "employees": { "$each": [], "$sort": { "age": -1 } } } },
{ "multi": true }
)
Which oddly uses $push to actually "modify" an array? Yes it's true, since the $each modifier says we are not actually adding anything new yet the $sort modifier is actually going to apply to the array in place and "re-order" it.
Of course this would then explain how "new" updates to the array should be written in order to apply that $sort and ensure that the "largest age" is always "first" in the array:
db.collection.update(
{ "dep_id": "some_id" },
{ "$push": {
"employees": {
"$each": [{ "name": "emp": 3, "age": 32 }],
"$sort": { "age": -1 }
}
}}
)
So what happens here is as you add the new entry to the array on update, the $sort modifier is applied and re-positions the new element between the two existing ones since that is where it would sort to.
This is a common pattern with MongoDB and is typically used in combination with the $slice modifier in order to keep arrays at a "maximum" length as new items are added, yet retain "ordered" results. And quite often "ranking" is the exact usage.
So overall, you can actually "update" your existing data and re-order it with "one simple atomic statement". No looping or collection renaming required. Furthermore, you now have a simple atomic method to "update" the data and maintain that order as you add new array items, or remove them.

In order to get what you want you can use the following query:
db.collection.aggregate({
$unwind: "$employees" // flatten employees array
}, {
$sort: {
"employees.name": -1 // sort all documents by employee name (descending)
}
}, {
$group: { // restore the previous structure
_id: "$_id",
"dep_id": {
$first: "$dep_id"
},
"departament": {
$first: "$departament"
},
"employees": {
$push: "$employees"
},
}
}, {
$out: "output" // write everything out to a separate collection
})
After this step you would want to drop your source table and rename the "output" collection to match your source table name.
This solution will, however, not deal with the concurrency issue. So you should remove write access from the collection first so nobody modifies it during the process and then restore it once you're done with the migration.
You could alternatively query all data first, then sort the employees array on the client side and then use either single update queries or - faster but more complicated - a bulk write operation with all the individual update calls in order to update the existing documents. Here, you could use the entire document that you've initially read as a filter for the update operation. So if an individual update does not modify any document you'd know straight away, that some other change must have modified the document you read before. Those cases you'd need to retry later (or straight away until the update does actually modify a document).

Order_by length of listfield in mongoengine

I wan't to run a query to get all Articles that have more than 6 com and then sort according length of com list,
for this i doing it:
ArticleModel.objects.filter(com__6__exists=True).order_by('-com.length')[:50]
suppose com is a ListField, but ordering not work, how can i fix it? thanks

Standard queries cannot do this as the "sort" needs to be done on a physical field present in the document. The best way to do this is to actually keep a count of your "list" as another field in the document. That also makes your query more efficient as well as that "counter" field can be indexed, so the basic query becomes ( Raw MongoDB sytax ) :
{ "comLength": { "$gt": 6 } }
If you cannot or do not want to change the document structure then the only way to otherwise sort on the length of your list is to $project it via .aggregate():
ArticleModel._get_collection().aggregate([
{ "$match": { "com.6": { "$exists": true } }},
{ "$project": {
"com": 1,
"otherField": 1,
"comLength": { "$size": "$com" }
}},
{ "$sort": { "comLength": -1 } }
])
And that considers that you have MongoDB 2.6 at least for the use of the $size aggregation operator. If you don't then you have to $unwind and $group in order to calculate the length of arrays:
ArticleModel._get_collection().aggregate([
{ "$match": { "com.6": { "$exists": true } }},
{ "$unwind": "$com" },
{ "$group": {
"_id": "$_id",
"otherField": { "$first": "$otherField" }
"com": { "$push": "$com" },
"comLength": { "$sum": 1 }
}},
{ "$sort": { "comLength": -1 } }
])
So if you are going to go down that route then take a good look at the documentation since you are possibly not used to the raw MongoDB syntax and have been using the query DSL that MongoEngine provdides.
Overall, only the aggregation providers in .aggregate() or .mapReduce() can actually "create a field" that is not present within the document. There is also not test for the "current" length that is available to standard projection or sorting of documents either.
Your best option to to add another field and keep it in sync with the actual array length. But failing that the above shows you the general approach.

If you're creating the database and you know such request will mostly be requested a lot it's recommended to add "com_length" field in A ArticleModel and make it automatically inserted on every save using save() method
add inside of your ArticleModel in models.py
def save(self, *args, **kwargs):
self.com_length = len(self.com)
return super(ArticleModel, self).save(*args, **kwargs)
then for requesting the asked question:
ArticleModel.objects.filter(com__6__exists=True).order_by('-com_length')[:50]

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse