How to use two MongoDB aggregations to perform an updateMany - mongodb

I am trying to write a script that uses 2 aggregates and saves the results as an array to be used for an updateMany.
The first aggregate finds any documents that has a firstTrackingId and a secondTrackingId on it. I save this into an array. This aggregate is working correctly when tested alone.
The second aggregate will use the first aggregate's result array, pulling all documents that have a firstTrackingId from the first aggregate's results. This one will pull any documents that do NOT have a secondTrackingId on it, and save the unique mongo _id/ObjectId to an array.
The updateMany will use all of the results from the second aggregation to update all relevant documents with a status of void.
All these functions are working when I give them hard-coded data, but I can't figure out how to pull the data from the arrays. I am not even sure if I'm "saving" it correctly, or if there is something else I should be doing aside from just initializing the aggregation as an array.
var ids = db.getCollection('Test').aggregate([
{
$match: {
"firstTrackingId": { "$ne": "" },
"secondTrackingId": { "$exists": true }
}
},
{
$group: {
_id: "$firstTrackingId",
}
},
])
var secondIds = db.getCollection('Test').aggregate([
{
$match: {
"firstTrackingId": { $in: ids },
"secondTrackingId": { $exists: false }
}
},
{
$group: {
"_id": "$_id",
}
},
])
db.getCollection('Test').updateMany({
"_id": {
"$in": secondIds
},
}, { $set: {
"status": "VOID"
} })
I tried printing the first aggregation's results out... can't really figure out how... so for the first one if I do:
print(ids.next(ids._id))
I get:
[object BSON]
Which leads me to believe I need to somehow perform an $objectToArray. If anyone has any insight, that'd be awesome. Thank you!

If you are using MongoDB 4.4+, you can do that with a single aggregation pipeline:
match documents with both first and second tracking ID
lookup an array of all documents with the same first tracking ID
unwind the array
consider the array elements as the root document
match to eliminate any that have a second tracking ID
set the desired status field
merge the results with the original collection
{$match: {
firstTrackingId: { $ne: "" },
secondTrackingId: { $exists: true }
}},
{$lookup:{
from: "Test",
localField:"firstTrackingId",
foreignField:"firstTrackingId",
as:"matched"
}},
{$unwind:"$matched"},
{$replaceRoot:{newRoot:"$matched"}},
{$match:{secondTrackingId:{"$exists":false}}},
{$addFields:{status:"VOID"}},
{$merge: {into: "Test"}}

Related

Looking for proper way to prioritize certain documents in Mongodb query

I was looking all over the place and I couldn't find a proper source for the problem I need to solve.
given record data, I need to prioritize some documents over others when I query all.
for example: lets say i'm doing this search
db.users.find().limit(10)
and my document has data with id = 1,2,3,....50;
how can I prioritize the query of id=12, or id=49 first?
what I would want to get back:
array({id=12}, {id=49} ... fill the rest until pager limit)
I tried using $or like this:
{
"$or": [
{'_id': {'$in': [id=12,id=49]}},
{}
]
}
But I don't think this is the proper way of doing this and it's not working
Any help would be greatly appreciated
You can use aggregate() method,
$addFields to add new fields for sorting purpose hasId, check condition if your field _id in your input ids then return 1 otherwise removes field
$sort by hasId in descending order
$limit documents
db.collection.aggregate([
{
$addFields: {
hasId: {
$cond: [
{ $in: ["$_id", [8, 5]] },
1,
"$$REMOVE"
]
}
}
},
{ $sort: { hasId: -1 } },
{ $limit: 5 }
])
Playground

How to set one data field to another date field in the same object of the collection mongodb

I am trying to edit the fields of entries in a collection. I am checking if the lastUpdated date is less then published date. If it is, then the entry is probably faulty and I need to make the lastUpdated date same as published date. I have created the following mongo query for it :-
db.runCommand({ aggregate: "collectionNameHere",pipeline: [
{
$project: {
isFaulty: {$lt: ["$lastUpdated","$published"]}
}
},{
$match: {
isFaulty: true
}
},{
$addFields: {
lastUpdated: "$published"
}
}]
})
I am able to get the list of documents which have this fault, but I am not able to update the field. The last $addFields does not seem to be working. There is no error as well. Can someone help me with this or if they can provide me a better query fro my use case.
Thanks a lot.
You're doing a mistake by trying to update with aggreggation, what is not possible. You have to use update command to achieve your goal.
Cannot test it right now, but this should do the job :
db.collection.update({
$expr: {
$lt: [
"$lastUpdated",
"$published"
]
}
},
{$set:{lastUpdated:"$published"}}
)
It is not possible to update the document with the same field. You can use $out aggregation
db.collection.aggregate([
{ "$match": { "$expr": { "$lt": ["$lastUpdated", "$published"] }}},
{ "$addFields": { "lastUpdated": "$published" }}
])
here but it always creates a new collection as the "output" which is also not a solution here.
So, at last You have to use some iteration here to first find the document using find query and then need to update it. And with the async await now it quite easy to work this type of nested asynchronous looping.
const data = await db.collection
.find({ "$expr": { "$lt": ["$lastUpdated", "$published"] }})
.project({ lastUpdated: 1 })
.toArray()
await Promise.all(data.map(async(d) => {
await db.collection.updateOne({ _id: d._id }, { $set: { lastUpdated: d.published }})
}))

MongoosJS: Best approach for a derived/calculated value

I am creating a college football betting app for my family.
Here are my schemas:
const GameSchema = new mongoose.Schema({
home: {
type: String,
required: true
},
opponent: {
type: String,
required: true
},
homeScore: Number,
opponentScore: Number,
week:{
type: Number,
required: true
},
winner: String,
userPicks: [
{
user: {
type: mongoose.Schema.Types.ObjectId,
ref: 'User'
},
choosenTeam: String
}
]
});
const UserSchema = new mongoose.Schema({
name: String
});
I need to be able to calculate each user's weekly score (i.e. the number of football games they predict correctly each week) and their accumulative score (i.e. the number of games each user predicts correctly overall)
I am still very new to MongoDB and Mongoose, so I am unsure how to handle the issue. Since the Game document will never grow beyond 200 records, I think both scores should be derived or calculated from the data stored in the database.
Here are the possible solutions that I have thought of so far:
Make both scores virtual attributes, not sure how this would work for the multiple users
Persist the attributes to the document, but use middleware to re-calculate the scores, when the results for the week's games are saved to the database.
Use a static method to calculate the scores.
Any advice would be appreciated.
You could use the aggregation framework for calculating the aggregates. This is a faster alternative to Map/Reduce for common aggregation operations.
In MongoDB, a pipeline consists of a series of special operators applied to a collection to process data records and return computed results. Aggregation operations group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result. For more details, please consult the documentation.
Consider running the following pipeline to get the desired result:
var pipeline = [
{ "$unwind": "$userPicks" },
{
"$group": {
"_id": {
"week": "$week",
"user": "$userPicks.user"
},
"weeklyScore": {
"$sum": {
"$cond": [
{ "$eq": ["$userPicks.chosenTeam", "$winner"] },
1, 0
]
}
}
}
},
{
"$group": {
"_id": "$_id.user",
"weeklyScores": {
"$push": {
"week": "$_id.week",
"score": "$weeklyScore"
}
},
"totalScores": { "$sum": "$weeklyScore" }
}
}
];
Game.aggregate(pipeline, function(err, results){
User.populate(results, { "path": "_id" }, function(err, results) {
if (err) throw err;
console.log(JSON.stringify(results, undefined, 4));
});
})
In the above pipeline, the first step is the $unwind operator
{ "$unwind": "$userPicks" }
which comes in quite handy when the data is stored as an array. When the unwind operator is applied on a list data field, it will generate a new record for each and every element of the list data field on which unwind is applied. It basically flattens the data.
This is a necessary operation for the next pipeline stage, the $group step where you group the flattened documents by the fields week and the "userPicks.user"
{
"$group": {
"_id": {
"week": "$week",
"user": "$userPicks.user"
},
"weeklyScore": {
"$sum": {
"$cond": [
{ "$eq": ["$userPicks.chosenTeam", "$winner"] },
1, 0
]
}
}
}
}
The $group pipeline operator is similar to the SQL's GROUP BY clause. In SQL, you can't use GROUP BY unless you use any of the aggregation functions. The same way, you have to use an aggregation function in MongoDB as well. You can read more about the aggregation functions here.
In this $group operation, the logic to calculate each user's weekly score (i.e. the number of football games they predict correctly each week) is done through the ternary operator $cond that takes a logical condition as it's first argument (if) and then returns the second argument where the evaluation is true (then) or the third argument where false (else). This makes true/false returns into 1 and 0 to feed to $sum respectively:
"$cond": [
{ "$eq": ["$userPicks.chosenTeam", "$winner"] },
1, 0
]
So, if within the document being processed the "$userPicks.chosenTeam" field is the same as the "$winner" field, the $cond operator feeds the value 1 to the sum else it sums zero value.
The second group pipeline:
{
"$group": {
"_id": "$user",
"weeklyScores": {
"$push": {
"week": "$_id.week",
"score": "$weeklyScore"
}
},
"totalScores": { "$sum": "$weeklyScore" }
}
}
takes the documents from the previous pipeline and groups them further by the user field and calculates another aggregate i.e. the total score, using the $sum accumulator operator. Within the same pipeline, you can aggregate a list of the weekly scores by using the $push operator which returns an array of expression values for each group.
One thing to note here is when executing a pipeline, MongoDB pipes operators into each other. "Pipe" here takes the Linux meaning: the output of an operator becomes the input of the following operator. The result of each operator is a new collection of documents. So Mongo executes the above pipeline as follows:
collection | $unwind | $group | $group => result
Now, when you run the aggregation pipeline in Mongoose, the results will have an _id key which is the user id and you need to populate the results on this field i.e. Mongoose will perform a "join" on the users collection and return the documents with the user schema in the results.
As a side note, to help with understanding the pipeline or to debug it should you get unexpected results, run the aggregation with just the first pipeline operator. For example, run the aggregation in mongo shell as:
db.games.aggregate([
{ "$unwind": "$userPicks" }
])
Check the result to see if the userPicks array is deconstructed properly. If that gives the expected result, add the next:
db.games.aggregate([
{ "$unwind": "$userPicks" },
{
"$group": {
"_id": {
"week": "$week",
"user": "$userPicks.user"
},
"weeklyScore": {
"$sum": {
"$cond": [
{ "$eq": ["$userPicks.chosenTeam", "$winner"] },
1, 0
]
}
}
}
}
])
Repeat the steps till you get to the final pipeline step.

Get single array from mongoDB collection where the status is current

i want to find accepted bodypart which have status active
i tried this
db.patients.find({
"injury.injurydata.injuryinformation.dateofinjury": {
"$gte": ISODate("2014-05-21T08:00:00Z") ,
"$lt": ISODate("2014-06-03T08:00:00Z")
},
{
"injury.injurydata.acceptedbodyparts":1,
"injury.injurydata.injuryinformation.dateofinjury":1
"injury":{
$elemMatch: {
"injury.injurydata.acceptedbodyparts.status": "current"
}
}
})
but still get both array
If acceptedbodyparts is an array, you can't query acceptedbodyparts.status. If status is a field on the documents contained in the array, you would need to use another $elemMatch clause in your query. So the last part would look something like this:
{"injury":{ "$elemMatch": { "injurydata.acceptedbodyparts": {"$elemMatch": {"status":"current"} }} }}
I also removed the injury. prefix in the first $elemMatch because you're querying data within the injury array.
Note that this will return the entire document with the full array, as long as it contains the document you're searching for. If your intention is to retrieve a particular element in an array, $elemMatch is the wrong approach.
Standard projection will not work with nested arrays or limiting any fields inside arrays. For that you need the aggregation framework:
db.patients.aggregate([
// First match, Matches documents
{ "$match": {
"injury.injurydata.injuryinformation.dateofinjury": {
"$gte": ISODate("2014-05-21T08:00:00Z"),
"$lt": ISODate("2014-06-03T08:00:00Z")
}
}},
// Un-wind the arrays
{ "$unwind": "$injury" },
{ "$unwind": "$injury.injurydata" },
{ "$unwind": "$injury.injurydata.acceptedbodyparts" },
// Now match the required data in the array
{ "$match": {
"injury.injurydata.acceptedbodyparts.status": "current"
}},
// Group only wanted fields
{ "$group": {
"_id": "$_id",
"acceptedbodyparts": {
"$push": "injury.injurydata.acceptedbodyparts"
}
}}
])
You can add in other fields outside of the array either using $first or by akin g them part of the _id in the grouping.
This is just something that is outside of the scope of the standard projection available and the aggregation framework with the extended manipulation capabilities solves this.

Fastest way to remove duplicate documents in mongodb

I have approximately 1.7M documents in mongodb (in future 10m+). Some of them represent duplicate entry which I do not want. Structure of document is something like this:
{
_id: 14124412,
nodes: [
12345,
54321
],
name: "Some beauty"
}
Document is duplicate if it has at least one node same as another document with same name. What is the fastest way to remove duplicates?
dropDups: true option is not available in 3.0.
I have solution with aggregation framework for collecting duplicates and then removing in one go.
It might be somewhat slower than system level "index" changes. But it is good by considering way you want to remove duplicate documents.
a. Remove all documents in one go
var duplicates = [];
db.collectionName.aggregate([
{ $match: {
name: { "$ne": '' } // discard selection criteria
}},
{ $group: {
_id: { name: "$name"}, // can be grouped on multiple properties
dups: { "$addToSet": "$_id" },
count: { "$sum": 1 }
}},
{ $match: {
count: { "$gt": 1 } // Duplicates considered as count greater than one
}}
],
{allowDiskUse: true} // For faster processing if set is larger
) // You can display result until this and check duplicates
.forEach(function(doc) {
doc.dups.shift(); // First element skipped for deleting
doc.dups.forEach( function(dupId){
duplicates.push(dupId); // Getting all duplicate ids
}
)
})
// If you want to Check all "_id" which you are deleting else print statement not needed
printjson(duplicates);
// Remove all duplicates in one go
db.collectionName.remove({_id:{$in:duplicates}})
b. You can delete documents one by one.
db.collectionName.aggregate([
// discard selection criteria, You can remove "$match" section if you want
{ $match: {
source_references.key: { "$ne": '' }
}},
{ $group: {
_id: { source_references.key: "$source_references.key"}, // can be grouped on multiple properties
dups: { "$addToSet": "$_id" },
count: { "$sum": 1 }
}},
{ $match: {
count: { "$gt": 1 } // Duplicates considered as count greater than one
}}
],
{allowDiskUse: true} // For faster processing if set is larger
) // You can display result until this and check duplicates
.forEach(function(doc) {
doc.dups.shift(); // First element skipped for deleting
db.collectionName.remove({_id : {$in: doc.dups }}); // Delete remaining duplicates
})
Assuming you want to permanently delete docs that contain a duplicate name + nodes entry from the collection, you can add a unique index with the dropDups: true option:
db.test.ensureIndex({name: 1, nodes: 1}, {unique: true, dropDups: true})
As the docs say, use extreme caution with this as it will delete data from your database. Back up your database first in case it doesn't do exactly as you're expecting.
UPDATE
This solution is only valid through MongoDB 2.x as the dropDups option is no longer available in 3.0 (docs).
Create collection dump with mongodump
Clear collection
Add unique index
Restore collection with mongorestore
I found this solution that works with MongoDB 3.4:
I'll assume the field with duplicates is called fieldX
db.collection.aggregate([
{
// only match documents that have this field
// you can omit this stage if you don't have missing fieldX
$match: {"fieldX": {$nin:[null]}}
},
{
$group: { "_id": "$fieldX", "doc" : {"$first": "$$ROOT"}}
},
{
$replaceRoot: { "newRoot": "$doc"}
}
],
{allowDiskUse:true})
Being new to mongoDB, I spent a lot of time and used other lengthy solutions to find and delete duplicates. However, I think this solution is neat and easy to understand.
It works by first matching documents that contain fieldX (I had some documents without this field, and I got one extra empty result).
The next stage groups documents by fieldX, and only inserts the $first document in each group using $$ROOT. Finally, it replaces the whole aggregated group by the document found using $first and $$ROOT.
I had to add allowDiskUse because my collection is large.
You can add this after any number of pipelines, and although the documentation for $first mentions a sort stage prior to using $first, it worked for me without it. " couldnt post a link here, my reputation is less than 10 :( "
You can save the results to a new collection by adding an $out stage...
Alternatively, if one is only interested in a few fields e.g. field1, field2, and not the whole document, in the group stage without replaceRoot:
db.collection.aggregate([
{
// only match documents that have this field
$match: {"fieldX": {$nin:[null]}}
},
{
$group: { "_id": "$fieldX", "field1": {"$first": "$$ROOT.field1"}, "field2": { "$first": "$field2" }}
}
],
{allowDiskUse:true})
The following Mongo aggregation pipeline does the deduplication and outputs it back to the same or different collection.
collection.aggregate([
{ $group: {
_id: '$field_to_dedup',
doc: { $first: '$$ROOT' }
} },
{ $replaceRoot: {
newRoot: '$doc'
} },
{ $out: 'collection' }
], { allowDiskUse: true })
My DB had millions of duplicate records. #somnath's answer did not work as is so writing the solution that worked for me for people looking to delete millions of duplicate records.
/** Create a array to store all duplicate records ids*/
var duplicates = [];
/** Start Aggregation pipeline*/
db.collection.aggregate([
{
$match: { /** Add any filter here. Add index for filter keys*/
filterKey: {
$exists: false
}
}
},
{
$sort: { /** Sort it in such a way that you want to retain first element*/
createdAt: -1
}
},
{
$group: {
_id: {
key1: "$key1", key2:"$key2" /** These are the keys which define the duplicate. Here document with same value for key1 and key2 will be considered duplicate*/
},
dups: {
$push: {
_id: "$_id"
}
},
count: {
$sum: 1
}
}
},
{
$match: {
count: {
"$gt": 1
}
}
}
],
{
allowDiskUse: true
}).forEach(function(doc){
doc.dups.shift();
doc.dups.forEach(function(dupId){
duplicates.push(dupId._id);
})
})
/** Delete the duplicates*/
var i,j,temparray,chunk = 100000;
for (i=0,j=duplicates.length; i<j; i+=chunk) {
temparray = duplicates.slice(i,i+chunk);
db.collection.bulkWrite([{deleteMany:{"filter":{"_id":{"$in":temparray}}}}])
}
Here is a slightly more 'manual' way of doing it:
Essentially, first, get a list of all the unique keys you are interested.
Then perform a search using each of those keys and delete if that search returns bigger than one.
db.collection.distinct("key").forEach((num)=>{
var i = 0;
db.collection.find({key: num}).forEach((doc)=>{
if (i) db.collection.remove({key: num}, { justOne: true })
i++
})
});
tips to speed up, when only small portion of your documents are duplicated:
you need an index on the field to detect duplicates.
$group does not use the index, but it can take advantage of $sort and $sort use the index. so you should put a $sort step at the beginning
do inplace delete_many() instead of $out to new collection, this will save lots of IO time and disk space.
if you use pymongo you can do:
index_uuid = IndexModel(
[
('uuid', pymongo.ASCENDING)
],
)
col.create_indexes([index_uuid])
pipeline = [
{"$sort": {"uuid":1}},
{
"$group": {
"_id": "$uuid",
"dups": {"$addToSet": "$_id"},
"count": {"$sum": 1}
}
},
{
"$match": {"count": {"$gt": 1}}
},
]
it_cursor = col.aggregate(
pipeline, allowDiskUse=True
)
# skip 1st dup of each dups group
dups = list(itertools.chain.from_iterable(map(lambda x: x["dups"][1:], it_cursor)))
col.delete_many({"_id":{"$in": dups}})
performance
I test it on a database contain 30M documents and 1TB large.
Without index/sort it takes more than an hour to get the cursor (I do not even have the patient to wait for it).
with index/sort but use $out to output to a new collection. This is safer if your filesystem does not support snapshot. But it requires lots of disk space and takes more than 40mins to finish despite the fact that we are using SSDs. It will be much slower if you are on HDD RAID.
with index/sort and inplace delete_many, it takes around 5mins in total.
The following method merges documents with the same name while only keeping the unique nodes without duplicating them.
I found using the $out operator to be a simple way. I unwind the array and then group it by adding to set. The $out operator allows the aggregation result to persist [docs].
If you put the name of the collection itself it will replace the collection with the new data. If the name does not exist it will create a new collection.
Hope this helps.
allowDiskUse may have to be added to the pipeline.
db.collectionName.aggregate([
{
$unwind:{path:"$nodes"},
},
{
$group:{
_id:"$name",
nodes:{
$addToSet:"$nodes"
}
},
{
$project:{
_id:0,
name:"$_id.name",
nodes:1
}
},
{
$out:"collectionNameWithoutDuplicates"
}
])
Using pymongo this should work.
Add the fields that need to be unique for the collection in unique_field
unique_field = {"field1":"$field1","field2":"$field2"}
cursor = DB.COL.aggregate([{"$group":{"_id":unique_field, "dups":{"$push":"$uuid"}, "count": {"$sum": 1}}},{"$match":{"count": {"$gt": 1}}},{"$group":"_id":None,"dups":{"$addToSet":{"$arrayElemAt":["$dups",1]}}}}],allowDiskUse=True)
slice the dups array depending on the duplications count(here i had only one extra duplicate for all)
items = list(cursor)
removeIds = items[0]['dups']
hold.remove({"uuid":{"$in":removeIds}})
I don't know whether is it going to answer main question, but for others it'll be usefull.
1.Query the duplicate row using findOne() method and store it as an object.
const User = db.User.findOne({_id:"duplicateid"});
2.Execute deleteMany() method to remove all the rows with the id "duplicateid"
db.User.deleteMany({_id:"duplicateid"});
3.Insert the values stored in User object.
db.User.insertOne(User);
Easy and fast!!!!
First, you can find all the duplicates and remove those duplicates in the DB. Here we take the id column to check and remove duplicates.
db.collection.aggregate([
{ "$group": { "_id": "$id", "count": { "$sum": 1 } } },
{ "$match": { "_id": { "$ne": null }, "count": { "$gt": 1 } } },
{ "$sort": { "count": -1 } },
{ "$project": { "name": "$_id", "_id": 0 } }
]).then(data => {
var dr = data.map(d => d.name);
console.log("duplicate Recods:: ", dr);
db.collection.remove({ id: { $in: dr } }).then(removedD => {
console.log("Removed duplicate Data:: ", removedD);
})
})
General idea is to use findOne https://docs.mongodb.com/manual/reference/method/db.collection.findOne/
to retrieve one random id from the duplicate records in the collection.
Delete all the records in the collection other than the random-id that we retrieved from findOne option.
You can do something like this if you are trying to do it in pymongo.
def _run_query():
try:
for record in (aggregate_based_on_field(collection)):
if not record:
continue
_logger.info("Working on Record %s", record)
try:
retain = db.collection.find_one(find_one({'fie1d1': 'x', 'field2':'y'}, {'_id': 1}))
_logger.info("_id to retain from duplicates %s", retain['_id'])
db.collection.remove({'fie1d1': 'x', 'field2':'y', '_id': {'$ne': retain['_id']}})
except Exception as ex:
_logger.error(" Error when retaining the record :%s Exception: %s", x, str(ex))
except Exception as e:
_logger.error("Mongo error when deleting duplicates %s", str(e))
def aggregate_based_on_field(collection):
return collection.aggregate([{'$group' : {'_id': "$fieldX"}}])
From the shell:
Replace find_one to findOne
Same remove command should work.