Related
This is the sample of my mongodb document( try to use jsonformatter.com to analyse it):
{"_id":"6278686","playerName":"Rohit Lal","tournamentId":"197831","score":[{"_id":"1611380","runsScored":0,"ballFaced":0,"fours":0,"sixes":0,"strikeRate":0,"oversBowled":0,"runsConceded":0,"economyRate":0,"wickets":0,"maiden":0,"howToOut":"-","catches":["Mohit Mishra"],"stumping":[],"runout":[],"participatedRunout":[]},{"_id":"1602732","runsScored":0,"ballFaced":0,"fours":0,"sixes":0,"strikeRate":0,"oversBowled":0,"runsConceded":0,"economyRate":0,"wickets":0,"maiden":0,"howToOut":"-","catches":[],"stumping":[],"runout":[],"participatedRunout":[]},{"_id":"1536514","runsScored":1,"ballFaced":3,"fours":0,"sixes":0,"strikeRate":33.33,"oversBowled":0,"runsConceded":0,"economyRate":0,"wickets":0,"maiden":0,"howToOut":"run out Sameer Baveja","catches":[],"stumping":[],"runout":[],"participatedRunout":[]},{"_id":"1536474","runsScored":2,"ballFaced":7,"fours":0,"sixes":0,"strikeRate":28.57,"oversBowled":0,"runsConceded":0,"economyRate":0,"wickets":0,"maiden":0,"howToOut":"c Rajesh b Prasad Naik","catches":[],"stumping":[],"runout":[],"participatedRunout":[]},{"_id":"1536467","runsScored":0,"ballFaced":0,"fours":0,"sixes":0,"strikeRate":0,"oversBowled":0,"runsConceded":0,"economyRate":0,"wickets":0,"maiden":0,"howToOut":"-","catches":[],"stumping":[],"runout":[],"participatedRunout":[]},{"_id":"1500825","runsScored":0,"ballFaced":0,"fours":0,"sixes":0,"strikeRate":0,"oversBowled":0,"runsConceded":0,"economyRate":0,"wickets":0,"maiden":0,"howToOut":"-","catches":[],"stumping":[],"runout":[],"participatedRunout":[]},{"_id":"1461428","runsScored":18,"ballFaced":6,"fours":1,"sixes":2,"strikeRate":300,"oversBowled":0,"runsConceded":0,"economyRate":0,"wickets":0,"maiden":0,"howToOut":"not out","catches":[],"stumping":[],"runout":[],"participatedRunout":[]},{"_id":"1461408","runsScored":0,"ballFaced":1,"fours":0,"sixes":0,"strikeRate":0,"oversBowled":0,"runsConceded":0,"economyRate":0,"wickets":0,"maiden":0,"howToOut":"c Sudhir b Vinay Kasat *vk*","catches":[],"stumping":[],"runout":[],"participatedRunout":[]},{"_id":"1451175","runsScored":0,"ballFaced":0,"fours":0,"sixes":0,"strikeRate":0,"oversBowled":0,"runsConceded":0,"economyRate":0,"wickets":0,"maiden":0,"howToOut":"-","catches":[],"stumping":[],"runout":[],"participatedRunout":[]},{"_id":"1451146","runsScored":0,"ballFaced":0,"fours":0,"sixes":0,"strikeRate":0,"oversBowled":0,"runsConceded":0,"economyRate":0,"wickets":0,"maiden":0,"howToOut":"-","catches":[],"stumping":[],"runout":[],"participatedRunout":[]},{"_id":"1392796","runsScored":0,"ballFaced":1,"fours":0,"sixes":0,"strikeRate":0,"oversBowled":0,"runsConceded":0,"economyRate":0,"wickets":0,"maiden":0,"howToOut":"c †Vinay Kedia b Lalit","catches":[],"stumping":[],"runout":[],"participatedRunout":[]}],"__v":0}
I want to sum the length of catches array field of all objects inside score array. I know, I can achieve it using aggregation framework, but I am begineer in mongodb and does not have knowledge of many aggregation operators. Here is the aggregation pipeline I have tried but it returns the number of existence of this field, not the sum of length of this array:
[
"totalCatches": {
$size: "$score.catches"
}
]
$unwind - Descontruct score array field to multiple documents.
$group - Group by null (for all objects), next $sum for the $size of score.catches.
db.collection.aggregate([
{
$unwind: "$score"
},
{
$group: {
_id: null,
"totalCatches": {
$sum: {
$size: "$score.catches"
}
}
}
}
])
Sample Mongo Playground
Note: If you want the result to be based on each document (not combine all documents), then you need to change the $group's _id as:
{
$group: {
_id: "$_id",
...
}
}
I need a mongodb query to get the list or map of values with unique value of the field(f) as the key in the collection and count of documents having the same value in the field(f) as the mapped value. How can I achieve this ?
Example:
Document1: {"id":"1","name":"n1","city":"c1"}
Document2: {"id":"2","name":"n2","city":"c2"}
Document3: {"id":"3","name":"n1","city":"c3"}
Document4: {"id":"4","name":"n1","city":"c5"}
Document5: {"id":"5","name":"n2","city":"c2"}
Document6: {"id":"6,""name":"n1","city":"c8"}
Document7: {"id":"7","name":"n3","city":"c9"}
Document8: {"id":"8","name":"n2","city":"c6"}
Query result should be something like this if group by field is "name":
{"n1":"4",
"n2":"3",
"n3":"1"}
It would be nice if the list is also sorted in the descending order.
It's worth noting, using data points as field names (keys) is somewhat considered an anti-pattern and makes tooling difficult. Nonetheless if you insist on having data points as field names you can use this complicated aggregation to perform the query output you desire...
Aggregation
db.collection.aggregate([
{
$group: { _id: "$name", "count": { "$sum": 1} }
},
{
$sort: { "count": -1 }
},
{
$group: { _id: null, "values": { "$push": { "name": "$_id", "count": "$count" } } }
},
{
$project:
{
_id: 0,
results:
{
$arrayToObject:
{
$map:
{
input: "$values",
as: "pair",
in: ["$$pair.name", "$$pair.count"]
}
}
}
}
},
{
$replaceRoot: { newRoot: "$results" }
}
])
Aggregation Explanation
This is a 5 stage aggregation consisting of the following...
$group - get the count of the data as required by name.
$sort - sort the results with count descending.
$group - place results into an array for the next stage.
$project - use the $arrayToObject and $map to pivot the data such
that a data point can be a field name.
$replaceRoot - make results the top level fields.
Sample Results
{ "n1" : 4, "n2" : 3, "n3" : 1 }
For whatever reason, you show desired results having count as a string, but my results show the count as an integer. I assume that is not an issue, and may actually be preferred.
I have a collection in mongodb, the collection has a field displayInCategories. The collection contains 1000's of data wrt to different displayInCategories.
Is it possible to limit the records to <=5 for all the available displayInCategories.
I didn't want to limit the record on whole result, I need to limit the record as per the displayInCategories
This might get you going:
db.collection.aggregate([{
$group: {
_id: "$displayInCategories", // group by displayInCategories
"docs": { $push: "$$ROOT" } // remember all documents for this category
}
}, {
$project: {
"docs": { $slice: [ "$docs", 5 ] } // limit the items in each "docs" array to 5
}
}])
You might want to apply a $sort stage at the start to make sure you don't get random documents but rather the "top 5" based on some criteria.
In MongoDB aggregation pipeline, record flow from stage to stage happens one/batch at a time (or) will wait for the current stage to complete for whole collection before passing it to next stage?
For e.g., I have a collection classtest with following sample records
{name: "Person1", marks: 20}
{name: "Person2", marks: 20}
{name: "Person1", marks: 20}
I have total 1000 records for about 100 students and I have following aggregate query
db.classtest.aggregate(
[
{$sort: {name: 1}},
{$group: {_id: '$name',
total: {$sum: '$marks'}}},
{$limit: 5}
])
I have following questions.
The sort order is lost in final results. If I place another sort after $group, then results are sorted properly. Does that mean $group not maintains the previous sort order?
I would like to limit the results to 5. Does group operation has to be completely done (for all 1000 records) before passing to the limit. (or) The group operation passes the records to limit stage as and when it has record and stops processing when the requirement for limit stage is met?
My actual idea is to do pagination on results of aggregate. In above scenario, if $group maintains sort order and processes only required number of records, I want to apply $match condition {$ge: 'lastPersonName'} in subsequent page queries.
I do not want to apply $limit before $group as I want results for 5 students not first 5 records.
I may not want to use $skip as that means effectively traversing those many records.
I have solved the problem without need of maintaining another collection or even without $group traversing whole collection, hence posting my own answer.
As others have pointed:
$group doesn't retain order, hence early sorting is not of much help.
$group doesn't do any optimization, even if there is a following $limit, i.e., runs $group on entire collection.
My usecase has following unique features, which helped me to solve it:
There will be maximum of 10 records per each student (minimum of 1).
I am not very particular on page size. The front-end capable of handling varying page sizes.
The following is the aggregation command I have used.
db.classtest.aggregate(
[
{$sort: {name: 1}},
{$limit: 5 * 10},
{$group: {_id: '$name',
total: {$sum: '$marks'}}},
{$sort: {_id: 1}}
])
Explaining the above.
if $sort immediately precedes $limit, the framework optimizes the amount of data to be sent to next stage. Refer here
To get a minimum of 5 records (page size), I need to pass at least 5 (page size) * 10 (max records per student) = 50 records to the $group stage. With this, the size of final result may be anywhere between 0 and 50.
If the result is less than 5, then there is no further pagination required.
If the result size is greater than 5, there may be chance that last student record is not completely processed (i.e., not grouped all the records of student), hence I discard the last record from the result.
Then name in last record (among retained results) is used as $match criteria in subsequent page request as shown below.
db.classtest.aggregate(
[
{$match: {name: {$gt: lastRecordName}}}
{$sort: {name: 1}},
{$limit: 5 * 10},
{$group: {_id: '$name',
total: {$sum: '$marks'}}},
{$sort: {_id: 1}}
])
In above, the framework will still optimize $match, $sort and $limit together as single operation, which I have confirmed through explain plan.
The first few things to consider here is that the aggregation framework works with a "pipeline" of stages to be applied in order to get a result. If you are familiar with processing things on the "command line" or "shell" of your operating system, then you might have some experience with the "pipe" or | operator.
Here is a common unix idiom:
ps -ef | grep mongod | tee "out.txt"
In this case the output of the first command here ps -ef is being "piped" to the next command grep mongod which in turn has it's output "piped" to the tee out.txt which both outputs to terminal as well as the specified file name. This is a "pipeline" wher each stage "feeds" the next, and in "order" of the sequence they are written in.
The same is true of the aggregation pipeline. A "pipeline" here is in fact an "array", which is an ordered set of instructions to be passed in processing the data to a result.
db.classtest.aggregate([
{ "$group": {
"_id": "$name",
"total": { "$sum": "$marks"}
}},
{ "$sort": { "name": 1 } },
{ "$limit": 5 }
])
So what happens here is that all of the items in the collection are first processed by $group to get their totals. There is no specified "order" to grouping so there is not much sense in pre-ordering the data. Neither is there any point in doing so because you are yet to get to your later stages.
Then you would $sort the results and also $limit as required.
For your next "page" of data you will want ideally $match on the last unique name found, like so:
db.classtest.aggregate([
{ "$match": { "name": { "$gt": lastNameFound } }},
{ "$group": {
"_id": "$name",
"total": { "$sum": "$marks"}
}},
{ "$sort": { "name": 1 } },
{ "$limit": 5 }
])
It's not the best solution, but there really are not alternatives for this type of grouping. It will however notably get "faster" with each iteration towards the end. Alternately, storing all the unqiue names ( or reading that out of another collection ) and "paging" through that list with a "range query" on each aggregation statement may be a viable option, if your data permits it.
Something like:
db.classtest.aggregate([
{ "$match": { "name": { "$gte": "Allan", "$lte": "David" } }},
{ "$group": {
"_id": "$name",
"total": { "$sum": "$marks"}
}},
{ "$sort": { "name": 1 } },
])
Unfortunately there is not a "limit grouping up until x results" option, so unless you can work with another list, then you are basically grouping up everything ( and possibly a a gradually smaller set each time ) with each aggregation query you send.
"$group does not order its output documents." See http://docs.mongodb.org/manual/reference/operator/aggregation/group/
$limit limits the number of processed elements of an immediately preceding $sort operation, not only the number of elements passed to the next stage. See the note at http://docs.mongodb.org/manual/reference/operator/aggregation/limit/
For the very first question you asked, I am not sure, but it appears (see 1.) that a stage n+1 can influence the behaviour of stage n : the limit will limit the sort operation to its first n elements, and the sort operation will not complete just as if the following limit stage did not exist.
pagination on group data mongodb -
in $group items you can't directly apply pagination, but below trick will be used ,
if you want pagination on group data -
for example- i want group products categoryWise and then i want only 5 product per category then
step 1 - write aggregation on product table, and write groupBY
{ $group: { _id: '$prdCategoryId', products: { $push: '$$ROOT' } } },
step 2 - prdSkip for skipping , and limit for limiting data , pass it
dynamically
{
$project: {
// pagination for products
products: {
$slice: ['$products', prdSkip, prdLimit],
}
}
},
finally query looks like -
params - limit , skip - for category pagination
and prdSkip and PrdLimit for products pagination
db.products.aggregate([
{ $group: { _id: '$prdCategoryId', products: { $push: '$$ROOT' } } },
{
$lookup: {
from: 'categories',
localField: '_id',
foreignField: '_id',
as: 'categoryProducts',
},
},
{
$replaceRoot: {
newRoot: {
$mergeObjects: [{ $arrayElemAt: ['$categoryProducts', 0] }, '$$ROOT'],
},
},
},
{
$project: {
// pagination for products
products: {
$slice: ['$products', prdSkip, prdLimit],
},
_id: 1,
catName: 1,
catDescription: 1,
},
},
])
.limit(limit) // pagination for category
.skip(skip);
I used replaceRoot here to pullOut category.
I have approximately 1.7M documents in mongodb (in future 10m+). Some of them represent duplicate entry which I do not want. Structure of document is something like this:
{
_id: 14124412,
nodes: [
12345,
54321
],
name: "Some beauty"
}
Document is duplicate if it has at least one node same as another document with same name. What is the fastest way to remove duplicates?
dropDups: true option is not available in 3.0.
I have solution with aggregation framework for collecting duplicates and then removing in one go.
It might be somewhat slower than system level "index" changes. But it is good by considering way you want to remove duplicate documents.
a. Remove all documents in one go
var duplicates = [];
db.collectionName.aggregate([
{ $match: {
name: { "$ne": '' } // discard selection criteria
}},
{ $group: {
_id: { name: "$name"}, // can be grouped on multiple properties
dups: { "$addToSet": "$_id" },
count: { "$sum": 1 }
}},
{ $match: {
count: { "$gt": 1 } // Duplicates considered as count greater than one
}}
],
{allowDiskUse: true} // For faster processing if set is larger
) // You can display result until this and check duplicates
.forEach(function(doc) {
doc.dups.shift(); // First element skipped for deleting
doc.dups.forEach( function(dupId){
duplicates.push(dupId); // Getting all duplicate ids
}
)
})
// If you want to Check all "_id" which you are deleting else print statement not needed
printjson(duplicates);
// Remove all duplicates in one go
db.collectionName.remove({_id:{$in:duplicates}})
b. You can delete documents one by one.
db.collectionName.aggregate([
// discard selection criteria, You can remove "$match" section if you want
{ $match: {
source_references.key: { "$ne": '' }
}},
{ $group: {
_id: { source_references.key: "$source_references.key"}, // can be grouped on multiple properties
dups: { "$addToSet": "$_id" },
count: { "$sum": 1 }
}},
{ $match: {
count: { "$gt": 1 } // Duplicates considered as count greater than one
}}
],
{allowDiskUse: true} // For faster processing if set is larger
) // You can display result until this and check duplicates
.forEach(function(doc) {
doc.dups.shift(); // First element skipped for deleting
db.collectionName.remove({_id : {$in: doc.dups }}); // Delete remaining duplicates
})
Assuming you want to permanently delete docs that contain a duplicate name + nodes entry from the collection, you can add a unique index with the dropDups: true option:
db.test.ensureIndex({name: 1, nodes: 1}, {unique: true, dropDups: true})
As the docs say, use extreme caution with this as it will delete data from your database. Back up your database first in case it doesn't do exactly as you're expecting.
UPDATE
This solution is only valid through MongoDB 2.x as the dropDups option is no longer available in 3.0 (docs).
Create collection dump with mongodump
Clear collection
Add unique index
Restore collection with mongorestore
I found this solution that works with MongoDB 3.4:
I'll assume the field with duplicates is called fieldX
db.collection.aggregate([
{
// only match documents that have this field
// you can omit this stage if you don't have missing fieldX
$match: {"fieldX": {$nin:[null]}}
},
{
$group: { "_id": "$fieldX", "doc" : {"$first": "$$ROOT"}}
},
{
$replaceRoot: { "newRoot": "$doc"}
}
],
{allowDiskUse:true})
Being new to mongoDB, I spent a lot of time and used other lengthy solutions to find and delete duplicates. However, I think this solution is neat and easy to understand.
It works by first matching documents that contain fieldX (I had some documents without this field, and I got one extra empty result).
The next stage groups documents by fieldX, and only inserts the $first document in each group using $$ROOT. Finally, it replaces the whole aggregated group by the document found using $first and $$ROOT.
I had to add allowDiskUse because my collection is large.
You can add this after any number of pipelines, and although the documentation for $first mentions a sort stage prior to using $first, it worked for me without it. " couldnt post a link here, my reputation is less than 10 :( "
You can save the results to a new collection by adding an $out stage...
Alternatively, if one is only interested in a few fields e.g. field1, field2, and not the whole document, in the group stage without replaceRoot:
db.collection.aggregate([
{
// only match documents that have this field
$match: {"fieldX": {$nin:[null]}}
},
{
$group: { "_id": "$fieldX", "field1": {"$first": "$$ROOT.field1"}, "field2": { "$first": "$field2" }}
}
],
{allowDiskUse:true})
The following Mongo aggregation pipeline does the deduplication and outputs it back to the same or different collection.
collection.aggregate([
{ $group: {
_id: '$field_to_dedup',
doc: { $first: '$$ROOT' }
} },
{ $replaceRoot: {
newRoot: '$doc'
} },
{ $out: 'collection' }
], { allowDiskUse: true })
My DB had millions of duplicate records. #somnath's answer did not work as is so writing the solution that worked for me for people looking to delete millions of duplicate records.
/** Create a array to store all duplicate records ids*/
var duplicates = [];
/** Start Aggregation pipeline*/
db.collection.aggregate([
{
$match: { /** Add any filter here. Add index for filter keys*/
filterKey: {
$exists: false
}
}
},
{
$sort: { /** Sort it in such a way that you want to retain first element*/
createdAt: -1
}
},
{
$group: {
_id: {
key1: "$key1", key2:"$key2" /** These are the keys which define the duplicate. Here document with same value for key1 and key2 will be considered duplicate*/
},
dups: {
$push: {
_id: "$_id"
}
},
count: {
$sum: 1
}
}
},
{
$match: {
count: {
"$gt": 1
}
}
}
],
{
allowDiskUse: true
}).forEach(function(doc){
doc.dups.shift();
doc.dups.forEach(function(dupId){
duplicates.push(dupId._id);
})
})
/** Delete the duplicates*/
var i,j,temparray,chunk = 100000;
for (i=0,j=duplicates.length; i<j; i+=chunk) {
temparray = duplicates.slice(i,i+chunk);
db.collection.bulkWrite([{deleteMany:{"filter":{"_id":{"$in":temparray}}}}])
}
Here is a slightly more 'manual' way of doing it:
Essentially, first, get a list of all the unique keys you are interested.
Then perform a search using each of those keys and delete if that search returns bigger than one.
db.collection.distinct("key").forEach((num)=>{
var i = 0;
db.collection.find({key: num}).forEach((doc)=>{
if (i) db.collection.remove({key: num}, { justOne: true })
i++
})
});
tips to speed up, when only small portion of your documents are duplicated:
you need an index on the field to detect duplicates.
$group does not use the index, but it can take advantage of $sort and $sort use the index. so you should put a $sort step at the beginning
do inplace delete_many() instead of $out to new collection, this will save lots of IO time and disk space.
if you use pymongo you can do:
index_uuid = IndexModel(
[
('uuid', pymongo.ASCENDING)
],
)
col.create_indexes([index_uuid])
pipeline = [
{"$sort": {"uuid":1}},
{
"$group": {
"_id": "$uuid",
"dups": {"$addToSet": "$_id"},
"count": {"$sum": 1}
}
},
{
"$match": {"count": {"$gt": 1}}
},
]
it_cursor = col.aggregate(
pipeline, allowDiskUse=True
)
# skip 1st dup of each dups group
dups = list(itertools.chain.from_iterable(map(lambda x: x["dups"][1:], it_cursor)))
col.delete_many({"_id":{"$in": dups}})
performance
I test it on a database contain 30M documents and 1TB large.
Without index/sort it takes more than an hour to get the cursor (I do not even have the patient to wait for it).
with index/sort but use $out to output to a new collection. This is safer if your filesystem does not support snapshot. But it requires lots of disk space and takes more than 40mins to finish despite the fact that we are using SSDs. It will be much slower if you are on HDD RAID.
with index/sort and inplace delete_many, it takes around 5mins in total.
The following method merges documents with the same name while only keeping the unique nodes without duplicating them.
I found using the $out operator to be a simple way. I unwind the array and then group it by adding to set. The $out operator allows the aggregation result to persist [docs].
If you put the name of the collection itself it will replace the collection with the new data. If the name does not exist it will create a new collection.
Hope this helps.
allowDiskUse may have to be added to the pipeline.
db.collectionName.aggregate([
{
$unwind:{path:"$nodes"},
},
{
$group:{
_id:"$name",
nodes:{
$addToSet:"$nodes"
}
},
{
$project:{
_id:0,
name:"$_id.name",
nodes:1
}
},
{
$out:"collectionNameWithoutDuplicates"
}
])
Using pymongo this should work.
Add the fields that need to be unique for the collection in unique_field
unique_field = {"field1":"$field1","field2":"$field2"}
cursor = DB.COL.aggregate([{"$group":{"_id":unique_field, "dups":{"$push":"$uuid"}, "count": {"$sum": 1}}},{"$match":{"count": {"$gt": 1}}},{"$group":"_id":None,"dups":{"$addToSet":{"$arrayElemAt":["$dups",1]}}}}],allowDiskUse=True)
slice the dups array depending on the duplications count(here i had only one extra duplicate for all)
items = list(cursor)
removeIds = items[0]['dups']
hold.remove({"uuid":{"$in":removeIds}})
I don't know whether is it going to answer main question, but for others it'll be usefull.
1.Query the duplicate row using findOne() method and store it as an object.
const User = db.User.findOne({_id:"duplicateid"});
2.Execute deleteMany() method to remove all the rows with the id "duplicateid"
db.User.deleteMany({_id:"duplicateid"});
3.Insert the values stored in User object.
db.User.insertOne(User);
Easy and fast!!!!
First, you can find all the duplicates and remove those duplicates in the DB. Here we take the id column to check and remove duplicates.
db.collection.aggregate([
{ "$group": { "_id": "$id", "count": { "$sum": 1 } } },
{ "$match": { "_id": { "$ne": null }, "count": { "$gt": 1 } } },
{ "$sort": { "count": -1 } },
{ "$project": { "name": "$_id", "_id": 0 } }
]).then(data => {
var dr = data.map(d => d.name);
console.log("duplicate Recods:: ", dr);
db.collection.remove({ id: { $in: dr } }).then(removedD => {
console.log("Removed duplicate Data:: ", removedD);
})
})
General idea is to use findOne https://docs.mongodb.com/manual/reference/method/db.collection.findOne/
to retrieve one random id from the duplicate records in the collection.
Delete all the records in the collection other than the random-id that we retrieved from findOne option.
You can do something like this if you are trying to do it in pymongo.
def _run_query():
try:
for record in (aggregate_based_on_field(collection)):
if not record:
continue
_logger.info("Working on Record %s", record)
try:
retain = db.collection.find_one(find_one({'fie1d1': 'x', 'field2':'y'}, {'_id': 1}))
_logger.info("_id to retain from duplicates %s", retain['_id'])
db.collection.remove({'fie1d1': 'x', 'field2':'y', '_id': {'$ne': retain['_id']}})
except Exception as ex:
_logger.error(" Error when retaining the record :%s Exception: %s", x, str(ex))
except Exception as e:
_logger.error("Mongo error when deleting duplicates %s", str(e))
def aggregate_based_on_field(collection):
return collection.aggregate([{'$group' : {'_id': "$fieldX"}}])
From the shell:
Replace find_one to findOne
Same remove command should work.