I have a collection in mongodb, the collection has a field displayInCategories. The collection contains 1000's of data wrt to different displayInCategories.
Is it possible to limit the records to <=5 for all the available displayInCategories.
I didn't want to limit the record on whole result, I need to limit the record as per the displayInCategories
This might get you going:
db.collection.aggregate([{
$group: {
_id: "$displayInCategories", // group by displayInCategories
"docs": { $push: "$$ROOT" } // remember all documents for this category
}
}, {
$project: {
"docs": { $slice: [ "$docs", 5 ] } // limit the items in each "docs" array to 5
}
}])
You might want to apply a $sort stage at the start to make sure you don't get random documents but rather the "top 5" based on some criteria.
I have a simple query that's giving me unexpected results when I query a public mongodb server,
mongodb://steemit:steemit#mongo1.steemdata.com:27017/SteemData
db.getCollection('Posts')
.aggregate([
{ $match: { "created" : { $gte: ISODate("2017-10-01T00:00:00.000Z"), $lt: ISODate("2017-10-30T00:00:00.000Z") } } },
{ $group: { "_id": "author", "itemcount": { "$sum": 1 } } },
{ $sort: { "itemcount": -1 } }
])
I'm trying to get a count of the number of articles by author for the month of October.
I would hope to see a result set something like,
Bob 10
Joe 9
Sam 3
Tim 1
Instead I'm getting,
author 23
Can someone explain what's wrong with my aggregation pipeline?
Thanks
There is a small issue with your group stage; the name of the field in the group stage ('author') should be prefixed with a dollar sign, like in this example in the documentation for $group.
Try updating the group stage to the following:
{ $group: { "_id": "$author", "itemcount": { "$sum": 1 } } },
Seems it's nothing to do with the query...
This one run at 5:32pm (ignore the single quotes, I tried with both)
And then a few mins later at 5:41pm
I have a set with 2m hashtags. However only around 200k are distinct values. I want to know wich hashtags are more repeated on my data.
I used this to find how many times each hashtag is repeated on my dataset:
db.hashtags.aggregate([{ "$group": {"_id": "$hashtag","count": { "$sum": 1 }}}]);
However, I would like to save the values in a distinct collection only with the unique values and its correspondency number of occurency.
How should I do that?
Please, if possible provide me some information in order that I can UNDERSTAND how to do it not only the code.
Thank you.
You can use the $out pipeline operator to write the output of the aggregation to another collection.
db.hashtags.aggregate([
{ "$group": {"_id": "$hashtag", "count": { "$sum": 1 }}},
{ "$out": "newcoll" }
]);
Note that this feature was added in MongoDB 2.6
Using the aggregation framework the following will, for hashtag with multiple records, return the duplicate hashtag and the corresponding record count:
db.hashtags.aggregate([
{
$group: {
_id: "$hashtag",
count: { $sum: 1 }
}
},
{ $match: { count: { $gt: 1 } } },
{ $sort : { count : -1} },
{ $limit : 200 },
{ $out: "duphashtags" }
])
The $sum operator adds up the values of the fields passed to it, in this case the constant 1 - thereby counting the number of grouped records into the count field. The $match filters documents with a count greater than 1, i.e. duplicates. $sort sorts the most frequent duplicates first, and limit the results to the top 200. The $out operator writes the documents returned by the aggregation pipeline to a specified collection, say "duphashtags".
I have approximately 1.7M documents in mongodb (in future 10m+). Some of them represent duplicate entry which I do not want. Structure of document is something like this:
{
_id: 14124412,
nodes: [
12345,
54321
],
name: "Some beauty"
}
Document is duplicate if it has at least one node same as another document with same name. What is the fastest way to remove duplicates?
dropDups: true option is not available in 3.0.
I have solution with aggregation framework for collecting duplicates and then removing in one go.
It might be somewhat slower than system level "index" changes. But it is good by considering way you want to remove duplicate documents.
a. Remove all documents in one go
var duplicates = [];
db.collectionName.aggregate([
{ $match: {
name: { "$ne": '' } // discard selection criteria
}},
{ $group: {
_id: { name: "$name"}, // can be grouped on multiple properties
dups: { "$addToSet": "$_id" },
count: { "$sum": 1 }
}},
{ $match: {
count: { "$gt": 1 } // Duplicates considered as count greater than one
}}
],
{allowDiskUse: true} // For faster processing if set is larger
) // You can display result until this and check duplicates
.forEach(function(doc) {
doc.dups.shift(); // First element skipped for deleting
doc.dups.forEach( function(dupId){
duplicates.push(dupId); // Getting all duplicate ids
}
)
})
// If you want to Check all "_id" which you are deleting else print statement not needed
printjson(duplicates);
// Remove all duplicates in one go
db.collectionName.remove({_id:{$in:duplicates}})
b. You can delete documents one by one.
db.collectionName.aggregate([
// discard selection criteria, You can remove "$match" section if you want
{ $match: {
source_references.key: { "$ne": '' }
}},
{ $group: {
_id: { source_references.key: "$source_references.key"}, // can be grouped on multiple properties
dups: { "$addToSet": "$_id" },
count: { "$sum": 1 }
}},
{ $match: {
count: { "$gt": 1 } // Duplicates considered as count greater than one
}}
],
{allowDiskUse: true} // For faster processing if set is larger
) // You can display result until this and check duplicates
.forEach(function(doc) {
doc.dups.shift(); // First element skipped for deleting
db.collectionName.remove({_id : {$in: doc.dups }}); // Delete remaining duplicates
})
Assuming you want to permanently delete docs that contain a duplicate name + nodes entry from the collection, you can add a unique index with the dropDups: true option:
db.test.ensureIndex({name: 1, nodes: 1}, {unique: true, dropDups: true})
As the docs say, use extreme caution with this as it will delete data from your database. Back up your database first in case it doesn't do exactly as you're expecting.
UPDATE
This solution is only valid through MongoDB 2.x as the dropDups option is no longer available in 3.0 (docs).
Create collection dump with mongodump
Clear collection
Add unique index
Restore collection with mongorestore
I found this solution that works with MongoDB 3.4:
I'll assume the field with duplicates is called fieldX
db.collection.aggregate([
{
// only match documents that have this field
// you can omit this stage if you don't have missing fieldX
$match: {"fieldX": {$nin:[null]}}
},
{
$group: { "_id": "$fieldX", "doc" : {"$first": "$$ROOT"}}
},
{
$replaceRoot: { "newRoot": "$doc"}
}
],
{allowDiskUse:true})
Being new to mongoDB, I spent a lot of time and used other lengthy solutions to find and delete duplicates. However, I think this solution is neat and easy to understand.
It works by first matching documents that contain fieldX (I had some documents without this field, and I got one extra empty result).
The next stage groups documents by fieldX, and only inserts the $first document in each group using $$ROOT. Finally, it replaces the whole aggregated group by the document found using $first and $$ROOT.
I had to add allowDiskUse because my collection is large.
You can add this after any number of pipelines, and although the documentation for $first mentions a sort stage prior to using $first, it worked for me without it. " couldnt post a link here, my reputation is less than 10 :( "
You can save the results to a new collection by adding an $out stage...
Alternatively, if one is only interested in a few fields e.g. field1, field2, and not the whole document, in the group stage without replaceRoot:
db.collection.aggregate([
{
// only match documents that have this field
$match: {"fieldX": {$nin:[null]}}
},
{
$group: { "_id": "$fieldX", "field1": {"$first": "$$ROOT.field1"}, "field2": { "$first": "$field2" }}
}
],
{allowDiskUse:true})
The following Mongo aggregation pipeline does the deduplication and outputs it back to the same or different collection.
collection.aggregate([
{ $group: {
_id: '$field_to_dedup',
doc: { $first: '$$ROOT' }
} },
{ $replaceRoot: {
newRoot: '$doc'
} },
{ $out: 'collection' }
], { allowDiskUse: true })
My DB had millions of duplicate records. #somnath's answer did not work as is so writing the solution that worked for me for people looking to delete millions of duplicate records.
/** Create a array to store all duplicate records ids*/
var duplicates = [];
/** Start Aggregation pipeline*/
db.collection.aggregate([
{
$match: { /** Add any filter here. Add index for filter keys*/
filterKey: {
$exists: false
}
}
},
{
$sort: { /** Sort it in such a way that you want to retain first element*/
createdAt: -1
}
},
{
$group: {
_id: {
key1: "$key1", key2:"$key2" /** These are the keys which define the duplicate. Here document with same value for key1 and key2 will be considered duplicate*/
},
dups: {
$push: {
_id: "$_id"
}
},
count: {
$sum: 1
}
}
},
{
$match: {
count: {
"$gt": 1
}
}
}
],
{
allowDiskUse: true
}).forEach(function(doc){
doc.dups.shift();
doc.dups.forEach(function(dupId){
duplicates.push(dupId._id);
})
})
/** Delete the duplicates*/
var i,j,temparray,chunk = 100000;
for (i=0,j=duplicates.length; i<j; i+=chunk) {
temparray = duplicates.slice(i,i+chunk);
db.collection.bulkWrite([{deleteMany:{"filter":{"_id":{"$in":temparray}}}}])
}
Here is a slightly more 'manual' way of doing it:
Essentially, first, get a list of all the unique keys you are interested.
Then perform a search using each of those keys and delete if that search returns bigger than one.
db.collection.distinct("key").forEach((num)=>{
var i = 0;
db.collection.find({key: num}).forEach((doc)=>{
if (i) db.collection.remove({key: num}, { justOne: true })
i++
})
});
tips to speed up, when only small portion of your documents are duplicated:
you need an index on the field to detect duplicates.
$group does not use the index, but it can take advantage of $sort and $sort use the index. so you should put a $sort step at the beginning
do inplace delete_many() instead of $out to new collection, this will save lots of IO time and disk space.
if you use pymongo you can do:
index_uuid = IndexModel(
[
('uuid', pymongo.ASCENDING)
],
)
col.create_indexes([index_uuid])
pipeline = [
{"$sort": {"uuid":1}},
{
"$group": {
"_id": "$uuid",
"dups": {"$addToSet": "$_id"},
"count": {"$sum": 1}
}
},
{
"$match": {"count": {"$gt": 1}}
},
]
it_cursor = col.aggregate(
pipeline, allowDiskUse=True
)
# skip 1st dup of each dups group
dups = list(itertools.chain.from_iterable(map(lambda x: x["dups"][1:], it_cursor)))
col.delete_many({"_id":{"$in": dups}})
performance
I test it on a database contain 30M documents and 1TB large.
Without index/sort it takes more than an hour to get the cursor (I do not even have the patient to wait for it).
with index/sort but use $out to output to a new collection. This is safer if your filesystem does not support snapshot. But it requires lots of disk space and takes more than 40mins to finish despite the fact that we are using SSDs. It will be much slower if you are on HDD RAID.
with index/sort and inplace delete_many, it takes around 5mins in total.
The following method merges documents with the same name while only keeping the unique nodes without duplicating them.
I found using the $out operator to be a simple way. I unwind the array and then group it by adding to set. The $out operator allows the aggregation result to persist [docs].
If you put the name of the collection itself it will replace the collection with the new data. If the name does not exist it will create a new collection.
Hope this helps.
allowDiskUse may have to be added to the pipeline.
db.collectionName.aggregate([
{
$unwind:{path:"$nodes"},
},
{
$group:{
_id:"$name",
nodes:{
$addToSet:"$nodes"
}
},
{
$project:{
_id:0,
name:"$_id.name",
nodes:1
}
},
{
$out:"collectionNameWithoutDuplicates"
}
])
Using pymongo this should work.
Add the fields that need to be unique for the collection in unique_field
unique_field = {"field1":"$field1","field2":"$field2"}
cursor = DB.COL.aggregate([{"$group":{"_id":unique_field, "dups":{"$push":"$uuid"}, "count": {"$sum": 1}}},{"$match":{"count": {"$gt": 1}}},{"$group":"_id":None,"dups":{"$addToSet":{"$arrayElemAt":["$dups",1]}}}}],allowDiskUse=True)
slice the dups array depending on the duplications count(here i had only one extra duplicate for all)
items = list(cursor)
removeIds = items[0]['dups']
hold.remove({"uuid":{"$in":removeIds}})
I don't know whether is it going to answer main question, but for others it'll be usefull.
1.Query the duplicate row using findOne() method and store it as an object.
const User = db.User.findOne({_id:"duplicateid"});
2.Execute deleteMany() method to remove all the rows with the id "duplicateid"
db.User.deleteMany({_id:"duplicateid"});
3.Insert the values stored in User object.
db.User.insertOne(User);
Easy and fast!!!!
First, you can find all the duplicates and remove those duplicates in the DB. Here we take the id column to check and remove duplicates.
db.collection.aggregate([
{ "$group": { "_id": "$id", "count": { "$sum": 1 } } },
{ "$match": { "_id": { "$ne": null }, "count": { "$gt": 1 } } },
{ "$sort": { "count": -1 } },
{ "$project": { "name": "$_id", "_id": 0 } }
]).then(data => {
var dr = data.map(d => d.name);
console.log("duplicate Recods:: ", dr);
db.collection.remove({ id: { $in: dr } }).then(removedD => {
console.log("Removed duplicate Data:: ", removedD);
})
})
General idea is to use findOne https://docs.mongodb.com/manual/reference/method/db.collection.findOne/
to retrieve one random id from the duplicate records in the collection.
Delete all the records in the collection other than the random-id that we retrieved from findOne option.
You can do something like this if you are trying to do it in pymongo.
def _run_query():
try:
for record in (aggregate_based_on_field(collection)):
if not record:
continue
_logger.info("Working on Record %s", record)
try:
retain = db.collection.find_one(find_one({'fie1d1': 'x', 'field2':'y'}, {'_id': 1}))
_logger.info("_id to retain from duplicates %s", retain['_id'])
db.collection.remove({'fie1d1': 'x', 'field2':'y', '_id': {'$ne': retain['_id']}})
except Exception as ex:
_logger.error(" Error when retaining the record :%s Exception: %s", x, str(ex))
except Exception as e:
_logger.error("Mongo error when deleting duplicates %s", str(e))
def aggregate_based_on_field(collection):
return collection.aggregate([{'$group' : {'_id': "$fieldX"}}])
From the shell:
Replace find_one to findOne
Same remove command should work.