Remove Duplicates from MongoDB

Remove Duplicates from MongoDB - mongodb

hi I have a ~5 million documents in mongodb (replication) each document 43 fields. how to remove duplicate document. I tryed
db.testkdd.ensureIndex({
duration : 1 , protocol_type : 1 , service : 1 ,
flag : 1 , src_bytes : 1 , dst_bytes : 1 ,
land : 1 , wrong_fragment : 1 , urgent : 1 ,
hot : 1 , num_failed_logins : 1 , logged_in : 1 ,
num_compromised : 1 , root_shell : 1 , su_attempted : 1 ,
num_root : 1 , num_file_creations : 1 , num_shells : 1 ,
num_access_files : 1 , num_outbound_cmds : 1 , is_host_login : 1 ,
is_guest_login : 1 , count : 1 , srv_count : 1 ,
serror_rate : 1 , srv_serror_rate : 1 , rerror_rate : 1 ,
srv_rerror_rate : 1 , same_srv_rate : 1 , diff_srv_rate : 1 ,
srv_diff_host_rate : 1 , dst_host_count : 1 , dst_host_srv_count : 1 ,
dst_host_same_srv_rate : 1 , dst_host_diff_srv_rate : 1 ,
dst_host_same_src_port_rate : 1 , dst_host_srv_diff_host_rate : 1 ,
dst_host_serror_rate : 1 , dst_host_srv_serror_rate : 1 ,
dst_host_rerror_rate : 1 , dst_host_srv_rerror_rate : 1 , lable : 1
},
{unique: true, dropDups: true}
)
run this code i get a error "errmsg" : "namespace name generated from index ..
{
"ok" : 0,
"errmsg" : "namespace name generated from index name \"project.testkdd.$duration_1_protocol_type_1_service_1_flag_1_src_bytes_1_dst_bytes_1_land_1_wrong_fragment_1_urgent_1_hot_1_num_failed_logins_1_logged_in_1_num_compromised_1_root_shell_1_su_attempted_1_num_root_1_num_file_creations_1_num_shells_1_num_access_files_1_num_outbound_cmds_1_is_host_login_1_is_guest_login_1_count_1_srv_count_1_serror_rate_1_srv_serror_rate_1_rerror_rate_1_srv_rerror_rate_1_same_srv_rate_1_diff_srv_rate_1_srv_diff_host_rate_1_dst_host_count_1_dst_host_srv_count_1_dst_host_same_srv_rate_1_dst_host_diff_srv_rate_1_dst_host_same_src_port_rate_1_dst_host_srv_diff_host_rate_1_dst_host_serror_rate_1_dst_host_srv_serror_rate_1_dst_host_rerror_rate_1_dst_host_srv_rerror_rate_1_lable_1\" is too long (127 byte max)",
"code" : 67
}
how can solve the problem ?

The "dropDups" syntax for index creation has been "deprecated" as of MongoDB 2.6 and removed in MongoDB 3.0. It is not a very good idea in most cases to use this as the "removal" is arbitrary and any "duplicate" could be removed. Which means what gets "removed" may not be what you really want removed.
Anyhow, you are running into an "index length" error since the value of the index key here would be longer that is allowed. Generally speaking, you are not "meant" to index 43 fields in any normal application.
If you want to remove the "duplicates" from a collection then your best bet is to run an aggregation query to determine which documents contain "duplicate" data and then cycle through that list removing "all but one" of the already "unique" _id values from the target collection. This can be done with "Bulk" operations for maximum efficiency.
NOTE: I do find it hard to believe that your documents actually contain 43 "unique" fields. It is likely that "all you need" is to simply identify only those fields that make the document "unique" and then follow the process as outlined below:
var bulk = db.testkdd.initializeOrderedBulkOp(),
count = 0;
// List "all" fields that make a document "unique" in the `_id`
// I am only listing some for example purposes to follow
db.testkdd.aggregate([
{ "$group": {
"_id": {
"duration" : "$duration",
"protocol_type": "$protocol_type",
"service": "$service",
"flag": "$flag"
},
"ids": { "$push": "$_id" },
"count": { "$sum": 1 }
}},
{ "$match": { "count": { "$gt": 1 } } }
],{ "allowDiskUse": true}).forEach(function(doc) {
doc.ids.shift(); // remove first match
bulk.find({ "_id": { "$in": doc.ids } }).remove(); // removes all $in list
count++;
// Execute 1 in 1000 and re-init
if ( count % 1000 == 0 ) {
bulk.execute();
bulk = db.testkdd.initializeOrderedBulkOp();
}
});
if ( count % 1000 != 0 )
bulk.execute();
If you have a MongoDB version "lower" than 2.6 and don't have bulk operations then you can try with standard .remove() inside the loop as well. Also noting that .aggregate() will not return a cursor here and the looping must change to:
db.testkdd.aggregate([
// pipeline as above
]).result.forEach(function(doc) {
doc.ids.shift();
db.testkdd.remove({ "_id": { "$in": doc.ids } });
});
But do make sure to look at your documents closely and only include "just" the "unique" fields you expect to be part of the grouping _id. Otherwise you end up removing nothing at all, since there are no duplicates there.

Related

Paginated aggregation with two fields

Trying to do an aggregate operation to find distinct property pairs in a collection of objects, and paginated, but the $skip and $limit doesn't seems to work.
I have a collection with the following object type
{
"_id" : {
"expiration" : ISODate("2021-06-30T00:00:00.000Z"),
"product" : "proda",
"site" : "warehouse1",
"type" : "AVAILABLE"
},
"quantity" : 2,
"date" : ISODate("2021-06-28T00:00:00.000Z"),
}
I'm trying to find distinct product/site pairs, but only 2 at a time with the following aggregation:
db.getCollection('OBJECT').aggregate( [
{ $group: { "_id": { product: "$_id.product", site: "$_id.site" } } },
{ $skip: 0 },
{ $limit: 2 }
])
With skip being 0 it returns 2 distinct product-site paris as expected, but when I increase the skip value to 2 or more for the next steps, the query will not return anything and I have many objects with distinct product-site pairs that should be returned.

Solving "BSONObj size: 17582686 (0x10C4A5E) is invalid" when doing aggregation in MongoDB?

I'm trying to remove duplicate documents in MongoDB in a large collection according to the approach described here:
db.events.aggregate([
{ "$group": {
"_id": { "firstId": "$firstId", "secondId": "$secondId" },
"dups": { "$push": "$_id" },
"count": { "$sum": 1 }
}},
{ "$match": { "count": { "$gt": 1 } }}
], {allowDiskUse:true, cursor:{ batchSize:100 } }).forEach(function(doc) {
doc.dups.shift();
db.events.remove({ "_id": {"$in": doc.dups }});
});
I.e. I want to remove events that has the same "firstId - secondId" combination. However after a while MongoDB responds with this error:
2016-11-30T14:13:57.403+0000 E QUERY [thread1] Error: getMore command failed: {
"ok" : 0,
"errmsg" : "BSONObj size: 17582686 (0x10C4A5E) is invalid. Size must be between 0 and 16793600(16MB)",
"code" : 10334
}
Is there anyway to get around this? I'm using MongoDB 3.2.6.

The error message indicates that some part of the process is attempting to create a document that is larger than the 16 MB document size limit in MongoDB.
Without knowing your data set, I would guess that the size of the collection is sufficiently large that the number of unique firstId / secondId combinations is growing the result set past the document size limit.
If the size of the collection prevents finding all duplicates values in one operation, you may want to try breaking it up and iterating through the collection and querying to find duplicate values:
db.events.find({}, { "_id" : 0, "firstId" : 1, "secondId" : 1 }).forEach(function(doc) {
cnt = db.events.find(
{ "firstId" : doc.firstId, "secondId" : doc.secondId },
{ "_id" : 0, "firstId" : 1, "secondId" : 1 } // explictly only selecting key fields to allow index to cover the query
).count()
if( cnt > 1 )
print('Dupe Keys: firstId: ' + doc.firstId + ', secondId: ' + doc.secondId)
})
It's probably not the most efficient implementation, but you get the idea.
Note that this approach heavily relies upon the existence of the index { 'firstId' : 1, 'secondId' : 1 }

Update query for the given case in MongoDB

{ "_id" :,
"final_terms" : [
{
"np" : "the role",
"tf" : 28571.000,
"idf" : 0
}]
}
How to update and set the flag to 1 for top 30% sorted in decreasing order by final_terms.idf and 0 for the rest
{ "_id" :,
"final_terms" : [
{
"np" : "the role",
"tf" : 28571.000,
"idf" : 0
"flag": 0
}]
}
I am new to mongodb, and I need to do this for nlp, the mongodb docs are less detail oriented and it is difficult get a grip on mongodb using them.

I would do this in steps. Firstly, you need to know how many documents will be in your result set, so that you can figure out what the top 30% is. Secondly, you do a query that will sort the documents in decreasing order by final_terms.idf and figure out what the value of final_terms.idf is for the last document in the top 30% of the result set. Once you know that, you can update all documents with a final_terms.idf value greater than or equal to that with flag: 1 and all others with flag: 0. The exact implementation would depend on your programming language, but an implementation in the mongo shell would look as follows:
// Get count
> db.collection.find().count();
100
Now you know that you have 100 documents, so the top 30% will be the first 30 documents. Skip the first 29 in the sorted results and find the value for the 30th document:
// Sort and get value for 30th document
> db.collection.find({}, { "final_terms.idf" : 1, "_id" : 0} ).sort({ "final_terms.idf" : -1 }).skip(29).limit(1);
{ "final_terms" : { "idf" : "<SOME_VALUE>" } }
You now have the value at the bottom limit of the first 30%. Use that value to do the respective updates:
// Update top 30%
db.collection.update({ "final_terms.idf" : { $gte : <SOME_VALUE> }}, { $set : { "final_terms.flag" : 1 } }, { "multi" : true });
// Update bottom 70%
db.collection.update({ "final_terms.idf" : { $lt : <SOME_VALUE> }}, { $set : { "final_terms.flag" : 0 } }, { "multi" : true });
That should give you an idea of how to solve your problem.

Return range of documents around ID in MongoDB

I have an ID of a document and need to return the document plus the 10 documents that come before and the 10 documents after it. 21 docs total.
I do not have a start or end value from any key. Only the limit in either direction.
Best way to do this? Thank you in advance.

Did you know that ObjectID's contain a timestamp? And that therefore they always represent the natural insertion order. So if you are looking for documents before an after a known document _id you can do this:
Our documents:
{ "_id" : ObjectId("5307f2d80f936e03d1a1d1c8"), "a" : 1 }
{ "_id" : ObjectId("5307f2db0f936e03d1a1d1c9"), "b" : 1 }
{ "_id" : ObjectId("5307f2de0f936e03d1a1d1ca"), "c" : 1 }
{ "_id" : ObjectId("5307f2e20f936e03d1a1d1cb"), "d" : 1 }
{ "_id" : ObjectId("5307f2e50f936e03d1a1d1cc"), "e" : 1 }
{ "_id" : ObjectId("5307f2e90f936e03d1a1d1cd"), "f" : 1 }
{ "_id" : ObjectId("5307f2ec0f936e03d1a1d1ce"), "g" : 1 }
{ "_id" : ObjectId("5307f2ee0f936e03d1a1d1cf"), "h" : 1 }
{ "_id" : ObjectId("5307f2f10f936e03d1a1d1d0"), "i" : 1 }
{ "_id" : ObjectId("5307f2f50f936e03d1a1d1d1"), "j" : 1 }
{ "_id" : ObjectId("5307f3020f936e03d1a1d1d2"), "j" : 1 }
So we know the _id of "f", get it and the next 2 documents:
> db.items.find({ _id: {$gte: ObjectId("5307f2e90f936e03d1a1d1cd") } }).limit(3)
{ "_id" : ObjectId("5307f2e90f936e03d1a1d1cd"), "f" : 1 }
{ "_id" : ObjectId("5307f2ec0f936e03d1a1d1ce"), "g" : 1 }
{ "_id" : ObjectId("5307f2ee0f936e03d1a1d1cf"), "h" : 1 }
And do the same in reverse:
> db.items.find({ _id: {$lte: ObjectId("5307f2e90f936e03d1a1d1cd") } })
.sort({ _id: -1 }).limit(3)
{ "_id" : ObjectId("5307f2e90f936e03d1a1d1cd"), "f" : 1 }
{ "_id" : ObjectId("5307f2e50f936e03d1a1d1cc"), "e" : 1 }
{ "_id" : ObjectId("5307f2e20f936e03d1a1d1cb"), "d" : 1 }
And that's a much better approach than scanning a collection.

Neil's answer is a good answer to the question as stated (assuming that you are using automatically generated ObjectIds), but keep in mind that there's some subtlety around the concept of the 10 documents before and after a given document.
The complete format for an ObjectId is documented here. Note that it consists of the following fields:
timestamp to 1-second resolution,
machine identifier
process id
counter
Generally if you don't specify your own _ids they are automatically generated by the driver on the client machine. So as long as the ObjectIds are generated on a single process on a client single machine, their order does indeed reflect the order in which they were generated, which in a typical application will also be the insertion order (but need not be). However if you have multiple processes or multiple client machines, the order of the ObjectIds for objects generated within a given second by those multiple sources has an unpredictable relationship to the insertion order.

Mongo DB sorting exception - too much data for sort() with no index

Using MongoDB version 2.4.4, I have a profile collection containing profiles documents.
I have the following query:
Query: { "loc" : { "$near" : [ 32.08290052711715 , 34.80888522811172] , "$maxDistance" : 0.0089992800575954}}
Fields: { "friendsCount" : 1 , "tappsCount" : 1 , "imageUrl" : 1 , "likesCount" : 1 , "lastActiveTime" : 1 , "smallImageUrl" : 1 , "loc" : 1 , "pid" : 1 , "firstName" : 1}
Sort: { "lastActiveTime" : -1}
Limited to 100 documents.
loc - embedded document containing the keys ( lat,lon)
I am getting the exception:
org.springframework.data.mongodb.UncategorizedMongoDbException: too much data for sort() with no index. add an index or specify a smaller limit;
As stated in the exception when I down-size the limit to 50 it works.. but it ain't option for me.
I have the following 2 relevant indexes on the profile document:
{'loc':'2d'}
{'lastActiveTime':-1}
I have also tried compound index as below but without success.
{'loc':'2d', 'lastActiveTime':-1}
This is example document (with the relevant keys):
{
"_id" : "5d5085601208aa918bea3c1ede31374d",
"gender" : "female",
"isCreated" : true,
"lastActiveTime" : ISODate("2013-04-08T11:30:56.615Z"),
"loc" : {
"lat" : 32.082230499955806,
"lon" : 34.813542940344945,
"locTime" : NumberLong(0)
}
}
There are other fields in the profile documents .. basically average profile document size is 0.5 MB correct me if I am wrong but if I am specifying only the relevant response fields (as I do) it is not the cause for the problem.
Don't know if it helps but when I down-size the limit size to 50 and the query succeed
I have the following explain information (via MongoVUE client)
cursor : GeoSearchCursor
isMultyKey : False
n : 50
nscannedObjects : 50
nscanned : 50
nscannedObjectsAllPlans : 50
nscannedAllPlans : 50
scanAndOrder : True
indexOnly : False
nYields : 0
nChunkSkips : 0
millis : 10
indexBounds :
It is a blocker for me and I will appreciate your help, what am I doing wrong? How can I make the query roll with the needed limit size?

Try creating a compound index instead of two indexes.
db.collection.ensureIndex( { 'loc':'2d','lastActiveTime':-1 } )
You can also suggest the query which index to use:
db.collection.find(...).hint('myIndexName')