MongoDB: Removing duplicate document based on ObjectId? - mongodb

This is really an open question. I am sorry if this goes little vague but I am trying to collect thoughts from other people since I am very new to Mongo
Situation
I realized that my collection has multiple duplicate documents (based on name key)
These documents may be same or might got changed during the subsequent dumps from file(we want to keep later changes)
Since there is no insert date, it will be hard to tell looking at document which one is latest (bad schema design)
Wanted
To remove the documents which were inserted earlier
I read that each document in collection is assigned an ObjectId(here) that makes document unique
Question
Is it possible to know which document is inserted earlier based on ObjectId and remove it using Map Reduce?
Any other thoughts and advices?

I'm bored this evening, so here we go.
Step 1. Let's prepare our test data.
> db.users.insert({name: 'John', other_field: Math.random()})
> db.users.insert({name: 'Bob', other_field: Math.random()})
> db.users.insert({name: 'Mary', other_field: Math.random()})
> db.users.insert({name: 'John', other_field: Math.random()})
> db.users.insert({name: 'Jeff', other_field: Math.random()})
> db.users.insert({name: 'Ivan', other_field: Math.random()})
> db.users.insert({name: 'Mary', other_field: Math.random()})
> db.users.find()
{
"_id" : ObjectId("501976e9bee9b253265bba8b"),
"name" : "John",
"other_field" : 0.9884713875252772
}
{
"_id" : ObjectId("501976e9bee9b253265bba8c"),
"name" : "Bob",
"other_field" : 0.048004131996396415
}
{
"_id" : ObjectId("501976e9bee9b253265bba8d"),
"name" : "Mary",
"other_field" : 0.20415803582615222
}
{
"_id" : ObjectId("501976e9bee9b253265bba8e"),
"name" : "John",
"other_field" : 0.5514446987265585
}
{
"_id" : ObjectId("501976e9bee9b253265bba8f"),
"name" : "Jeff",
"other_field" : 0.8685077449753242
}
{
"_id" : ObjectId("501976e9bee9b253265bba90"),
"name" : "Ivan",
"other_field" : 0.2842514340422925
}
{
"_id" : ObjectId("501976eabee9b253265bba91"),
"name" : "Mary",
"other_field" : 0.984048520281136
}
Step 2. The map-reduce
var map = function() {
emit(this.name, this);
};
var reduce = function(name, vals) {
var last_obj = null;
vals.forEach(function(v) {
if(!last_obj || v._id > last_obj._id) {
last_obj = v;
}
});
return last_obj;
};
db.users.mapReduce(map, reduce, {out: 'temp_coll'})
It basically groups all documents by name and then selects the one with the largest _id.
Step 3. Do something with unique data.
> db.temp_coll.find()
{
"_id" : "Bob",
"value" : {
"_id" : ObjectId("501976e9bee9b253265bba8c"),
"name" : "Bob",
"other_field" : 0.048004131996396415
}
}
{
"_id" : "Ivan",
"value" : {
"_id" : ObjectId("501976e9bee9b253265bba90"),
"name" : "Ivan",
"other_field" : 0.2842514340422925
}
}
{
"_id" : "Jeff",
"value" : {
"_id" : ObjectId("501976e9bee9b253265bba8f"),
"name" : "Jeff",
"other_field" : 0.8685077449753242
}
}
{
"_id" : "John",
"value" : {
"_id" : ObjectId("501976e9bee9b253265bba8e"),
"name" : "John",
"other_field" : 0.5514446987265585
}
}
{
"_id" : "Mary",
"value" : {
"_id" : ObjectId("501976eabee9b253265bba91"),
"name" : "Mary",
"other_field" : 0.984048520281136
}
}
For example, drop the original collection, iterate this one and insert values into new collection. Don't forget to drop the temp collection when you're done.
Important
I didn't bother with extraction of a timestamp from objectid, because I assumed that you run your import jobs not twice a second (not even every second, maybe).

Ok since object id uses timestamp as it's leading four bytes you can do this with a bit of math.
Thankfully the mongo shell has a way to get the timestamp from an object id by you will need to do some more javascript to first query your documents with the same name then store them in a temp variable (if using the command line) or in a temp table (if using drivers) and parse each individual id's using the timestamp getter that's shown in the link below.
http://www.mongodb.org/display/DOCS/Optimizing+Object+IDs#OptimizingObjectIDs-Extractinsertiontimesfromidratherthanhavingaseparatetimestampfield.
Remember that object id's are only accurate to the second so this still doesn't help in rapid insertion mode.
But either way what you are asking for is doable either in a map reduce function or in the way shown above which does it through the command line.
Give that a shot and if you get stuck let me know. If i know your collection structure i can probably whip up something real quick but only after you bang your head on it a couple of times :)

Related

MongoDB: How to get the object names in collection?

and think you in advance for the help. I have recently started using mongoDB for some personal project and I'm interested in finding a better way to query my data.
My question is: I have the following collection:
{
"_id" : ObjectId("5dbd77f7a204d21119cfc758"),
"Toyota" : {
"Founder" : "Kiichiro Toyoda",
"Founded" : "28 August 1937",
"Subsidiaries" : [
"Lexus",
"Daihatsu",
"Subaru",
"Hino"
]
}
}
{
"_id" : ObjectId("5dbd78d3a204d21119cfc759"),
"Volkswagen" : {
"Founder" : "German Labour Front",
"Founded" : "28 May 1937",
"Subsidiaries" : [
"Audi",
"Volkswagen",
"Skoda",
"SEAT"
]
}
}
I want to get the object name for example here I want to return
[Toyota, Volkswagen]
I have use this method
var names = {}
db.cars.find().forEach(function(doc){Object.keys(doc).forEach(function(key){names[key]=1})});
names;
which gave me the following result:
{ "_id" : 1, "Toyota" : 1, "Volkswagen" : 1 }
however, is there a better way to get the same result and also to just return the names of the objects. Thank you.
I would suggest you to change the schema design to be something like:
{
_id: ...,
company: {
name: 'Volkswagen',
founder: ...,
subsidiaries: ...,
...<other fields>...
}
You can then use the aggregation framework to achieve a similar result:
> db.test.find()
{ "_id" : 0, "company" : { "name" : "Volkswagen", "founder" : "German Labour Front" } }
{ "_id" : 1, "company" : { "name" : "Toyota", "founder" : "Kiichiro Toyoda" } }
> db.test.aggregate([ {$group: {_id: null, companies: {$push: '$company.name'}}} ])
{ "_id" : null, "companies" : [ "Volkswagen", "Toyota" ] }
For more details, see:
Aggregation framework
$group
Accumulator operators
As a bonus, you can create an index on the company.name field, whereas you cannot create an index on varying field names like in your example.

Querying with array of parameters in mongodb

I have below collection in the DB, I want to retrieve data where birth month equal to given 2 months. lets say [1,2], or [4,5]
{
"_id" : ObjectId("55aa1e526fea82e9a4188f38"),
"name" : "Nilmini",
"birthDate" : 6,
"birthMonth" : 1
},
{
"_id" : ObjectId("55aa1e526fea82e9a4188f39"),
"name" : "Ruwan",
"birthDate" : 6,
"birthMonth" : 1
},{
"_id" : ObjectId("55aa1e526fea82e9a4188f40"),
"name" : "Malith",
"birthDate" : 6,
"birthMonth" : 1
},
{
"_id" : ObjectId("55aa1e526fea82e9a4188f7569"),
"name" : "Pradeep",
"birthDate" : 6,
"birthMonth" : 7
}
I use below query to get the result set, I could get the result for give one month,now I want to get results for multiple months.
var currentDay = moment().date();
var currentMonths = [];
var currentMonth = moment().month();
if(currentDay > 20){
currentMonths.push(moment().month());
currentMonths.push(moment().month()+1);
}else{
currentMonths.push(currentMonth);
}
// In blow query I am trying to pass the array to the 'birthMonth',
I'm getting nothing when I pass array to the query, I think there should be another way to do this,
Employee.find(
{
"birthDate": {$gte:currentDay}, "birthMonth": currentMonths
}, function(err, birthDays) {
res.json(birthDays);
});
I would really appreciate if you could help me to figure this out
You can use the $in operator to match against multiple values in an array like currentMonths.
So your query would be:
Employee.find(
{
"birthDate": {$gte:currentDay}, "birthMonth": {$in: currentMonths}
}, function(err, birthDays) {
res.json(birthDays);
});

Get specific object in array of array in MongoDB

I need get a specific object in array of array in MongoDB.
I need get only the task object = [_id = ObjectId("543429a2cb38b1d83c3ff2c2")].
My document (projects):
{
"_id" : ObjectId("543428c2cb38b1d83c3ff2bd"),
"name" : "new project",
"author" : ObjectId("5424ac37eb0ea85d4c921f8b"),
"members" : [
ObjectId("5424ac37eb0ea85d4c921f8b")
],
"US" : [
{
"_id" : ObjectId("5434297fcb38b1d83c3ff2c0"),
"name" : "Test Story",
"author" : ObjectId("5424ac37eb0ea85d4c921f8b"),
"tasks" : [
{
"_id" : ObjectId("54342987cb38b1d83c3ff2c1"),
"name" : "teste3",
"author" : ObjectId("5424ac37eb0ea85d4c921f8b")
},
{
"_id" : ObjectId("543429a2cb38b1d83c3ff2c2"),
"name" : "jklasdfa_XXX",
"author" : ObjectId("5424ac37eb0ea85d4c921f8b")
}
]
}
]
}
Result expected:
{
"_id" : ObjectId("543429a2cb38b1d83c3ff2c2"),
"name" : "jklasdfa_XXX",
"author" : ObjectId("5424ac37eb0ea85d4c921f8b")
}
But i not getting it.
I still testing with no success:
db.projects.find({
"US.tasks._id" : ObjectId("543429a2cb38b1d83c3ff2c2")
}, { "US.tasks.$" : 1 })
I tryed with $elemMatch too, but return nothing.
db.projects.find({
"US" : {
"tasks" : {
$elemMatch : {
"_id" : ObjectId("543429a2cb38b1d83c3ff2c2")
}
}
}
})
Can i get ONLY my result expected using find()? If not, what and how use?
Thanks!
You will need an aggregation for that:
db.projects.aggregate([{$unwind:"$US"},
{$unwind:"$US.tasks"},
{$match:{"US.tasks._id":ObjectId("543429a2cb38b1d83c3ff2c2")}},
{$project:{_id:0,"task":"$US.tasks"}}])
should return
{ task : {
"_id" : ObjectId("543429a2cb38b1d83c3ff2c2"),
"name" : "jklasdfa_XXX",
"author" : ObjectId("5424ac37eb0ea85d4c921f8b")
}
Explanation:
$unwind creates a new (virtual) document for each array element
$match is the query part of your find
$project is similar as to project part in find i.e. it specifies the fields you want to get in the results
You might want to add a second $match before the $unwind if you know the document you are searching (look at performance metrics).
Edit: added a second $unwind since US is an array.
Don't know what you are doing (so realy can't tell and just sugesting) but you might want to examine if your schema (and mongodb) is ideal for your task because the document looks just like denormalized relational data probably a relational database would be better for you.

How can I select a number of records per a specific field using mongodb?

I have a collection of documents in mongodb, each of which have a "group" field that refers to a group that owns the document. The documents look like this:
{
group: <objectID>
name: <string>
contents: <string>
date: <Date>
}
I'd like to construct a query which returns the most recent N documents for each group. For example, suppose there are 5 groups, each of which have 20 documents. I want to write a query which will return the top 3 for each group, which would return 15 documents, 3 from each group. Each group gets 3, even if another group has a 4th that's more recent.
In the SQL world, I believe this type of query is done with "partition by" and a counter. Is there such a thing in mongodb, short of doing N+1 separate queries for N groups?
You cannot do this using the aggregation framework yet - you can get the $max or top date value for each group but aggregation framework does not yet have a way to accumulate top N plus there is no way to push the entire document into the result set (only individual fields).
So you have to fall back on MapReduce. Here is something that would work, but I'm sure there are many variants (all require somehow sorting an array of objects based on a specific attribute, I borrowed my solution from one of the answers in this question.
Map function - outputs group name as a key and the entire rest of the document as the value - but it outputs it as a document containing an array because we will try to accumulate an array of results per group:
map = function () {
emit(this.name, {a:[this]});
}
The reduce function will accumulate all the documents belonging to the same group into one array (via concat). Note that if you optimize reduce to keep only the top five array elements by checking date then you won't need the finalize function, and you will use less memory during running mapreduce (it will also be faster).
reduce = function (key, values) {
result={a:[]};
values.forEach( function(v) {
result.a = v.a.concat(result.a);
} );
return result;
}
Since I'm keeping all values for each key, I need a finalize function to pull out only latest five elements per key.
final = function (key, value) {
Array.prototype.sortByProp = function(p){
return this.sort(function(a,b){
return (a[p] < b[p]) ? 1 : (a[p] > b[p]) ? -1 : 0;
});
}
value.a.sortByProp('date');
return value.a.slice(0,5);
}
Using a template document similar to one you provided, you run this by calling mapReduce command:
> db.top5.mapReduce(map, reduce, {finalize:final, out:{inline:1}})
{
"results" : [
{
"_id" : "group1",
"value" : [
{
"_id" : ObjectId("516f011fbfd3e39f184cfe13"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.498Z"),
"contents" : 0.23778377776034176
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe0e"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.467Z"),
"contents" : 0.4434165076818317
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe09"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.436Z"),
"contents" : 0.5935856597498059
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe04"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.405Z"),
"contents" : 0.3912118375301361
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfdff"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.372Z"),
"contents" : 0.221651989268139
}
]
},
{
"_id" : "group2",
"value" : [
{
"_id" : ObjectId("516f011fbfd3e39f184cfe14"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.504Z"),
"contents" : 0.019611883210018277
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe0f"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.473Z"),
"contents" : 0.5670706110540777
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe0a"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.442Z"),
"contents" : 0.893193120136857
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe05"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.411Z"),
"contents" : 0.9496864483226091
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe00"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.378Z"),
"contents" : 0.013748752186074853
}
]
},
{
"_id" : "group3",
...
}
]
}
],
"timeMillis" : 15,
"counts" : {
"input" : 80,
"emit" : 80,
"reduce" : 5,
"output" : 5
},
"ok" : 1,
}
Each result has _id as group name and values as array of most recent five documents from the collection for that group name.
you need aggregation framework $group stage piped in a $limit stage...
you want also to $sort the records in some ways or else the limit will have undefined behaviour, the returned documents will be pseudo-random (the order used internally by mongo)
something like that:
db.collection.aggregate([{$group:...},{$sort:...},{$limit:...}])
here there is the documentation if you want to know more

mongodb get elements which was inserted after some document

I have a document and I need to query mongodb database to return me all the documents which was inserted after current document.
Is it possible and how to do that query?
If you do not override the default _id field you can use that objectID (see the mongodb docs) to make a comparison by time. For instance, the following query will find all the documents that are inserted after curDoc has been inserted (assuming none overwrite the _id field):
>db.test.find({ _id : {$gt : curDoc._id}})
Note that these timestamps are not super granular, if you would like a finer grained view of the time that documents are inserted I encourage you to add your own timestamp field to the documents you are inserting and use that field to make such queries.
If you are using Insert time stamp as on of the parameter, you can query like below
> db.foo.find()
{ "_id" : ObjectId("514bf8bbbe11e483111af213"), "Name" : "abc", "Insert_time" : ISODate("2013-03-22T06:22:51.422Z") }
{ "_id" : ObjectId("514bf8c5be11e483111af214"), "Name" : "xyz", "Insert_time" : ISODate("2013-03-22T06:23:01.310Z") }
{ "_id" : ObjectId("514bf8cebe11e483111af215"), "Name" : "pqr", "Insert_time" : ISODate("2013-03-22T06:23:10.006Z") }
{ "_id" : ObjectId("514bf8eabe11e483111af216"), "Name" : "ijk", "Insert_time" : ISODate("2013-03-22T06:23:38.410Z") }
>
Here my Insert_time corresponds to the document inserted time, and following query will give you the documents after a particular Insert_time,
> db.foo.find({Insert_time:{$gt:ISODate("2013-03-22T06:22:51.422Z")}})
{ "_id" : ObjectId("514bf8c5be11e483111af214"), "Name" : "xyz", "Insert_time" : ISODate("2013-03-22T06:23:01.310Z") }
{ "_id" : ObjectId("514bf8cebe11e483111af215"), "Name" : "pqr", "Insert_time" : ISODate("2013-03-22T06:23:10.006Z") }
{ "_id" : ObjectId("514bf8eabe11e483111af216"), "Name" : "ijk", "Insert_time" : ISODate("2013-03-22T06:23:38.410Z") }
>