MongoDB: Several fields to a list - mongodb

I currently have a collection that follows a format like this:
{ "_id": ObjectId(...),
"name" : "Name",
"red": 0,
"blue": 0,
"yellow": 1,
"green": 0,
...}
and so on (a bunch of colors). What I would like to do is to create a new array named colors, whose elements are those colors that have a value of 1.
For example:
{ "_id": ObjectId(...),
"name" : "Name",
"colors": ["yellow"]
}
Is this something I can do on the Mongo shell? Or should I do it in a program?
I'm pretty sure I can do it using Python, however I am having difficulties trying to do it directly in the shell. If it can be done in the shell, can anyone point me in the right direction?
Thanks.

Yes it can be easily done in the shell, or basically by following the example adapted into any language.
The key here is to look at the fields that are "colors" then contruct an update statement that both removes those fields from the document while testing them to see if they are valid for inclusion into the array, then of course adding that to the document update as well:
var bulk = db.collection.initializeOrderedBulkOp(),
count = 0;
db.collection.find().forEach(function(doc) {
doc.colors = doc.colors || [];
var update = { "$unset": {}};
Object.keys(doc).filter(function(key) {
return !/^_id|name|colors/.test(key)
}).forEach(function(key) {
update.$unset[key] = "";
if ( doc[key] == 1)
doc.colors.push(key);
});
update["$addToSet"] = { "colors": { "$each": doc.colors } };
bulk.find({ "_id": doc._id }).updateOne(update);
count++;
if ( count % 1000 == 0 ) {
bulk.execute();
bulk = db.collection.initializeOrderedBulkOp()
}
});
if ( count % 1000 != 0 )
bulk.execute();
The Bulk Operations usage means that batches of updates are sent rather than one request and response per document, so this will process a lot faster than merely issuing singular updates back and forth.
The main operators here are $unset to remove the existing fields and $addToSet to add the new evaluated array. Both are built up by cycling the keys of the document that make up the possible colors and excluding the other keys you don't want to modify using a regex filter.
Also using $addToSet and this line:
doc.colors = doc.colors || [];
with the purpose of being sure that if any document was already partially converted or otherwise touched by a code change that had already started storing the correct array, then these would not be adversely affected or overwritten by the update process.

tl;dr, spoiler
Mongodb's shell has access to some javascript-like methods on their objects. You can query your collection with db.yourCollectionName.find() which will return a cursor (cursor methods). Then iterate through to get each document, iterate through the keys, conditionally filter out keys like _id and name and then check to see if the value is 1, store that key somewhere in a collection.
Once done, you'd probably want to use db.yourCollectionName.update() or db.yourCollectionName.findAndModify() to find the record by _id and use $set to add a new field and set it's value to the collection of keys.

Related

Elegantly return only Subdocuments satisfying elemMatch in MongoDb aggregation result [duplicate]

This question already has answers here:
Retrieve only the queried element in an object array in MongoDB collection
(18 answers)
Closed 5 years ago.
I've tried several ways of creating an aggregation pipeline which returns just the matching entries from a document's embedded array and not found any practical way to do this.
Is there some MongoDB feature which would avoid my very clumsy and error-prone approach?
A document in the 'workshop' collection looks like this...
{
"_id": ObjectId("57064a294a54b66c1f961aca"),
"type": "normal",
"version": "v1.4.5",
"invitations": [],
"groups": [
{
"_id": ObjectId("57064a294a54b66c1f961acb"),
"role": "facilitator"
},
{
"_id": ObjectId("57064a294a54b66c1f961acc"),
"role": "contributor"
},
{
"_id": ObjectId("57064a294a54b66c1f961acd"),
"role": "broadcaster"
},
{
"_id": ObjectId("57064a294a54b66c1f961acf"),
"role": "facilitator"
}
]
}
Each entry in the groups array provides a unique ID so that a group member is assigned the given role in the workshop when they hit a URL with that salted ID.
Given a _id matching an entry in a groups array like ObjectId("57064a294a54b66c1f961acb"), I need to return a single record like this from the aggregation pipeline - basically returning the matching entry from the embedded groups array only.
{
"_id": ObjectId("57064a294a54b66c1f961acb"),
"role": "facilitator",
"workshopId": ObjectId("57064a294a54b66c1f961aca")
},
In this example, the workshopId has been added as an extra field to identify the parent document, but the rest should be ALL the fields from the original group entry having the matching _id.
The approach I have adopted can just about achieve this but has lots of problems and is probably inefficient (with repetition of the filter clause).
return workshopCollection.aggregate([
{$match:{groups:{$elemMatch:{_id:groupId}}}},
{$unwind:"$groups"},
{$match:{"groups._id":groupId}},
{$project:{
_id:"$groups._id",
role:"$groups.role",
workshopId:"$_id",
}},
]).toArray();
Worse, since it explicitly includes named fields from the entry, it will omit any future fields which are added to the records. I also can't generalise this lookup operation to the case of 'invitations' or other embedded named arrays unless I can know what the array entries' fields are in advance.
I have wondered if using the $ or $elemMatch operators within a $project stage of the pipeline is the right approach, but so far they have either been either ignored or triggered operator validity errors when running the pipeline.
QUESTION
Is there another aggregation operator or alternative approach which would help me with this fairly mainstream problem - to return only the matching entries from a document's array?
The implementation below can handle arbitrary queries, serves results as a 'top-level document' and avoids duplicate filtering in the pipeline.
function retrieveArrayEntry(collection, arrayName, itemMatch){
var match = {};
match[arrayName]={$elemMatch:itemMatch};
var project = {};
project[arrayName+".$"] = true;
return collection.findOne(
match,
project
).then(function(doc){
if(doc !== null){
var result = doc[arrayName][0];
result._docId = doc._id;
return result;
}
else{
return null;
}
});
}
It can be invoked like so...
retrieveArrayEntry(workshopCollection, "groups", {_id:ObjectId("57064a294a54b66c1f961acb")})
However, it relies on the collection findOne(...) method instead of aggregate(...) so will be limited to serving the first matching array entry from the first matching document. Projections referencing an array match clause are apparently not possible through aggregate(...) in the same way they are through findXXX() methods.
A still more general (but confusing and inefficient) implementation allows retrieval of multiple matching documents and subdocuments. It works around the difficulty MongoDb has with syntax consistency of Document and Subdocument matching through the unpackMatch method, so that an incorrect 'equality' criterion e.g. ...
{greetings:{_id:ObjectId("437908743")}}
...gets transferred into the required syntax for a 'match' criterion (as discussed at Within a mongodb $match, how to test for field MATCHING , rather than field EQUALLING )...
{"greetings._id":ObjectId("437908743")}
Leading to the following implementation...
function unpackMatch(pathPrefix, match){
var unpacked = {};
Object.keys(match).map(function(key){
unpacked[pathPrefix + "." + key] = match[key];
})
return unpacked;
}
function retrieveArrayEntries(collection, arrayName, itemMatch){
var matchDocs = {},
projectItems = {},
unwindItems = {},
matchUnwoundByMap = {};
matchDocs.$match={};
matchDocs.$match[arrayName]={$elemMatch:itemMatch};
projectItems.$project = {};
projectItems.$project[arrayName]=true;
unwindItems.$unwind = "$" + arrayName;
matchUnwoundByMap.$match = unpackMatch(arrayName, itemMatch);
return collection.aggregate([matchDocs, projectItems, unwindItems, matchUnwoundByMap]).toArray().then(function(docs){
return docs.map(function(doc){
var result = doc[arrayName];
result._docId = doc._id;
return result;
});
});
}

how to drop duplicate embedded document

I have users' collection containing many lists of sub documents. Schema is something like this:
{
_id: ObjectId(),
name: aaa,
age: 20,
transactions:[
{
trans_id: 1,
product: mobile,
price: 30,
},
{
trans_id: 2,
product: tv,
price: 10
},
...]
...
}
So I have one doubt. trans_id in transactions list is unique over all the products, but it may be possible that I may have copied the same transaction again with same trans_id (due to bad ETL programming). Now I want to drop those duplicate sub documents. I have indexed trans_id thought not unique. I read about dropDups option. But will it delete a particular duplicate exists in DB or it'll drop whole document (which I definitely don't want). If not how to do it?
PS: I am using MongoDB 2.6.6 version.
Nearest case to all we can see presented here it that now you need a way of defining the "distinct" items within the array where some items are in fact an "exact copy" of other items in the array.
The best case is to use $addToSet along with the $each modifier within a looping operation for the collection. Ideally you use the Bulk Operations API to take advantage of the reduced traffic when doing so:
var bulk = db.collection.initializeOrderedBulkOperation();
var count = 0;
// Read the docs
db.collection.find({}).forEach(function(doc) {
// Blank the array
bulk.find({ "_id": doc.id })
.updateOne({ "$set": { "transactions": [] } });
// Resend as a "set"
bulk.find({ "_id": doc.id })
.updateOne({
"$addToSet": {
"trasactions": { "$each": doc.transactions }
}
});
count++;
// Execute once every 500 statements ( actually 1000 )
if ( count % 500 == 0 ) {
bulk.execute()
bulk = db.collection.initializeOrderedBulkOperation();
}
});
// If a remainder then execute the remaining stack
if ( count % 500 != 0 )
bulk.execute();
So as long as the "duplicate" content is "entirely the same" then this approach will work. If the only thing that is actually "duplicated" is the "trans_id" field then you need an entirely different approach, since none of the "whole documents" are "duplicated" and this means you need more logic in place to do this.

How to store an ordered set of documents in MongoDB without using a capped collection

What's a good way to store a set of documents in MongoDB where order is important? I need to easily insert documents at an arbitrary position and possibly reorder them later.
I could assign each item an increasing number and sort by that, or I could sort by _id, but I don't know how I could then insert another document in between other documents. Say I want to insert something between an element with a sequence of 5 and an element with a sequence of 6?
My first guess would be to increment the sequence of all of the following elements so that there would be space for the new element using a query something like db.items.update({"sequence":{$gte:6}}, {$inc:{"sequence":1}}). My limited understanding of Database Administration tells me that a query like that would be slow and generally a bad idea, but I'm happy to be corrected.
I guess I could set the new element's sequence to 5.5, but I think that would get messy rather quickly. (Again, correct me if I'm wrong.)
I could use a capped collection, which has a guaranteed order, but then I'd run into issues if I needed to grow the collection. (Yet again, I might be wrong about that one too.)
I could have each document contain a reference to the next document, but that would require a query for each item in the list. (You'd get an item, push it onto the results array, and get another item based on the next field of the current item.) Aside from the obvious performance issues, I would also not be able to pass a sorted mongo cursor to my {#each} spacebars block expression and let it live update as the database changed. (I'm using the Meteor full-stack javascript framework.)
I know that everything has it's advantages and disadvantages, and I might just have to use one of the options listed above, but I'd like to know if there is a better way to do things.
Based on your requirement, one of the approaches could be to design your schema, in such a way that each document has the capability to hold more than one document and in itself act as a capped container.
{
"_id":Number,
"doc":Array
}
Each document in the collection will act as a capped container, and the documents will be stored as array in the doc field. The doc field being an array, will maintain the order of insertion.
You can limit the number of documents to n. So the _id field of each container document will be incremental by n, indicating the number of documents a container document can hold.
By doing these you avoid adding extra fields to the document, extra indices, unnecessary sorts.
Inserting the very first record
i.e when the collection is empty.
var record = {"name" : "first"};
db.col.insert({"_id":0,"doc":[record]});
Inserting subsequent records
Identify the last container document's _id, and the number of
documents it holds.
If the number of documents it holds is less than n, then update the
container document with the new document, else create a new container
document.
Say, that each container document can hold 5 documents at most,and we want to insert a new document.
var record = {"name" : "newlyAdded"};
// using aggregation, get the _id of the last inserted container, and the
// number of record it currently holds.
db.col.aggregate( [ {
$group : {
"_id" : null,
"max" : {
$max : "$_id"
},
"lastDocSize" : {
$last : "$doc"
}
}
}, {
$project : {
"currentMaxId" : "$max",
"capSize" : {
$size : "$lastDocSize"
},
"_id" : 0
}
// once obtained, check if you need to update the last container or
// create a new container and insert the document in it.
} ]).forEach( function(check) {
if (check.capSize < 5) {
print("updating");
// UPDATE
db.col.update( {
"_id" : check.currentMaxId
}, {
$push : {
"doc" : record
}
});
} else {
print("inserting");
//insert
db.col.insert( {
"_id" : check.currentMaxId + 5,
"doc" : [ record ]
});
}
})
Note that the aggregation, runs on the server side and is very efficient, also note that the aggregation would return you a document rather than a cursor in versions previous to 2.6. So you would need to modify the above code to just select from a single document rather than iterating a cursor.
Inserting a new document in between documents
Now, if you would like to insert a new document between documents 1 and 2, we know that the document should fall inside the container with _id=0 and should be placed in the second position in the doc array of that container.
so, we make use of the $each and $position operators for inserting into specific positions.
var record = {"name" : "insertInMiddle"};
db.col.update(
{
"_id" : 0
}, {
$push : {
"doc" : {
$each : [record],
$position : 1
}
}
}
);
Handling Over Flow
Now, we need to take care of documents overflowing in each container, say we insert a new document in between, in container with _id=0. If the container already has 5 documents, we need to move the last document to the next container and do so till all the containers hold documents within their capacity, if required at last we need to create a container to hold the overflowing documents.
This complex operation should be done on the server side. To handle this, we can create a script such as the one below and register it with mongodb.
db.system.js.save( {
"_id" : "handleOverFlow",
"value" : function handleOverFlow(id) {
var currDocArr = db.col.find( {
"_id" : id
})[0].doc;
print(currDocArr);
var count = currDocArr.length;
var nextColId = id + 5;
// check if the collection size has exceeded
if (count <= 5)
return;
else {
// need to take the last doc and push it to the next capped
// container's array
print("updating collection: " + id);
var record = currDocArr.splice(currDocArr.length - 1, 1);
// update the next collection
db.col.update( {
"_id" : nextColId
}, {
$push : {
"doc" : {
$each : record,
$position : 0
}
}
});
// remove from original collection
db.col.update( {
"_id" : id
}, {
"doc" : currDocArr
});
// check overflow for the subsequent containers, recursively.
handleOverFlow(nextColId);
}
}
So that after every insertion in between , we can invoke this function by passing the container id, handleOverFlow(containerId).
Fetching all the records in order
Just use the $unwind operator in the aggregate pipeline.
db.col.aggregate([{$unwind:"$doc"},{$project:{"_id":0,"doc":1}}]);
Re-Ordering Documents
You can store each document in a capped container with an "_id" field:
.."doc":[{"_id":0,","name":"xyz",...}..]..
Get hold of the "doc" array of the capped container of which you want
to reorder items.
var docArray = db.col.find({"_id":0})[0];
Update their ids so that after sorting the order of the item will change.
Sort the array based on their _ids.
docArray.sort( function(a, b) {
return a._id - b._id;
});
update the capped container back, with the new doc array.
But then again, everything boils down to which approach is feasible and suits your requirement best.
Coming to your questions:
What's a good way to store a set of documents in MongoDB where order is important?I need to easily insert documents at an arbitrary
position and possibly reorder them later.
Documents as Arrays.
Say I want to insert something between an element with a sequence of 5 and an element with a sequence of 6?
use the $each and $position operators in the db.collection.update() function as depicted in my answer.
My limited understanding of Database Administration tells me that a
query like that would be slow and generally a bad idea, but I'm happy
to be corrected.
Yes. It would impact the performance, unless the collection has very less data.
I could use a capped collection, which has a guaranteed order, but then I'd run into issues if I needed to grow the collection. (Yet
again, I might be wrong about that one too.)
Yes. With Capped Collections, you may lose data.
An _id field in MongoDB is a unique, indexed key similar to a primary key in relational databases. If there is an inherent order in your documents, ideally you should be able to associate a unique key to each document, with the key value reflecting the order. So while preparing your document for insertion, explicitly add an _id field as this key (if you do not, mongo creates it automatically with a BSON objectid).
As far as retrieving the results are concerned, MongoDB does not guarantee the order of return documents unless you explicitly use .sort() . If you do not use .sort(), the results are usually returned in natural order (order of insertion).Again, there is no guarantee on this behavior.
I'd advise you to override _id with your order while inserting, and use a sort while retrieving. Since _id is a necessary and auto-indexed entity, you will not be wasting any space defining a sort key, and storing the index for it.
For abitrary sorting of any collection, you'll need a field to sort it on. I call mine "sequence".
schema:
{
_id: ObjectID,
sequence: Number,
...
}
db.items.ensureIndex({sequence:1});
db.items.find().sort({sequence:1})
Here is a link to some general sorting database answers that may be relevant:
https://softwareengineering.stackexchange.com/questions/195308/storing-a-re-orderable-list-in-a-database/369754
I suggest going with Floating point solution - adding a position column:
Use a floating-point number for the position column.
You can then reorder the list changing only the position column in the "moved" row.
If your user wants to position "red" after "blue" but before "yellow" Then you just need to calculate
red.position = ((yellow.position - blue.position) / 2) + blue.position
After a few re-positions in the same place (Cuttin in half every time) - you might reach a wall - it's better that if you reach a certain threshold - to resort the list.
When retrieving it you can simply say col.sort() to get it sorted and no need for any client-side code (Like in the case of a Linked list solution)

Can MongoDB aggregate "top x" results in this document schema?

{
"_id" : "user1_20130822",
"metadata" : {
"date" : ISODate("2013-08-22T00:00:00.000Z"),
"username" : "user1"
},
"tags" : {
"abc" : 19,
"123" : 2,
"bca" : 64,
"xyz" : 14,
"zyx" : 12,
"321" : 7
}
}
Given the schema example above, is there a way to query this to retrieve the top "x" tags: E.g., Top 3 "tags" sorted descending?
Is this possible in a single document? e.g., top tags for a user on a given day
What if i have multiple documents that need to be combined together before getting the top? e.g., top tags for a user in a given month
I know this can be done by using a "document per user per tag per day" or by making "tags" an array, but I'd like to be able to do this as above, as it makes in place $inc's easier (many more of these happening than reads).
Or do I need to return back the whole document, and defer to the client on the sorting/limiting?
When you use object-keys as tag-names, you are making this kind of reporting very difficult. The aggreation framework has no $unwind-equivalent for objects. But there is always MapReduce.
Have your map-function emit one document for each key/value pair in the tags-subdocument. It should look something like this;
var mapFunction = function() {
for (var key in this.tags) {
emit(key, this.tags[key]);
}
}
Your reduce-function would then sum up the values emitted for the same key.
var reduceFunction = function(key, values) {
var sum = 0;
for (var i = 0; i < values.length; i++) {
sum += values[i];
}
return sum;
}
The complete MapReduce command would look something like this:
db.runCommand(
{
mapReduce: "yourcollection", // the collection where your data is stored
query: { _id : "user1_20130822" }, // or however you want to limit the results
map: mapFunction,
reduce: reduceFunction,
out: "inline", // means that the output is returned directly.
}
)
This will return all tags in unpredictable order. MapReduce has a sort and a limit option, but these only work on a field which has an index in the original collection, so you can't use it on a computed field. To get only the top 3, you would have to sort the results on the application-level. When you insist on doing the sorting and limiting on the database, define an output-collection to store the mapReduce results in (with the out-option set to out: { replace: "temporaryCollectionName" }) and then query that collection with sort and limit afterwards.
Keep in mind that when you use an intermediate collection, you must make sure that no two users run MapReduces with different queries into the same collection. When you have multiple users which want to view your top-3 list, you could let them query the output-collection and do the MapReduce in the background at regular intervales.

How to remove duplicates based on a key in Mongodb?

I have a collection in MongoDB where there are around (~3 million records). My sample record would look like,
{ "_id" = ObjectId("50731xxxxxxxxxxxxxxxxxxxx"),
"source_references" : [
"_id" : ObjectId("5045xxxxxxxxxxxxxx"),
"name" : "xxx",
"key" : 123
]
}
I am having a lot of duplicate records in the collection having same source_references.key. (By Duplicate I mean, source_references.key not the _id).
I want to remove duplicate records based on source_references.key, I'm thinking of writing some PHP code to traverse each record and remove the record if exists.
Is there a way to remove the duplicates in Mongo Internal command line?
This answer is obsolete : the dropDups option was removed in MongoDB 3.0, so a different approach will be required in most cases. For example, you could use aggregation as suggested on: MongoDB duplicate documents even after adding unique key.
If you are certain that the source_references.key identifies duplicate records, you can ensure a unique index with the dropDups:true index creation option in MongoDB 2.6 or older:
db.things.ensureIndex({'source_references.key' : 1}, {unique : true, dropDups : true})
This will keep the first unique document for each source_references.key value, and drop any subsequent documents that would otherwise cause a duplicate key violation.
Important Note: Any documents missing the source_references.key field will be considered as having a null value, so subsequent documents missing the key field will be deleted. You can add the sparse:true index creation option so the index only applies to documents with a source_references.key field.
Obvious caution: Take a backup of your database, and try this in a staging environment first if you are concerned about unintended data loss.
This is the easiest query I used on my MongoDB 3.2
db.myCollection.find({}, {myCustomKey:1}).sort({_id:1}).forEach(function(doc){
db.myCollection.remove({_id:{$gt:doc._id}, myCustomKey:doc.myCustomKey});
})
Index your customKey before running this to increase speed
While #Stennie's is a valid answer, it is not the only way. Infact the MongoDB manual asks you to be very cautious while doing that. There are two other options
Let the MongoDB do that for you using Map Reduce
Another way
You do programatically which is less efficient.
Here is a slightly more 'manual' way of doing it:
Essentially, first, get a list of all the unique keys you are interested.
Then perform a search using each of those keys and delete if that search returns bigger than one.
db.collection.distinct("key").forEach((num)=>{
var i = 0;
db.collection.find({key: num}).forEach((doc)=>{
if (i) db.collection.remove({key: num}, { justOne: true })
i++
})
});
I had a similar requirement but I wanted to retain the latest entry. The following query worked with my collection which had millions of records and duplicates.
/** Create a array to store all duplicate records ids*/
var duplicates = [];
/** Start Aggregation pipeline*/
db.collection.aggregate([
{
$match: { /** Add any filter here. Add index for filter keys*/
filterKey: {
$exists: false
}
}
},
{
$sort: { /** Sort it in such a way that you want to retain first element*/
createdAt: -1
}
},
{
$group: {
_id: {
key1: "$key1", key2:"$key2" /** These are the keys which define the duplicate. Here document with same value for key1 and key2 will be considered duplicate*/
},
dups: {
$push: {
_id: "$_id"
}
},
count: {
$sum: 1
}
}
},
{
$match: {
count: {
"$gt": 1
}
}
}
],
{
allowDiskUse: true
}).forEach(function(doc){
doc.dups.shift();
doc.dups.forEach(function(dupId){
duplicates.push(dupId._id);
})
})
/** Delete the duplicates*/
var i,j,temparray,chunk = 100000;
for (i=0,j=duplicates.length; i<j; i+=chunk) {
temparray = duplicates.slice(i,i+chunk);
db.collection.bulkWrite([{deleteMany:{"filter":{"_id":{"$in":temparray}}}}])
}
Expanding on Fernando's answer, I found that it was taking too long, so I modified it.
var x = 0;
db.collection.distinct("field").forEach(fieldValue => {
var i = 0;
db.collection.find({ "field": fieldValue }).forEach(doc => {
if (i) {
db.collection.remove({ _id: doc._id });
}
i++;
x += 1;
if (x % 100 === 0) {
print(x); // Every time we process 100 docs.
}
});
});
The improvement is basically using the document id for removing, which should be faster, and also adding the progress of the operation, you can change the iteration value to your desired amount.
Also, indexing the field before the operation helps.
pip install mongo_remove_duplicate_indexes
create a script in any language
iterate over your collection
create new collection and create new index in this collection with unique set to true ,remember this index has to be same as index u wish to remove duplicates from in ur original collection with same name
for ex-u have a collection gaming,and in this collection u have field genre which contains duplicates,which u wish to remove,so just create new collection
db.createCollection("cname")
create new index
db.cname.createIndex({'genre':1},unique:1)
now when u will insert document with similar genre only first will be accepted,other will be rejected with duplicae key error
now just insert the json format values u received into new collection and handle exception using exception handling
for ex pymongo.errors.DuplicateKeyError
check out the package source code for the mongo_remove_duplicate_indexes for better understanding
If you have enough memory, you can in scala do something like that:
cole.find().groupBy(_.customField).filter(_._2.size>1).map(_._2.tail).flatten.map(_.id)
.foreach(x=>cole.remove({id $eq x})