Mongo Oplog - Extracting Specifics of Updates - mongodb

Say I have an entry in the inventory collection that looks like
{ _id: 1, item: "polarizing_filter", tags: [ "electronics", "camera" ]}
and I issue the command
db.inventory.update(
{ _id: 1 },
{ $addToSet: { tags: "accessories" } }
)
I have an oplog tailer, and would like to know that, specifically, "accessories" has been added to this document's tags field. As far as I can tell, the oplog always normalizes commands to use $set and $unset to maintain idempotency. In this case, the field of the entry describing the update would show something like
{$set : { tags : ["electronics", "camera", "accessories"] } }
which makes it impossible to know which tags were actually added by this update. Is there anyway to do this? I'm also curious about the analogous case in which fields are modified through deletion, e.g. through $pull. Solutions outside of the realm of an oplog tailer are welcome, as well as pointers to documentation of this command normalization process - I can't find it.
Thanks!

Related

mongodb update an existing field without overwritting

{
questions: {
q1: "",
}
}
result after updating:
{
questions: {
q1: "",
q2: ""
}
}
I want to add q2 inside questions, without overwritting what's already inside it (q1).
One solution I found is to get the whole document, and modify it on my backend, and send the whole document to replace the current one. But it seems really in-effiecient as I have other fields in the document as well.
Is there a query that does it more efficiently? I looked at the mongodb docs, didn't seem to find a query that does it.
As turivishal said, you can use $set like this
But you also can use $addFields in this way:
db.collection.aggregate([
{
"$match": {
"questions.q1": "q1"
}
},
{
"$addFields": {
"questions.q2": "q2"
}
}
])
Example here
Also, reading your comment where you say Cannot create field 'q2' in element {questions: "q1"}. It seems your original schema is not the same you have into your DB.
Your schema says that questions is an object with field q1. But your error says that questions is the field and q1 the value.

Mongo error 16996 during aggregation - too large document produced

I am parsing Wikipedia dumps in order to play with the link-oriented metadata. One of the collections is named articles and it is in the following form:
{
_id : "Tree",
id: "18955875",
linksFrom: " [
{
name: "Forest",
count: 6
},
[...]
],
categories: [
"Trees",
"Forest_ecology"
[...]
]
}
The linksFrom field stores all articles this article points to, and how many times that happens. Next, I want to create another field linksTo with all the articles that point to this article. In the beginning, I went through the whole collection and updated every article, but since there's lots of them it takes too much time. I switched to aggregation for performance purposes and tried it on a smaller set - works like a charm and is super fast in comparison with the older method. The aggregation pipeline is as follows:
db.runCommand(
{
aggregate: "articles",
pipeline : [
{
$unwind: "$linksFrom"
},
{
$sort: { "linksFrom.count": -1 }
},
{
$project:
{
name: "$_id",
linksFrom: "$linksFrom"
}
},
{
$group:
{
_id: "$linksFrom.name",
linksTo: { $push: { name: "$name", count: { $sum : "$linksFrom.count" } } },
}
},
{
$out: "TEMPORARY"
}
] ,
allowDiskUse: true
}
)
However, on a large dataset being the english Wikipedia I get the following error after a few minutes:
{
"ok" : 0,
"errmsg" : "insert for $out failed: { connectionId: 24, err: \"BSONObj size: 24535193 (0x1766099) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: \"United_States\"\", code: 10334, n: 0, ok: 1.0 }",
"code" : 16996
}
I understand that there are too many articles, which link to United_States article and the corresponding document's size grows above 16MB, currently almost 24MB. Unfortunately, I cannot even check if that's the case (error messages sometimes tend to lie)... Because of that, I'm trying to change the model so that the relationship between articles is stored with IDs rather than long names but I'm afraid that might not be enough - especially because my plan is to merge the two collections for every article later...
The question is: does anyone have a better idea? I don't want to try to increase the limit, I'm rather thinking about a different approach of storing this data in the database.
UPDATE after comment by Markus
Markus is correct, I am using a SAX parser and, as a matter of fact, I'm already storing all the links in a similar way. Apart from articles I have three more collections - one with links and two others, labels and stemmed-labels. The first one stores all links that occur in the dump in the following way:
{
_id : "tree",
stemmedName: "tree",
targetArticle: "Christmas_tree"
}
_id stores the text that is used to represent a given link, stemmedName represents stemmed _id and targetArticle marks what article this text pointed to. I'm in the middle of adding sourceArticle to this one, because it's obviously a good idea.
The second collection labels contains documents as follows:
{
_id : "tree",
targetArticles: [
{
name: "Christmas_tree",
count: 1
},
{
name: "Tree",
count: 166
}
[...]
]
}
The third stemmed-labels is analogous to the labels with its _id being a stemmed version of the root label.
So far, the first collection links serves as a baseline for the two other collections. I group the labels together by their name so that I only do one lookup for every phrase and then I can immiedately get all target articles with one query. Then I use the articles and labels collections in order to:
Look for label with a given name.
Get all articles it might
point to.
Compare the incoming and outcoming links for these
articles.
This is where the main question comes. I thought that it's better if I store all possible articles for a given phrase in one document rather than leave them scattered in the links collection. Only now did it occur to me, that - as long as the lookups are indexed - the overall performance might be the same for one big document or many smaller ones! Is this a correct assumption?
I think your data model is wrong. It may well be (albeit a bit theoretical) that individual articles (let's stick with the wikipedia example) are linked more often than you could store in a document. Embedding only works with One-To(-Very)-Few™ relationships.
So basically, I think you should change your model. I will show you how I would do it.
I will use the mongoshell and JavaScript in this example, since it is the lingua franca. You might need to translate accordingly.
The questions
Lets begin with the questions you want to have answered:
For a given article, which other articles link to that article?
For a given article, to which other articles does that article link to?
For a given article, how many articles link to it?
Optional: For a given article, to how many articles does it link to?
The crawling
What I would do basically is to implement a SAX parser on the articles, creating a new document for each article link you encounter. The document itself should be rather simple:
{
"_id": new ObjectId(),
// optional, for recrawling or pointing out a given state
"date": new ISODate(),
"article": wikiUrl,
"linksTo": otherWikiUrl
}
Note that you should not do an insert, but an upsert. The reason for this is that we do not want to document the number of links, but the articles linked to. If we did an insert, the same combination of article and linksTocould occur multiple times.
So our statement when encountering a link would look like this for example:
db.links.update(
{ "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ "date": new ISODate(), "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ upsert:true }
)
Answering the questions
As you might already guess, answering the questions becomes pretty straightforward now. I have use the following statements for creating a few documents:
db.links.update(
{ "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ "date": new ISODate(), "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ upsert:true }
)
db.links.update(
{ "article":"Royal_Navy", "linksTo":"Mutiny_on_the_Bounty" },
{ "date":new ISODate(), "article":"Royal_Navy", "linksTo":"Mutiny_on_the_Bounty" },
{ upsert:true }
)
db.links.update(
{ "article":"Mutiny_on_the_Bounty", "linksTo":"Royal_Navy"},
{ "date":new ISODate(), "article":"Mutiny_on_the_Bounty", "linksTo":"Royal_Navy" },
{ upsert:true }
)
For a given article, which other articles link to that article?
We found out that we should not use an aggregation, since that might exceed the size limit. But we don't have to. We simply use a cursor and gather the results:
var toLinks =[]
var cursor = db.links.find({"linksTo":"Royal_Navy"},{"_id":0,"article":1})
cursor.forEach(
function(doc){
toLinks.push(doc.article);
}
)
printjson(toLinks)
// Output: [ "HMS_Warrior_(1860)", "Mutiny_on_the_Bounty" ]
For a given article, to which other articles does that article link to?
This works pretty much like the first question – we basically only change the query:
var fromLinks = []
var cursor = db.links.find({"article":"Royal_Navy"},{"_id":0,"linksTo":1})
cursor.forEach(
function(doc){
fromLinks.push(doc.linksTo)
}
)
printjson(fromLinks)
// Output: [ "Mutiny_on_the_Bounty" ]
For a given article, how many articles link to it?
It should be obvious that in case you already have answered question 1, you could simply check toLinks.length. But let's assume you haven't. There are two other ways of doing this
Using .count()
You can use this method on replica sets. On sharded clusters, this doesn't work well. But it is easy:
db.links.find({ "linksTo":"Royal_Navy" }).count()
// Output: 2
Using an aggregation
This works on any environment and isn't much more complicated:
db.links.aggregate([
{ "$match":{ "linksTo":"Royal_Navy" }},
{ "$group":{ "_id":"$linksTo", "isLinkedFrom":{ "$sum":1 }}}
])
// Output: { "_id" : "Royal_Navy", "isLinkedFrom" : 2 }
Optional: For a given article, to how many articles does it link to?
Again, you can answer this question by reading the length of the array from question 2 of use the .count()method. The aggregation again is simple
db.links.aggregate([
{ "$match":{ "article":"Royal_Navy" }},
{ "$group":{ "_id":"$article", "linksTo":{ "$sum":1 }}}
])
// Output: { "_id" : "Royal_Navy", "linksTo" : 1 }
Indices
As for the indices, I haven't really checked them, but individual indices on the fields is probably what you want:
db.links.createIndex({"article":1})
db.links.createIndex({"linksTo":1})
A compound index will not help much, since order matters and we do no always ask for the first field. So this is probably as optimized as it can get.
Conclusion
We are using an extremely simple, scalable model and rather simple queries and aggregations to get the questions answered you have to the data.

Best practice for obtaining embedded document metadata?

So, my schema design requires that I use an embedded document format. While I recognize that what I'm about to ask could be made easier by redesigning the schema, the current design meets all of the other requirements in place so I'm doing my best to make it work.
Consider the following rudementary schema:
{
"_id" : "01234ABCD,
"type" : "thing",
"resources" : {
foo : [
{
"herp" : "derp",
},
],
bar : [
{
"herp" : "derp",
},
{
"derp" : "herp"
}
]
},
}
Obviously the value that corresponds to the "resources" key is an embedded document. I would like to be able to efficiently calculate the count of keys in that document, and derive results based upon tests on that value. It's important to note that the length and content of the embedded doc is an unknown quantity - hence my reason for wanting to be able to query this meta. Being a complete js idiot, I've managed to cobble together the following query. For example, if I were to look for documents with more than 3 keys in the "resources" document:
db.coll.find({$where: function(){
var total = 0;
for(i in this['resources']){
++total;
if(total > 3){
return true;
}
}
}})
As I'm pretty new to Mongo and terrible at js, I feel like there may be a smarter way to do this. I'm also very curious to hear opinions on whether or not this goes against the Mongo ethos a bit by not pushing this processing to the client. Any feedback or criticism of this approach and implementation are most welcome.
Thanks for reading.
You can use an aggregate pipeline to assemble metadata about the docs and then filter on them.
db.coll.aggregate([
{$project: {
// Compute a total count of the keys in the resources docs
keys: {$add: [{$size: '$resources.foo'}, {$size: '$resources.bar'}]},
// Project the original doc
doc: '$$ROOT'
}},
// Only include the docs that have more than 3 keys
{$match: {keys: {$gt: 3}}}
])

Maintaining total count of a set in another collection

I got simple scenario of two entities: post; bumps (ie upvote).
Example of a post:
{_id: 'happy_days', 'title': 'Happy days', text: '...', bumps: 2}
Example of a bump:
{_id: {user: 'jimmy', post: 'happy_days'}}
{_id: {user: 'hans', post: 'happy_days'}}
Question: how do I maintain correct bumps count in post under all circumstances (and failures)?
The method I have come up with so far is:
To bump, upsert and check for existence. Only if inserted, increase bumps count.
To unbump, delete and check for existence. Only if deleted, decrease bumps count.
Above fails if the app crashes between the two ops and the only way to correct the bumps stats is to query all documents in bump collection and recalculate everything offline (ie there is no way to know which post have incorrect bumps count).
I suggest that you stick with what you already have. The worst that can happen if there is a failover/connection issue between your two operations is that you bump count is wrong. So what? This is not the end of the world, and nobody is going to care too much if a bump count is either 812 or 813. You can always recreate the count anyway by checking how many bumps you have for each post by running an aggregation query if something went wrong. Embrace eventual consistency!
As an alternative to updating the data in multiple places (which, for read performance, will probably be the best but as you noticed will complicate updates) it may be worth considering storing uid's of the bumps in an array (here called bump_uids) directly on the post, and just count the bumps when needed using aggregate framework;
> db.test.aggregate( [ { $match: { _id:'happy_days' } },
{ $project: { bump_uids: 1 } },
{ $unwind: '$bump_uids' },
{ $group: {_id:'$_id', bumps: { $sum:1 } } } ] )
>>> { "result" : [ { "_id" : "happy_days", "bumps" : 3 } ], "ok" : 1 }
Since MongoDB does not yet support triggers ( https://jira.mongodb.org/browse/SERVER-124 ) you have to do this the gritty way with application logic.
As a brief example:
db.follower.insert({fromId:u,toId:c});
db.user.update({_id:u},{$inc:{totalFollowing:1}});
db.user.update({_id:c},{$inc:{totalFollowers:1}});
Yes, it is not atomic etc etc however it is the way to do it. In reality many update counters like this, whether in MongoDB or not.

Removing Non-Collection Embedded Document via Mongo Shell

So I've got an application that uses a lot of embedded documents, which is fine. However I've noticed that some embedded documents aren't displayed when you show collections in Mongo shell.
Normally this isn't an issue, but whilst dicking with setting embedded documents, I accidentally added an empty entry to one of the entries. I'd normally do something like this to remove the entry db.collection.remove({_id: ObjectId('<OBJECT_ID>')}), but since some of these aren't actual collections I'm unable to do it like this.
I'm also not sure how I'd splice out this successfully while actually removing the document. I could splice it out of the entry, but I have a feeling it would leave that embedded document floating around somewhere in the DB.
Any ideas how to do this?
To give you an idea what I'm talking about, an example of the entry:
entry = {
_id: ObjectId('blah blah blah'),
name: {
first: 'example',
last: 'city'
},
log : [
{
_id: ObjectId('some id'),
action: 'whatever',
someField: 'etc.'
},
{
_id: ObjectId('another id')
},
{
_id: ObjectId('yet another id'),
action: 'who cares',
someField: 'data'
}
]
}
If you want to remove one of the log entries, then you want the $pull operator.
The format would be something like:
db.collection.update({_id:<id-of-document-to-update>},
{$pull:{"log._id":<id-of-log-entry-to-remove>"}}
)
This says, find document with certain _id and remove from log array an entry with certain sub_id.