What do the oplog fields actually mean?

What do the oplog fields actually mean? - mongodb

I've seen this question posed before, but the answers were very vague. I have been doing some research on oplog, and am trying to understand exactly how it works. In particular, I want to have a good understanding of the fields in an oplog document and what data they store.
These are the fields I have found through tests and what I think they mean as well as what I am still unsure of:
ts: timestamp of the write operation / oplog entry
h: a unique identifier for the oplog entry (but why is it sometimes positive and sometimes negative?)
op: type of operation performed (usually i/u/d for insert, update or delete)
ns: database & collection affected
o: the new state of the document after performing the change
o2: Seems to contain the _id field of the document during an update operation. Why is this needed when that same field is present as part of the o field, which also contains the rest of the document?
b: Seems to be a bool that appears for delete operations. What is the significance of this field?
I would like to confirm whether or not the points I made above are accurate, as well as clarifications for the bits that aren't clear. I am also interested to know if there any other fields that can appear in an oplog document.

h is a hash (signed Long)
ts is the internal timestamp format (the "\x11" type shown at bsonspec.org; search the API docs for your driver at api.mongodb.org for further information)
you are correct on op, ns, o, and o2
there's also a "v" field (I'm gonna speculate that this is version, which would allow them to update the schema for the oplog).
b is True for all the delete operations I could find, so I can't provide any information.
The best source of documentation I've found is this. It was a presentation by a company called Stripe at 2014's MongoDB World conference, and it includes some sample Ruby code.

Related

What is a secure way to prevent a user from updating specific fields in a MongoDB document?

I am trying to prevent users from updating certain fields in a mongodb object for which they are allowed to edit every other field. For instance the user should be able to edit/add/remove all fields except the field "permissions". My current approach is to test each key the user is trying to "$set" and see if it starts with the substring "permissions" (to cover dot notation). Example in python:
def sanitize_set(son):
return {"$set": {k: v for k, v in son.get("$set", {}).items()
if not k.startswith("permissions")}}
This approach is beautifully simple and seems to work. I wanted to reach out to the community to see if anyone else has tackled this issue before or sees obvious flaws in my approach. Thank you,
Joshua

Without seeing some example data with an explanation of what should/shouldn't be updatable - it's hard to say for sure, but the way I would prevent this would be to not allow the user to directly supply the fields they will be updating. For example say you had a function called update_employee which updated information in an employee document. If you implement it like this:
update_employee(employee):
db.employees.update({_id: session.user_id}, {$set: employee})
Whatever gets passed in as the employee object is what will be updated. Instead you could create the update object using the values passed in like so:
update_employee(employee):
updatedEmployee = {
email: employee.email,
address: employee.address,
phone: employee.phone
}
db.employees.update({_id: session.user_id}, {$set: updatedEmployee})
This way you have complete control over what is being updated in your database. So if an extra field (such as salary) is passed in, it will be ignored.

Since (as far as I know) does not have a field lock, what you can do in this case is create a routine to pick up the specific document, present it to the user in any way you wish, but simply only show the fields they are allowed to edit.
You can present the entire JSON representation to the user (editor) and have a routine which simply does not allow changes to the fields that are locked. In other words if you dont want field {"name": "Sam"} to be edited even if the editor changes this value to {"name": "Joe"} just kick it out before updating and only update fields which are allowed to be edited. Since it is all done in memory before an actual update (upsert) you have total control over what is being edited and what is not.
If you follow a scheme which does have a prefix say e_address where you have decided any field with e_ allows editing, the job is that much easier programmatically.
Even in user-defined roles I have not seen any possibility of locking specific fields in a collection. (I could be wrong here.)
The programming constructs here are simple though.
A. Pick up field to memory
B. Editor does editing
C. Only update fields which are allowed to be edited. Any other changes just ignore.
(I kept this answer generic as I do not use Python, though the construct should apply to any language.)

MongoDB: When to denormalize and when to use $lookup [duplicate]

I want to design a question structure with some comments. Which relationship should I use for comments: embed or reference?
A question with some comments, like stackoverflow, would have a structure like this:
Question
title = 'aaa'
content = 'bbb'
comments = ???
At first, I thought of using embedded comments (I think embed is recommended in MongoDB), like this:
Question
title = 'aaa'
content = 'bbb'
comments = [ { content = 'xxx', createdAt = 'yyy'},
{ content = 'xxx', createdAt = 'yyy'},
{ content = 'xxx', createdAt = 'yyy'} ]
It is clear, but I'm worried about this case: If I want to edit a specified comment, how do I get its content and its question? There is no _id to let me find one, nor question_ref to let me find its question. (Is there perhaps a way to do this without _id and question_ref?)
Do I have to use ref rather than embed? Do I then have to create a new collection for comments?

This is more an art than a science. The Mongo Documentation on Schemas is a good reference, but here are some things to consider:
Put as much in as possible
The joy of a Document database is that it eliminates lots of Joins. Your first instinct should be to place as much in a single document as you can. Because MongoDB documents have structure, and because you can efficiently query within that structure (this means that you can take the part of the document that you need, so document size shouldn't worry you much) there is no immediate need to normalize data like you would in SQL. In particular any data that is not useful apart from its parent document should be part of the same document.
Separate data that can be referred to from multiple places into its own collection.
This is not so much a "storage space" issue as it is a "data consistency" issue. If many records will refer to the same data it is more efficient and less error prone to update a single record and keep references to it in other places.
Document size considerations
MongoDB imposes a 4MB (16MB with 1.8) size limit on a single document. In a world of GB of data this sounds small, but it is also 30 thousand tweets or 250 typical Stack Overflow answers or 20 flicker photos. On the other hand, this is far more information than one might want to present at one time on a typical web page. First consider what will make your queries easier. In many cases concern about document sizes will be premature optimization.
Complex data structures:
MongoDB can store arbitrary deep nested data structures, but cannot search them efficiently. If your data forms a tree, forest or graph, you effectively need to store each node and its edges in a separate document. (Note that there are data stores specifically designed for this type of data that one should consider as well)
It has also been pointed out than it is impossible to return a subset of elements in a document. If you need to pick-and-choose a few bits of each document, it will be easier to separate them out.
Data Consistency
MongoDB makes a trade off between efficiency and consistency. The rule is changes to a single document are always atomic, while updates to multiple documents should never be assumed to be atomic. There is also no way to "lock" a record on the server (you can build this into the client's logic using for example a "lock" field). When you design your schema consider how you will keep your data consistent. Generally, the more that you keep in a document the better.
For what you are describing, I would embed the comments, and give each comment an id field with an ObjectID. The ObjectID has a time stamp embedded in it so you can use that instead of created at if you like.

In general, embed is good if you have one-to-one or one-to-many relationships between entities, and reference is good if you have many-to-many relationships.

Well, I'm a bit late but still would like to share my way of schema creation.
I have schemas for everything that can be described by a word, like you would do it in the classical OOP.
E.G.
Comment
Account
User
Blogpost
...
Every schema can be saved as a Document or Subdocument, so I declare this for each schema.
Document:
Can be used as a reference. (E.g. the user made a comment -> comment has a "made by" reference to user)
Is a "Root" in you application. (E.g. the blogpost -> there is a page about the blogpost)
Subdocument:
Can only be used once / is never a reference. (E.g. Comment is saved in the blogpost)
Is never a "Root" in you application. (The comment just shows up in the blogpost page but the page is still about the blogpost)

I came across this small presentation while researching this question on my own. I was surprised at how well it was laid out, both the info and the presentation of it.
http://openmymind.net/Multiple-Collections-Versus-Embedded-Documents
It summarized:
As a general rule, if you have a lot of [child documents] or if they are large, a separate collection might be best.
Smaller and/or fewer documents tend to be a natural fit for embedding.

Actually, I'm quite curious why nobody spoke about the UML specifications. A rule of thumb is that if you have an aggregation, then you should use references. But if it is a composition, then the coupling is stronger, and you should use embedded documents.
And you will quickly understand why it is logical. If an object can exist independently of the parent, then you will want to access it even if the parent doesn't exist. As you just can't embed it in a non-existing parent, you have to make it live in it's own data structure. And if a parent exist, just link them together by adding a ref of the object in the parent.
Don't really know what is the difference between the two relationships ?
Here is a link explaining them:
Aggregation vs Composition in UML

If I want to edit a specified comment, how to get its content and its question?
You can query by sub-document: db.question.find({'comments.content' : 'xxx'}).
This will return the whole Question document. To edit the specified comment, you then have to find the comment on the client, make the edit and save that back to the DB.
In general, if your document contains an array of objects, you'll find that those sub-objects will need to be modified client side.

Yes, we can use the reference in the document. To populate another document just like SQL i joins. In MongoDB, they don't have joins to map one to many relationship documents. Instead that we can use populate to fulfil our scenario.
var mongoose = require('mongoose')
, Schema = mongoose.Schema
var personSchema = Schema({
_id : Number,
name : String,
age : Number,
stories : [{ type: Schema.Types.ObjectId, ref: 'Story' }]
});
var storySchema = Schema({
_creator : { type: Number, ref: 'Person' },
title : String,
fans : [{ type: Number, ref: 'Person' }]
});
The population is the process of automatically replacing the specified paths in the document with the document(s) from other collection(s). We may populate a single document, multiple documents, plain objects, multiple plain objects, or all objects returned from a query. Let's look at some examples.
Better you can get more information please visit: http://mongoosejs.com/docs/populate.html

I know this is quite old but if you are looking for the answer to the OP's question on how to return only specified comment, you can use the $ (query) operator like this:
db.question.update({'comments.content': 'xxx'}, {'comments.$': true})

MongoDB gives freedom to be schema-less and this feature can result in pain in the long term if not thought or planned well,
There are 2 options either Embed or Reference. I will not go through definitions as the above answers have well defined them.
When embedding you should answer one question is your embedded document going to grow, if yes then how much (remember there is a limit of 16 MB per document) So if you have something like a comment on a post, what is the limit of comment count, if that post goes viral and people start adding comments. In such cases, reference could be a better option (but even reference can grow and reach 16 MB limit).
So how to balance it, the answer is a combination of different patterns, check these links, and create your own mix and match based on your use case.
https://www.mongodb.com/blog/post/building-with-patterns-a-summary
https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-1

If I want to edit a specified comment, how do I get its content and
its question?
If you had kept track of the number of comments and the index of the comment you wanted to alter, you could use the dot operator (SO example).
You could do f.ex.
db.questions.update(
{
"title": "aaa"
},
{
"comments.0.contents": "new text"
}
)
(as another way to edit the comments inside the question)

How should i do the references in my Mongodb DB [duplicate]

I want to design a question structure with some comments. Which relationship should I use for comments: embed or reference?
A question with some comments, like stackoverflow, would have a structure like this:
Question
title = 'aaa'
content = 'bbb'
comments = ???
At first, I thought of using embedded comments (I think embed is recommended in MongoDB), like this:
Question
title = 'aaa'
content = 'bbb'
comments = [ { content = 'xxx', createdAt = 'yyy'},
{ content = 'xxx', createdAt = 'yyy'},
{ content = 'xxx', createdAt = 'yyy'} ]
It is clear, but I'm worried about this case: If I want to edit a specified comment, how do I get its content and its question? There is no _id to let me find one, nor question_ref to let me find its question. (Is there perhaps a way to do this without _id and question_ref?)
Do I have to use ref rather than embed? Do I then have to create a new collection for comments?

In general, embed is good if you have one-to-one or one-to-many relationships between entities, and reference is good if you have many-to-many relationships.

Well, I'm a bit late but still would like to share my way of schema creation.
I have schemas for everything that can be described by a word, like you would do it in the classical OOP.
E.G.
Comment
Account
User
Blogpost
...
Every schema can be saved as a Document or Subdocument, so I declare this for each schema.
Document:
Can be used as a reference. (E.g. the user made a comment -> comment has a "made by" reference to user)
Is a "Root" in you application. (E.g. the blogpost -> there is a page about the blogpost)
Subdocument:
Can only be used once / is never a reference. (E.g. Comment is saved in the blogpost)
Is never a "Root" in you application. (The comment just shows up in the blogpost page but the page is still about the blogpost)

I came across this small presentation while researching this question on my own. I was surprised at how well it was laid out, both the info and the presentation of it.
http://openmymind.net/Multiple-Collections-Versus-Embedded-Documents
It summarized:
As a general rule, if you have a lot of [child documents] or if they are large, a separate collection might be best.
Smaller and/or fewer documents tend to be a natural fit for embedding.

Actually, I'm quite curious why nobody spoke about the UML specifications. A rule of thumb is that if you have an aggregation, then you should use references. But if it is a composition, then the coupling is stronger, and you should use embedded documents.
And you will quickly understand why it is logical. If an object can exist independently of the parent, then you will want to access it even if the parent doesn't exist. As you just can't embed it in a non-existing parent, you have to make it live in it's own data structure. And if a parent exist, just link them together by adding a ref of the object in the parent.
Don't really know what is the difference between the two relationships ?
Here is a link explaining them:
Aggregation vs Composition in UML

If I want to edit a specified comment, how to get its content and its question?
You can query by sub-document: db.question.find({'comments.content' : 'xxx'}).
This will return the whole Question document. To edit the specified comment, you then have to find the comment on the client, make the edit and save that back to the DB.
In general, if your document contains an array of objects, you'll find that those sub-objects will need to be modified client side.

Yes, we can use the reference in the document. To populate another document just like SQL i joins. In MongoDB, they don't have joins to map one to many relationship documents. Instead that we can use populate to fulfil our scenario.
var mongoose = require('mongoose')
, Schema = mongoose.Schema
var personSchema = Schema({
_id : Number,
name : String,
age : Number,
stories : [{ type: Schema.Types.ObjectId, ref: 'Story' }]
});
var storySchema = Schema({
_creator : { type: Number, ref: 'Person' },
title : String,
fans : [{ type: Number, ref: 'Person' }]
});
The population is the process of automatically replacing the specified paths in the document with the document(s) from other collection(s). We may populate a single document, multiple documents, plain objects, multiple plain objects, or all objects returned from a query. Let's look at some examples.
Better you can get more information please visit: http://mongoosejs.com/docs/populate.html

I know this is quite old but if you are looking for the answer to the OP's question on how to return only specified comment, you can use the $ (query) operator like this:
db.question.update({'comments.content': 'xxx'}, {'comments.$': true})

MongoDB gives freedom to be schema-less and this feature can result in pain in the long term if not thought or planned well,
There are 2 options either Embed or Reference. I will not go through definitions as the above answers have well defined them.
When embedding you should answer one question is your embedded document going to grow, if yes then how much (remember there is a limit of 16 MB per document) So if you have something like a comment on a post, what is the limit of comment count, if that post goes viral and people start adding comments. In such cases, reference could be a better option (but even reference can grow and reach 16 MB limit).
So how to balance it, the answer is a combination of different patterns, check these links, and create your own mix and match based on your use case.
https://www.mongodb.com/blog/post/building-with-patterns-a-summary
https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-1

If I want to edit a specified comment, how do I get its content and
its question?
If you had kept track of the number of comments and the index of the comment you wanted to alter, you could use the dot operator (SO example).
You could do f.ex.
db.questions.update(
{
"title": "aaa"
},
{
"comments.0.contents": "new text"
}
)
(as another way to edit the comments inside the question)

Does MongoDB's update atomicity apply to both query and modification?

MongoDB has support for atomic updates. I.e. I can be sure that when a document is updated no other update will overwrite my previous change. My question relates to the combination of query and update statement, and is best illustrated by the example shown below.
db.foo.update(
{ state : 1, players: { $size: 2 } } ,
{ $push: { players : { new player document } } },
false , true );
In the above example, I only want to push a new player into a collection of players, if the number of players equals 2. Given the above query and update statement, is it possible that two simultaneous updates both push a player onto the same document, because at the time of reading the document its players $size is 2? I.e. does the atomicity span across the query and update part of the update statement or not?
Edit More in-depth sequence of events:
Consider firing the same update twice (U1 and U2) at the same time. Is the following sequence of events possible or not?
U1 finds that document #1 matches the query portion of the update
statement.
U2 finds that document #1 matches the query portion of
the update statement.
U1 pushes a new player in document #1.
U2 pushes a new player in document #1.
The end result is that document #1 contains one more player than expected, because both U1 and U2 were under the impression that document #1 contains only two players.

I've asked this question on the mongodb-user group. http://groups.google.com/group/mongodb-user/browse_thread/thread/e61e220dc0f6f64c
According to the answer by Marc (who works at 10gen) the situation described by me cannot occur.
The situation that you described is not possible; there is no danger
of both updates modifying the same document.

Update: not sure of my knowledge anymore... See "The ABA Nuance". Please don't accept this answer (or my comment below) as it is probably not correct. Would love to be corrected.
Your explanation of atomic is incorrect (I can be sure that when a document is updated no other update will overwrite my previous change). Other updates can (and will) overwrite your change. But they won't do it in a way that would interfere with integrity of your query.
It is important to know that MongoDB updates are atomic on single document. So when a document matches your query, it is "locked" and ready for an update. Note that your update ($push) works inside the same document that was locked. When update is finished, lock is released.
I am not sure I understand "does the atomicity span across the query and update part of the update statement or not", but: atomic means that other queries can't mess with our query. Our query can change data that is "locked" by itself.
Disclaimer: I am not privy to internal mechanisms MongoDB uses to ensure this atomicity, so this description might be lacking from technical viewpoint (especially in connection to locking) - but it is valid conceptually. This is how it works from external viewpoint.

With the sequence of events that you write down, you can indeed have one player too many. The update's "find" and "update" work very much like doing it yourself with a "find" and then an "update" on each of the documents that you're iterating over. You probably want to have a look at the "$atomic" operator: http://www.mongodb.org/display/DOCS/Atomic+Operations#AtomicOperations-ApplyingtoMultipleObjectsAtOnce

MongoDB relationships: embed or reference?

I want to design a question structure with some comments. Which relationship should I use for comments: embed or reference?
A question with some comments, like stackoverflow, would have a structure like this:
Question
title = 'aaa'
content = 'bbb'
comments = ???
At first, I thought of using embedded comments (I think embed is recommended in MongoDB), like this:
Question
title = 'aaa'
content = 'bbb'
comments = [ { content = 'xxx', createdAt = 'yyy'},
{ content = 'xxx', createdAt = 'yyy'},
{ content = 'xxx', createdAt = 'yyy'} ]
It is clear, but I'm worried about this case: If I want to edit a specified comment, how do I get its content and its question? There is no _id to let me find one, nor question_ref to let me find its question. (Is there perhaps a way to do this without _id and question_ref?)
Do I have to use ref rather than embed? Do I then have to create a new collection for comments?

In general, embed is good if you have one-to-one or one-to-many relationships between entities, and reference is good if you have many-to-many relationships.

Well, I'm a bit late but still would like to share my way of schema creation.
I have schemas for everything that can be described by a word, like you would do it in the classical OOP.
E.G.
Comment
Account
User
Blogpost
...
Every schema can be saved as a Document or Subdocument, so I declare this for each schema.
Document:
Can be used as a reference. (E.g. the user made a comment -> comment has a "made by" reference to user)
Is a "Root" in you application. (E.g. the blogpost -> there is a page about the blogpost)
Subdocument:
Can only be used once / is never a reference. (E.g. Comment is saved in the blogpost)
Is never a "Root" in you application. (The comment just shows up in the blogpost page but the page is still about the blogpost)

I came across this small presentation while researching this question on my own. I was surprised at how well it was laid out, both the info and the presentation of it.
http://openmymind.net/Multiple-Collections-Versus-Embedded-Documents
It summarized:
As a general rule, if you have a lot of [child documents] or if they are large, a separate collection might be best.
Smaller and/or fewer documents tend to be a natural fit for embedding.

Actually, I'm quite curious why nobody spoke about the UML specifications. A rule of thumb is that if you have an aggregation, then you should use references. But if it is a composition, then the coupling is stronger, and you should use embedded documents.
And you will quickly understand why it is logical. If an object can exist independently of the parent, then you will want to access it even if the parent doesn't exist. As you just can't embed it in a non-existing parent, you have to make it live in it's own data structure. And if a parent exist, just link them together by adding a ref of the object in the parent.
Don't really know what is the difference between the two relationships ?
Here is a link explaining them:
Aggregation vs Composition in UML

If I want to edit a specified comment, how to get its content and its question?
You can query by sub-document: db.question.find({'comments.content' : 'xxx'}).
This will return the whole Question document. To edit the specified comment, you then have to find the comment on the client, make the edit and save that back to the DB.
In general, if your document contains an array of objects, you'll find that those sub-objects will need to be modified client side.

Yes, we can use the reference in the document. To populate another document just like SQL i joins. In MongoDB, they don't have joins to map one to many relationship documents. Instead that we can use populate to fulfil our scenario.
var mongoose = require('mongoose')
, Schema = mongoose.Schema
var personSchema = Schema({
_id : Number,
name : String,
age : Number,
stories : [{ type: Schema.Types.ObjectId, ref: 'Story' }]
});
var storySchema = Schema({
_creator : { type: Number, ref: 'Person' },
title : String,
fans : [{ type: Number, ref: 'Person' }]
});
The population is the process of automatically replacing the specified paths in the document with the document(s) from other collection(s). We may populate a single document, multiple documents, plain objects, multiple plain objects, or all objects returned from a query. Let's look at some examples.
Better you can get more information please visit: http://mongoosejs.com/docs/populate.html

I know this is quite old but if you are looking for the answer to the OP's question on how to return only specified comment, you can use the $ (query) operator like this:
db.question.update({'comments.content': 'xxx'}, {'comments.$': true})

MongoDB gives freedom to be schema-less and this feature can result in pain in the long term if not thought or planned well,
There are 2 options either Embed or Reference. I will not go through definitions as the above answers have well defined them.
When embedding you should answer one question is your embedded document going to grow, if yes then how much (remember there is a limit of 16 MB per document) So if you have something like a comment on a post, what is the limit of comment count, if that post goes viral and people start adding comments. In such cases, reference could be a better option (but even reference can grow and reach 16 MB limit).
So how to balance it, the answer is a combination of different patterns, check these links, and create your own mix and match based on your use case.
https://www.mongodb.com/blog/post/building-with-patterns-a-summary
https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-1

If I want to edit a specified comment, how do I get its content and
its question?
If you had kept track of the number of comments and the index of the comment you wanted to alter, you could use the dot operator (SO example).
You could do f.ex.
db.questions.update(
{
"title": "aaa"
},
{
"comments.0.contents": "new text"
}
)
(as another way to edit the comments inside the question)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse