MongoDB: Are bulk operations written to the oplog as a whole? - mongodb

When I issue an ordered bulk operation in MongoDB 3, is the bulk operation as a whole written to the oplog so that it can be replayed as a whole after a server crash?
The rationale for this question is the following:
I know that there are no real transactions but that I can use the $isolated keyword to have some read consistency (in some cases).
Apart from being good schema design or not, let's assume that I would have to update multiple documents in possibly different collections in one go, in what would be a transaction in SQL. I do not care about the data being in an inconsistent state at any near moment, however I do require the data to be consistent eventually. So, while I may not care about errors and missing rollbacks during the operation, I require the sequence of updates to be performed entirely or not performed at all at some point, in order to have them survive unexpected server failures or shutdowns in the middle of the bulk operation (because of, say, random CoreOS updates).

I will enter into this with the general caveat that I admit I have not even looked at the results, but the basic principles seem valid to me from the start.
What you need to consider here is "what is actually happening under the hood" of the "nice syntax sugar" you are presented with in general calls. What this means is basically looking at what the "command form" of the operations you are calling actually do. In this case "update".
So, if you had a look at that link already, then consider the following "Bulk" update form:
var bulk = db.collection.initializeOrdedBulkOp();
bulk.find({ "_id": 1 }).updateOne({ "$set": { "a": 1 } });
bulk.find({ "_id": 2 }).updateOne({ "$set": { "b": 2 } });
bulk.execute();
Now you already know this is being sent to the server as one request, but what you are likely not considering is that actual "request" made "under the hood" is actually this:
db.runCommand({
"update": "collection",
"updates": [
{ "q": { "_id": 1 }, "u": { "$set": { "a": 1 } } },
{ "q": { "_id": 2 }, "u": { "$set": { "b": 2 } } }
],
"ordered": true
})
Therefore it stands to reason that what you actually see in the logs under the "update" operation is actually something like (abbreviated from full output to just the query):
{ "q": { "_id": 1 }, "u": { "$set": { "a": 1 } } }
{ "q": { "_id": 2 }, "u": { "$set": { "b": 2 } } }
Which therefore means that each of those actions with the associated command is in the oplog for "replay" on replication and/or on other actions you might perform such as specifically "replaying" the oplog entries.
I'd be sure that is what actually happens without even looking, because I know that it how the drivers implement the actual calls, and it makes sense that each call is kept within the oplog in this way.
Therefore "as a whole", then no. These are not "transactions" and are always distinct operations even if their submission and return are within a singular request. But they are not a singular operation, and therefore will not and should not be recorded as such.

Related

Update Array Children Sorted Order

I have a collection containing objects with the following structure
{
"dep_id": "some_id",
"departament": "dep name",
"employees": [{
"name": "emp1",
"age": 31
},{
"name": "emp2",
"age": 35
}]
}
I would like to sort and save the array of employees for the object with id "some_id", by employees.age, descending. The best outcome would be to do this atomically using mongodb's query language. Is this possible?
If not, how can I rearrange the subdocuments without affecting the parent's other data or the data of the subdocuments? In case I have to download the data from the database and save back the sorted array of children, what would happen if something else performs an update to one of the children or children are added or removed in the meantime?
In the end, the data should be persisted to the database like this:
{
"dep_id": "some_id",
"departament": "dep name",
"employees": [{
"name": "emp2",
"age": 35
},{
"name": "emp1",
"age": 31
}]
}
The best way to do this is to actually apply the $sort modifier as you add items to the array. As you say in your comment "My actual objects have a "rank" and 'created_at'", which means that you really should have asked that in your question instead of writing a "contrived" case ( don't know why people do that ).
So for "sorting" by multiple properties, the following reference would adjust like this:
db.collection.update(
{ },
{ "$push": { "employees": { "$each": [], "$sort": { "rank": -1, "created_at": -1 } } } },
{ "multi": true }
)
But to update all the data you presently have "as is shown in the question", then you would sort on "age" with:
db.collection.update(
{ },
{ "$push": { "employees": { "$each": [], "$sort": { "age": -1 } } } },
{ "multi": true }
)
Which oddly uses $push to actually "modify" an array? Yes it's true, since the $each modifier says we are not actually adding anything new yet the $sort modifier is actually going to apply to the array in place and "re-order" it.
Of course this would then explain how "new" updates to the array should be written in order to apply that $sort and ensure that the "largest age" is always "first" in the array:
db.collection.update(
{ "dep_id": "some_id" },
{ "$push": {
"employees": {
"$each": [{ "name": "emp": 3, "age": 32 }],
"$sort": { "age": -1 }
}
}}
)
So what happens here is as you add the new entry to the array on update, the $sort modifier is applied and re-positions the new element between the two existing ones since that is where it would sort to.
This is a common pattern with MongoDB and is typically used in combination with the $slice modifier in order to keep arrays at a "maximum" length as new items are added, yet retain "ordered" results. And quite often "ranking" is the exact usage.
So overall, you can actually "update" your existing data and re-order it with "one simple atomic statement". No looping or collection renaming required. Furthermore, you now have a simple atomic method to "update" the data and maintain that order as you add new array items, or remove them.
In order to get what you want you can use the following query:
db.collection.aggregate({
$unwind: "$employees" // flatten employees array
}, {
$sort: {
"employees.name": -1 // sort all documents by employee name (descending)
}
}, {
$group: { // restore the previous structure
_id: "$_id",
"dep_id": {
$first: "$dep_id"
},
"departament": {
$first: "$departament"
},
"employees": {
$push: "$employees"
},
}
}, {
$out: "output" // write everything out to a separate collection
})
After this step you would want to drop your source table and rename the "output" collection to match your source table name.
This solution will, however, not deal with the concurrency issue. So you should remove write access from the collection first so nobody modifies it during the process and then restore it once you're done with the migration.
You could alternatively query all data first, then sort the employees array on the client side and then use either single update queries or - faster but more complicated - a bulk write operation with all the individual update calls in order to update the existing documents. Here, you could use the entire document that you've initially read as a filter for the update operation. So if an individual update does not modify any document you'd know straight away, that some other change must have modified the document you read before. Those cases you'd need to retry later (or straight away until the update does actually modify a document).

Upsert Document and/or add a Sub-Document

I've been wrestling with the asynchronous nature of MongoDB, Mongoose and JavaScript and how to best make multiple updates to a collection.
I have an Excel sheet of client and contact data. There are some clients with multiple contacts, one per line, and the client data is the same (so the client name can be used as a unique key - in fact in the schema it's defined with unique: true).
The logic I want to achieve is:
Search the Client collection for the client with clientName as the key
If a matching clientName isn't found then create a new document for that client (not an upsert, I don't want to change anything if the client document is already in the database)
Check to see if the contact is already present in the array of contacts within the client document using firstName and lastName as the keys
If the contact isn't found then $push that contact onto the array
Of course, we could easily have a situation where the client doesn't exists (and so is created) and then immediately, the very next row of the sheet, is another contact for the same client so then I'd want to find that existing (just created) client and $push that 2nd new contact into the array.
I've tried this but it's not working:
Client.findOneAndUpdate(
{clientName: obj.client.clientname},
{$set: obj.client, $push: {contacts: obj.contact}},
{upsert: true, new: true},
function(err, client){
console.log(client)
}
)
and I've had a good look at other questions, e.g.:
create mongodb document with subdocuments atomically?
https://stackoverflow.com/questions/28026197/upserting-complex-collections-through-mongoose-via-express
but can't get a solution... I'm coming to the conclusion that maybe I have to use some app logic to do the find, then decisions in my code, then writes, rather than use a single Mongoose/Mongo statement, but then the issues of asynchronicity rear their ugly head.
Any suggestions?
The approach to handling this is not a simple one, as mixing "upserts" with adding items to "arrays" can easily lead to undesired results. It also depends on if you want logic to set other fields such as a "counter" indicating how many contacts there are within an array, which you only want to increment/decrement as items are added or removed respectively.
In the most simple case however, if the "contacts" only contained a singular value such as an ObjectId linking to another collection, then the $addToSet modifier works well, as long as there no "counters" involved:
Client.findOneAndUpdate(
{ "clientName": clientName },
{ "$addToSet": { "contacts": contact } },
{ "upsert": true, "new": true },
function(err,client) {
// handle here
}
);
And that is all fine as you are only testing to see if a doucment matches on the "clientName", if not upsert it. Whether there is a match or not, the $addToSet operator will take care of unique "singular" values, being any "object" that is truly unique.
The difficulties come in where you have something like:
{ "firstName": "John", "lastName": "Smith", "age": 37 }
Already in the contacts array, and then you want to do something like this:
{ "firstName": "John", "lastName": "Smith", "age": 38 }
Where your actual intention is that this is the "same" John Smith, and it's just that the "age" is not different. Ideally you want to just "update" that array entry end neiter create a new array or a new document.
Working this with .findOneAndUpdate() where you want the updated document to return can be difficult. So if you don't really want the modified document in response, then the Bulk Operations API of MongoDB and the core driver are of most help here.
Considering the statements:
var bulk = Client.collection.initializeOrderedBulkOP();
// First try the upsert and set the array
bulk.find({ "clientName": clientName }).upsert().updateOne({
"$setOnInsert": {
// other valid client info in here
"contacts": [contact]
}
});
// Try to set the array where it exists
bulk.find({
"clientName": clientName,
"contacts": {
"$elemMatch": {
"firstName": contact.firstName,
"lastName": contact.lastName
}
}
}).updateOne({
"$set": { "contacts.$": contact }
});
// Try to "push" the array where it does not exist
bulk.find({
"clientName": clientName,
"contacts": {
"$not": { "$elemMatch": {
"firstName": contact.firstName,
"lastName": contact.lastName
}}
}
}).updateOne({
"$push": { "contacts": contact }
});
bulk.execute(function(err,response) {
// handle in here
});
This is nice since the Bulk Operations here mean that all statements here are sent to the server at once and there is only one response. Also note here that the logic means here that at most only two operations will actually modify anything.
In the first instance, the $setOnInsert modifier makes sure that nothing is changed when the document is just a match. As the only modifications here are within that block, this only affects a document where an "upsert" occurs.
Also note on the next two statements you do not try to "upsert" again. This considers that the first statement was possibly successful where it had to be, or otherwise did not matter.
The other reason for no "upsert" there is because the condtions needed to test the presence of the element in the array would lead to the "upsert" of a new document when they were not met. That is not desired, therefore no "upsert".
What they do in fact is respectively check whether the array element is present or not, and either update the existing element or create a new one. Therefore in total, all operations mean you either modify "once" or at most "twice" in the case where an upsert occurred. The possible "twice" creates very little overhead and no real problem.
Also in the third statement the $not operator reverses the logic of the $elemMatch to determine that no array element with the query condition exists.
Translating this with .findOneAndUpdate() becomes a bit more of an issue. Not only is it the "success" that matters now, it also determines how the eventual content is returned.
So the best idea here is to run the events in "series", and then work a little magic with the result in order to return the end "updated" form.
The help we will use here is both with async.waterfall and the lodash library:
var _ = require('lodash'); // letting you know where _ is coming from
async.waterfall(
[
function(callback) {
Client.findOneAndUpdate(
{ "clientName": clientName },
{
"$setOnInsert": {
// other valid client info in here
"contacts": [contact]
}
},
{ "upsert": true, "new": true },
callback
);
},
function(client,callback) {
Client.findOneAndUpdate(
{
"clientName": clientName,
"contacts": {
"$elemMatch": {
"firstName": contact.firstName,
"lastName": contact.lastName
}
}
},
{ "$set": { "contacts.$": contact } },
{ "new": true },
function(err,newClient) {
client = client || {};
newClient = newClient || {};
client = _.merge(client,newClient);
callback(err,client);
}
);
},
function(client,callback) {
Client.findOneAndUpdate(
{
"clientName": clientName,
"contacts": {
"$not": { "$elemMatch": {
"firstName": contact.firstName,
"lastName": contact.lastName
}}
}
},
{ "$push": { "contacts": contact } },
{ "new": true },
function(err,newClient) {
newClient = newClient || {};
client = _.merge(client,newClient);
callback(err,client);
}
);
}
],
function(err,client) {
if (err) throw err;
console.log(client);
}
);
That follows the same logic as before in that only two or one of those statements is actually going to do anything with the possibility that the "new" document returned is going to be null. The "waterfall" here passes a result from each stage onto the next, including the end where also any error will immediately branch to.
In this case the null would be swapped for an empty object {} and the _.merge() method will combine the two objects into one, at each later stage. This gives you the final result which is the modified object, no matter which preceeding operations actually did anything.
Of course, there would be a differnt manipulation required for $pull, and also your question has input data as an object form in itself. But those are actually answers in themselves.
This should at least get you started on how to approach your update pattern.
I don't use mongoose so I'll post a mongo shell update; sorry for that. I think the following would do:
db.clients.update({$and:[{'clientName':'apple'},{'contacts.firstName': {$ne: 'nick'}},{'contacts.lastName': {$ne: 'white'}}]},
{$set:{'clientName':'apple'}, $push: {contacts: {'firstName': 'nick', 'lastName':'white'}}},
{upsert: true });
So:
if the client "apple" does not exists, it is created, with a contact with given first and last name. If it exists, and does not have the given contact, it has it pushed. If it exists, and already has the given contact, nothing happens.

how to restrict $push in mongodb?

I am learning mongodb and wondering if can I restrict push by matching values.
For example:
field1 = {
id:123,
title:123,
likes: [{by:1,type:'like'}, {by:2, type:'like'}]
}
Can I restrict push by id in likes?
What you may have already tried was the $addToSet operator, but then found out it does not suit the case here as the combination of "id" and "type" can possibly vary. For instance what you don't want is the same "id" value with both types "like" and "dislike".
This is however a typical "voting" model, and the current structure is not the best one. A better model for this is as so, with the basic fields just for example:
{
"_id": 123,
"likeCount": 2,
"dislikeCount": 0,
"likes": [456,789]
"dislikes": []
}
Having seperate arrays is important to the atomic update process, since you cannot both $pull and $push from an array. But more than that, as it re-enforces the logic behind keeping the "count" values, as this is useful for simple queries as sorting as opposed to calculating array length.
In order to post a "like" for a user who you don't want to duplicate in the array, the $addToSet operator is still not be best one despite the values now being truly unique. You want to contrain the "count" as well, so add the conditions to the query in the update instead:
db.collection.update(
{ "_id": 123, "likes": { "$ne": 456 } },
{
"$push": { "likes": 456 },
"$inc": { "likeCount": 1 }
}
)
That way, if the user has already voted their "like" then not only is nothing added but the "count" is kept at the correct total as well. Basically the query condition on the update was not met as there already was an element in the array matching that value. So the document does not match and nothing is updated.
That is a good approach, but we can make that better still. What if the user already posted to "dislike" and now changes their mind to "like" instead? What you really need here are "two" update statements to cover the possible conditions, and this is where the Bulk Operations API comes in, to handle that logic in a single request:
var bulk = db.collection.initializeOrderedBulkOp();
// match and update where a dislike is present
bulk.find({
"_id": 123,
"likes": { "$ne": 456 },
"dislikes": 456
}).updateOne({
"$push": { "likes": 456 },
"$pull": { "dislikes": 456 }
"$inc": {
"likeCount": 1,
"dislikeCount": -1
}
});
// match and update where no dislike exists
bulk.find({
"_id": 123,
"likes": { "$ne": 456 },
"dislikes": { "$ne": 456 }
}).updateOne({
"$push": { "likes": 456 },
"$inc": { "likeCount": 1 }
});
// Send requests to server and respond
bulk.execute();
In this case if the first statement did not match because there was no dislike then nothing would be updated, but if there was a dislike then the correct adjustments would be made.
With the second request, this one would be applied if there was nothing in the dislikes array to match and there was also not a matching item in the likes array. So this would apply for a new vote and also does not conflict with the previous statement. Despite the two statements, the upadte is only ever applied once or not at all depending on the state conditions.
That is the basic pattern for handling this kind of voting properly, as you keep lists of each vote type as well as maintaining the counts for ease of access. The "dislikes" process is pretty much just the reverse of the logic for the elements you need to check for, and removing votes has similar conditions as well.

How to get (or aggregate) distinct keys of array in MongoDB

I'm trying to get MongoDB to aggregate for me over an array with different key-value pairs, without knowing keys (Just a simple sum would be ok.)
Example docs:
{data: [{a: 3}, {b: 7}]}
{data: [{a: 5}, {c: 12}, {f: 25}]}
{data: [{f: 1}]}
{data: []}
So basically each doc (or it's array really) can have 0 or many entries, and I don't know the keys for those objects, but I want to sum and average the values over those keys.
Right now I'm just loading a bunch of docs and doing it myself in Node, but I'd like to offload that work to MongoDB.
I know I can unwind those first, but how to proceed from there? How to sum/avg/min/max the values if I don't know the keys?
If you do not know the keys or cannot make a reasonable educated guess then you are basically stuck from going any further with the aggregation framework. You could supply "all of the keys" for consideration, but I supect your acutal data looks more like this:
{ "data": [{ "film": 10 }, { "televsion": 5 },{ "boardGames": 1 }] }
So there would be little point here findin out all the "key names" and then throwing that at an aggregation statement.
For the record though, "this is why you do not structure your data storage like this". Information like "film" here should not be used as a "key" name, because it is useful "data" that could be searched upon and most importantly "indexed" in a database system.
So your data should really look like this:
{
"data": [
{ "type": "film", "value": 10 },
{ "type": "televsion", "valule": 5 },
{ "type": "boardGames", "value": 1 }
]
}
Then the aggregation statement is simple, as are many other things:
db.collection.aggregate([
{ "$unwind": "$data" },
{ "$group": {
"_id": null,
"sum": { "$sum": "$data.value" },
"avg": { "$avg": "$data.value" }
}}
])
But since the key names are constantly changing in documents and do not have a uniform structure, then you need JavaScript processing on the server to traverse the keys, and that meand mapReduce:
db.collection.mapReduce(
function() {
this.data.forEach(function(data) {
Object.keys(data).forEach(function(key) {
emit(null,data[key]); // emit the value regardless of key name
});
});
},
function(key,values) {
return Array.sum(values); // Just summing for example
},
{ "out": { "inline": 1 } }
)
And of course the JavaScript execution here will work much more slowly than the native coded operators available to the aggregation framework.
So this should be an abject lesson as to why you don not use "data" as "key names" when storing data in a database. The aggregation framework works with standard structres and is fast, falling back to JavaScript processing is more flexible, but the cost is mostly in speed and other features.

How to store option data in mongo?

I have a site that I'm using Mongo on. So far everything is going well. I've got several fields that are static option data, for example a field for animal breeds and another field for animal registrars.
Breeds
Arabian
Quarter Horse
Saddlebred
Registrars
AQHA
American Arabians
There are maybe 5 or 6 different collections like this that range from 5-15 elements.
What is the best way to put these in Mongo? Right now, I've got a separate collection for each group. That is a breeds collection, a registrars collection etc.
Is that the best way, or would it make more sense to have a single static data collection with a "type" field specifying the option type?
Or something else completely different?
Since this data is static then it's better to just embed the data in documents. This way you don't have to do manual joins.
And also store it in a separate collection (one or several, doesn't matter, choose what's easier) to facilitate presentation (render combo-boxes, etc.)
I believe creating multiple collections has collection size implications? (something about MongoDB creating a collection file on disk as twice the size of the previous file [ db.0 = 64MB, db.1 = 128MB and so on)
Here's what I can think of:
1. Storing as single collection
The benefits here are:
You only need one call to Mongo to fetch, and if you can cache the call you quickly have the data.
You avoid duplication: create a single schema that deals with all your options. You can just nest suboptions if there are any.
Of course, you also avoid duplication in statics/methods to modify options.
I have something similar on a project that I'm working on. I have categories and subcategories all stored in one collection. Here's a JSON/BSON dump as example:
In all the data where I need to store my 'options' (station categories in my case) I simply use the _id.
{
"status": {
"code": 200
},
"response": {
"categories": [
{
"cat": "Taxi",
"_id": "50b92b585cf34cbc0f000004",
"subcat": []
},
{
"cat": "Bus",
"_id": "50b92b585cf34cbc0f000005",
"subcat": [
{
"cat": "Bus Rapid Transit",
"_id": "50b92b585cf34cbc0f00000b"
},
{
"cat": "Express Bus Service",
"_id": "50b92b585cf34cbc0f00000a"
},
{
"cat": "Public Transport Bus",
"_id": "50b92b585cf34cbc0f000009"
},
{
"cat": "Tour Bus",
"_id": "50b92b585cf34cbc0f000008"
},
{
"cat": "Shuttle Bus",
"_id": "50b92b585cf34cbc0f000007"
},
{
"cat": "Intercity Bus",
"_id": "50b92b585cf34cbc0f000006"
}
]
},
{
"cat": "Rail",
"_id": "50b92b585cf34cbc0f00000c",
"subcat": [
{
"cat": "Intercity Train",
"_id": "50b92b585cf34cbc0f000012"
},
{
"cat": "Rapid Transit/Subway",
"_id": "50b92b585cf34cbc0f000011"
},
{
"cat": "High-speed Rail",
"_id": "50b92b585cf34cbc0f000010"
},
{
"cat": "Express Train Service",
"_id": "50b92b585cf34cbc0f00000f"
},
{
"cat": "Passenger Train",
"_id": "50b92b585cf34cbc0f00000e"
},
{
"cat": "Tram",
"_id": "50b92b585cf34cbc0f00000d"
}
]
}
]
}
}
I have a call to my API that gets me that document (app.ly/api/v1/stationcategories). I find this much easier to code with.
In your case you could have something like:
{
"option": "Breeds",
"_id": "xxx",
"suboption": [
{
"option": "Arabian",
"_id": "xxx"
},
{
"option": "Quarter House",
"_id": "xxx"
},
{
"option": "Saddlebred",
"_id": "xxx"
}
]
},
{
"option": "Registrars",
"_id": "xxx",
"suboption": [
{
"option": "AQHA",
"_id": "xxx"
},
{
"option": "American Arabians",
"_id": "xxx"
}
]
}
Whenever you need them, either loop through them, or pull specific options from your collection.
2. Storing as a static JSON document
This as #Sergio mentioned, is a viable and more simplistic approach. You can then either have separate docs for separate options, or put them in one document.
You do lose some flexibility here because you can't reference options by Id (which I prefer because changing option name doesn't affect all your other data).
Prone to typos (though if you know what you're doing this shouldn't be a problem).
For Node.js users: this might leave you with a headache from require('../../../options.json') similar to PHP.
The reader will note that I'm being negative about this approach, it works, but is rather inflexible.
Though we're discouraged from using joins unnecessarily on MongoDB, referencing by ObjectId is sometimes useful and extensible.
An example is if your website becomes popular in one region of the world, and say people from Poland start accounting for say 50% of your site visits. If you decide to add Polish translations. You would need to go back to all your documents, and add Polish names (if exists) to your options. If using approach 1, it's as easy as adding a Polish name to your options, and then plucking the Polish name from your options collection at runtime.
I could only think of 2 options other than storing each option as a collection
UPDATE: If someone has positives or negatives for either approach, may you please add them. My bias might be unhelpful to some people as there are benefits to storing static JSON files
MongoDB is schemaless and also no JOIN is supported. So you have to move out of the RDBMS and normalization given the fact that this is purely a different kind of database.
Few rules which you can apply while designing as a guidelines. Of course, you have the choice of keeping it in a separate collection when needed.
Static Master/Reference Data:
You have to always embed them in your documents wherever required. Since the data is not going to be changed, it is not at all bad idea to keep in the same collection. If the data is too large, group them and store them in a separate collection instead of creating multiple collection for the this master data itself.
NOTE: When embedding the collections as sub-documents, always make sure that you are never going to exceed the 16MB limit. That is the limit (at this point) for each collection can take in MongoDB.
Dynamic Master/Reference Data
Try to keep them in a separate collection as the master data is tend to change often.
Always remember, NO join support, so keep them in a way that you can easily access it without querying the database too many times.
So there is NO suggested best way, it always changes based on your need and the design can go either way.