So I understand that MongoDB (and by proxy Mongoose) does not support transactions, but that operations involving a single document are always atomic. In looking over the Mongoose docs, I ran into Model.create which allows one to pass an array of documents and store them in a single action, like so:
var array = [{ type: 'jelly bean' }, { type: 'snickers' }];
Candy.create(array, function (err, jellybean, snickers) {
// ...
}
Is this action atomic? Does Mongo save all the documents at once, or does the Mongoose ODM loop through the array, saving one document at a time? Sources (or source code) would be greatly appreciated. (Also, I'm new, so please, don't shoot!)
MongoDB Wire Protocol accepts either single document or multiple documents with OP_INSERT. However, on the server they are still inserted one at a time.
In other words, if the server were to crash part-way through the insert, some documents would be inserted and others would not be. Within each document you are guaranteed consistent view of it - either it's all inserted or it's not. But for multiple documents no such guarantee exists.
Related
I have been scouring the internet for a few hours to look at a solution where doing bulk upserts in Meteor.js's smart collections is efficient.
Scenario:
I am hitting an api to get updated info for 200 properties asynchronously every 12 hours. For each property I get an array of about 300 JSON objects on average. 70% of the objects might not have been updated. But for the rest of 30% objects I need to update them in the database. As there is no way to find those 30% objects without matching them with the documents in database, I decided to upsert all documents.
My options:
Run a loop over the objects array and upsert each document in the database.
Remove all documents from collection and bulk insert the new objects.
For option 1 running a loop and upserting 60K documents (which are going to increase with time), takes a-lot of time. But at the moment seems like the only plausible option.
For option 2 meteor.js does not allow bulk inserts in their smart collections. Even for that we will have to loop over the array of objects.
Is there another option where I can achieve this efficiently?
MongoDb supports inserting an array of documents. So you can insert all your documents in one call from Meteors 'rawCollection'.
MyCollection.remove({}); // its empty
var theRaw = MyCollection.rawCollection();
var mongoInsertSync = Meteor.wrapAsync(theRaw.insert, theRaw);
var result = mongoInsertSync(theArrayOfDocs);
In production code you will wrap this in a try/catch to get hold of the error if your insert fails, result is only good if the inserting of the array of documents succeeds.
The above solution with the rawCollection does insert, but it appears not to support the ordered:false directive to continue processing if one document fails, the raw collection exits on the first error, which is unfortunate.
"If false, perform an unordered insert, and if an error occurs with one of documents, continue processing the remaining documents in the array."
What is
Meteor.Collection
and
Meteor.Collection.Cursor
?
How does these two related to each other? Did:
new Meteor.Collection("name")
create a MONGODB collection with the parameter name?
Did new Meteor.Collection("name") create a MONGODB collection with the parameter name?
Not exactly. A Meteor.Collection represents a MongoDB collection that may or may not exist yet, but the actual MongoDB collection isn't actually created until you insert a document.
A Meteor.Collection.Cursor is a reactive data source that represents a changing subset of documents that exist within a MongoDB collection. This subset of documents is specified by the selector and options arguments you pass to the Meteor.Collection.find(selector, options) method. This find() method returns the cursor object. I think the Meteor Docs explain cursors well:
find returns a cursor. It does not immediately access the database or return documents. Cursors provide fetch to return all matching documents, map and forEach to iterate over all matching documents, and observe and observeChanges to register callbacks when the set of matching documents changes.
Collection cursors are not query snapshots. If the database changes between calling Collection.find and fetching the results of the cursor, or while fetching results from the cursor, those changes may or may not appear in the result set.
Cursors are a reactive data source. The first time you retrieve a cursor's documents with fetch, map, or forEach inside a reactive computation (eg, a template or autorun), Meteor will register a dependency on the underlying data. Any change to the collection that changes the documents in a cursor will trigger a recomputation. To disable this behavior, pass {reactive: false} as an option to find.
The reactivity of cursors is important. If I have a cursor object, I can retrieve the current set of documents it represents by calling fetch() on it. If the data changes in between calls, the fetch() method will actually return a different array of documents. Many things in Meteor natively understand the reactivity of cursors. This is why we can return a cursor object from a template helper function:
Template.foo.documents = function() {
return MyCollection.find(); // returns a cursor object, rather than an array of documents
};
Behind the scenes, Meteor's templating system knows to call fetch() on this cursor object. When the server sends the client updates telling it that the collection has changed, the cursor is informed of this change, which causes the template helper to be recomputed, which causes the template to be rerendered.
A Meteor.Collection is an object that you would define like this:
var collection = new Meteor.Collection("collection");
This object then lets you store data in your mongo database. Note just defining a collection this way does not create a collection in your mongo database. The colleciton would be created after you insert a document in.
So in this way you would not have a collection called name until you insert a document to it.
A cursor is the result of a .find() operation:
var cursor = collection.find()
You may have 1000s of documents, the cursor lets you go through them, one by one, without having to load all of them into your server's RAM.
You can then loop through using forEach, or use some of the other operations as specified in the docs : http://docs.meteor.com/#meteor_collection_cursor
A Cursor is also a reactive data source on the client, so if data changes, you can use the same query to update your DOM.
As Neil mentions its also worthwhile knowing Mongo is a NoSQL database. This means you don't have to really create tables/collections. You would just define a collecction like above, then insert a document to it. This way the collection would be created if it didn't exist. If it already existed, it would be inserted into that collection instead.
Browsing your local database
You don't really need to concern yourself with MongoDB until you are publishing your app, you can just interact with it using Meteor alone. In case you want to have a look at what it looks like:
If you want to have a look at your Mongo database. While meteor is running, in the same directory use meteor mongo to bring up a mongo shell, or use a tool like robomongo (Gui tool) to connect to localhost on port 3002 to have a peek at what your mongo database looks like.
I have a 'reports' collection in mongodb. The collection holds a the ObjectId of the user that the report belongs to. I am currently doing an aggregate so that I can group the reports by user.
db.reports.aggregate(
{
$group: {
_id: "$user",
stuff: {
$push: {
things: {
properties: "$properties"
}
}
}
}
},
{
$project: {
_id: 0,
user: "$_id",
stuff: "$stuff"
}
}
)
This give me an array of users id and the report 'stuff'. Instead of just the userId I am wondering if I can form the aggregate such that instead of just the userId I could hit the users collection and return more information about the user.
Is that possible? I am using mongoose as an ORM, but looking at the mongoose docs, aggregate looks to be a straight pass through to mongodb's aggregate function. Don't think I can take advantage of mongoose's populate as aggregate is not dealing with a schema, but I could be wrong.
Lastly reports are computer generated,each user could have millions. This prevents me from storing the reports in an array with the users collection.
No, you can't do that as there are no joins in MongoDB (not in normal queries, MapReduce, or the Aggregation Framework).
Only one collection can be accessed at a time from the aggregation functions.
Mongoose won't directly help, or necessarily make the query for additional user information any more efficient than doing an $in on a large batch of Users at a time (an array of userId). ($in docs here)
There really aren't work-arounds for this as the lack of joins is currently an intentional design of MongoDB (ie., it's how it works). Two paths you might consider:
You may find that another database platform would be better suited to the types of queries that you're trying to run.
You might try using $in as suggested above after the aggregation results are returned to your client code (it's one of the ways Mongoose handles fetching related documents). Just grab the userIds, and in batches request the associated User documents. While it's a bit more network traffic, depending on how you want to display the results, you may consider it an optimization to only fetch extra User information as it's shown to a user (if that's possible), by incrementally loading the extra data.
You could combine the two collections into one and do aggregation thusly. You don't need to have same structured top-level documents in an collection. It wouldn't be too difficult to mark each document of type A or B and adapt your queries involving A type documents to include if the document is type A.
You might not like how it looks but would set you up for the "join".
I know that we can bulk update documents in mongodb with
db.collection.update( criteria, objNew, upsert, multi )
in one db call, but it's homogeneous, i.e. all those documents impacted are following one kind of criteria. But what I'd like to do is something like
db.collection.update([{criteria1, objNew1}, {criteria2, objNew2}, ...]
, to send multiple update request which would update maybe absolutely different documents or class of documents in single db call.
What I want to do in my app is to insert/update a bunch of objects with compound primary key, if the key is already existing, update it; insert it otherwise.
Can I do all these in one combine in mongodb?
That's two seperate questions. To the first one; there is no MongoDB native mechanism to bulk send criteria/update pairs although technically doing that in a loop yourself is bound to be about as efficient as any native bulk support.
Checking for the existence of a document based on an embedded document (what you refer to as compound key, but in the interest of correct terminology to avoid confusion it's better to use the mongo name in this case) and insert/update depending on that existence check can be done with upsert :
document A :
{
_id: ObjectId(...),
key: {
name: "Will",
age: 20
}
}
db.users.update({name:"Will", age:20}, {$set:{age: 21}}), true, false)
This upsert (update with insert if no document matches the criteria) will do one of two things depending on the existence of document A :
Exists : Performs update "$set:{age:21}" on the existing document
Doesn't exist : Create a new document with fields "name" and field
"age" with values "Will" and "20" respectively (basically the
criteria are copied into the new doc) and then the update is applied
($set:{age:21}). End result is a document with "name"="Will" and
"age"=21.
Hope that helps
we are seeing some benefits of $in clause.
our use case was to update the 'status' in a document for a large number number records.
In our first cut, we were doing a for loop and doing updates one by 1. But then we switched to using $in clause and that made a huge improvement.
There is no real benefit from doing updates the way you suggest.
The reason that there is a bulk insert API and that it is faster is that Mongo can write all the new documents sequentially to memory, and update indexes and other bookkeeping in one operation.
A similar thing happens with updates that affect more than one document: the update will traverse the index only once and update objects as they are found.
Sending multiple criteria with multiple criteria cannot benefit from any of these optimizations. Each criteria means a separate query, just as if you issued each update separately. The only possible benefit would be sending slightly fewer bytes over the connection. The database would still have to do each query separately and update each document separately.
All that would happen would be that Mongo would queue the updates internally and execute them sequentially (because only one update can happen at any one time), this is exactly the same as if all the updates were sent separately.
It's unlikely that the overhead in sending the queries separately would be significant, Mongo's global write lock will be the limiting factor anyway.
Consider the following scenario with MongoDB:
Three writers (A,B,C) insert a document into the same collection.
A inserts first, followed by B, followed by C.
How can we guarantee A retrieves the ObjectId of the document he inserted and not B's document or C's document? Do we need to serialize the writes (i.e., only permit B to write after A inserts and retrieves the ObjectId), or does MongoDB offer some native functionality for this scenario?
Thanks!
We're on Rails.
the normal pattern here is for the driver to allocate the ObjectId and then you know what it is for the insert even before the server gets it.
You can generate the _id value in your client applications (writers) before inserting the document. This way you don't need to rely on the server generating the ObjectId an retrieving the correct value. Most MongoDB language drivers will do this for you automatically if you leave the _id blank.