MongoDb background indexing and unique index - mongodb

When you create an index in MongoDb. There are 2 options:
Do foreground indexing and lock all write operations while doing so
Do background indexing and still allow records to be written in the mean time
My question is:
How can something like unique index be built in the background? What if a duplicated document is inserted while the index is building?

I believe this is the most relevant excerpt from the MongoDB docs:
Background indexing operations run in the background so that other database operations can run while creating the index. However, the mongo shell session or connection where you are creating the index will block until the index build is complete. To continue issuing commands to the database, open another connection or mongo instance.
Queries will not use partially-built indexes: the index will only be usable once the index build is complete.
So this means the client where you issued the command to create the index will remain blocked until the index is fully created. If, from another client, you're doing something like adding a duplicate document while the index is being built, it will insert the document without an error, but eventually your initial client will encounter an error that it was unable to complete the index because there is a duplicate key for the unique index.
Now, I actually ended up here while trying to understand what MongoID's index(..., {background: true}) option does, because it seems to imply that every write may perform the indexing portion of the write in the background, but my understanding now is that this option only applies to the initial creation of the index. This is explained in the introduction to the docs for the background option for MongoDB's createIndex method (which is not technically the same thing as MongoID's background option, but it clarifies the concept of the feature related to that option):
MongoDB provides several options that only affect the creation of the index [...] This section describes the uses of these creation options and their behavior.
Related: Some options that you can specify to createIndex() options control the properties of the index, which are not index creation options. For example, the unique option affects the behavior of the index after creation.

Referring MongoDB docs-
If a background index build is in progress when the mongod process terminates, when the instance restarts the index build will restart as foreground index build. If the index build encounters any errors, such as a duplicate key error, the mongod will exit with an error.
So there are two possibilities-
If index creation is completed then the document which you are trying to insert will give you instant error.
Or if index creation is in progress in background then you will be able to insert the document (because at the time of insertion the index is not there 100%). But later when index creation process tries to put index on your duplicate document then it will exit with error. This is same behavior as if you have duplicate documents and you try to create foreground index.

#mltsy
If, from another client, you're doing something like adding a duplicate document while the index is being built, it will insert the document without an error.
I am not sure this is correct,as Mongodb Doc described as below:
When building an index on a collection, the database that holds the collection is unavailable for read or write operations until the index build completes.
I used the mongoose to test this :
var uniqueUsernameSchema = new Schema({
username: {
index: { unique: true, background: false },
type: String,
required: true
}
})
var U = mongoose.model('U1', uniqueUsernameSchema)
var dup = [{ username: 'Val' }, { username: 'Val' }]
U.create(dup, function (e, d) {
console.log(e, d)
})
The unique index failed to build. This result showed the foreground option didnot block the write operation in MongoDB.

Related

Is it possible to run a "dummy" query to see how many documents _would_ be inserted

I am using MongoDB to track unique views of a resource.
Everytime a user views a specific resource for the first time, a new view is logged in the db.
If that same user views the same resource again, the unique compound index on the collection blocks the insert of the duplicate.
For bulk inserts, with { ordered: false }, Mongo allows the new views through and blocks the duplicates. The return value of the insert is an object with an insertedCount property, telling me how many docs made it past the unique index.
In some cases, I want to know how many docs would be inserted before running the query. Then, based on the dummy insertedCount, I would choose to run the query, or not.
Is there a way to test a query and have it do everything except actually inserting the docs?
I could solve this by running some js serverside to get the answer I need. But I would prefer to let the db do those checks

How mongodb background indexing works?

Say, we want to add index to a zipcode field of mongodb collection people. In order not to affect any operations, we write the following line: db.people.createIndex( { zipcode: 1}, {background: true} ). Now, I'm having a hard time understanding what exactly does that do?
It's a command to create and index. When we specify {background: true}, does that mean that it will run in background only on the index's initial creation (after we press enter), or every time new record is added?
Background index creation starts immediately (when you "press enter"), but it will be done in the background and you can continue updating the collection while this is being done.
Any documents you add while index creation is still ongoing will make it into the final index, but this will not happen immediately when you insert the document (it also happens "in the background" if you will, but really the index does not properly exist yet at this point).
Once the index has been fully created (i.e. is up-to-date with the collection), it works just like a normal index.
That means adding new documents to the collection will also add them to the index at the same time (not sometime later).

When are mongodb indexes updated?

Question
Are mongodb indexes updated before the success of a write operation is reported to the application or do index updates run in the background? If they run in the background: is there a way to wait for an index update to complete?
Background
I have a document
person1obj = {
email: 'user#domain.tld',
[...]
}
in a people collection where a unique index is applied to the email field. Now I'd like to insert another document
person2obj = {
email: 'user#domain.tld',
[...]
}
Obviously, I have to change the email field of person1 before person2 can be inserted. With mongoose, the code looks like
mongoose.model('Person').create(person1obj, function (err, person1) {
// person1 has been saved to the db and 'user#domain.tld' is
// added to the *unique* email field index
// change email for person1 and save
person1.email = 'otheruser#domain.tld';
person1.save(function(err, person1) {
// person1 has been updated in the db
// QUESTION: is it guaranteed that 'user#domain.tld' has been removed from
// the index?
// inserting person2 could fail if the index has not yet been updated
mongoose.model('Person').create(person2obj, function (err, person2) {
// ...
});
});
});
I have seen a random fail of my unit tests with the error E11000 duplicate key error index which made me wonder if index updates run in the background.
This question probably is related to mongodb's write concern but I couldn't find any documentation on the actual process for index updates.
From the FAQ (emphasis mine):
How do write operations affect indexes?
Any write operation that alters an indexed field requires an update to the index in addition to the document itself. If you update a document that causes the document to grow beyond the allotted record size, then MongoDB must update all indexes that include this document as part of the update operation.
Therefore, if your application is write-heavy, creating too many indexes might affect performance.
At the very least in the case of unique indexes, the indexing does not run in the background. This is evident by the fact that when you try to write a new document with a duplicate key that is suppose to be unique you get a duplicate key error.
If indexing was to happen asynchronously in the background, Mongo would not be able to tell if the write actually succeeded. Thus the indexing must happen during the write sequence.
While I have no evidence for this (though Mongo is open source, if you have enough time you can look it up), I believe that all indexing is done during the write sequence, even if its not a unique index. It wouldn't make sense to have special logic for writes that affect a unique index.

is there any way to restore predefined schema to mongoDB?

I'm beginner with mongoDB. i want to know is there any way to load predefined schema to mongoDB? ( for example like cassandra that use .cql file for this purpose)
If there is, please intruduce some document about structure of that file and way for restoring.
If there is not, how i can create an index only one time when I create a collection. I think it is wrong if i create index every time I call insert method or run my program.
p.s: I have a multi-threaded program that every thread insert and update my mongo collection. I want to create index only one time.
Thanks.
To create an index on a collection you need to use ensureIndex command. You need to only call it once to create an index on a collection.
If you call ensureIndex repeatedly with the same arguments, only the first call will create an index, all subsequent calls will have no effect.
So if you know what indexes you're going to use for your database, you can create a script that will call that command.
An example insert_index.js file that creates 2 indexes for collA and collB collections:
db.collA.ensureIndex({ a : 1});
db.collB.ensureIndex({ b : -1});
You can call it from a shell like this:
mongo --quiet localhost/dbName insert_index.js
This will create those indexes on a database named dbName on your localhost. It's worth noticing that if your database and/or collections are not yet created, this will create both the database and the collections for which you're adding the indexes.
Edit
To clarify a little bit. MongoDB is schemaless so you can't restore it's schema.
You can only create indexes and collections (by using createCollection helper).
MongoDB is basically schemaless so there is no definition of a schema or namespaces to be restored.
In the case of indexes, these can be created at any time. There does not need to be a collection present or even the required fields for the index as this will all be sorted out as the collections are created and when documents are inserted that matches the defined fields.
Commands to create an index are generally the same with each implementation language, for example:
db.collection.ensureIndex({ a: 1, b: -1 })
Will define the index on the target collection in the target database that will reference field "a" and field "b", the latter in descending order. This will happen even if the collection or even the database does not exist as yet, or in fact will establish a blank namespace in that case.
Subsequent calls to the same index creation method do not actually re-create the index. Where the same index is specified to one that already exists it is effectively skipped as a "no-operation".
As such, you can simply feed all your required index creation statements at application startup and anything that is not already present will be created. Anything that already exists will be left alone.

insert or ignore multiple documents in mongoDB

I have a collection in which all of my documents have at least these 2 fields, say name and url (where url is unique so I set up a unique index on it). Now if I try to insert a document with a duplicate url, it will give an error and halt the program. I don't want this behavior, but I need something like mysql's insert or ignore, so that mongoDB should not insert the document with duplicate url and continue with the next documents.
Is there some parameter I can pass to the insert command to achieve this behavior? I generally do a batch of inserts using pymongo as:
collection.insert(document_array)
Here collection is a collection and document_array is an array of documents.
So is there some way I can implement the insert or ignore functionality for a multiple document insert?
Set the continue_on_error flag when calling insert(). Note PyMongo driver 2.1 and server version 1.9.1 are required:
continue_on_error (optional): If True, the database will not stop
processing a bulk insert if one fails (e.g. due to duplicate IDs).
This makes bulk insert behave similarly to a series of single inserts,
except lastError will be set if any insert fails, not just the last
one. If multiple errors occur, only the most recent will be reported
by error().
Use insert_many(), and set ordered=False.
This will ensure that all write operations are attempted, even if there are errors:
http://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection.insert_many
Try this:
try:
coll.insert(
doc_or_docs=doc_array,
continue_on_error=True)
except pymongo.errors.DuplicateKeyError:
pass
The insert operation will still throw an exception if an error occurs in the insert (such as trying to insert a duplicate value for a unique index), but it will not affect the other items in the array. You can then swallow the error as shown above.
Why not just put your call to .insert() inside a try: ... except: block and continue if the insert fails?
In addition, you could also use a regular update() call with the upsert flag. Details here: http://www.mongodb.org/display/DOCS/Updating#Updating-update%28%29
If you have your array of documents already in memory in your python script, why not insert them by iterating through them, and simply catch the ones that fail on insertion due to the unique index?
for doc in docs:
try:
collection.insert(doc)
except pymongo.errors.DuplicateKeyError:
print 'Duplicate url %s' % doc
Where collection is an instance of a collection created from your connection/database instances and docs is the array of dictionaries (documents) you would currently be passing to insert.
You could also decide what to do with the duplicate keys that violate your unique index within the except block.
It is highly recommended to use upsert
stat.update({'location': d['user']['location']}, \
{'$inc': {'count': 1}},upsert = True, safe = True)
Here stat is the collection if visitor location is already present in the collection, count is increased by one, else count is set to 1.
Here is the link for documentation http://www.mongodb.org/display/DOCS/Updating#Updating-UpsertswithModifiers
What I am doing :
Generate array of MongoDB ids I want to insert (hash of some values in my case)
Remove existing IDs (I am using a Redis queue bcoz performance, but you can query mongo)
Insert your cleaned data !
Redis is perfect for that, you can use Memcached or Mysql Memory, according your needs