How mongodb background indexing works? - mongodb

Say, we want to add index to a zipcode field of mongodb collection people. In order not to affect any operations, we write the following line: db.people.createIndex( { zipcode: 1}, {background: true} ). Now, I'm having a hard time understanding what exactly does that do?
It's a command to create and index. When we specify {background: true}, does that mean that it will run in background only on the index's initial creation (after we press enter), or every time new record is added?

Background index creation starts immediately (when you "press enter"), but it will be done in the background and you can continue updating the collection while this is being done.
Any documents you add while index creation is still ongoing will make it into the final index, but this will not happen immediately when you insert the document (it also happens "in the background" if you will, but really the index does not properly exist yet at this point).
Once the index has been fully created (i.e. is up-to-date with the collection), it works just like a normal index.
That means adding new documents to the collection will also add them to the index at the same time (not sometime later).

Related

Is it possible to run a "dummy" query to see how many documents _would_ be inserted

I am using MongoDB to track unique views of a resource.
Everytime a user views a specific resource for the first time, a new view is logged in the db.
If that same user views the same resource again, the unique compound index on the collection blocks the insert of the duplicate.
For bulk inserts, with { ordered: false }, Mongo allows the new views through and blocks the duplicates. The return value of the insert is an object with an insertedCount property, telling me how many docs made it past the unique index.
In some cases, I want to know how many docs would be inserted before running the query. Then, based on the dummy insertedCount, I would choose to run the query, or not.
Is there a way to test a query and have it do everything except actually inserting the docs?
I could solve this by running some js serverside to get the answer I need. But I would prefer to let the db do those checks

Fast query and deletion of documents of a large collection in MongoDB

I have a collection (let say CollOne) with several million documents. They have the common field "id"
{...,"id":1}
{...,"id":2}
I need to delete some documents in CollOne by id. Those ids stored in a document in another collection (CollTwo). This ids_to_delete document has the structure as follows
{"action_type":"toDelete","ids":[4,8,9,....]}
As CollOne is quite large, finding and deleting one document will take quite a long time. Is there any way to speed up the process?
Like you can't really avoid a deletion operation in the database if you want to delete anything. If you're having performance issue I would just recommend to make sure you have an index built on the id field otherwise Mongo will use a COLLSCAN to satisfy the query which means it will over iterate the entire colLOne collection which is I guess where you feel the pain.
Once you make sure an index is built there is no "more" efficient way than using deleteMany.
db.collOne.deleteMany({id: {$in: [4, 8, 9, .... ]})
In case you don't have an index and wonder how to build one, you should use createIndex like so:
(Prior to version 4.2 building an index lock the entire database, in large scale this could take up to several hours if not more, to avoid this use the background option)
db.collOne.createIndex({id: 1})
---- EDIT ----
In Mongo shell:
Mongo shell is javascript based, so you just have to to execute the same logic with js syntax, here's how I would do it:
let toDelete = db.collTwo.findOne({ ... })
db.collOne.deleteMany({id: {$in: toDelete.ids}})

MongoDb background indexing and unique index

When you create an index in MongoDb. There are 2 options:
Do foreground indexing and lock all write operations while doing so
Do background indexing and still allow records to be written in the mean time
My question is:
How can something like unique index be built in the background? What if a duplicated document is inserted while the index is building?
I believe this is the most relevant excerpt from the MongoDB docs:
Background indexing operations run in the background so that other database operations can run while creating the index. However, the mongo shell session or connection where you are creating the index will block until the index build is complete. To continue issuing commands to the database, open another connection or mongo instance.
Queries will not use partially-built indexes: the index will only be usable once the index build is complete.
So this means the client where you issued the command to create the index will remain blocked until the index is fully created. If, from another client, you're doing something like adding a duplicate document while the index is being built, it will insert the document without an error, but eventually your initial client will encounter an error that it was unable to complete the index because there is a duplicate key for the unique index.
Now, I actually ended up here while trying to understand what MongoID's index(..., {background: true}) option does, because it seems to imply that every write may perform the indexing portion of the write in the background, but my understanding now is that this option only applies to the initial creation of the index. This is explained in the introduction to the docs for the background option for MongoDB's createIndex method (which is not technically the same thing as MongoID's background option, but it clarifies the concept of the feature related to that option):
MongoDB provides several options that only affect the creation of the index [...] This section describes the uses of these creation options and their behavior.
Related: Some options that you can specify to createIndex() options control the properties of the index, which are not index creation options. For example, the unique option affects the behavior of the index after creation.
Referring MongoDB docs-
If a background index build is in progress when the mongod process terminates, when the instance restarts the index build will restart as foreground index build. If the index build encounters any errors, such as a duplicate key error, the mongod will exit with an error.
So there are two possibilities-
If index creation is completed then the document which you are trying to insert will give you instant error.
Or if index creation is in progress in background then you will be able to insert the document (because at the time of insertion the index is not there 100%). But later when index creation process tries to put index on your duplicate document then it will exit with error. This is same behavior as if you have duplicate documents and you try to create foreground index.
#mltsy
If, from another client, you're doing something like adding a duplicate document while the index is being built, it will insert the document without an error.
I am not sure this is correct,as Mongodb Doc described as below:
When building an index on a collection, the database that holds the collection is unavailable for read or write operations until the index build completes.
I used the mongoose to test this :
var uniqueUsernameSchema = new Schema({
username: {
index: { unique: true, background: false },
type: String,
required: true
}
})
var U = mongoose.model('U1', uniqueUsernameSchema)
var dup = [{ username: 'Val' }, { username: 'Val' }]
U.create(dup, function (e, d) {
console.log(e, d)
})
The unique index failed to build. This result showed the foreground option didnot block the write operation in MongoDB.

mongodb upsert with conditional field update

I have a script that populates a mongo db from daily server log files. Log files come from a number of servers so the chronological order of the data is not guaranteed. To make this simple, let's say that the document schema is this:
{
_id: <username>,
first_seen: <date>,
last_seen: <date>,
most_recent_ip: <string>
}
that is, documents are indexed by the name of the user who accessed the server. For each user, we keep track of the first time the user was seen and the ip from the last visit.
Right now I handle this very inefficiently: first try an insert. If it fails, retrieve a record by _id, then calculate updated values (e.g. first_seen and most_recent_up), and finally update the record. This is 3 db calls per log entry, which makes the script's running time prohibitively long given the very high volume of data.
I'm wondering if I can replace this with an upsert instead. I can see how to handle first/last_seen: probably something like {$min: {'first_seen': <log_entry_date>}} (hope this works correctly when inserting a new doc). But how do I set most_recent_ip to the new value only when <log_entry_date> > $last_seen.
Is there generally a preferred pattern for my use case?
You can just use $set to set the most_recent_ip, e.g.
db.logs.update(
{_id:"user1"},
{$set:{most_recent_ip:"2.2.2.2"}, $min:{first_seen:new Date()}, $max:{last_seen:new Date()}},
{upsert: true}
)

When are mongodb indexes updated?

Question
Are mongodb indexes updated before the success of a write operation is reported to the application or do index updates run in the background? If they run in the background: is there a way to wait for an index update to complete?
Background
I have a document
person1obj = {
email: 'user#domain.tld',
[...]
}
in a people collection where a unique index is applied to the email field. Now I'd like to insert another document
person2obj = {
email: 'user#domain.tld',
[...]
}
Obviously, I have to change the email field of person1 before person2 can be inserted. With mongoose, the code looks like
mongoose.model('Person').create(person1obj, function (err, person1) {
// person1 has been saved to the db and 'user#domain.tld' is
// added to the *unique* email field index
// change email for person1 and save
person1.email = 'otheruser#domain.tld';
person1.save(function(err, person1) {
// person1 has been updated in the db
// QUESTION: is it guaranteed that 'user#domain.tld' has been removed from
// the index?
// inserting person2 could fail if the index has not yet been updated
mongoose.model('Person').create(person2obj, function (err, person2) {
// ...
});
});
});
I have seen a random fail of my unit tests with the error E11000 duplicate key error index which made me wonder if index updates run in the background.
This question probably is related to mongodb's write concern but I couldn't find any documentation on the actual process for index updates.
From the FAQ (emphasis mine):
How do write operations affect indexes?
Any write operation that alters an indexed field requires an update to the index in addition to the document itself. If you update a document that causes the document to grow beyond the allotted record size, then MongoDB must update all indexes that include this document as part of the update operation.
Therefore, if your application is write-heavy, creating too many indexes might affect performance.
At the very least in the case of unique indexes, the indexing does not run in the background. This is evident by the fact that when you try to write a new document with a duplicate key that is suppose to be unique you get a duplicate key error.
If indexing was to happen asynchronously in the background, Mongo would not be able to tell if the write actually succeeded. Thus the indexing must happen during the write sequence.
While I have no evidence for this (though Mongo is open source, if you have enough time you can look it up), I believe that all indexing is done during the write sequence, even if its not a unique index. It wouldn't make sense to have special logic for writes that affect a unique index.