Mongodb - add if not exists (non id field) - mongodb

I have a large number of records to iterate over (coming from an external data source) and then insert into a mongo db.
I do not want to allow duplicates. How can this be done in a way that will not affect performance.
The number of records is around 2 million.

I can think of two fairly straightforward ways to do this in mongodb, although a lot depends upon your use case.
One, you can use the upsert:true option to update, using whatever you define as your unique key as the query for the update. If it does not exist it will insert it, otherwise it will update it.
http://docs.mongodb.org/manual/reference/method/db.collection.update/
Two, you could just create a unique index on that key and then insert ignoring the error generated. Exactly how to do this will be somewhat dependent on language and driver used along with version of mongodb. This has the potential to be faster when performing batch inserts, but YMMV.

2 million is not a huge number that will affect performance, split your records fields into diffent collections will be good enough.
i suggest create a unique index on your unique key before insert into the mongodb.
unique index will filter redundant data and lose some records and you can ignore the error.

Related

why is indexing applied to a specific column and not on every column by default?

I have read about indexing in DB, the point i don't understand is that why do we need to specify index to a particular column, if it makes the search fast, why it is not by default to all columns? is it possible to create multiple indexes in one table?
To resume :
Using indexes improves performances on read query, because mongodb doesn't read entire collection when searching documents. It also improves sorting.
But these improvments have a cost :
Indexes use disk/memory space.
Delete, insert and update operations will be longer : on each insert, delete or update operation, mongodb must update all you concerned indexes.
There are multiple indexes type, and some of them (compound index ie) can have planty of combinations
For these reasons (but not only), by default only index on _id field (as it need to be unique) is created on collection creation.
If there are n no. of indices and when you perform any save/update operation that modifies keys, it does it n times, hence produces an excessive write lock. When you will perform such operation, you will observe that reads would be faster with no indexes than when trying to update consistently with too many indexes. So in order to perform indexing, you should keep track of indexes else there would be a great performance issue (sake of RAM and write lock).

Choosing the right database index type

I have a very simple Mongo database for a personal nodejs project. It's basically just records of registered users.
My most important field is an alpha-numeric string (let's call it user_id and assume it can't be only numeric) of about 15 to 20 characters.
Now the most important operation is checking if the user exists at or all not. I do this by querying db.collection.find("user_id": "testuser-123")
if no record returns, I save the user along with some other not so important data like first name, last and signup date.
Now I obviously want to make user_id an index.
I read the Indexing Tutorials on the official MongoDB Manual.
First I tried setting a text index because I thought that would fit the alpha-numeric field. I also tried setting language:none. But it turned out that my query returned in ~12ms instead of 6ms without indexing.
Then I tried just setting an ordered index like {user_id: 1}, but I haven't seen any difference (is it only working for numeric values?).
Can anyone recommend me the best type of index for this case or quickest query to check if the user exists? Or maybe is MongoDB not the best match for this?
Some random thoughts first:
A text index is used to help full text search. Given your description this is not what is needed here, as, if I understand it well, you need to use an exact match of the whole field.
Without any index, MongoDB will use a linear search. Using big O notation, this is an O(n) operation. With an (ordered) index, the search is performed in O(log(n)). That means that an index will dramatically speed up queries when you will have many documents. But you will not necessary see any improvement if you have a small number of documents. In that case, O(n) can even be worst than O(log(n)). Some database management systems don't even bother using the index if the optimizer estimate that it will not provide enough benefits. I don't know if MongoDB does that, though.
Given your use case, I think the proper index is an unique index. This is an ordered index that would prevent insertion of two identical documents.
In your application, do not test before insert. In real application, this could lead to race condition when you have concurrent inserts. If you use an unique index, just try to insert -- and be prepared to gracefully handle an error caused by a duplicate key.

How to insert quickly to a very large collection

I have a collection of over 70 million documents. Whenever I add new documents in batches (lets say 2K), the insert operation is really slow. I suspect that is because, the mongo engine is comparing the _id's of all the new documents with all the 70 million to find out any _id duplicate entries. Since the _id based index is disk-resident, it'll make the code a lot slow.
Is there anyway to avoid this. I just want mongo to take new documents and insert it as they are, without doing this check. Is it even possible?
Diagnosing "Slow" Performance
Your question includes a number of leading assumptions about how MongoDB works. I'll address those below, but I'd advise you to try to understand any performance issues based on facts such as database metrics (i.e. serverStatus, mongostat, mongotop), system resource monitoring, and information in the MongoDB log on slow queries. Metrics need to be monitored over time so you can identify what is "normal" for your deployment, so I would strongly recommend using a MongoDB-specific monitoring tool such as MMS Monitoring.
A few interesting presentations that provide very relevant background material for performance troubleshooting and debugging are:
William Zola: The (Only) Three Reasons for Slow MongoDB Performance
Aska Kamsky: Diagnostics and Debugging with MongoDB
Improving efficiency of inserts
Aside from understanding where your actual performance challenges lie and tuning your deployment, you could also improve efficiency of inserts by:
removing any unused or redundant secondary indexes on this collection
using the Bulk API to insert documents in batches
Assessing Assumptions
Whenever I add new documents in batches (lets say 2K), the insert operation is really slow. I suspect that is because, the mongo engine is comparing the _id's of all the new documents with all the 70 million to find out any _id duplicate entries. Since the _id based index is disk-resident, it'll make the code a lot slow.
If a collection has 70 million entries, that does not mean that an index lookup involves 70 million comparisons. The indexed values are stored in B-trees which allow for a small number of efficient comparisons. The exact number will depend on the depth of the tree and how your indexes are built and the value you're looking up .. but will be on the order of 10s (not millions) of comparisons.
If you're really curious about the internals, there are some experimental storage & index stats you can enable in a development environment: Storage-viz: Storage Visualizers and Commands for MongoDB.
Since the _id based index is disk-resident, it'll make the code a lot slow.
MongoDB loads your working set (portion of data & index entries recently accessed) into available memory.
If you are able to create your ids in an approximately ascending order (for example, the generated ObjectIds) then all the updates will occur at the right side of the B-tree and your working set will be much smaller (FAQ: "Must my working set fit in RAM").
Yes, I can let mongo use the _id for itself, but I don't want to waste a perfectly good index for it. Moreover, even if I let mongo generate _id for itself won't it need to compare still for duplicate key errors?
A unique _id is required for all documents in MongoDB. The default ObjectId is generated based on a formula that should ensure uniqueness (i.e. there is an extremely low chance of returning a duplicate key exception, so your application will not get duplicate key exceptions and have to retry with a new _id).
If you have a better candidate for the unique _id in your documents, then feel free to use this field (or collection of fields) instead of relying on the generated _id. Note that the _id is immutable, so you shouldn't use any fields that you might want to modify later.

MongoDB and composite primary keys

I'm trying to determine the best way to deal with a composite primary key in a mongo db. The main key for interacting with the data in this system is made up of 2 uuids. The combination of uuids is guaranteed to be unique, but neither of the individual uuids is.
I see a couple of ways of managing this:
Use an object for the primary key that is made up of 2 values (as suggested here)
Use a standard auto-generated mongo object id as the primary key, store my key in two separate fields, and then create a composite index on those two fields
Make the primary key a hash of the 2 uuids
Some other awesome solution that I currently am unaware of
What are the performance implications of these approaches?
For option 1, I'm worried about the insert performance due to having non sequential keys. I know this can kill traditional RDBMS systems and I've seen indications that this could be true in MongoDB as well.
For option 2, it seems a little odd to have a primary key that would never be used by the system. Also, it seems that query performance might not be as good as in option 1. In a traditional RDBMS a clustered index gives the best query results. How relevant is this in MongoDB?
For option 3, this would create one single id field, but again it wouldn't be sequential when inserting. Are there any other pros/cons to this approach?
For option 4, well... what is option 4?
Also, there's some discussion of possibly using CouchDB instead of MongoDB at some point in the future. Would using CouchDB suggest a different solution?
MORE INFO: some background about the problem can be found here
You should go with option 1.
The main reason is that you say you are worried about performance - using the _id index which is always there and already unique will allow you to save having to maintain a second unique index.
For option 1, I'm worried about the insert performance do to having
non sequential keys. I know this can kill traditional RDBMS systems
and I've seen indications that this could be true in MongoDB as well.
Your other options do not avoid this problem, they just shift it from the _id index to the secondary unique index - but now you have two indexes, once that's right-balanced and the other one that's random access.
There is only one reason to question option 1 and that is if you plan to access the documents by just one or just the other UUID value. As long as you are always providing both values and (this part is very important) you always order them the same way in all your queries, then the _id index will be efficiently serving its full purpose.
As an elaboration on why you have to make sure you always order the two UUID values the same way, when comparing subdocuments { a:1, b:2 } is not equal to { b:2, a:1 } - you could have a collection where two documents had those values for _id. So if you store _id with field a first, then you must always keep that order in all of your documents and queries.
The other caution is that index on _id:1 will be usable for query:
db.collection.find({_id:{a:1,b:2}})
but it will not be usable for query
db.collection.find({"_id.a":1, "_id.b":2})
I have an option 4 for you:
Use the automatic _id field and add 2 single field indexes for both uuid's instead of a single composite index.
The _id index would be sequential (although that's less important in MongoDB), easily shardable, and you can let MongoDB manage it.
The 2 uuid indexes let you to make any kind of query you need (with the first one, with the second or with both in any order) and they take up less space than 1 compound index.
In case you use both indexes (and other ones as well) in the same query MongoDB will intersect them (new in v2.6) as if you were using a compound index.
I'd go for the 2 option and there is why
Having two separate fields instead of the one concatenated from both uuids as suggested in 1st, will leave you the flexibility to create other combinations of indexes to support the future query requests or if turns out, that the cardinality of one key is higher then another.
having non sequential keys could help you to avoid the hotspots while inserting in sharded environment, so its not such a bad option. Sharding is the best way, for my opinion, to scale inserts and updates on the collections, since the write locking is on database level (prior to 2.6) or collection level (2.6 version)
I would've gone with option 2. You can still make an index that handles both the UUID fields, and performance should be the same as a compound primary key, except it'll be much easier to work with.
Also, in my experience, I've never regretted giving something a unique ID, even if it wasn't strictly required. Perhaps that's an unpopular opinion though.

MongoDB insertion based on reading

In my data, I need to look up if a combination of attributes exist in database yet. If so I just update it, if not I will insert a new doc. I know it is easy in relational database if we set a combination of attributes as primary key, but I don't find the same thing in Mongo. Now I simply set this combination as index, query the database to find if count is zero, then decide to update or insert. However with the growth of data size (currently 2Mill docs), the query ate up memory much faster than inserting without query. Does anyone have good idea?