MongoDB insertion based on reading - mongodb

In my data, I need to look up if a combination of attributes exist in database yet. If so I just update it, if not I will insert a new doc. I know it is easy in relational database if we set a combination of attributes as primary key, but I don't find the same thing in Mongo. Now I simply set this combination as index, query the database to find if count is zero, then decide to update or insert. However with the growth of data size (currently 2Mill docs), the query ate up memory much faster than inserting without query. Does anyone have good idea?

Related

Choosing the right database index type

I have a very simple Mongo database for a personal nodejs project. It's basically just records of registered users.
My most important field is an alpha-numeric string (let's call it user_id and assume it can't be only numeric) of about 15 to 20 characters.
Now the most important operation is checking if the user exists at or all not. I do this by querying db.collection.find("user_id": "testuser-123")
if no record returns, I save the user along with some other not so important data like first name, last and signup date.
Now I obviously want to make user_id an index.
I read the Indexing Tutorials on the official MongoDB Manual.
First I tried setting a text index because I thought that would fit the alpha-numeric field. I also tried setting language:none. But it turned out that my query returned in ~12ms instead of 6ms without indexing.
Then I tried just setting an ordered index like {user_id: 1}, but I haven't seen any difference (is it only working for numeric values?).
Can anyone recommend me the best type of index for this case or quickest query to check if the user exists? Or maybe is MongoDB not the best match for this?
Some random thoughts first:
A text index is used to help full text search. Given your description this is not what is needed here, as, if I understand it well, you need to use an exact match of the whole field.
Without any index, MongoDB will use a linear search. Using big O notation, this is an O(n) operation. With an (ordered) index, the search is performed in O(log(n)). That means that an index will dramatically speed up queries when you will have many documents. But you will not necessary see any improvement if you have a small number of documents. In that case, O(n) can even be worst than O(log(n)). Some database management systems don't even bother using the index if the optimizer estimate that it will not provide enough benefits. I don't know if MongoDB does that, though.
Given your use case, I think the proper index is an unique index. This is an ordered index that would prevent insertion of two identical documents.
In your application, do not test before insert. In real application, this could lead to race condition when you have concurrent inserts. If you use an unique index, just try to insert -- and be prepared to gracefully handle an error caused by a duplicate key.

How to insert quickly to a very large collection

I have a collection of over 70 million documents. Whenever I add new documents in batches (lets say 2K), the insert operation is really slow. I suspect that is because, the mongo engine is comparing the _id's of all the new documents with all the 70 million to find out any _id duplicate entries. Since the _id based index is disk-resident, it'll make the code a lot slow.
Is there anyway to avoid this. I just want mongo to take new documents and insert it as they are, without doing this check. Is it even possible?
Diagnosing "Slow" Performance
Your question includes a number of leading assumptions about how MongoDB works. I'll address those below, but I'd advise you to try to understand any performance issues based on facts such as database metrics (i.e. serverStatus, mongostat, mongotop), system resource monitoring, and information in the MongoDB log on slow queries. Metrics need to be monitored over time so you can identify what is "normal" for your deployment, so I would strongly recommend using a MongoDB-specific monitoring tool such as MMS Monitoring.
A few interesting presentations that provide very relevant background material for performance troubleshooting and debugging are:
William Zola: The (Only) Three Reasons for Slow MongoDB Performance
Aska Kamsky: Diagnostics and Debugging with MongoDB
Improving efficiency of inserts
Aside from understanding where your actual performance challenges lie and tuning your deployment, you could also improve efficiency of inserts by:
removing any unused or redundant secondary indexes on this collection
using the Bulk API to insert documents in batches
Assessing Assumptions
Whenever I add new documents in batches (lets say 2K), the insert operation is really slow. I suspect that is because, the mongo engine is comparing the _id's of all the new documents with all the 70 million to find out any _id duplicate entries. Since the _id based index is disk-resident, it'll make the code a lot slow.
If a collection has 70 million entries, that does not mean that an index lookup involves 70 million comparisons. The indexed values are stored in B-trees which allow for a small number of efficient comparisons. The exact number will depend on the depth of the tree and how your indexes are built and the value you're looking up .. but will be on the order of 10s (not millions) of comparisons.
If you're really curious about the internals, there are some experimental storage & index stats you can enable in a development environment: Storage-viz: Storage Visualizers and Commands for MongoDB.
Since the _id based index is disk-resident, it'll make the code a lot slow.
MongoDB loads your working set (portion of data & index entries recently accessed) into available memory.
If you are able to create your ids in an approximately ascending order (for example, the generated ObjectIds) then all the updates will occur at the right side of the B-tree and your working set will be much smaller (FAQ: "Must my working set fit in RAM").
Yes, I can let mongo use the _id for itself, but I don't want to waste a perfectly good index for it. Moreover, even if I let mongo generate _id for itself won't it need to compare still for duplicate key errors?
A unique _id is required for all documents in MongoDB. The default ObjectId is generated based on a formula that should ensure uniqueness (i.e. there is an extremely low chance of returning a duplicate key exception, so your application will not get duplicate key exceptions and have to retry with a new _id).
If you have a better candidate for the unique _id in your documents, then feel free to use this field (or collection of fields) instead of relying on the generated _id. Note that the _id is immutable, so you shouldn't use any fields that you might want to modify later.

Mongodb - add if not exists (non id field)

I have a large number of records to iterate over (coming from an external data source) and then insert into a mongo db.
I do not want to allow duplicates. How can this be done in a way that will not affect performance.
The number of records is around 2 million.
I can think of two fairly straightforward ways to do this in mongodb, although a lot depends upon your use case.
One, you can use the upsert:true option to update, using whatever you define as your unique key as the query for the update. If it does not exist it will insert it, otherwise it will update it.
http://docs.mongodb.org/manual/reference/method/db.collection.update/
Two, you could just create a unique index on that key and then insert ignoring the error generated. Exactly how to do this will be somewhat dependent on language and driver used along with version of mongodb. This has the potential to be faster when performing batch inserts, but YMMV.
2 million is not a huge number that will affect performance, split your records fields into diffent collections will be good enough.
i suggest create a unique index on your unique key before insert into the mongodb.
unique index will filter redundant data and lose some records and you can ignore the error.

How often shall we reindex the geospatial data in mongodb?

I wonder if it is a must of reindexing the geospatial data in mongodb if there are some new geo-data has been inserted in order to search them? Say we have a document,which looks like:
{user:'a',loc:[363.236,-45.365]}, and it is indexed. Later on, I inserted document b, which looks like: {user:'b',loc:{42.3654,-56.3}}. In order to search, do I have to reindex (using ensureIndex()) the collection every time when a new document is inserted? Will the frequent reindexing affect the overall application performance?
Thanks.
You only need to ensureIndex once; after that MongoDB maintains the index on every insert. I'm not 100% sure the index is maintained for deletes though - I imagine it must do.
You can defragment an index and rebuild it to make it smaller, hence the existence of the functionality. A useful post:
http://jasonwilder.com/blog/2012/02/08/optimizing-mongodb-indexes/

How can I use Mongo $index so the query would NOT use any index

this question is somewhat strange, but I bumped onto it in a current implementation of mine:
I want to privilege inserts over everything in my application and it came to my mind that the command $hint could also be used to let mongo NOT use an index.
Is that possible? is that a sound question, considering what $hint is supposed to do?
Thanks
To force the query optimizer to not use indexes (do a table scan), use:
db.collection.find().hint({$natural:1})
Not sure if this achieves what you want (prioritizing inserts over other activity), though.
I don't think inserts work the way you think.
An insert will catalogue it's needed fields to the btree dependant upon the number of indexes on the collection itself. As such to privilege inserts you would have to destroy all indexes on the collection.
As such using $natural order hinting will make no difference to the order of read and write. Not to mention $natural order is a disk insertion index, just an index you cannot effectively use in a query as such it will force full table scan.
However that does not actually privilege anything since maintaining the btree is part of inserting data so as such there is no way, via indexes to prioritise inserts.
Also the write and read lock are two completely different things so again I am not sure if your question makes sense.
Are you more like looking for an atomic lock to ensure that you update or put data in before it is read?