I would like to understand which of the below queries would be faster, while doing updates, in mongo db? I want to update few thousands of records at one stretch.
Accumulating the object ids of those records and firing them using $in or using bulk update?
Using one or two fields in the collection which are common for those few thousand records - akin to "where" in sql and firing an update using those fields. These fields might or might not be indexed.
I know that query will be much smaller in the 2nd case as every single "_id" (oid) is not accumulated. Does accumulating _ids and using those to update documents offer any practical performance advantages?
Does accumulating _ids and using those to update documents offer any practical performance advantages?
Yes because MongoDB will certainly use the _id index (idhack).
In the second method - as you observed - you can't tell whether or not an index will be used for a certain field.
So the answer will be: it depends.
If your collection has million of documents or more, and / or the number of search fields is quite large you should prefer the first search method. Especially if the id list size is not small and / or the id values are adjacent.
If your collection is pretty small and you can tolerate a full scan you may prefer the second approach.
In any case, you should testify both methods using explain().
Related
I have a collection of ~500M documents.
Every time when I execute a query, I receive one or more documents from this collection. Let's say I have a counter for each document, and I increase this counter by 1 whenever this document is returned from the query. After a few months of running the system in production, I discover that the counter of only 5% of the documents is greater than 0 (zero). Meaning, 95% of the documents are not used.
My question is: Is there an efficient way to arrange these documents to speedup the query execution time, based on the fact that 95% of the documents are not used?
What is the best practice in this case?
If - for example - I will add another boolean field for each document named "consumed" and index this field. Can I improve the query execution time somehow?
~500M documents That is quite a solid figure, good job if that's true. So here is how I see the solution of the problem:
If you want to re-write/re-factor and rebuild the DB of an app. You could use versioning pattern.
How does it looks like?
Imagine you have a two collections (or even two databases, if you are using micro service architecture)
Relevant docs / Irrelevant docs.
Basically you could use find only on relevant docs collection (which store 5% of your useful docs) and if there is nothing, then use Irrelevant.find(). This pattern will allows you to store old/historical data. And manage it via TTL index or capped collection.
You could also add some Redis magic to it. (Which uses precisely the same logic), take a look:
This article can also be helpful (as many others, like this SO question)
But don't try to replace Mongo with Redis, team them up instead.
Using Indexes and .explain()
If - for example - I will add another boolean field for each document named "consumed" and index this field. Can I improve the query execution time somehow?
Yes, it will deal with your problem. To take a look, download MongoDB Compass, create this boolean field in your schema, (don't forget to add default value), index the field and then use Explain module with some query. But don't forget about compound indexes! If you create field on one index, measure the performance by queering only this one field.
The result should been looks like this:
If your index have usage (and actually speed-up) Compass will shows you it.
To measure the performance of the queries (with and without indexing), use Explain tab.
Actually, all this part can be done without Compass itself, via .explain and .index queries. But Compass got better visuals of this process, so it's better to use it. Especially since he becomes absolutely free for all.
I am designing a MongoDB collection that will have 50 million documents and every field in the document will be searchable and sortable. The searching and sorting logics will be sent from the frontend so could be a lot of fields searchings and sorting combinations. I've made some tests and concluded that when there is searching and sorting only in indexed fields the query runs very fast, but when searching or sorting non-indexed fields the query runs very slow.
Considering that will have a lot of possible searching/sorting combinations, how can I build indexes in this collection in this case to get a better performance?
Indexing comes at a cost of extra memory space and possible increased execution time of database write(insert and update) operations. However, like you rightly pointed out, indexing makes database reads(and sorting) super fast.
Creating indexes is easy and straight forward, however, you need to consider the tradeoffs, most times, this is usually the read-write ration of the fields in your documents.
If you frequently read(or sort) documents from a very large collection(like the 50million examples you mentioned), it makes a lot of sense to add indexing to all the fields you use to identify(or sort) your documents, you just need to ensure you don't run out of memory space in the DB. Not indexing the fields would be very frustrating, just imagine if you need to get the last document by a field that is not indexed, you would have to search through 49,999,999 documents to find it.
I hope this helps.
I have have a Python application that is iteratively going through every document in a MongoDB (3.0.2) collection (typically between 10K and 1M documents), and adding new fields (probably doubling/tripling the number of fields in the document).
My initial thought was that I would use upsert the entire of the revised documents (using pyMongo) - now I'm questioning that:
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
Also, is it better to perform a write to the collection on a document by document basis or in bulk?
this is actually a great question that can be solved a few different ways depending on how you are managing your data.
if you are upserting additional fields does this mean your data is appending additional fields at a later point in time with the only changes being the addition of the additional fields? if so you could set the ttl on your documents so that the old ones drop off over time. keep in mind that if you do this you will want to set an index that sorts your results by descending _id so that the most recent additions are selected before the older ones.
the benefit of this of doing it this way is that your are continually writing data as opposed to seeking and updating data so it is faster.
in regards to upserts vs bulk inserts. bulk inserts are always faster than upserts since bulk upserting requires you to find the original document first.
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
you really need to understand your data fully to determine what is best but if only change to the data is additional fields or changes that only need to be considered from that point forward then bulk inserting and setting a ttl on your older data is the better method from the stand point of write operations as opposed to seek, find and update. when using this method you will want to db.document.find_one() as opposed to db.document.find() so that only your current record is returned.
Also, is it better to perform a write to the collection on a document by document basis or in bulk?
bulk inserts will be faster than inserting each one sequentially.
Since MongoDB 3.x introduces lock per record and not on collection or database, does it make sense to write all of your data to single collection with one extra identifier field "documentType".
It will help simulate "join" through map-reduce operation.
Couchbase does the same thing with "buckets" instead of collection.
Does anybody see any disadvatanges with this approach ?
There's one big general-case disadvantage: indexes.
With Mongo, you generally want to set up indexes so that most, if not all, queries you make, use them. So in addition to the one on _id, you'll set up indexes on the primary fields you search by (often compounded with those you sort by).
If you're storing everything in one single collection, that means you need to have all those indexes on that collection. Which means two things:
The indexes are be bigger, since there's more documents to index. Granted, this can be somewhat mitigated by using sparse indexes.
Inserting or modifying documents in the collection requires Mongo to update all these indexes (where it'd just update the relevant indexes in the standard use-many-collections approach). This kills your write performance.
Furthermore, if you have in your application a query that somehow doesn't use one of those many indexes, it needs to scan through the entire collection, which is O(n) where n is the number of documents in the collection -- in your case, that means the number of documents in the entire database.
Collections are cheap. Use them ;)
I have a collection of over 70 million documents. Whenever I add new documents in batches (lets say 2K), the insert operation is really slow. I suspect that is because, the mongo engine is comparing the _id's of all the new documents with all the 70 million to find out any _id duplicate entries. Since the _id based index is disk-resident, it'll make the code a lot slow.
Is there anyway to avoid this. I just want mongo to take new documents and insert it as they are, without doing this check. Is it even possible?
Diagnosing "Slow" Performance
Your question includes a number of leading assumptions about how MongoDB works. I'll address those below, but I'd advise you to try to understand any performance issues based on facts such as database metrics (i.e. serverStatus, mongostat, mongotop), system resource monitoring, and information in the MongoDB log on slow queries. Metrics need to be monitored over time so you can identify what is "normal" for your deployment, so I would strongly recommend using a MongoDB-specific monitoring tool such as MMS Monitoring.
A few interesting presentations that provide very relevant background material for performance troubleshooting and debugging are:
William Zola: The (Only) Three Reasons for Slow MongoDB Performance
Aska Kamsky: Diagnostics and Debugging with MongoDB
Improving efficiency of inserts
Aside from understanding where your actual performance challenges lie and tuning your deployment, you could also improve efficiency of inserts by:
removing any unused or redundant secondary indexes on this collection
using the Bulk API to insert documents in batches
Assessing Assumptions
Whenever I add new documents in batches (lets say 2K), the insert operation is really slow. I suspect that is because, the mongo engine is comparing the _id's of all the new documents with all the 70 million to find out any _id duplicate entries. Since the _id based index is disk-resident, it'll make the code a lot slow.
If a collection has 70 million entries, that does not mean that an index lookup involves 70 million comparisons. The indexed values are stored in B-trees which allow for a small number of efficient comparisons. The exact number will depend on the depth of the tree and how your indexes are built and the value you're looking up .. but will be on the order of 10s (not millions) of comparisons.
If you're really curious about the internals, there are some experimental storage & index stats you can enable in a development environment: Storage-viz: Storage Visualizers and Commands for MongoDB.
Since the _id based index is disk-resident, it'll make the code a lot slow.
MongoDB loads your working set (portion of data & index entries recently accessed) into available memory.
If you are able to create your ids in an approximately ascending order (for example, the generated ObjectIds) then all the updates will occur at the right side of the B-tree and your working set will be much smaller (FAQ: "Must my working set fit in RAM").
Yes, I can let mongo use the _id for itself, but I don't want to waste a perfectly good index for it. Moreover, even if I let mongo generate _id for itself won't it need to compare still for duplicate key errors?
A unique _id is required for all documents in MongoDB. The default ObjectId is generated based on a formula that should ensure uniqueness (i.e. there is an extremely low chance of returning a duplicate key exception, so your application will not get duplicate key exceptions and have to retry with a new _id).
If you have a better candidate for the unique _id in your documents, then feel free to use this field (or collection of fields) instead of relying on the generated _id. Note that the _id is immutable, so you shouldn't use any fields that you might want to modify later.