performance difference between update_one() and insert_one() in MongoDB - mongodb

I've been reading about MongoDB and realize i can insert a new document with insert_one() or with update_one(upsert = True).
since I manage my own ids, it is easier for me to query by a specific id and create the doc if not exist, without the need to handle 2 different methods (create and update).
Queries should be very fast in MongoDB, so what do i trade for this convenience?

Actually, you trade nothing. You should use insert_one() only for explicit creation of objects.

Related

I wonder if there are a lot of collections

Do many mongodb collections have a big impact on mongodb performance, memory and capacity? I am designing an api with mvc pattern, and a collection is being created for each model. I question the way I am doing now.
MongoDB with the WirdeTiger engine supports an unlimited number of collections. So you are not going to run into any hard technical limitations.
When you wonder if something should be in one collection or in multiple collections, these are some of the considerations you need to keep in mind:
More collections = more maintenance work. Sharding is configured on the collection level. So having a large number of collections will make shard configuration a lot more work. You also need to set up indexes for each collection separately, but this is quite easy to automatize, because createIndex on an index which already exists does nothing.
The MongoDB API is designed in a way that every database query operates on one collection at a time. That means when you need to search for a document in n different collections, you need to perform n queries. When you need to aggregate data stored in multiple collections, you run into even more problems. So any data which is queried together should be stored together in the same collection.
Creating one collection for each class in your model is usually a god rule of thumb, but it is not a golden hammer solution. There are situations where you want to embed object in their parent-object documents instead of putting them into a separate collection. There are also cases where you want to put all objects with the same base-class in the same collection to benefit from MongoDB's ability to handle heterogeneous collections. But that goes beyond the scope of this question.
Why don't you use this and test your application ?
https://docs.mongodb.com/manual/tutorial/evaluate-operation-performance/
By the way your question is not completely clear... is more like a "discussion" rather than question. And you're asking others to evaluate your work instead of searching the web the rigth approach.

Denormalization vs Parent Referencing vs MapReduce

I have a highly normalized data model with me. Currently I'm using manual referencing by storing the _id and running sequential queries to fetch details from the deepest collection.
The referencing is one-way and the flow has around 5-6 collections. For one particular use case, I'm having to query down to the deepest collection by querying subsequent "_id" from the higher level collections. So technically I'm hitting the database every time I run a
db.collection_name.find(_id: ****).
My prime goal is to optimize the read without hugely affecting the atomicity of the other collections. I have read about de-normalization and it does not make sense to me because I want to keep an option for changing the cardinality down the line and hence want to maintain a separate collection altogether.
I was initially thinking of using MapReduce to do an aggregation from the back and have a collection primarily for the particular use-case. But well even that does not sound that good.
In a relational db, I would be breaking the query in sub-queries and performing a join to get the data sets that intersect from the initial results. Since mongodb does not support joins, I'm having a tough time figuring anything out.
Please help if you have faced anything like this earlier or have any idea how to resolve it.
Denormalize your data.
MongoDB does not do JOIN's - period.
There is no operation on the database which gets data from more than one collection. Not find(), not aggregate() and not MapReduce. When you need to puzzle your data together from more than one collection, there is no other way than doing it on the application layer. For that reason you should organize your data in a way that any common and performance-relevant query can be resolved by querying just a single collection.
In order to do that you might have to create redundancies and transitive dependencies. This is normal in MongoDB.
When this feels "dirty" to you, then you should either accept the fact that your performance will be sub-optimal or use a different kind of database, like a classic relational database or a graph database.

What is the preferred way to add many fields to all documents in a MongoDB collection?

I have have a Python application that is iteratively going through every document in a MongoDB (3.0.2) collection (typically between 10K and 1M documents), and adding new fields (probably doubling/tripling the number of fields in the document).
My initial thought was that I would use upsert the entire of the revised documents (using pyMongo) - now I'm questioning that:
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
Also, is it better to perform a write to the collection on a document by document basis or in bulk?
this is actually a great question that can be solved a few different ways depending on how you are managing your data.
if you are upserting additional fields does this mean your data is appending additional fields at a later point in time with the only changes being the addition of the additional fields? if so you could set the ttl on your documents so that the old ones drop off over time. keep in mind that if you do this you will want to set an index that sorts your results by descending _id so that the most recent additions are selected before the older ones.
the benefit of this of doing it this way is that your are continually writing data as opposed to seeking and updating data so it is faster.
in regards to upserts vs bulk inserts. bulk inserts are always faster than upserts since bulk upserting requires you to find the original document first.
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
you really need to understand your data fully to determine what is best but if only change to the data is additional fields or changes that only need to be considered from that point forward then bulk inserting and setting a ttl on your older data is the better method from the stand point of write operations as opposed to seek, find and update. when using this method you will want to db.document.find_one() as opposed to db.document.find() so that only your current record is returned.
Also, is it better to perform a write to the collection on a document by document basis or in bulk?
bulk inserts will be faster than inserting each one sequentially.

mongodb indexing user-defined schemas

We are currently using MongoDB to allow tenants in a SaaS application to define entities that they can use in the application. We do not know know how each tenant is going to define the fields for the entities that they are creating upfront. Each entity will have a collection dynamically created for it in a separate database that belongs to the tenant.
For example, One tenant might define a Customer as First Name, Last Name, Email. Another tenant might define Shipment as Shipment Ref, Ship Date, Owner etc... Each tenant will have many entities/collections in their tenant database.
We have one field (ID) which we will always force the user to include in each entity/collection. We will index this field upfront when creating the collection.
However, how do we handle the case where we want to allow the tenant to sort/search/order/query large collections/entities quickly when/if the dataset becomes too large?
That is, since we do not know upfront what fields the user will be sorting/filtering/ordering by, what is the indexing strategy to use in this case with Mongo?
First of all Mongo requires you to have _id for each document and it indexes it automatically. You should take advantage of this and not create yet another ID field in case you require your clients to have ID field. I'm not sure if that's the case in your application.
What you are asking for can't have a perfect solution or even the most optimal one, but I can suggest couple options:
Create single field index for each field in the document. Let Mongo query optimizer decide which index to use depending on query. Disadvantages - takes lots of space on disk and in memory. Makes inserts slower. Mongo can use only 1 index in condition clause, so it will not be able to use compound index. You can easily extract schema with a tool like this. I wrote this little prototype that analyzes and prints Mongo schema.
Let your application learn what indexes to create. Get slow queries from Mongo profiler (in Mongo log), analyze common parts (automatically?) and create indexes on most commonly used fields. That's not so easy to implement and efficiency might change with time if your client changes queries or data. Application will be slow in the start until it learns about itself :).
Would just like to emphasize in choosing your design that the ID and not _id field you mention is actually some unique entity identifier then you are better of putting this in _id.
The reason here is that the performance trade-off for using another unique index over the required _id is a considerable overhead. Thinking about this, since _id is required it is the first thing that MongoDB looks for when determining which index to use. Otherwise consider a compound _id field containing your entity information and some other useful uniqueness.
As for the user defined fields, which is kind of the essence of mongo documents, for my money I would make it part of the API to set up indexes as required. Depending on the type of searching that is happening you'll probably want compound indexes and generated queries that make sense to these.
Simply indexing every field will probably have limited use as only one index is going to be picked for the find anyhow, and the query optimizer is going to try all of them. As has been mentioned, a long option could be to set indexes according to the usage patterns. But it could take some work to do.

Many to many update in MongoDB without transactions

I have two collections with a many-to-many relationship. I want to store an array of linked ObjectIds in both documents so that I can take Document A and retrieve all linked Document B's quickly, and vice versa.
Creating this link is a two step process
Add Document A's ObjectId to Document B
Add Document B's ObjectId to Document A
After watching a MongoDB video I found this to be the recommended way of storing a many-to-many relationship between two collections
I need to be sure that both updates are made. What is the recommended way of robustly dealing with this crucial two step process without a transaction?
I could condense this relationship into a single link collection, the advantage being a single update with no chance of Document B missing the link to Document A. The disadvantage being that I'm not really using MongoDB as intended. But, because there is only a single update, it seems more robust to have a link collection that defines the many-to-many relationship.
Should I use safe mode and manually check the data went in afterwards and try again on failure? Or should I represent the many-to-many relationship in just one of the collections and rely on an index to make sure I can still quickly get the linked documents?
Any recommendations? Thanks
#Gareth, you have multiple legitimate ways to do this. So they key concern is how you plan to query for the data, (i.e.: what queries need to be fast)
Here are a couple of methods.
Method #1: the "links" collection
You could build a collection that simply contains mappings between the collections.
Pros:
Supports atomic updates so that data is not lost
Cons:
Extra query when trying to move between collections
Method #2: store copies of smaller mappings in larger collection
For example: you have millions of Products, but only a hundred Categories. Then you would store the Categories as an array inside each Product.
Pros:
Smallest footprint
Only need one update
Cons:
Extra query if you go the "wrong way"
Method #3: store copies of all mappings in both collections
(what you're suggesting)
Pros:
Single query access to move between either collection
Cons:
Potentially large indexes
Needs transactions (?)
Let's talk about "needs transactions". There are several ways to do transactions and it really depends on what type of safety you require.
Should I use safe mode and manually check the data went in afterwards and try again on failure?
You can definitely do this. You'll have to ask yourself, what's the worst that happens if only one of the saves fails?
Method #4: queue the change
I don't know if you've ever worked with queues, but if you have some leeway you can build a simple queue and have different jobs that update their respective collections.
This is a much more advanced solution. I would tend to go with #2 or #3.
Why don't you create a dedicated collection holding the relations between A and B as dedicated rows/documents as one would do it in a RDBMS. You can modify the relation table with one operation which is of course atomic.
Should I use safe mode and manually check the data went in afterwards and try again on failure?
Yes this an approach, but there is an another - you can implement an optimistic transaction. It has some overhead and limitations but it guarantees data consistency. I wrote an example and some explanation on a GitHub page.