I am trying to replace all documents with new values in bulk.
Example
we have 500k docs in db and we have 500k the same docs with updated props inside. Now we need to update old data.
Idea was to use InsertMany with lean option to new collection and then remove the old one to get less number of reads/writes.
The question is there something easier for such a scenario?
Maybe even import/export is better in this case?
PS
Model.updateMany() has a filter, we do not need filter here, because we know for sure that every document has updated properties, so we just need to replace them
There are different options.
1.insertMany to insert many documents.
2. UpdateMany with upsert will insert and update documents
3. Replace will replace the matching documents.
If your new doc has altogether new set of fields, you could use replace.
If only values changes and new fields, then you could use updateMany.
On the other hand, there is something called bulk Write. Explore that also.
Edit:
The best option for this use case is creating new collection everyday. Considering the number 500k is easy to handle. Swap the new and old collections.
Otherwise you could use replace option.
Related
I am trying to update many documents in a single query, how can I update many documents in a single query such that I don't have to loop over a list and update each individually?
You can create an array of operations that you want, and use a bulkWrite (view the docs here).
In that way you don't need to make a lot of request and get all the updates done. You can choose if you want the operations to be ordered or unordered and each type of operation has its own behavior. You can also choose which level of write concern you want.
I'm working with the Mongo Java Driver, but looking through Mongo's documentation, it doesn't look driver specific.
update(filter, update) can update multiple documents but returns a WriteResult which only provides flags/counts.
findOneAndUpdate(filter, update) returns the actual document that was modified, but it can only update one document at a time.
Is there no way to do this in one call? If not, the client would have to call find(filter), then update(filter, update), then find(...) with a new filter matching the IDs obtained in the initial find (since the update can potentially change document values that were in the initial filter).
Is there a better way?
I am unaware of any write commands that return a cursor, which is essentially what you are asking for, nor am I seeing anything relevant in driver source.
Is there a way to write a script that updates a document by adding a duplicate field with a different value? I cannot use set as that replaces the existing value. I cannot use push as the field is in an object, not an array. I even tried creating the new field with a different name and renaming it which also replaces the existing field.
You cannot have duplicate fields in a Mongo record. A Mongo collection is a collection of documents, otherwise known as objects. You cannot have a duplicate field in an object and Mongo is no different.
MongoDB (and any other database that I have come across so far) is built around the idea that individual fields are identifiable so they can be filtered by, grouped by, sorted by, etc... That also explains why MongoDB does not provide support for the scenario you're facing. That being said, MongoDB can be used as a dumb datastore for arbitrary JSON data. And the JSON specification does not say anything about duplicate field names which is probably why you can actually store such a document in MongoDB in the first place.
Anyway, there is no way to achieve what you want without loading the entire document, changing it (by adding the duplicate field(s)) and then replacing the whole document. That, however, will work.
I personally cannot think of a reasonable scenario where this sort of document could make sense, though. So I would strongly suggest you revisit your document structure.
I have have a Python application that is iteratively going through every document in a MongoDB (3.0.2) collection (typically between 10K and 1M documents), and adding new fields (probably doubling/tripling the number of fields in the document).
My initial thought was that I would use upsert the entire of the revised documents (using pyMongo) - now I'm questioning that:
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
Also, is it better to perform a write to the collection on a document by document basis or in bulk?
this is actually a great question that can be solved a few different ways depending on how you are managing your data.
if you are upserting additional fields does this mean your data is appending additional fields at a later point in time with the only changes being the addition of the additional fields? if so you could set the ttl on your documents so that the old ones drop off over time. keep in mind that if you do this you will want to set an index that sorts your results by descending _id so that the most recent additions are selected before the older ones.
the benefit of this of doing it this way is that your are continually writing data as opposed to seeking and updating data so it is faster.
in regards to upserts vs bulk inserts. bulk inserts are always faster than upserts since bulk upserting requires you to find the original document first.
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
you really need to understand your data fully to determine what is best but if only change to the data is additional fields or changes that only need to be considered from that point forward then bulk inserting and setting a ttl on your older data is the better method from the stand point of write operations as opposed to seek, find and update. when using this method you will want to db.document.find_one() as opposed to db.document.find() so that only your current record is returned.
Also, is it better to perform a write to the collection on a document by document basis or in bulk?
bulk inserts will be faster than inserting each one sequentially.
I am using MongoDB to save data about products. After writing the initial large data-set (24mio items) I would like to change all the items in the collection.
Therefore I use a cursor to iterate over the whole collection. Then I want to add a "row" or field to every item in the collection. With large data-sets this is not working. There were only 180000 items updated. On a small scale it is working. Is that normal behavior?
Is MongoDB not supposed to support writes while iterating with a cursor over the whole collection?
What would be a good practice to do that instead?
For larger collections, you might run into snapshotting problems. When you add data to the object and save it, it will grow, forcing mongodb to move the document around. Then you might find the object twice.
You can either use $snapshot in your query, or use a stable order such as sort({"_id":1}). Note that you can't use both.
Also make sure to use at least acknowledged write concern.
When we had a similar problem, we fetched data in 100k(with some test) chunks. It's a quick and simple solution.