I'm using an update operation with upsert. I want to retrieve all documents that have been modified after an update.
for key in categories_links:
collection.update({"name" : key}, {"name": key ,"url" : categories_links[key]}, True)
You should use a timestamp field in your documents if you ever need to find which ones where updated and when. There is a BSON type for that.
To my knowledge, pymongo will not return a list of all of the records which have been modified by an update.
However, if you are using a replicaset, you might be able to accomplish this by looking at the oplog.
According to the documentation:
The oplog must translate multi-updates into individual operations in
order to maintain idempotency. This can use a great deal of oplog
space without a corresponding increase in data size or disk use.
If you want to keep track of each element being updated, you might instead do a find(), and then loop through those to do an individual update() on each. Obviously this would be much slower, but perhaps a tradeoff for your specific use case.
Related
I'm using Nodejs and MongoDB driver.
A two part question
I have a collection which is called openOffers which I want to expire when it hits the time closeOfferAt. I know that MongoDB offers a TTL, expireAt and expireAfterSeconds. But, when I use these, the same TTL is applied to all the documents in a particular collection. I'm not sure if I'm doing this correctly. I want document level custom expiry. Any syntax might be very useful!
Docs in openOffers
{
"_id":"12345",
"data": {...},
"closeOfferAt" : "2022-02-21T23:22:34.023Z"
}
I want to push these expired documents to another collection openOfferLog. This is so that I can retain the document for later analysis.
Current approach:
I haven't figured a way to have a customized TTL on each doc in openOffers. But, I currently insert docs into both, openOffers and openOffersLog together, and any data change in openOffers has to be separately propagated to openOffersLog to ensure consistency. There has to be a better scalable approach I suppose.
EDIT-1:
I'm looking for some syntax logic that I can use for the above use case. If not possible with the current MongoDB driver, I'm looking for an alternative solution with NodeJS code I can experiment with. I'm new to both NodeJS and MongoDB -- so any reasoning supporting the solution would be super useful as well.
There are two ways to implement TTL indexes.
Delete after a certain amount of time - this is you have already implemented
Delete at a specific clock time - for the detailed answers you can visit the MongoDB Docs
So the second option fulfills your expectation,
Just set 0 (zero) in expireAfterSeconds field while creating an index,
db.collection.createIndex({ "closeOfferAt": 1 }, { expireAfterSeconds: 0 })
just set the expiration date in closeOfferAt while inserting the document, this will remove the document at a particular timestamp.
db.collection.insert({
"_id":"12345",
"data": {...},
"closeOfferAt" : ISODate("2022-02-23T06:14:15.840Z")
})
Do not make your application logic depend on TTL indexes.
In your app you should have some scheduler that runs periodic tasks, some of them would move the finished offers to other collection and delete from the original even in bulk and no need to set a TTL index.
To keep consistency nothing better than a single source of truth, so if you can, avoid deleting and only change some status flag and timestamps.
A good use of a TTL index is to automatically clear old data after a relative long time, like one month or more. This keeps collection/indexes size in check.
I have have a Python application that is iteratively going through every document in a MongoDB (3.0.2) collection (typically between 10K and 1M documents), and adding new fields (probably doubling/tripling the number of fields in the document).
My initial thought was that I would use upsert the entire of the revised documents (using pyMongo) - now I'm questioning that:
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
Also, is it better to perform a write to the collection on a document by document basis or in bulk?
this is actually a great question that can be solved a few different ways depending on how you are managing your data.
if you are upserting additional fields does this mean your data is appending additional fields at a later point in time with the only changes being the addition of the additional fields? if so you could set the ttl on your documents so that the old ones drop off over time. keep in mind that if you do this you will want to set an index that sorts your results by descending _id so that the most recent additions are selected before the older ones.
the benefit of this of doing it this way is that your are continually writing data as opposed to seeking and updating data so it is faster.
in regards to upserts vs bulk inserts. bulk inserts are always faster than upserts since bulk upserting requires you to find the original document first.
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
you really need to understand your data fully to determine what is best but if only change to the data is additional fields or changes that only need to be considered from that point forward then bulk inserting and setting a ttl on your older data is the better method from the stand point of write operations as opposed to seek, find and update. when using this method you will want to db.document.find_one() as opposed to db.document.find() so that only your current record is returned.
Also, is it better to perform a write to the collection on a document by document basis or in bulk?
bulk inserts will be faster than inserting each one sequentially.
I've got a mongo db instance with a collection in it which has around 17 million records.
I wish to alter the document structure (to add a new attribute in the document) of all 17 million documents, so that I dont have to problematically deal with different structures as well as make queries easier to write.
I've been told though that if I run an update script to do that, it will lock the whole database, potentially taking down our website.
What is the easiest way to alter the document without this happening? (I don't mind if the update happens slowly, as long as it eventually happens)
The query I'm attempting to do is:
db.history.update(
{ type : { $exists: false }},
{
$set: { type: 'PROGRAM' }
},
{ multi: true }
)
You can update the collection in batches(say half a million per batch), this will distribute the load.
I created a collection with 20000000 records and ran your query on it. It took ~3 minutes to update on a virtual machine and i could still read from the db in a separate console.
> for(var i=0;i<20000000;i++){db.testcoll.insert({"somefield":i});}
The locking in mongo is quite lightweight, and it is not going to be held for the whole duration of the update. Think of it like 20000000 separate updates. You can read more here:
http://docs.mongodb.org/manual/faq/concurrency/
You do actually care if your update query is slow, because of the write lock problem on the database you are aware of, both are tightly linked. It's not a simple read query here, you really want this write query to be as fast as possible.
Updating the "find" part is part of the key here. First, since your collection has millions of documents, it's a good idea to keep the field name size as small as possible (ideally one single character : type => t). This helps because of the schemaless nature of mongodb collections.
Second, and more importantly, you need to make your query use a proper index. For that you need to workaround the $exists operator which is not optimized (several ways to do it there actually).
Third, you can work on the field values themselves. Use http://bsonspec.org/#/specification to estimate the size of the value you want to store, and eventually pick a better choice (in your case, you could replace the 'PROGRAM' string by a numeric constant for example and gain a few bytes in the process, multiplied by the number of documents to update for each update multiple query). The smaller the data you want to write, the faster the operation will be.
A few links to other questions which can inspire you :
Can MongoDB use an index when checking for existence of a field with $exists operator?
Improve querying fields exist in MongoDB
I am using MongoDB to save data about products. After writing the initial large data-set (24mio items) I would like to change all the items in the collection.
Therefore I use a cursor to iterate over the whole collection. Then I want to add a "row" or field to every item in the collection. With large data-sets this is not working. There were only 180000 items updated. On a small scale it is working. Is that normal behavior?
Is MongoDB not supposed to support writes while iterating with a cursor over the whole collection?
What would be a good practice to do that instead?
For larger collections, you might run into snapshotting problems. When you add data to the object and save it, it will grow, forcing mongodb to move the document around. Then you might find the object twice.
You can either use $snapshot in your query, or use a stable order such as sort({"_id":1}). Note that you can't use both.
Also make sure to use at least acknowledged write concern.
When we had a similar problem, we fetched data in 100k(with some test) chunks. It's a quick and simple solution.
I'm using an update operation with upsert. I want to retrieve all documents that have been modified after an update.
for key in categories_links:
collection.update({"name" : key}, {"name": key ,"url" : categories_links[key]}, True)
You should use a timestamp field in your documents if you ever need to find which ones where updated and when. There is a BSON type for that.
To my knowledge, pymongo will not return a list of all of the records which have been modified by an update.
However, if you are using a replicaset, you might be able to accomplish this by looking at the oplog.
According to the documentation:
The oplog must translate multi-updates into individual operations in
order to maintain idempotency. This can use a great deal of oplog
space without a corresponding increase in data size or disk use.
If you want to keep track of each element being updated, you might instead do a find(), and then loop through those to do an individual update() on each. Obviously this would be much slower, but perhaps a tradeoff for your specific use case.