Mongo delete and insert vs update - mongodb

I am using mongo version 3.0 db and java driver. I have a collection 100,000+ entries. Each day there will be approx 500 updates and approx 500 inserts which should be done in a batch. I will get the updated documents with old fields plus some new ones which I have to store. I dont know which are the feilds are newly added also for each field I am maintaining a summary statistic. Since I dont know what were the changes I will have to fetch the records which already exist to see the difference between updated ones and new ones to appropriately set the summary statistics.So I wanted inputs regarding how this can be done efficiently.
Should I delete the existing records and insert again or should I update the 500 records. And should I consider doing 1000 upsers if it has potential advantages.
Example UseCase
initial record contains: f=[185, 75, 186]. I will get the update request as f=[185, 75, 186, 1, 2, 3] for the same record. Also the summary statistics mentioned above store the counts of the ids in f. So the counts for 1,2,3 will be increased and for 185, 75, 186 will remain the same.

Upserts are used to add a document if it does not exist. So if you're expecting new documents then yes, set {upsert: true}.
In order to update your statistics I think the easiest way is to redo the statistics if you were doing it in mongo (e.g. using the aggregation framework). If you index your documents properly it should be fine. I assume that your statistics update is an offline operation.
If you weren't doing the statistics in mongo then you can add another collection where you can save the updates along with the old fields (of course you update your current collection too) so you will know which documents have changed during the day. At the end of the day you can just remove this temporary/log collection once you've extracted the needed information.

Mongo maintains every change log using oplog.rs capped collection in local db. We are creating a tailable cursor on oplog.rs on timestamp basis and each change in db / collection are streamed thru. Believe this is the best way to identify changes in mongo. One can certainly drop no interest document changes.
Further read http://docs.mongodb.org/manual/reference/glossary/#term-oplog

I think the easiest way is to redo the statistics if you were doing it in mongo (e.g. using the aggregation framework). If you index upsers documents properly it should be fine. I assume that your statistics update is an offline operation.

Related

how to convert mongoDB Oplog file into actual query

I want to convert the MongoDB local Oplog file into an actual real query so I can execute that query and get the exact copy database.
Is there any package, file, build-in tools, or script for it?
It's not possible to get the exact query from the oplog entry because MongoDB doesn't save the query.
The oplog has an entry for each atomic modification performed. Multi-inserts/updates/deletes performed on the mongo instance using a single query are converted to multiple entries and written to the oplog collection. For example, if we insert 10,000 documents using Bulk.insert(), 10,000 new entries will be created in the oplog collection. Now the same can also be done by firing 10,000 Collection.insertOne() queries. The oplog entries would look identical! There is no way to tell which one actually happened.
Sorry, but that is impossible.
The reason is that, that opLog doesn't have queries. OpLog includes only changes (add, update, delete) to data, and it's there for replication and redo.
To get an exact copy of DB, it's called "replication", and that is of course supported by the system.
To "replicate" changes to f.ex. one DB or collection, you can use https://www.mongodb.com/docs/manual/changeStreams/.
You can get the query from the Oplogs. Oplog defines multiple op types, for instance op: "i","u", "d" etc, are for insert, update, delete. For these types, check the "o"/"o2" fields which have corresponding data and filters.
Now based on the op types call the corresponding driver APIs db.collection.insert()/update()/delete().

Managing transaction documents with MongoDB

Imagine you have millions of users those who perform transactions on your platform. Assuming each transaction is a document in your MongoDB collection there would be millions of documents generated everyday thus exploding your database in no time. I have received the following solutions from friends and family.
Having TTL index on the document - This won't work because we need those document stored somewhere so that it can be retrieved at a later point in time when the user demands for it.
Sharding the collection with timestamp as the key - This won't help us control the time frame we want the data to be backed up.
I would like to understand and implement a strategy somewhat similar to what banks follow. They keep your transactions upto a certain point (eg: 6 months) after which you have to request them via support or any other channel. I am assuming they follow a Hot/Cold storage pattern but I am not completely sure about it.
The entire point is to manage transaction documents and on a daily basis back up or move the older records to another place where it can be read from. Any idea how that is possible with MongoDB?
Update: Sample Document (Please note there are few other keys from the document that have been redacted)
{
"_id" : ObjectId("5d2c92d547d273c1329b49f0"),
"transactionType" : "type_3",
"transactionTimestamp" : ISODate("2019-07-15T14:51:54.444Z"),
"transactionValue" : 0.2,
"userId" : ObjectId("5d2c92f947d273c1329b49f1")
}
First Create a Table Where you want to save all records. (As you mentioned the sample, let's take this entry stored on a collection named A).
After that Create a backup at daily midnight and then after successful backup restored the collection with named timestamp.
After successful entry stored on table, you can truncate the original table.
By this approach you have a limited entry table on the collection and also have all records.

Duplicate efficiently documents in MongoDB

I would like to find-out the most efficient way to duplicate documents in MongoDB, given that I want to take a bunch of documents from an existing collection, update one of their field, unset _id to generate a new one, and push them back in the collection to create duplicates.
This is typically to create a "branching" feature in MongoDB, allowing users to modify data in two separate branches at the same time.
I've tried the following things:
In my server, get data chunks in multiple threads, modify data, and insert modified data with a new _id in the base
This basically works but performance is not super good (~ 20s for 1 million elements).
In the future MongoDB version (tested on version 4.1.10), use the new $out aggregation mechanism to insert in the same collection
This does not seem to work and raise an error message "errmsg" : "$out with mode insertDocuments is not supported when the output collection is the same as the aggregation collection"
Any ideas how to be faster than the first approach? Thanks!

How to get inserted or updated documents only in MongoDB

I am using MongoDB in my web API. MongoDB is being updated/inserted by other sources.
How do I query mongodb to get only newly inserted or updated documents ?
I can get sorted documents by below query but this doesn't solve my purpose
db.collectionName.findOne({}, {sort:{$natural:-1}})
Is there any log or other way like in SQL there is INSERTED and UPDATED
What are these newly inserted/updated documents in your context?
Assuming that you need to fetch newly/inserted documents relative to a time, you need to have a field in your collection that holds the time stamp (for example, inserted, and lastUpdated fields), you have an operator to help with updating. But this needs application changes.
Or you can use change streams, for a trigger like functionality. You can listen for changes and take actions as changes are made.

What is the preferred way to add many fields to all documents in a MongoDB collection?

I have have a Python application that is iteratively going through every document in a MongoDB (3.0.2) collection (typically between 10K and 1M documents), and adding new fields (probably doubling/tripling the number of fields in the document).
My initial thought was that I would use upsert the entire of the revised documents (using pyMongo) - now I'm questioning that:
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
Also, is it better to perform a write to the collection on a document by document basis or in bulk?
this is actually a great question that can be solved a few different ways depending on how you are managing your data.
if you are upserting additional fields does this mean your data is appending additional fields at a later point in time with the only changes being the addition of the additional fields? if so you could set the ttl on your documents so that the old ones drop off over time. keep in mind that if you do this you will want to set an index that sorts your results by descending _id so that the most recent additions are selected before the older ones.
the benefit of this of doing it this way is that your are continually writing data as opposed to seeking and updating data so it is faster.
in regards to upserts vs bulk inserts. bulk inserts are always faster than upserts since bulk upserting requires you to find the original document first.
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
you really need to understand your data fully to determine what is best but if only change to the data is additional fields or changes that only need to be considered from that point forward then bulk inserting and setting a ttl on your older data is the better method from the stand point of write operations as opposed to seek, find and update. when using this method you will want to db.document.find_one() as opposed to db.document.find() so that only your current record is returned.
Also, is it better to perform a write to the collection on a document by document basis or in bulk?
bulk inserts will be faster than inserting each one sequentially.