I am trying to load data from a file created using mongoexport in one server into another. New documents must be inserted, existing ones must be updated with data in the file.I see that documentation refers to upsert and merge options, but the difference is not obvious to me.
Related
I want to convert the MongoDB local Oplog file into an actual real query so I can execute that query and get the exact copy database.
Is there any package, file, build-in tools, or script for it?
It's not possible to get the exact query from the oplog entry because MongoDB doesn't save the query.
The oplog has an entry for each atomic modification performed. Multi-inserts/updates/deletes performed on the mongo instance using a single query are converted to multiple entries and written to the oplog collection. For example, if we insert 10,000 documents using Bulk.insert(), 10,000 new entries will be created in the oplog collection. Now the same can also be done by firing 10,000 Collection.insertOne() queries. The oplog entries would look identical! There is no way to tell which one actually happened.
Sorry, but that is impossible.
The reason is that, that opLog doesn't have queries. OpLog includes only changes (add, update, delete) to data, and it's there for replication and redo.
To get an exact copy of DB, it's called "replication", and that is of course supported by the system.
To "replicate" changes to f.ex. one DB or collection, you can use https://www.mongodb.com/docs/manual/changeStreams/.
You can get the query from the Oplogs. Oplog defines multiple op types, for instance op: "i","u", "d" etc, are for insert, update, delete. For these types, check the "o"/"o2" fields which have corresponding data and filters.
Now based on the op types call the corresponding driver APIs db.collection.insert()/update()/delete().
I'm trying to write a mongodump / mongorestore script that would copy our data from the production environment to staging once a week.
Problem is, I need to filter out one of the collections.
I was sure I'd find a way to apply a query only on a specific collection during the mongodump, but it seems like the query statement affects all cloned collections.
So currently I'm running one dump-restore for all the other collections, and one for this specific collection with a query on it.
Am I missing something? Is there a better way to achieve this goal?
Thanks!
It is possible.
--excludeCollection=<string>
Excludes the specified collection from the mongodump output. To exclude multiple collections, specify the --excludeCollection multiple times.
Example
mongodump --db=test --excludeCollection=users --excludeCollection=salaries
See Details here.
Important mongodump writes to /dump folder. If it's already there, it will overwrite everything.
If you need that data rename the folder or give mongodump an --out directory. Otherwise you don't need to worry.
So, straight to the problem. We have many clients that have their local MongoDB, everyday new data are generated and stored in .TSV files, these files are uploaded to their database using mongoimport (insert, update and merge) to achieve a, lets say, incremental load.
We already have a _id field that works as Key for mongo, so this way mongo automatically can detect if a document already exists or not, if not he will import that document, it is kinda a incremental load (again, mongoimport mentioned above).
Since we already have the insert and update working correctly, what we are trying to do right now is the following:
How to automatically delete the documents that are in the local mongo and are not in the .TSV files?
Remembering that we already have the _id created and maybe we can use it as a comparison key.
Basically what we want to achieve is that the data stored in the client's local mongo be the same as the data store in the .TSV files that we import, so the mongo will be a "mirror" of the client's data. All that without deleting and uploading everything everyday.
I hope it was clear enough to understand what we want to do.
Thanks!
What I'd be inclined to do is replace the mongoimport with an equivalent pymongo load routine (that will have to be developed) that loads the data and adds a "LastUpdated" field with the current date/time added.
Once completed, delete any documents that have not been updated since the start of the load.
Good luck!
I'm reading mosql code and how it uses oplog collection from mongodb to copy to postgresql.
For updating case, for example:
I saw that mosql always update the whole document to postgresql instead of only the modified fields. That is really weird because there is not sense to update all the fields in postgresql table when I want to update only 1 or 2 fields. That is a problem because I'm using bigger documents.
Looking the code I saw that mosql uses the o field from oplog but it keeps the whole document and that is why mosql update all the fields in postgresql and there is not way to know what fields were updated.
Is there a way to figure out what fields were updated? to updating only that fields instead of the complete document?
I have the following two documents in a mongo collection:
{
_id: "123",
name: "n1"
}
{
_id: "234",
name: "n2"
}
Let's suppose I read those two documents, and make changes, for example, add "!" to the end of the name.
I now want to save the two documents back.
For a single document, there's save, for new documents, I can use insert to save an array of documents.
What is the solution for saving updates to those two documents? The update command asks for a query, but I don't need a query, I already have the documents, I just want to save them back...
I can update one by one, but if that was 2 million documents instead of just two this would not work so well.
One thing to add: we are currently using Mongo v2.4, we can move to 2.6 if Bulk operations are the only solution for this (as that was added in 2.6)
For this you have 2 options (present in 2.6),
Bulk operations like Mongoimport, mongorestore.
Upsert command for each document.
First option goes better with huge no. of documents (which is your case). In Mongoimport you can use --upsert flag to overwrite the existing documents. You can use --upsert --drop flags to drop existing data and set new document.
This options scales well with lot amount of data in terms of IO and system util.
Upsert command works on in-place update principle. You can use it with a filter but drawback is it works in serial fashion and shouldn't be used for huge data size. Performant only with small data.
When you switch off write concerns, a save doesn't block until the database wrote and returns almost immediately. So with WriteConcern.Unacknowledged, storing 2 million documents with save is a lot quicker than you would think. But no write concerns have the drawback that you won't get any errors from the database.
When you don't want to save them one-by-one, bulk operations are the way to go.