Delete documents using their _ids as a comparison - mongodb

So, straight to the problem. We have many clients that have their local MongoDB, everyday new data are generated and stored in .TSV files, these files are uploaded to their database using mongoimport (insert, update and merge) to achieve a, lets say, incremental load.
We already have a _id field that works as Key for mongo, so this way mongo automatically can detect if a document already exists or not, if not he will import that document, it is kinda a incremental load (again, mongoimport mentioned above).
Since we already have the insert and update working correctly, what we are trying to do right now is the following:
How to automatically delete the documents that are in the local mongo and are not in the .TSV files?
Remembering that we already have the _id created and maybe we can use it as a comparison key.
Basically what we want to achieve is that the data stored in the client's local mongo be the same as the data store in the .TSV files that we import, so the mongo will be a "mirror" of the client's data. All that without deleting and uploading everything everyday.
I hope it was clear enough to understand what we want to do.
Thanks!

What I'd be inclined to do is replace the mongoimport with an equivalent pymongo load routine (that will have to be developed) that loads the data and adds a "LastUpdated" field with the current date/time added.
Once completed, delete any documents that have not been updated since the start of the load.
Good luck!

Related

What is the difference between mongoimport upsert and merge

I am trying to load data from a file created using mongoexport in one server into another. New documents must be inserted, existing ones must be updated with data in the file.I see that documentation refers to upsert and merge options, but the difference is not obvious to me.

Purge documents in MongoDB without impacting the working set

We have a collection of documents and each document has a time window associated to it. (For example, fields like 'fromDate' and 'toDate'). Once the document is expired (i.e. toDate is in the past), the document isn't accessed anymore by our clients.
So we wanted to purge these documents to reduce the number of documents in the collection and thus making our queries faster. However we later realized that this past data could be important to analyze the pattern of data changes, so we decided to archive it instead of purging it completely. So this is what I've come up with.
Let's say we have a "collectionA" which has past versions of documents
Query all the past documents in "collectionA". (queries are made on secondary server)
Insert them to a separate collection called "collectionA-archive"
Delete the documents from collectionA that are successfully inserted in the archive
Delete documents in "collectionA-archive" that meet a certain condition. (we do not want to keep a huge archive)
My question here is, even though I'm making the queries on Secondary server, since the insertions are happening in Primary, does the documents inserted in archive collection make it to the working set of Primary ? The last thing we need is these past documents getting stored in RAM of Primary which could affect the performance of our live API.
I know, one solution could be to insert the past documents into a separate DB server. But acquiring another server is a bit of hassle. So would like to know if this is achievable within one server.

Faster way to remove 2TB of data from single collection (without sharding)

We collect a lot of data and currently decided to migrate our data from mongodb into data lake. We are going to leave in mongo only some portion of our data and use it as our operational database that keeps only newest most relevant data. We have replica set, but we don't use sharding. I suspect, if we had sharded cluster, we could achieve necessary results much simpler, but it's one-time operation, so setting up cluster just for one operation looks like very complex solution (plus I also suspect, that it will be very long running operation to convert such collection into sharded collection, but I can be completely wrong here)
One of our collections has size of 2TB right now. We want to remove old data from original database as fast as possible, but looks like standard "remove" operation is very slow, even if we use unorderedBulkOperation.
I found a few suggestions to copy data into another collection and then just drop original collection instead of trying remove data (so migrate data that we want to keep instead of removing data that we don't want to keep). There are few different ways, that I found, to copy portion of data from original collection to another collection:
Extract data and insert it into other collection one by one. Or extract portion of data and insert it in bulk using insertMany(). It looks faster than just remove data, but still not enough fast.
Use $out operator with aggregation framework to extract portions of data. It's very fast! But it extracts every portion of data into separate collections and doesn't have ability to append data in current mongodb version. So we will need to combine all exported portions of data into one final collection, what is slow again. I see that $out will be able to append data in next release of mongo (https://jira.mongodb.org/browse/SERVER-12280). But we need some solution now, and unfortunately, we won't be able to do quick update of mongo version anyway.
mongoexport / mongoimport - it exports portion of data into json file and append to another collection using import. It's quite fast too, so looks like good option.
Currently it looks like the best choice to improve performance of migration is combination of $out + mongoexport/mongoimport approaches. Plus multithreading to perform multiple described operations at once.
But is there any even faster option that I might missed?

Import/Export of Mongo Collections, Preserving _id

I have a MEAN database application with a number of Mongo collections with hierarchical relationships via ObjectId. A copy of the application works locally offline, and another copy runs on the production server.
The data contain collectively describe rules and content that drive a complex process. These data need to be entered offline so that these processes can be tested before the data go into the production environment.
What I assumed I would be able to easily do is to export selected documents as JSON, then relatively simply import them into the production database. So, the system would have a big "Export" button that would take the current document and all subdocuments and related documents, and export them as a single JSON file. Then, my "Import" button would parse that JSON file on the production server.
So, exporting is no problem. Did that in a couple of hours.
But, I quickly found that when I import a document, its _id field value is not preserved. This breaks relationships, obviously.
I have considered writing parsing routines that preserved these relationships by programmatically setting ObjectIds in parent documents after the child documents have been saved. This will be a huge headache though.
I'm hoping there is either:
a) ... and easy way to import a JSON document with _id fields intact, or ...
b) ... another way to accomplish this entirely that is easier than I am making it.
I appreciate any advice.
There's always got to be someone that doesn't know the answer who complains about the question. The question is clear and the problem is familiar.
Indeed, Mongoose will overwrite any value you provide for _id when you create a document either via the create() method or using the constructor (var thing = new Thing()).
Also, mongoexport/mongoimport will not fill the need to do this programmatically, at least not easily.
If I'm understanding correctly, you want to export a subset of documents, along with any related documents, keeping references intact. Then, you want to import this data into a remote system, again, keeping references intact.
The approach you took would work just fine except it will destroy all references, as you found out.
I've worked on a similar problem and I believe that the best way to do this is to do what it sounds like you wanted to avoid. That is, you'll iterate over your collections and let Mongo generate its _ids as it will. Add your child documents first, then set the references correctly in your parent documents. I really don't think there is a better way that still gives you granular control.
In current version of mongodb you can use db.copyDatabase(). Start current instance of mongodb where you want to copy database and run following command:
db.copyDatabase(fromDB, toDB).
For more options and details refer to db.copyDatabase()

Matching documents of mongodb with json

Is there any way to verify that all documents of a mongodb are correctly entered i.e. to check that the data in JSON file and the inserted data are same?
If yes how to do it? Consider that there are 3 millions documents in db.
I want to do it with java script.
Thanks
You will have to run a find for every document that you expect to be in the database, verifying that there is in fact an exact match present (just use the entire document as the match criteria).
In the future, you can use safe mode (safe=True in most drivers, but the syntax varies slightly) to make sure your writes do not fail. Using safe mode will alert you as to the results of the write.