MongoDB query for deleted documents

MongoDB query for deleted documents - mongodb

I will be running a nightly cron job to query a collection and then send results to another system.
I need to sync this collection between two systems.
Documents can be removed from the host and this deletion needs to be reflected on the client system.
So - my question is, is there a way to query for documents that have been recently deleted?
I'm looking for something like db.Collection.find({RECORDS_THAT_WERE_DELETED_YESTERDAY});
I was reading about parsing the oplog. However, I don't have one setup yet. Is that something you can introduce into an existing DB?

Related

Purge documents in MongoDB without impacting the working set

We have a collection of documents and each document has a time window associated to it. (For example, fields like 'fromDate' and 'toDate'). Once the document is expired (i.e. toDate is in the past), the document isn't accessed anymore by our clients.
So we wanted to purge these documents to reduce the number of documents in the collection and thus making our queries faster. However we later realized that this past data could be important to analyze the pattern of data changes, so we decided to archive it instead of purging it completely. So this is what I've come up with.
Let's say we have a "collectionA" which has past versions of documents
Query all the past documents in "collectionA". (queries are made on secondary server)
Insert them to a separate collection called "collectionA-archive"
Delete the documents from collectionA that are successfully inserted in the archive
Delete documents in "collectionA-archive" that meet a certain condition. (we do not want to keep a huge archive)
My question here is, even though I'm making the queries on Secondary server, since the insertions are happening in Primary, does the documents inserted in archive collection make it to the working set of Primary ? The last thing we need is these past documents getting stored in RAM of Primary which could affect the performance of our live API.
I know, one solution could be to insert the past documents into a separate DB server. But acquiring another server is a bit of hassle. So would like to know if this is achievable within one server.

Why update operation on one collection blocks read from another?

I have an issue with my mongoDB, I am running one mongo instance on the server. I have several collections in the db - the problem I'm running into is that when I run a long update operation (150k records) in myRecords collection, any read query in myDetails will be blocked until that long update operation is complete.
This doesn't make much sense to me, I can see how reading from the same collection might be blocked during the update, but why would another be affected? Am I missing something?
More details:
-running nodejs and performing operations with javascript
-db version v3.0.11
-mmapv1

All documents in the collection magically disappeared. Can i find out what happened?

I cloned 2 of my collections from localhost to a remote location on MongoLab platform yesterday. I was trying to debug my (MEAN stack) application (with WebStorm IDE) and i realized one of those collections have no data in it. Well, there were 7800 documents this morning...
I am pretty much the only one who works on the database and especially with this collection. I didn't run any query to remove all of the documents from this collection. In mongolab's website there is a button says 'delete all documents from collection'. I am pretty sure I didn't hit that button. I asked my team mates; no one even opened that web page today.
Assuming that my team is telling the truth and I didn't remove everything and have a black out...
Is there a way to find out what happened?
And, is there a way to keep a query history (like unix command-line history) for mongo database that runs on a remote server? And if yes, how?
So, I am just curious about what happened. Also note that I don't have any DBA responsibilities or experience in that field.

MongoDB replica sets have a special collection called oplog. This collection stores all write operations for all databases in that replica set.
Here are instructions on how to access oplog in Mongolab:
Accessing the MongoDB oplog
Here is a query that will find all delete operations:
use local
db.oplog.rs.find({"op": "d", "ns" : "db_name.collection_name"})

How does mongo rename collection works?

I am confused by how mongo renames collections and how much time will it take to rename a very large collection.
Here is the scenario, I have a mongo collection with too much data (588 million documents), which slows down finding and insertion, so I creating an archive collection to keep all this data.
For this I am thinking to rename the old collection to oldcollectionname_archive and start with a fresh collection with oldcollectionname.
And planning to do this by following command :
db.oldCollectionName.renameCollection("oldCollectionName_archive")
But I am not sure, how much time it will take.
I read the mongodocs and many stackoverflow answers regarding collection renaming, but I could find anywhere any data regarding whether the size of the collection affect the time required to renaming the collection.
Please help if anyone has any knowledge regarding this or any same experience.
Note : I have read other issues which can occur during renaming, on mongo documentation and other SO answers.

From the mongodb documentation (https://docs.mongodb.com/manual/reference/command/renameCollection/)
renameCollection has different performance implications depending on the target namespace.
If the target database is the same as the source database, renameCollection simply changes the namespace. This is a quick operation.
If the target database differs from the source database, renameCollection copies all documents from the source collection to the target collection. Depending on the size of the collection, this may take longer to complete. Other operations which require exclusive access to the affected databases will be blocked until the rename completes. See What locks are taken by some common client operations? for operations which require exclusive access to all databases.
Note that:
* renameCollection is not compatible with sharded collections.
* renameCollection fails if target is the name of an existing collection and you do not specify dropTarget: true.

I have renamed multiple collections with around 500M documents. It completes in ~0 time.
This is true for MongoDB 3.2 and 3.4, and I would guess also for older versions.

data synchronization between clients

I don't know if this is the right place to ask my question, but here it is.
Inspired by Firebase, I decided to write a little framework to synchronize data between clients. That should simplify the development of web applications such as chats, forums, etc...
Let's assume that there are one or more servers. A client can connect to one server and access a particular collection (a list of chat messages, for instance). If and when the client modifies the collection, those modifications will be sent to the other clients who requested access to the same collection.
I'd like my solution to be fast and general. The propagation of the modifications should be very fast and the collections should be persisted on a DB.
The collections may be very large but the clients may request just a view of the collection (for instance, the chat messages of the last 20 minutes).
Possible solution
We have n servers, 1 node with a fast in-memory DB (Redis) and a cluster with a NoSQL DB.
The cluster will contain the full collections.
When a client connects to a server and is given access to a collection for the first time, the requested part of the collection is read directly from the cluster.
When a client modifies a collection C, the modification is written to the in-memory DB which will contain something like:
123 added "message..."
124 deleted id235
125 modified id143 "new message..."
where 123, 124 and 125 are the versions of the collection.
In this case, the cluster contains the entire collection C and its version number which is 122.
When a client first connects to a server and accesses the collection C, the server reads the requested part of the collection from the cluster and then reads the updates from the in-memory DB in order to update the collection from version 122 to version 125.
When a client modifies the collection C,
the description of the modification is inserted into the in-memory DB;
the other servers are informed that a new version of C is available;
the client is sent the update.
Of course, the other servers, once informed, will send the updates to their clients as well.
Another process in the background will update the cluster the following way:
while (the in-memory database contains less than K updates for the collection C)
read the next update, U, from the in-memory database;
use U to update the collection C and its version number in the cluster ATOMICALLY.
The updates must be linearizable, i.e. no server should be able to see the collection C in a state where an update has been applied before a previous update.
When the cluster is fully-consistent, we remove the updates from the in-memory DB from lowest to highest version.
Problem
My solution requires a DB (for the cluster) which supports transactions (ACID?) and offers strong consistence. For instance, I can't use MongoDB.
Question
Can you think of a simpler solution?
or
If my solution is acceptable, what DB do you recommend for the cluster?
Thank you for your patience.

If each element of the collection is assigned a unique id and the updates in the in-memory DB include those ids, then one doesn't need a version number for the collection in the cluster and so transactions are unnecessary.
The idea is that the ids can be used to decide if an update is needed or not. For instance, if an update says
version=123 action=ADD id=05276 msg="Message"
and the collection in the cluster already contains a document with id=05276, then this update is old and was already applied to the collection in the cluster.
We only need to pay attention to this: before we remove some updates from the in-memory DB, we must make sure that those updates have been applied to the collection in the cluster and that the cluster is fully consistent wrt that collection.
When a client requests access to a collection, the server it connected to needs to:
read all the updates from the in-memory DB and save them in memory
read (the relevant part of) the collection from the cluster
update the read collection with the updates saved in memory
It's important to first read all the updates from the in-memory DB to avoid a race condition. Consider this scenario:
we read an old version of the collection from the cluster
the collection is updated
the cluster becomes fully consistent
some updates are removed from the in-memory DB
we update our read collection with the new updates
Problem is, in point 5 we'd miss some updates.