Keep only n documents in collection - Meteor - mongodb

I want to keep only the latest n documents in my activityFeed collection, in order to speed up the application. I know that I could subscribe only to n activityFeed elements in my iron-router configs, but it is not necessary to keep all the entries.
How can I do this?
Do I need to check on every insert, or is there a better way to do this?
Any help would be greatly appreciated.

As you point out, your subscription could handle this on the client side, but if you also want to purge your DB there are two obvious solutions:
Use Collection hooks to delete the oldest item on every insert.
Use a cron job to find the nth oldest element every so often (15 minutes or whatever), and then delete everything older than that. I'd use synced-cron for this.

Better would be to create "capped collection": https://docs.mongodb.org/manual/core/capped-collections/
In meteor you can use it like this:
var coll = new Meteor.Collection("myCollection");
coll._createCappedCollection(numBytes, maxDocuments);
Where maxDocuments is maximal number of documents to store in collection and numBytes is the max size of collection. If one of these limits will be reached then automatically mongo will drop the oldest inserted documents from collection.
No need for additional scripts and cron jobs.

Related

A good way to expire specific documents in one collection and add them to another collection on expiry

I'm using Nodejs and MongoDB driver.
A two part question
I have a collection which is called openOffers which I want to expire when it hits the time closeOfferAt. I know that MongoDB offers a TTL, expireAt and expireAfterSeconds. But, when I use these, the same TTL is applied to all the documents in a particular collection. I'm not sure if I'm doing this correctly. I want document level custom expiry. Any syntax might be very useful!
Docs in openOffers
{
"_id":"12345",
"data": {...},
"closeOfferAt" : "2022-02-21T23:22:34.023Z"
}
I want to push these expired documents to another collection openOfferLog. This is so that I can retain the document for later analysis.
Current approach:
I haven't figured a way to have a customized TTL on each doc in openOffers. But, I currently insert docs into both, openOffers and openOffersLog together, and any data change in openOffers has to be separately propagated to openOffersLog to ensure consistency. There has to be a better scalable approach I suppose.
EDIT-1:
I'm looking for some syntax logic that I can use for the above use case. If not possible with the current MongoDB driver, I'm looking for an alternative solution with NodeJS code I can experiment with. I'm new to both NodeJS and MongoDB -- so any reasoning supporting the solution would be super useful as well.
There are two ways to implement TTL indexes.
Delete after a certain amount of time - this is you have already implemented
Delete at a specific clock time - for the detailed answers you can visit the MongoDB Docs
So the second option fulfills your expectation,
Just set 0 (zero) in expireAfterSeconds field while creating an index,
db.collection.createIndex({ "closeOfferAt": 1 }, { expireAfterSeconds: 0 })
just set the expiration date in closeOfferAt while inserting the document, this will remove the document at a particular timestamp.
db.collection.insert({
"_id":"12345",
"data": {...},
"closeOfferAt" : ISODate("2022-02-23T06:14:15.840Z")
})
Do not make your application logic depend on TTL indexes.
In your app you should have some scheduler that runs periodic tasks, some of them would move the finished offers to other collection and delete from the original even in bulk and no need to set a TTL index.
To keep consistency nothing better than a single source of truth, so if you can, avoid deleting and only change some status flag and timestamps.
A good use of a TTL index is to automatically clear old data after a relative long time, like one month or more. This keeps collection/indexes size in check.

Using Mongo to continuously rollup data

I've been experimenting with the Mongo Aggregation Framework and, with help from folks on here, am able to generate the right set of output docs for a given input. I have a couple of conceptual issues though that I'm hoping folks can help me design around.
The application I have is a runtime system that collects data for all the transactions it processes. All this data is written to a distributed, sharded collection in Mongo. What I need to do is periodically (every 5 seconds at this point) run a job that traverses this data, rolling it up by carious categories and appending the rolled up documents to a set of existing collections (or one existing collection).
I have a couple of challenges with the way Mongo Aggregration works:
1 - the $out pipeline stage doesn’t append to the target collection, it overwrites it - I need to append to a constantly growing collection. It also can't write to a sharded collection, but I don't think this is that big an issue.
2 - I don't know how I can configure it to essentially "tail" the input collection. Right now I would need to run it from a server and would have to mark the set of documents it's going to process with a query before running the aggregate() command and then have another job that periodically goes back through the source collection deleting documents that have been marked for processing (this assumes the aggregate worked and rolled them up properly - there is no transactionality).
Anyone have any suggestions for a better way to do this?
Thanks,
Ian
I recommend looking at version 3.6 (released last Nov) and the feature known as change streams. Change streams is effectively the "tail" you seek. A compact program in pseudo-code would look like this. Note also how we iterate over the agg on inputCollection and write doc by doc to the outputCollection.
tailableCursor = db.inputCollection.watch()
while(true) {
// Block until something comes in;
document = next(tailableCursor);
// Examine document to ensure it is of interest
if(of interest) {
cur2 = db.inputCollection.aggregate([pipeline]);
while(cur.hasNext()) {
db.outputCollection.insert(cur.next());
}
}
}

Automatically remove document from the collection when array field becomes empty

I have a collection with documents which contains array fields with even triggers and when there are no more triggers left i want to remove this document. As i understand mongo doesn't have triggers support. Is there any way i can delegate this job to Mongo?
You are correct, there is no triggers in mongo. So there is no normal way to do this with mogno. You have to use application logic to achieve this. One way would be to do cleaning every n minutes. Where you remove documents which have array of size zero. Another way (which I like a more) is after each update to the document, remove it if it has empty array.
The only feature I know MongoDB provides to expire data is by using an expiration index.
Expire data

MongoDB cursor and write operations

I am using MongoDB to save data about products. After writing the initial large data-set (24mio items) I would like to change all the items in the collection.
Therefore I use a cursor to iterate over the whole collection. Then I want to add a "row" or field to every item in the collection. With large data-sets this is not working. There were only 180000 items updated. On a small scale it is working. Is that normal behavior?
Is MongoDB not supposed to support writes while iterating with a cursor over the whole collection?
What would be a good practice to do that instead?
For larger collections, you might run into snapshotting problems. When you add data to the object and save it, it will grow, forcing mongodb to move the document around. Then you might find the object twice.
You can either use $snapshot in your query, or use a stable order such as sort({"_id":1}). Note that you can't use both.
Also make sure to use at least acknowledged write concern.
When we had a similar problem, we fetched data in 100k(with some test) chunks. It's a quick and simple solution.

How do I split/sample from a cursor in MongoDB?

I have a database with millions of documents. I want to perform a relatively time-consuming operation on each document and then update it. I have two (related) questions:
If I want a random sample of, say, 1000 documents from a given cursor, how do I do that?
I want to compute on and update a million documents. I am on a cluster and I want to dispatch a separate job for each batch of, say 1000 documents. What's the easiest way to do something like this?
Thanks!
Uri
In order to do this, you'd have to push things to a worker manager. I would suggest gearman for this. In that case, a script would: 1. query all documents that you want to update, and return their _id. 2. push all object IDs into a gearman server 3. have a gearman worker process running on each of the machines in your cluster 4. have each gearman worker pick up a new object ID from the queue and process the document, saving it back into MonogDB.