What's the benefit of mongodb's ttl collection? vs purging data from a housekeeper? - mongodb

I have been thinking about using the build in TTL feature, but it's not easy to dynamically changing the expiration date.
Since mongodb is using a background task purging the data. Is there any downside just coding my own purging function based on "> certain_date" and run say once a day?
This way, I can dynamically changing the TTL value, and this date field won't have to be single indexed. I can reuse this field as part of the complex indexing to minimize number of indexes.

There are 2 ways to set the expiration date on a TTL collection:
at a global level, when creating the index
per document, as a field in the document
Those modes are exclusive.
Global expiry
If you want all your documents to expire 3 months after creation, use the first mode by creating the index like the following:
db.events.ensureIndex({ "createdAt": 1 }, { expireAfterSeconds: 7776000 })
If you later decide to change the expiry to "4 months", you just need to update the expireAfterSeconds value using the collMod command:
db.runCommand({"collMod" : "events" , "index" : { "keyPattern" : {"createdAt" : 1 } , "expireAfterSeconds" : 10368000 } })
Per-document expiry
If you want to have every document has its own expiration date, save the specific date in a field like "expiresAt", then index your collection with:
db.events.ensureIndex({ "expiresAt": 1 }, { expireAfterSeconds: 0 })

I have been thinking about using the build in TTL feature, but it's not easy to dynamically changing the expiration date
That's odd. Why would that be a problem? If your document has a field Expires, you can update that field at any time to dynamically prolong or shorten the life of the document.
Is there any downside just coding my own purging function based on "> certain_date" and run say once a day?
You have to code, document and maintain it
Deleting a whole lot of documents can be expensive and lead to a lot of re-ordering. It's probably helpful to run the purging more often
Minimizing the number of indexes is a good thing, but the question is whether it's really worth the effort. Only you can give an answer to this question. My advice is: start with something that's already there if any possible and come up with something better if and only if you really have to.

Related

How should I efficiently delete alot of records from a mongodb collection?

This bounty has ended. Answers to this question are eligible for a +500 reputation bounty. Bounty grace period ends in 4 hours.
Jiew Meng wants to draw more attention to this question.
I am using Mongo to store multi tenant data. As part of data cleanup for a tenant I want to delete everything related to the tenant. The tenantId is indexed but there are alot of rows and it takes a long time to query and I have no easy way to get the progress.
Currently I do something
db.records.deleteMany({tenantId: x})
Is there a better way?
Thinking of doing in batches but like query for x records then build a list of ids to delete. Seems very manual but isit the recommended way?
Some options that I can think of.
Drop the index, before deleting. You can recreate the index after the deletion.
Change the write concern to a lower value, possibly 0. Request won't wait for acknowledgement from secondaries.
db.records.deleteMany({tenantId: x},{w : 0});
If there is another field with enough cardinality to reduce the number of documents, try including that in the query.
Ex: if anotherField as 0,1,2,3 as values, then execute the delete command 4 times, each time with different value.
db.records.deleteMany({tenantId: x, anotherField: 0},{w : 0});
db.records.deleteMany({tenantId: x, anotherField: 1},{w : 0});
db.records.deleteMany({tenantId: x, anotherField: 2},{w : 0});
db.records.deleteMany({tenantId: x, anotherField: 3},{w : 0});
The performance may depend on variety of different factors. But here are some options you can try to improve the performance
Bulk operations
Bulk operations might help here. bulk.find(query).remove() is a version of db.collection.remove(query) that optimized for large numbers of operations. You can read more about it here
You can use the following way:
Declare a search query:
var query= {tenantId: x};
Initialize and use a bulk:
var bulk = db.yourCollection.initializeUnorderedBulkOp()
bulk.find(query).remove() // or try delete() instead of remove()
bulk.execute()
The idea here rather not to speed up the removal, but to produce less load.
Also you could try bulkWrite()
db.yourCollection.bulkWrite([
{ deleteMany: {
"filter" : query,
}}
])
TTL indexes
It may be not suitable for your use case, but there's entirely another approach without removing by yourself at all.
If it is suitable for you to delete data based on a timestamp, then a TTL index might help you. The idea here is that the record is being removed when the TTL expires.
Implemented as a special index type, TTL collections make it possible
to store data in MongoDB and have the mongod automatically remove data
after a specified period of time.
DeleteMany I think, There must be something common between all the rows that you want to remove from the collection.
You can find out something and then create a query accordingly.
this will help you to remove those records fast.
Let me give you one example. I want to remove all the records where username is not exists.
db.collection.deleteMany({ username: {$exists: false} })
The best place to start is to find something that all records have in common in-order to removed them all at once.
For example the following code deletes all entries that don't contain an email address.
db.users.deleteMany({ email: { $exists: false } })
MongoDB documentation have great examples. Link provided below.
https://www.mongodb.com/docs/manual/reference/method/db.collection.deleteMany/#delete-multiple-documents
You might also want to consider dropping the index since it could be recreated after your done with the operation.
Finally you might want to lower the write concern in your operation in order to speed things up. A compile list of options can be found here
https://www.mongodb.com/docs/v5.0/reference/write-concern/#w-option
I found a good tutorial on https://www.geeksforgeeks.org/mongodb-delete-multiple-documents-using-mongoshell/ that might help you further.
apologies for any grammatical mistakes since English is not my native tongue
I would suggest two solutions, and also please export your model If anything goes wrong you will have a backup of your data or try this in your test DB first 
you can use your tenantId as a condition, not matching _id but with extra logic, like if any of the records do have the tenantId delete them so this way all of your tenant data will be removed using a single query.
db.records.deleteMany({tenantId : {$exists: true})
// suggestion- if any of your tenant data has a field tenantId but it is null you can check for a null value also to delete those records.
 
2) find command data in all of the records, if there is use it as a condition to delete those records.
for example, all of your tenant data have a common field called type with the same value use delete statement like
db.records.deleteMany({type : 1})

How to search values in real time on a badly designed database?

I have a collection named Company which has the following structure:
{
"_id" : ObjectId("57336ea1a7454c0100d889e4"),
"currentMonth" : 62,
"variables1": { ... },
...
"variables61": { ... },
"variables62" : {
"name" : "Test",
"email": "email#test.com",
...
},
"country" : "US",
}
My need is to be able to search for companies by name with up-to-date data. I don't have permission to change this data structure because many applications still use it. For the moment I haven't found a way to index these variables with this data structure, which makes the search slow.
Today each of these documents can be several megabytes in size and there are over 20,000 of them in this collection.
The system I want to implement uses a search engine to index the names of companies, but for that it needs to be able to detect changes in the collection.
MongoDB's change stream seems like a viable option but I'm not sure how to make it scalable and efficient.
Do you have any suggestions that would help me solve this problem? Any suggestion on the steps needed to set up the above system?
Usually with MongoDB you can add new fields to documents and existing applications would simply ignore the extra fields (though they naturally would not be populated by old code). Therefore:
Create a task that is regularly executed which goes through all documents in your collection, figures out the name for each document from its fields, then writes the name into a top-level field.
Add an index on that field.
In your search code, look up by the values of that field.
Compare the calculated name to the source-of-truth name. If different, discard the document.
If names don't change once set, step 1 only needs to go through documents that are missing the top-level name and step 4 is not needed.
Using the change detection pattern with monstache, I was able to synchronise in real time MongoDB with ElasticSearch, performing a Filter based on the current month and then Map the result of the variables to be indexed 🎊

Remove document if timestamp is too old

I know there is possible to remove some documents when the saved date is passed away with this command:
db.course.deleteMany({date: {"$lt": ISODate()}})
I trying to do the same thing but I'm trying to check is a timestamps is passed away
All saved timestamps are like this one
1492466400000
Is it possible to make a command with a condition to delete all documents with a too old timestamp?
EDIT
I use milliseconds timestamps
In MongoDB you 3.2 you can also use TTL Indexes where you can say to MongoDB "please remove all the Documents if the {fieldDate} is more older then 3600 seconds". This is pretty useful for Logging (remove all logs more older then 3 months).
Maybe is not your use case, but I think is pretty good to know.
TL indexes are special single-field indexes that MongoDB can use to automatically remove documents from a collection after a certain amount of time or at a specific clock time. Data expiration is useful for certain types of information like machine generated event data, logs, and session information that only need to persist in a database for a finite amount of time.
db.eventlog.createIndex( { "lastModifiedDate": 1 }, { expireAfterSeconds: 3600 } )
https://docs.mongodb.com/v3.2/core/index-ttl/
I find the way to do this
db.course.deleteMany({date: {"$lt": ISODate()*1}})
ISODate()*1 condition will convert the Date into a timestamp
and
ISODate(timestamp) convert the timestamps into a Date.

Should I use the timestamp in "_id"?

I need monitor the time of the records been created, for further query and modify.
first thing flashed in my mind is give the document a "createDateTime" field, with the default value of "new Date()", but Mongodb said the document _id has a timestamp embedded with, and the id was generated when the document was created, so it sounds dummy to add a new field for that.
for too many times, I've seen people set a "createDateTime" for their data, and I don't know if they know about the details of mongodb's _id.
I want know should I use the _id as a "createDateTime" field? what is the best practice?
and the pros and cons.
thanks for any tips.
I'd actually say it depends on how you want to use the date.
For example, it's not actionable using the aggregation framework Date operators.
This will fail for example:
db.test.aggregate( { $group : { _id: { $year: "$_id" } } })
The following error occurs:
"errmsg" : "exception: can't convert from BSON type OID to Date"
(The date cannot be extracted from the ObjectId.)
So, operations that are normally simple date operations become much more complex if you wanted to do any sort of date math in an aggregation. It would be far easier to have a createDateTime stamp. Counting the number of documents created in a particular year and month would be simple using aggregation with a distinct createdDateTime field.
You can sort on an ObjectId, to some degree. The remaining 8 bytes of the ObjectId aren't sortable in a meaningful way. Most MongoDB drivers default to creating the ObjectId within the driver and not on the database. So, if you've got multiple clients (like web servers for example) creating new documents (and new ObjectIds), the time stamps will only be as accurate as the various servers.
Also, depending the precision you'd need, an ISODate value is stored using 8 bytes, rather than the 4 used in an ObjectId.
Yes, you should. There is no reason not to do, besides the human readability while directly looking into the database. See also here and here.
If you want to use the aggregation framework to group by the date within _id, this is not possible yet as WiredPrairie correctly said. There is an open jira ticket for that, you might watch. But of course you can do this with Map-Reduce and ObjectID.getTimestamp(). An example for that can be found here.

How to implement persisted sorted list which is often updated and you need maintain the order

I need to display members of community which are sorted by last visit. There are millions of communities each of wich can have millions of members. The list should be scrollable. Because of sorting by last visit time the order is updated very often.
In RDBMS this functionality could be simply done by ordinary B-tree index. But how can I do it with NoSQL approach?
My current thoughts are:
Standart NoSQL scrollable list approach which uses buckets of fixed length that are chained doesn't help much because of requirements of reordering.
Cassandra keeps values ordered by column name. So theoretically I could use last visit time as column key but for each update I would need to delete existing column and insert new one which doesn't sound very effectively.
Apache Lucene is not NoSQL storage but also an option because it creates sorted index. But I'm not sure how it is scalable for massive updates.
Redis Sorted Sets sounds really promising but I haven't had experience with it.
What other options do I have?
If you keep the last modification date in the object you could sort at query time in many NoSQL db's:
MongoDB (see docs on indexes):
db.collection.find({ ... spec ... }).sort({ key: 1 })
db.collection.ensureIndex( { "username" : 1, "timestamp" : -1 } )
Elastic search has sorting in queries too:
{
"sort" : [
{ "date" : {"order" : "asc"} }
],
"query" : {
...
}
}
Some storages like CouchDB seem to lack built-in sorting feature altogether so it pays off to have a look at a particular solution before investing in it.