Imagine you have millions of users those who perform transactions on your platform. Assuming each transaction is a document in your MongoDB collection there would be millions of documents generated everyday thus exploding your database in no time. I have received the following solutions from friends and family.
Having TTL index on the document - This won't work because we need those document stored somewhere so that it can be retrieved at a later point in time when the user demands for it.
Sharding the collection with timestamp as the key - This won't help us control the time frame we want the data to be backed up.
I would like to understand and implement a strategy somewhat similar to what banks follow. They keep your transactions upto a certain point (eg: 6 months) after which you have to request them via support or any other channel. I am assuming they follow a Hot/Cold storage pattern but I am not completely sure about it.
The entire point is to manage transaction documents and on a daily basis back up or move the older records to another place where it can be read from. Any idea how that is possible with MongoDB?
Update: Sample Document (Please note there are few other keys from the document that have been redacted)
{
"_id" : ObjectId("5d2c92d547d273c1329b49f0"),
"transactionType" : "type_3",
"transactionTimestamp" : ISODate("2019-07-15T14:51:54.444Z"),
"transactionValue" : 0.2,
"userId" : ObjectId("5d2c92f947d273c1329b49f1")
}
First Create a Table Where you want to save all records. (As you mentioned the sample, let's take this entry stored on a collection named A).
After that Create a backup at daily midnight and then after successful backup restored the collection with named timestamp.
After successful entry stored on table, you can truncate the original table.
By this approach you have a limited entry table on the collection and also have all records.
Related
Let's say I have a document in the following format in MongoDB:
students: Array
grades: Array
I currently have about 10,000 students and grades that are constantly changing. The number of students is constantly growing and students are removed from the document. I have a process to update the document every 30 minutes. At the same time, I've built an ExpressJS API where various teachers query the database as often as every minute to view info about their students.
What is the best way to update the data? Since there are so many possibilities of students being added, removed, and with grades changed, should I just overwrite the document every 30 minutes? The dataset is only a couple MB in size overall.
How can I ensure that the teachers will have no downtime if I happen to be updating at the same time they're making a GET request?
If you have so many changes it is best to overwrite the document with all changes at once compared to updating the document for any single student change every time.
The teachers will have no downtime , they will just read the previous document and after the change they will read the new document.
With so many ( 10k students/grades - elements in array ) maybe better if you have single document per student/grade in collection so you update the relevant student document only.
You need to adapt the schema based on the use cases , I guess not every time teachers need to read the full list of students/grades , but just students per class or lesson or school , guessing here since I dont see your exact use case and document example ...
We have a collection of documents and each document has a time window associated to it. (For example, fields like 'fromDate' and 'toDate'). Once the document is expired (i.e. toDate is in the past), the document isn't accessed anymore by our clients.
So we wanted to purge these documents to reduce the number of documents in the collection and thus making our queries faster. However we later realized that this past data could be important to analyze the pattern of data changes, so we decided to archive it instead of purging it completely. So this is what I've come up with.
Let's say we have a "collectionA" which has past versions of documents
Query all the past documents in "collectionA". (queries are made on secondary server)
Insert them to a separate collection called "collectionA-archive"
Delete the documents from collectionA that are successfully inserted in the archive
Delete documents in "collectionA-archive" that meet a certain condition. (we do not want to keep a huge archive)
My question here is, even though I'm making the queries on Secondary server, since the insertions are happening in Primary, does the documents inserted in archive collection make it to the working set of Primary ? The last thing we need is these past documents getting stored in RAM of Primary which could affect the performance of our live API.
I know, one solution could be to insert the past documents into a separate DB server. But acquiring another server is a bit of hassle. So would like to know if this is achievable within one server.
I am looking into using Mongo to store my data. I am wanting to store one document per change to a record. For example, a record represents a calendar event. Each time this event is updated (via a web form), I want to store the new version in a new document. This will allow historical details of this event to be retrieved upon request.
If I were to store this data using a relational database, I would have an 'events' table and an 'events_history' table:
'events' table:
event_id
last_event_history_id
'events_history' table:
event_history_id
event_id
event_date
So, when I want to retrieve a list of events (showing the latest history for each event), I would do this:
SELECT * FROM events_history eh, events e
WHERE
eh.events_history_id = e.last_event_history_id
However, I am unsure about how to approach storing the data and generating this list if using Mongo?
Joe,
Your question is a frequent question for folks coming from an RDBMS background to MongoDB (which is BTW exactly how I personally came to MongoDB)
I can relate to your question.
If I were to restate your question in a generic way, I would say:
How to model one-to-many relationships in MongoDB?
There are basically two approaches:
Embedded Documents
You can have a "events" collection. The documents in this collection can contain a "key" called "Event_history where each entry is an "old version" of the event itself.
The discussion on embedded documents for MongoDB is here.
Document References
This is very similar to what you do in relational databases. You can have two collections, each with its own documents.One collection for "active" events and one collections for historical events
The discussion for Document references in MongoDB is here.
Now back to your question: Which one of these 2 approaches is better.
There are a couple of factors to consider
1 - MongoDB does not currently have database based joins - If your workload is primarily reads, and your documents/events do not change frequently The approach with embedded documents will be easier and have better performance.
2 - Avoid Growing Documents. If your events change frequently causing MongoDB documents to grow, then you should opt for design #2 with the references. "Document growth" at scale with MongoDB is usually not the best performance option. An in-depth discussion of why document growth should be avoided is here.
Without knowing details about your app, I am inclined to "guess" document references would be better for an event management system where history is an important feature. Have 2 separate collections, and perform the join inside of your app.
I am using mongo version 3.0 db and java driver. I have a collection 100,000+ entries. Each day there will be approx 500 updates and approx 500 inserts which should be done in a batch. I will get the updated documents with old fields plus some new ones which I have to store. I dont know which are the feilds are newly added also for each field I am maintaining a summary statistic. Since I dont know what were the changes I will have to fetch the records which already exist to see the difference between updated ones and new ones to appropriately set the summary statistics.So I wanted inputs regarding how this can be done efficiently.
Should I delete the existing records and insert again or should I update the 500 records. And should I consider doing 1000 upsers if it has potential advantages.
Example UseCase
initial record contains: f=[185, 75, 186]. I will get the update request as f=[185, 75, 186, 1, 2, 3] for the same record. Also the summary statistics mentioned above store the counts of the ids in f. So the counts for 1,2,3 will be increased and for 185, 75, 186 will remain the same.
Upserts are used to add a document if it does not exist. So if you're expecting new documents then yes, set {upsert: true}.
In order to update your statistics I think the easiest way is to redo the statistics if you were doing it in mongo (e.g. using the aggregation framework). If you index your documents properly it should be fine. I assume that your statistics update is an offline operation.
If you weren't doing the statistics in mongo then you can add another collection where you can save the updates along with the old fields (of course you update your current collection too) so you will know which documents have changed during the day. At the end of the day you can just remove this temporary/log collection once you've extracted the needed information.
Mongo maintains every change log using oplog.rs capped collection in local db. We are creating a tailable cursor on oplog.rs on timestamp basis and each change in db / collection are streamed thru. Believe this is the best way to identify changes in mongo. One can certainly drop no interest document changes.
Further read http://docs.mongodb.org/manual/reference/glossary/#term-oplog
I think the easiest way is to redo the statistics if you were doing it in mongo (e.g. using the aggregation framework). If you index upsers documents properly it should be fine. I assume that your statistics update is an offline operation.
I am wondering what is the best way to expire only a subset of a collection.
In one collection I store conversion data and click data.
The click data I would like to store for lets a week
And the conversion data for a year.
In my collection "customers" I store something like:
{ "_id" : ObjectId("53f5c0cfeXXXXXd"), "appid" : 2, "action" : "conversion", "uid" : "2_b2f5XXXXXX3ea3", "iid" : "2_2905040001", "t" : ISODate("2014-07-18T15:01:00.001Z") }
And
{ "_id" : ObjectId("53f5c0cfe4b0d9cd24847b7d"), "appid" : 2, "action" : "view", "uid" : "2_b2f58679e6f73ea3", "iid" : "2_2905040001", "t" : ISODate("2014-07-18T15:01:00.001Z") }
for the click data
So should I exucute a ensureIndex or something like a cronjob?
Thank you in advance
There are a couple of built in techniques you can use. The most obvious is a TTL collection which will automatically remove documents based on a date/time field. The caveat here is that for that convenience, you lose some control. You will be automatically doing deletes all the time that you have no control over, and deletes are not free - they require a write lock, they need to be flushed to disk etc. Basically you will want to test to see if your system can handle the level of deletes you will be doing and how it impacts your performance.
Another option is a capped collection - capped collections are pre-allocated on disk and don't grow (except for indexes), they don't have the same overheads as TTL deletes do (though again, not free). If you have a consistent insert rate and document size, then you can work out how much space corresponds to the time frame you wish to keep data. Perhaps 20GiB is 5 days, so to be safe you allocate 30GiB and make sure to monitor from time to time to make sure your data size has not changed.
After that you are into more manual options. For example, you could simply have a field that marks a document as expired or not, perhaps a boolean - that would mean that expiring a document would be an in-place update and about as efficient as you can get in terms of a MongoDB operation. You could then do a batch delete of your expired documents at a quiet time for your system when the deletes and their effect on performance are less of a concern.
Another alternative: you could start writing to a new database every X days in a predictable pattern so that your application knows what the name of the current database is and can determine the names of the previous 2. When you create your new database, you delete the one older than the previous two and essentially always just have 3 (sub in numbers as appropriate). This sounds like a lot of work, but the benefit is that the removal of the old data is just a drop database command, which just unlinks/deletes the data files at the OS level and is far more efficient from an IO perspective than randomly removing documents from within a series of large files. This model also allows for a very clean backup model - mongodump the old database, compress and archive, then drop etc.
As you can see, there are a lot of trade offs here - you can go for convenience, IO efficiency, database efficiency, or something in between - it all depends on what your requirements are and what fits best for your particular use case and system.