I have a collection in mongo that stores every user action of my application, and its very huge in size (3Million documents per day). On UI I have a requirement to show the user actions for max. 6months period.
And the queries on this collection are becoming very slow with all the historic data, though there are indexes in place. So, I want to move the documents that are older than 6months to a separate collection.
Is it the right way to handle my issue?
Following are some of the techniques you can use to manage data growth in MongoDB:
Using capped collection
Using TTLs
Using mulitple collections for months
Using different databases on same host
Related
I am in need of storing applications transaction logs. Decided to use MongoDB. Every day there are almost 200000+- data is storing in single node MongoDB.
We have some reports and operation(if something happened then do something) depending on those logs. So, need to find documents matching different criteria. If going on that pace, is it vulnerable? Will it be slow to execute query?
Any suggestions to make it efficient to use MongoDB?
By the way, those data are in single collection. And MongoDB server version: 4.2.6
mongo collections can grow to be many terabytes without much issue. to be able to query that data in a speedy manner, you will have to analyze your queries and create indexes for the fields that are used in those queries.
indexes are not free though. they will take both diskspace and use up RAM, because for indexes to be useful, they need to fit entirely in RAM.
in most cases, if indexes and collections grow beyond what your hardware can handle, you will have to archive/evict old data and trim down the collections.
if your queries need to include that evicted data in order to generate your reports, you will have to have another collection for summarized values/data of the evicted records which you will have to combine with present data when generating the reports.
alternatively sharding can help with big data but there are some limitations on queries you can do with sharded collections.
We are planning to use MongoDB for a general purpose system and it seems well suited to the particular data and use cases we have.
However we have one use case where we will need to compare every document (of which there could be 10s of millions) with every other document. The 'distance measure' could be pre computed offline by another system but we are concerned about the online performance of MongoDB when we want to query - eg when we want to see the top 10 closest documents in the entire collection to a list of specific documents ...
Is this likely to be slow? Also can this be done across documents (eg query for the top10 closest documents in one collection to a document in another collection)...
Thanks in advance,
FK
In MongoDB, we have one large collection, which has every possible bit of information we use and it is being updated on a daily basis from different sources.
Now my question is: Is it a good idea to have multiple collections created from the original collection and to use these collection for the transaction from UI. These small collections are kept updated on daily basis using CRON.
I am trying to build an event tracking system for my mobile apps. I am evaluating mongodb for my needs and I don't have any hands-on experience with NoSQL databases. I have read mongodb documentation thoroughly and have come up with following schema design for my needs.
1. Must have a horizontally scalable data store
2. Data store must execute group queries quickly in sharded environment
3. Must have extremely high write throughput
Collections:
Events:
{name:'<name>', happened_at:'<timestamp>', user : { imei: '<imei>', model_id: '<model_id>'}
Devices:
{model_id:'<model_id>', device_width:<width>, memory: '<memory>', cpu: '<cpu>'}
I do not want to store devices as embedded document with in events.user to save storage space in my fastest growing collection i.e. events. Devices collection is not going to grow much and must be having records not more than 30k. While events collection is going to have few million documents added every day.
My data growth needs a sharded environment and we shall care about that from day 1 and hence not use anything which doesn't work in sharded system.
e.g. Group functions don't work with shards, we shall always write mongo M/R commands for such needs.
Problem: What is the best way to get all user who did a particular event(name='abc happened') on devices, having device_width<300.
My solution: Find all models having device_width<300 and use result for filtering events documents on such models.
Problem: Return count of users for which a particular event(name='abc happened') on devices, grouped against cpu of device
My solution: Get count of users for given event, grouped by model_ids(<30k records, I know). Further group with model_id related cpu and return final result.
Please let me know if I am doing it the right way. If not, what is the right way to do it at scale?
EDIT: Please also point out if there is any possible caveat like indexes might not get used to maximum effect with map/reduce.
In my current project, I am using two databases.
A MongoDB instance gathering data from different data providers (abt 15M documents)
Another (relational) database instance holding only the data which is needed for the application, i.e. a subset of the data in the MongoDB instance. (abt 5M rows)
As part of the synchronisation process, I need to regularly check for new entries in the MongoDB depending on data in the relational DB.
Let's say, this is about songs and artists, a document in the MongoDB might look like this:
{_id:1,artists:["Simon","Garfunkel"],"name":"El Condor Pasa"}
Part of the sync process is to import/update all songs from those artists that already exist in the relational DB, which are currently about 1M artists.
So how do I retrieve all songs of 1M named artists from MongoDB for import?
My first thought (and try) was to over all artists and query all songs for each artist (of course, there's an index on the "artists" field). But this takes several minutes for each batch of 1.000 artists, which would make this process a long runner.
My second thought was to write all existing artists to a separate mongoDB collection and have a super query which only retrieves songs of artists that are stored in there. But so far I have not been able retrieve data based on two collections.
Is this a good use case for map/reduce? If yes, can someone pls. give me a hint on how to achieve this? (I am not completely new to NoSQL, but sort of a newbie when it comes to map/reduce.)
Or is this idea just crazy and I have to stick with a process that's running for several days?
Thanks in advance for any hints.
If you regularly need to check for changes, then add a timestamp to your data, and incorporate that timestamp into your query. For example, if you add a "created_ts" attribute, then you can look for records that were created since the last time your batch ran.
Here are a few ideas for making the mongo interaction more efficient:
Reduce network overhead by using an "in" query. Play around with the size of the array of artist IDs in order to determine what works best for your case.
Reduce network overhead by only selecting or reading the attributes that you need.
Make sure that your documents are indexed by artist.
On the Mongo server, make sure that as much of your data fits into memory as possible. Retrieving data from disk is going to be slow no matter what else you do. If it doesn't fit into memory, then you have a few options -- buy more memory; shrink your data set (ex. drop attributes that you don't actually need); shard; etc.