Dealing with a LARGE data in mongodb

Dealing with a LARGE data in mongodb - mongodb

It is going to be a "general-ish" question but I have a reason for that. I am asking this because I am not sure what kind of approach shall I take to make things faster.
I have a mongoDB server running on a BIG aws instance (r3.4xlarge 16 core vCPU and 122 GB primary memory). The database has one HUGE collection which has 293017368 documents in it and some of them have a field called paid which holds a string value and some of them do not. Also some of them them have an array called payment_history some of them do not. I need to perform some tasks on that database but ALL the documents that do not have either paid or payment_history or both is not relevant to me. So I thought of cleaning (shrinking) the DB before I proceed with actual operations. I thought that as I have to check something like ({paid: {$exists: false}}) to delete records for the first step I should create an index over paid. I can see that at the present rate the index will take 80 days to be finished.
I am not sure what should be my approach for this situation? Shall I write a map-reduce to visit each and every document and perform whatever I need to perform in one shot and write the resulting documents in a different collection? Or shall I somehow (not sure how though) split up the huge DB in small sections and apply the transforms on each of them and then merge the resultant records from each server into a final cleaned record set? Or shall I somehow (not sure how) put that data in a Elastic Map-Reduce or redshift to operate on it? In a nutshell, what do you people think the best route to take for such a situation?
I am sorry in advance if this question sounds a little bit vague. I tried to explain the real situation as much as I could.
Thanks a lot in advance for help :)
EDIT
According to the comment about sparse indexing, I am now performing a partialIndexing. with this command - db.mycol.createIndex({paid: 1}, {partialFilterExpression: {paid: {$exists: true}}}) it roughly creates 53 indices per second... at this rate I am not sure how long it will take for the entire collection to get indexed. But I am keeping it on for the night and I will come back tomorrow here to update this question. I intend to hold this question the entire journey that I will go through, just for the sake of people in future with the same problem and same situation.

Related

MongoDB Array field maximum lenght

I am using mongoDB with mongoose, and I need to create a schema "advertisement" which has a field called "views" in which will be and array of
{userId: String, date: Date}.
I want to know if this is a good practice, since although I know now how much it will grow (until 1500 and then reseted) in the future I will not. I want to know if for example it would seriously affect the performance of the application if that array could be 50000 or 100000 or whatever. (It is is an unbounded array) In this cases what would be the best practice. I thought just storing an increasing number, but business decision is to know by who and when the ad was seen.
I know that there is a limit only for the document (16mb), but not for the fields themselves. But my questions is more related to performance rather than document limit.
Thank you!
Edit => In the end definitely it is not a good idea to let an array grow unbounded. I checked at the answer they provided first..and it is a good approach. However since I will be querying the whole document with the array property quite often I didn't want to split it. So, since I don't want to store data longer than 3 days in the array..I will pull all elements that have 3 days or more, and I hope this keeps the array clean.

I know that there is a limit only for the document (16mb), but not for
the fields themselves.
Fields and their values are parts of the document, so they make direct impact on the document size.
Beside that, having such a big arrays usually is not the best approach. It decreases performance and complicates queries.
In your case, it is much better to have a separate views collection of the documents which are referencing to the advertisements by their _id.
Also, if you expect advertisement.views to be queried pretty often or, for example, you often need to show the last 10 or 20 views, then the Outlier pattern may also work for you.

Performance loss with big size of collections

I've a collection that name "test" and has 132K documents in it. When I get first document of the collection it takes between 2-5ms but it's not same for last documation. It takes 100-200ms to pull.
So I've decided to ask the community.
My questions
What is the best document amount in one collection for the performance?
Why does it take so long to get last document from the collection? (I actually don't know how mongo works partially.)
What should I do for this issue and future problems?

After some search of how mongodb works, I found the solution. I didn't use any indexes for my collection so whenever I try to pull something it scans each data and each document. After creating some indexes for my needs, it is much more faster, actually 1ms, than before.
Conclusion
Create indexes for your collection and your needs. It'd be effective write and read operation both. Do not forget to search more 'cause there're some options like background which prevents blocking operations while creating index.

Aggregate,Find,Group confusion?

I am building a web based system for my organization, using Mongo DB, I have gone through the document provided by mongo db and came to the following conclusion:
find: Cannot pull data from sub array.
group: Cannot work in sharded environment.
aggregate:Best for sub arrays, but has performance issue when data set is large.
Map Reduce : Too risky to write map and reduce function.
So,if someone can help me out with the best approach to work with sub array document, in production environment having sharded cluster.
Example:
{"testdata":{"studdet":[{"id","name":"xxxx","marks",80}.....]}}
now my "studdet" is a huge collection of more than 1000, rows for each document,
So suppose my query is:
"Find all the "name" from "studdet" where marks is greater than 80"
its definitely going to be an aggregate query, so is it feasible to go with aggregate in this case because ,"find" cannot do this and "group" will not work in sharded environment, so if I go with aggregate what will be the performance impact, i need to call this query most of the time.

Please have a look at:
http://docs.mongodb.org/manual/core/data-modeling/
and
http://docs.mongodb.org/manual/tutorial/model-embedded-one-to-many-relationships-between-documents/#data-modeling-example-one-to-many
These documents describe the decisions in creating a good document schema in MongoDB. That is one of the hardest things to do in MongoDB, and one of the most important. It will affect your performance etc.
In your case running a database that has a student collection with an array of grades looks to be the best bet.
{_id:, …., grades:[{type:”test”, grade:80},….]}
In general, and, given your sample data set, the aggregation framework is the best choice. The aggregation framework is faster then map reduce in most cases (certainly in execution speed, it is C++ vs javascript for map reduce).
If your data's working set becomes so large you have to shard then aggregation, and everything else, will be slower. Not, however, slower then putting everything on a single machine that has a lot of page faults. Generally you need a working set larger then the RAM available on a modern computer for sharding to be the correct way to go such that you can keep everything in RAM. (At this point a commercial support contract for Mongo for assistance is going to be a less then the cost of hardware, and that include extensive help with schema design.)
If you need anything else please don’t hesitate to ask.
Best,
Charlie

MongoDB - single huge collection of raw data. Split or not?

We collect and store instrumentation data from a large number of hosts.
Our storage is MongoDB - several shards with replicas. Everything is stored in a single large collection.
Each document we insert is a time based observation with some attributes (measurements). The time stamp is the most important attribute because all queries are based on time at least. Documents are never updated, so it's a pure write-in-look-up model. Right now it works reasonably well with several billions of docs.
Now,
We want to grow a bit and hold up to 12 month of data which may amount to a scary trillion+ observations (documents).
I was wandering if dumping everything into a single monstrous collection is the best choice or there is a more intelligent way to go about it.
By more intelligent I mean - use less hardware while still providing fast inserts and (importantly) fast queries.
So I thought about splitting the large collection into smaller pieces hoping to gain memory on indexes, insertion and query speed.
I looked into shards, but sharding by the time stamp sounds like a bad idea because all writes will go into one node canceling the benefits of sharding.
The insert rates are pretty high, so we need sharding to work properly here.
I also thought about creating a new collection every month and then pick up a relevant collection for a user query.
Collections older than 12 month will be either dropped or archived.
There is also an option to create entirely new database every month and do similar rotation.
Other options? Or perhaps one large collection is THE option to grow real big?
Please share your experience and considerations in similar apps.

It really depends on the use-case for your queries.
If it's something that could be aggregated, I would say do this through a scheduled map/reduce function and store the smaller data size in separate collection(s).
If everything should be in the same collection and all data should be queried at the same time to generate the desired results, then you need to go with Sharding. Then depending on the data size for your queries, you could go with an in memory map/reduce or even doing it at the application layer.
As yourself pointed out, Sharding based on time is a very bad idea. It makes all the writes going to one shard, so define your shard key. MongoDB Docs, has a very good explanation on this.
If you can elaborate more on your specific needs for the queries would be easier to suggest something.
Hope it helps.

I think collection on monthly basis will help you to get some boost up but I was wondering why can not you use the hour field of your timestamp for sharding . You can add a column which will hold the HOUR part of time stamp and when you shard against it will be shared nicely as you have repeating hour daily basis. I have not tested it but thought it will may help you

Would suggest to go ahead with single collection, as suggested by #Devesh hour based shard should be fine, Need to take care of the new ' hour Key ' while querying to get better performance.

Prioritize specific long-running operation

I have a mongo collection with a little under 2 million documents in it, and I have a query that I wish to run that will delete around 700.000 of them, based on a Date-field.
The remove query looks something like this:
db.collection.remove({'timestamp': { $lt: ISODate('XXXXX') }})
The exact date is not important in this case, the syntax is correct and I know it will work. However, I also know it's going to take forever (last time we did something similar it took a little under 2 hours).
There is another process inserting and updating records at the same time that I cannot stop. However, as long as those insertions/updates "eventually" get executed, I don't mind them being deferred.
My question is: Is there any way to set the priority of a specific query / operation so that it runs faster / before all the queries sent afterwards? In this case, I assume mongo has to do a lot of swapping data in and out of the database which is not helping performance.

I don't know whether the priority can be fine-tuned, so there might be a better answer.
A simple workaround might be what is suggested in the documentation:
Note: For large deletion operations it may be more effect [sic] to copy the documents that you want to save to a new collection and then use drop() on the original collection.
Another approach is to write a simple script that fetches e.g. 500 elements and then deletes them using $in. You can add some kind of sleep() to throttle the deletion process. This was recommended in the newsgroup.
If you will encounter this problem in the future, you might want to
Use a day-by-day collection so you can simply drop the entire collection once data becomes old enough (this makes aggregation harder), or
use a TTL-Collection where items will time out automatically and don't need to be deleted in a bunch.

If your application needs to delete data older than a certain amount of time i suggest using TTL indexes. Ex (from the mongodb site):
db.log.events.ensureIndex( { "status": 1 }, { expireAfterSeconds: 3600 } )
This works like a capped collection, except data is deleted by time. The biggest win for you is that it works in a background thread, your inserts/updates will be mostly unhurt. I use this technique on a SaaS based product in production, works like a charm.
This may not be your use-case, but i hope that helped.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse