MongoDb "Working Set" exceeding RAM - mongodb

I m collecting timeseries in mongoDb. Eventually my working set will be larger than my RAM. However I mostly need to access the recent data.
If I put everything in just one table, would it still be possible to do that? Because the index size will keep growing if I just put all the data in one table.
I was thinking of creating a new table every month and put the data there. This way, the very old data will not be loaded in RAM unless someone (rarely) needs that archive data.
So my question is : is it better to manually partition the data like that or just leave everything up to mongoDB?

Related

Place fast-growing table on another data file

I am a postgres newbie and could not find the answer in the documentation or from google. Please help me out.
We have a fast-growing table whose content which will be periodically off-loaded to a file and then the table will be vacuumed and cleaned.
The table however grows rapidly and can use up 20% of the anticipated size of the database to 80+%. In some cases it could reach 100% of the volume size due to bad capacity planning. Also, this table has the highest write volume and it would be better to offload it into another volume that has a better write performance. hence we want to handle this table specifically in another location.
The growth rate is about 1GB/hour. There is only about one hour of activity per day though. The cleanup happens at 45 days, so 45GB of data is generated. We delete a month's worth of data, so the db size goes down to 15GB again.
Is it possible to place a fast-growing table (which is periodically cleaned and vacuumed) into a separate file in a separate volume in postgresql? I want it to be part of the same database, but only on a different file.

TorQ: How to update disk database populated with .loader.loadallfiles?

I populate a disk database from large CSV files using TorQ's .loader.loadallfiles in a cumulative fashion and it works great. However, I now need to also append data coming from a streaming source and I'm not sure what's the best way to go.
I know how to update or append data to the in-memory database. However, I do not know what API there is to cosistently bring the delta updates to the disk database previously populated with .loader.loadallfiles?
I call .loader.loadallfiles e.g.
rawdatadir:hsym `$("" sv (getenv[`KDBRAWDATA]; "fwdcurve"));
.loader.loadallfiles[`headers`types`separator`tablename`dbdir`partitioncol`partitiontype!(`date`ccypair`ftype;"ZSS";enlist ",";`fwdcurve;target;`date;`month); rawdatadir];
The best idea as Jonathon commented is to maintain an RDB for storing the data from your streaming source. When Kdb saves data to disk it saves entire columns in one go, so given 1000 records with 5 columns it is better to ask it to save 5 lists 1000 entries long than to ask it to save 5 columns each with one entry 1000 times.
To illustrate the amount of time this takes, suppose I have two on disk lists x and y.
Upserting 10000 elements at once is very fast
q)\t `:x upsert 10000#1
0
Doing them one at a time is much slower
q)\t:10000 `:y upsert 1
126
It might be worth looking into using the full TorQ framework. Its designed specifically for this kind of situation. It has RDB and HDB functionality and can be found here http://aquaqanalytics.github.io/TorQ/
If you wish to append data like you're saying then there currently isn't any API to do that. What you can do is modify the RDB or WDB to write to append to the database. Using .loader.writedatapartition followed by calling .loader.finish will be helpful I think.

How to handle large mongodb collection

We have a collection that is potentially going to be very large.This collection used to store Bill releated data. So this is often used to reporting/Analytics purpose.
Please let me know the best approch to handle this large collection
1) Can I split and archive the old data(say 12 months period)?.But here old data is required to get analytic reports.I want to query this old data to show the sale comparion for past 2 yesrs.
2)can I have new collection with old data(12 months) .So for every 12 months i've to create new collection. For reports generation,I've to access all this documents to query. So this will cause performance problem?
3) Can I go for Sharding?
There are many variables to account for, the clearest being what hardware you use, how the data is structured, and how it is queried. A distributed network ought to be able to chew through your data faster than a single machine, but before diving into that solution I recommend generating an absurd amount of mock data comparable to what you are expecting, and then testing various approaches. Seriously. Create a bunch of data, and try to break things. It's fun! Soon enough you'll know more about what your problem requires than any website could tell you.
As for direct responses:
Perhaps, before archiving the data, appropriate stats summaries can be generated (or updated). Those summaries/simplifications can be used for sale comparisons without reloading all of the archived data they represent.
This strikes me as sensible. By splitting up the sales data, you have more control over how much data needs to be accessed. After all, a user won't always wish to see 3 years of data, they may only wish to see last week's.
Move to sharding when you actually need it. As is stated on the MongoDB site:
Converting an unsharded database to a sharded cluster is easy and seamless, so there is little advantage in configuring sharding while your data set is small.
You'll know it's time when your memory-map approaches the server's RAM limit. MongoDB supports reading and writing to databases too large to keep in memory, but I'm sure you already know that is SLOW.

Why does MongoDB takes up so much space?

I am trying to store records with a set of doubles and ints (around 15-20) in mongoDB. The records mostly (99.99%) have the same structure.
When I store the data in a root which is a very structured data storing format, the file is around 2.5GB for 22.5 Million records. For Mongo, however, the database size (from command show dbs) is around 21GB, whereas the data size (from db.collection.stats()) is around 13GB.
This is a huge overhead (Clarify: 13GB vs 2.5GB, I'm not even talking about the 21GB), and I guess it is because it stores both keys and values. So the question is, why and how Mongo doesn't do a better job in making it smaller?
But the main question is, what is the performance impact in this? I have 4 indexes and they come out to be 3GB, so running the server on a single 8GB machine can become a problem if I double the amount of data and try to keep a large working set in memory.
Any guesses into if I should be using SQL or some other DB? or maybe just keep working with ROOT files if anyone has tried them?
Basically, this is mongo preparing for the insertion of data. Mongo performs prealocation of storage for data to prevent (or minimize) fragmentation on the disk. This prealocation is observed in the form of a file that the mongod instance creates.
First it creates a 64MB file, next 128MB, next 512MB, and on and on until it reaches files of 2GB (the maximum size of prealocated data files).
There are some more things that mongo does that might be suspect to using more disk space, things like journaling...
For much, much more info on how mongoDB uses storage space, you can take a look at this page and in specific the section titled Why are the files in my data directory larger than the data in my database?
There are some things that you can do to minimize the space that is used, but these tequniques (such as using the --smallfiles option) are usually only recommended for development and testing use - never for production.
Question: Should you use SQL or MongoDB?
Answer: It depends.
Better way to ask the question: Should you use use a relational database or a document database?
Answer:
If your data is highly structured (every row has the same fields), or you rely heavily on foreign keys and you need strong transactional integrity on operations that use those related records... use a relational database.
If your records are heterogeneous (different fields per document) or have variable length fields (arrays) or have embedded documents (hierarchical)... use a document database.
My current software project uses both. Use the right tool for the job!

Is there a way to configure Heroku PostgreSQL to not bother loading a particular column into RAM?

This may be a long shot, but I thought I'd ask anyway.
I am looking at using Heroku's new Crane Postgres DB (400 MB RAM Cache) in conjunction with an app I'm deploying on Heroku. The 400 MB cache size should be plenty for our needs... except for one column of one table, in which we store a cached PDF file as a string. The PDF's could easily use up the 400MB RAM pretty quickly if Heroku uses its Cache for them.
If I were on an actual server, I'd just store the PDF as a file, but given Heroku's ephemeral file system, my life is much simpler if I just store the pdf in the DB rather than rigging up a connection to S3 just for this one thing. (It further complicates that we're looking at deploying multiple heroku instances, one for each client ... so using the DB's is simpler than creating a new bucket for each one.) I don't really care about the speed on this. If people are getting the file, they will expect speeds as if it were coming from a file system anyhow, since thats how most file downloads are done. Is there any way to tell PostGRES to not bother caching this column?
Or maybe I'm asking the wrong question, and there is some other way to solve the problem or design alternatives that make it irrelevant.
You don't have to do anything. PostgreSQL will automatically use TOAST on values larger than 8 kB.
From http://www.postgresql.org/docs/9.1/static/storage-toast.html
PostgreSQL uses a fixed page size (commonly 8 kB), and does not allow tuples to span multiple pages. Therefore, it is not possible to store very large field values directly. To overcome this limitation, large field values are compressed and/or broken up into multiple physical rows. This happens transparently to the user, with only small impact on most of the backend code. The technique is affectionately known as TOAST (or "the best thing since sliced bread").
PostgreSQL caching is also done at the page level so TOAST does not have to be cached with the rest of the row (http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCache.pdf).
The fact that Postgres can TOAST large field values, it doesn't mean it's the best thing to do.
If you store big fields in your main database, it will make many things harder, such as creating forks or followers, and creating and restoring backups in particular. I would strongly reconsider utilizing S3 to store the PDF files, and simply invest in automated onboarding of new clients (create heroku app, provision database, provision/create S3 bucket).
I'm not quite sure how you're managing to store large PDF's, since Postgres imposes a maximum field size (or at least a maximum page size). However, you might be able to get around this by using TOAST. TOASTed items are stored in a separate (physical) table, so if you're not selecting them frequently they shouldn't be cached.
If you are selecting them frequently, then I'm not sure if what you want is possible. Remember that Postgres only supplies one "level" of caching - the Linux VFS does caching also.