I am learning how to use kdb at the moment.
My question is how do I create a new directory inside the directory that I start q from.
This is to resolve the problem of not having enough RAM to store large financial data sets.
One way of getting around the problem of not having enough memory is writing your data sets to disk as partitioned tables, and then mapping them back into memory. This will allow you to query them whilst only loading the relevant columns and partitions into memory.
If you're feeling relatively comfortable with q you could read this chapter from the book Q for Mortals, which explains how to work with large datasets.
Related
I am trying to store records with a set of doubles and ints (around 15-20) in mongoDB. The records mostly (99.99%) have the same structure.
When I store the data in a root which is a very structured data storing format, the file is around 2.5GB for 22.5 Million records. For Mongo, however, the database size (from command show dbs) is around 21GB, whereas the data size (from db.collection.stats()) is around 13GB.
This is a huge overhead (Clarify: 13GB vs 2.5GB, I'm not even talking about the 21GB), and I guess it is because it stores both keys and values. So the question is, why and how Mongo doesn't do a better job in making it smaller?
But the main question is, what is the performance impact in this? I have 4 indexes and they come out to be 3GB, so running the server on a single 8GB machine can become a problem if I double the amount of data and try to keep a large working set in memory.
Any guesses into if I should be using SQL or some other DB? or maybe just keep working with ROOT files if anyone has tried them?
Basically, this is mongo preparing for the insertion of data. Mongo performs prealocation of storage for data to prevent (or minimize) fragmentation on the disk. This prealocation is observed in the form of a file that the mongod instance creates.
First it creates a 64MB file, next 128MB, next 512MB, and on and on until it reaches files of 2GB (the maximum size of prealocated data files).
There are some more things that mongo does that might be suspect to using more disk space, things like journaling...
For much, much more info on how mongoDB uses storage space, you can take a look at this page and in specific the section titled Why are the files in my data directory larger than the data in my database?
There are some things that you can do to minimize the space that is used, but these tequniques (such as using the --smallfiles option) are usually only recommended for development and testing use - never for production.
Question: Should you use SQL or MongoDB?
Answer: It depends.
Better way to ask the question: Should you use use a relational database or a document database?
Answer:
If your data is highly structured (every row has the same fields), or you rely heavily on foreign keys and you need strong transactional integrity on operations that use those related records... use a relational database.
If your records are heterogeneous (different fields per document) or have variable length fields (arrays) or have embedded documents (hierarchical)... use a document database.
My current software project uses both. Use the right tool for the job!
I am using a standard splayed format for my trade data where i have directories for each date and each column as separate file in there.
I am reading from csv files and storing using the below code. I am using the trial version 32 bit on win 7, 64 bit.
readDat: {[x]
tmp: read data from csv file(x)
tmp: `sym`time`trdId xasc tmp;
/trd: update `g#sym from trd;
trade:: trd;
.Q.dpft[`:/kdb/ndb; dt; `sym; `trade];
.Q.gc[];
};
\t readDat each 50#dtlist
I have tried both using the `g#sym and without it. Data has typically 1.5MM rows per date. select time for this is from 0.5 to 1 second for a day
Is there a way to improve times for either of the below queries.
\t select from trade where date=x
\t select from trade where date=x, sym=y
I have read the docs on segmentation, partitioning etc. but not sure if anything would help here.
On second thoughts, will creating a table for each sym speed up things? I am trying that out but wanted to know if there are memory/space tradeoffs i should be aware of.
Have you done any profiling to see what the actual bottleneck is? If you find that the problem has to do with disk read speed (using something like iostat) you can either get a faster disk (SSD), more memory (for bigger disk cache), or use par.txt to shard your database across multiple disks such that the query happens on multiple disks and cores in parallel.
As you are using .Q.dpft, you are already partitioning your DB. If your use case is always to pass one date in your queries, then segmenting by date will not provide any performance improvements. You could possibly segment by symbol range (see here), although this is never something I've tried.
One basic way to improve performance would be to select a subset of the columns. Do you really need to read all of the fields when querying? Depending on the width of your table this can have a large impact as it now can ignore some files completely.
Another way to improve performance would be to apply `u# to the sym file. This will speed up your second query as the look up on the sym file will be faster. Although this really depends on the size of your universe. The benefit of this would be marginal in comparison to reducing the number of columns requested I would imagine.
As user1895961 mentioned, selecting only certain columns will be faster.
KDB splayed\partitioned tables are almost exactly just files on the filesystem, the smaller the files and the fewer you have to read, the faster it will be. The balance between the number of folders and the number of files is key. 1.5mln per partition is ok, but is on the large side. Perhaps you might want to partition by something else.
You may also want to normalise you data, splitting it into multiple tables and using linked columns to join it back again on the fly. Linked columns, if set up correctly, can be very powerful and can help avoid reading too much data back from disk if filtering is added in.
Also try converting your data to char instead of sym, i found big performance increases from doing so.
i'm trying to find the best solution to create scalable storage for big files. File size can vary from 1-2 megabytes and up to 500-600 gigabytes.
I have found some information about Hadoop and it's HDFS, but it looks a little bit complicated, because i don't need any Map/Reduce jobs and many other features. Now i'm thinking to use MongoDB and it's GridFS as file storage solution.
And now the questions:
What will happen with gridfs when i try to write few files
concurrently. Will there be any lock for read/write operations? (I will use it only as file storage)
Will files from gridfs be cached in ram and how it will affect read-write perfomance?
Maybe there are some other solutions that can solve my problem more efficiently?
Thanks.
I can only answer for MongoDB here, I will not pretend I know much about HDFS and other such technologies.
The GridFs implementation is totally client side within the driver itself. This means there is no special loading or understanding of the context of file serving within MongoDB itself, effectively MongoDB itself does not even understand they are files ( http://docs.mongodb.org/manual/applications/gridfs/ ).
This means that querying for any part of the files or chunks collection will result in the same process as it would for any other query, whereby it loads the data it needs into your working set ( http://en.wikipedia.org/wiki/Working_set ) which represents a set of data (or all loaded data at that time) required by MongoDB within a given time frame to maintain optimal performance. It does this by paging it into RAM (well technically the OS does).
Another point to take into consideration is that this is driver implemented. This means that the specification can vary, however, I don't think it does. All drivers will allow you to query for a set of documents from the files collection which only houses the files meta data allowing you to later serve the file itself from the chunks collection with a single query.
However that is not the important thing, you want to serve the file itself, including its data; this means that you will be loading the files collection and its subsequent chunks collection into your working set.
With that in mind we have already hit the first snag:
Will files from gridfs be cached in ram and how it will affect read-write perfomance?
The read performance of small files could be awesome, directly from RAM; the writes would be just as good.
For larger files, not so. Most computers will not have 600 GB of RAM and it is likely, quite normal in fact, to house a 600 GB partition of a single file on a single mongod instance. This creates a problem since that file, in order to be served, needs to fit into your working set however it is impossibly bigger than your RAM; at this point you could have page thrashing ( http://en.wikipedia.org/wiki/Thrashing_%28computer_science%29 ) whereby the server is just page faulting 24/7 trying to load the file. The writes here are no better as well.
The only way around this is to starting putting a single file across many shards :\.
Note: one more thing to consider is that the default average size of a chunks "chunk" is 256KB, so that's a lot of documents for a 600GB file. This setting is manipulatable in most drivers.
What will happen with gridfs when i try to write few files concurrently. Will there be any lock for read/write operations? (I will use it only as file storage)
GridFS, being only a specification uses the same locks as on any other collection, both read and write locks on a database level (2.2+) or on a global level (pre-2.2). The two do interfere with each other as well, i.e. how can you ensure a consistent read of a document that is being written to?
That being said the possibility for contention exists based on your scenario specifics, traffic, number of concurrent writes/reads and many other things we have no idea about.
Maybe there are some other solutions that can solve my problem more efficiently?
I personally have found that S3 (as #mluggy said) in reduced redundancy format works best storing a mere portion of meta data about the file within MongoDB, much like using GridFS but without the chunks collection, let S3 handle all that distribution, backup and other stuff for you.
Hopefully I have been clear, hope it helps.
Edit: Unlike what I accidently said, MongoDB does not have a collection level lock, it is a database level lock.
Have you considered saving meta data onto MongoDB and writing actual files to Amazon S3? Both have excellent drivers and the latter is highly redundant, cloud/cdn-ready file storage. I would give it a shot.
I'll start by answering the first two:
There is a write lock when writing in to GridFS, yes. No lock for reads.
The files wont be cached in memory when you query them, but their metadata will.
GridFS may not be the best solution for your problem. Write locks can become something of a pain when you're dealing with this type of situation, particularly for huge files. There are other databases out there that may solve this problem for you. HDFS is a good choice, but as you say, it is very complicated. I would recommend considering a storage mechanism like Riak or Amazon's S3. They're more oriented around being storage for files, and don't end up with major drawbacks. S3 and Riak both have excellent admin facilities, and can handle huge files. Though with Riak, last I knew, you had to do some file chunking to store files over 100mb. Despite that, it generally is a best practice to do some level of chunking for huge file sizes. There are a lot of bad things that can happen when transferring files in to DBs- From network time outs, to buffer overflows, etc. Either way, your solution is going to require a fair amount of tuning for massive file sizes.
Context:
I want to store some temporary results in some temporary tables. These tables may be reused in several queries that may occur close in time, but at some point the evolutionary algorithm I'm using may not need some old tables any more and keep generating new tables. There will be several queries, possibly concurrently, using those tables. Only one user doing all those queries. I don't know if that clarifies everything about sessions and so on, I'm still uncertain about how that works.
Objective:
What I would like to do is to create temporary tables (if they don't exist already), store them on memory as far as that is possible and if at some point there is not enough memory, delete those that would be committed to the HDD (I guess those will be the least recently used).
Examples:
The client will be doing queries for EMAs with different parameters and an aggregation of them with different coefficients, each individual may vary in terms of the coefficients used and so the parameters for the EMAs may repeat as they are still in the gene pool, and may not be needed after a while. There will be similar queries with more parameters and the genetic algorithm will find the right values for the parameters.
Questions:
Is that what "on commit drop" means? I've seen descriptions about
sessions and transactions but I don't really understand those
concepts. Sorry if the question is stupid.
If it is not, do you know about any simple way to get Postgres to do
this?
Workaround:
In the worst case I should be able to make a guesstimation about how many tables I can keep on memory and try to implement the LRU by myself, but it's never going to be as good as what Postgres could do.
Thank you very much.
This is a complicated topic and probably one to discuss in some depth. I think it is worth both explaining why PostgreSQL doesn't support this and also what you can do instead with recent versions to approach what you are trying to do.
PostgreSQL has a pretty good approach to caching diverse data sets across multiple users. In general you don't want to allow a programmer to specify that a temporary table must be kept in memory if it becomes very large. Temporary tables however are managed quite differently from normal tables in that they are:
Buffered by the individual back-end, not the shared buffers
Locally visible only, and
Unlogged.
What this means is that typically you aren't generating a lot of disk I/O for temporary tables. The tables do not normally flush WAL segments, and they are managed by the local back-end so they don't affect shared buffer usage. This means that only occasionally is data going to be written to disk and only when necessary to free memory for other (usually more frequent) tasks. You certainly aren't forcing disk writes and only need disk reads when something else has used up memory.
The end result is that you don't really need to worry about this. PostgreSQL already tries, to a certain extent, to do what you are asking it to do, and temporary tables have much lower disk I/O requirements than standard tables do. It does not force the tables to stay in memory though and if they become large enough, the pages may expire into the OS disk cache, and eventually on to disk. This is an important feature because it ensures that performance gracefully degrades when many people create many large temporary tables.
We're currently using Postgres 9 on Amazon's EC2 and are very satisfied with the performance. Now we're looking at adding ~2TB of data to Postgres, which is larger than our EC2 small instance can hold.
I found S3QL and am considering using it in conjunction with moving the Postgres data directory to S3 storage. Has anyone had experience with doing this? I'm mainly concerned with performance (frequent reads, less frequent writes). Any advice is welcome, thanks.
My advice is "don't do that". I don't know anything about the context of your problem, but I guess that a solution doesn't have to involve doing bulk data processing through PostgreSQL. The whole reason grid processing systems were invented was to solve the problem of analyzing large data sets. I think you should consider building a system that follows standard BI practices around extracting dimensional data. Then take that normalized data and, assuming it's still pretty large, load it into Hadoop/Pig. Do your analysis and aggregation there. Dump the resulting aggregate data into a file and load that into your PG database along-side the dimensions.