TorQ: How to update disk database populated with .loader.loadallfiles? - kdb

I populate a disk database from large CSV files using TorQ's .loader.loadallfiles in a cumulative fashion and it works great. However, I now need to also append data coming from a streaming source and I'm not sure what's the best way to go.
I know how to update or append data to the in-memory database. However, I do not know what API there is to cosistently bring the delta updates to the disk database previously populated with .loader.loadallfiles?
I call .loader.loadallfiles e.g.
rawdatadir:hsym `$("" sv (getenv[`KDBRAWDATA]; "fwdcurve"));
.loader.loadallfiles[`headers`types`separator`tablename`dbdir`partitioncol`partitiontype!(`date`ccypair`ftype;"ZSS";enlist ",";`fwdcurve;target;`date;`month); rawdatadir];

The best idea as Jonathon commented is to maintain an RDB for storing the data from your streaming source. When Kdb saves data to disk it saves entire columns in one go, so given 1000 records with 5 columns it is better to ask it to save 5 lists 1000 entries long than to ask it to save 5 columns each with one entry 1000 times.
To illustrate the amount of time this takes, suppose I have two on disk lists x and y.
Upserting 10000 elements at once is very fast
q)\t `:x upsert 10000#1
0
Doing them one at a time is much slower
q)\t:10000 `:y upsert 1
126
It might be worth looking into using the full TorQ framework. Its designed specifically for this kind of situation. It has RDB and HDB functionality and can be found here http://aquaqanalytics.github.io/TorQ/
If you wish to append data like you're saying then there currently isn't any API to do that. What you can do is modify the RDB or WDB to write to append to the database. Using .loader.writedatapartition followed by calling .loader.finish will be helpful I think.

Related

What are some of the most efficient workflows for processing "big data" (250+ GB) from postgreSQL databases?

I am constructing a script that will be processing well-over 250+ GB of data from a single postgreSQL table. The table's shape is ~ 150 cols x 74M rows (150x74M). My goal is to somehow sift through all the data and make sure that each cell entry meets certain criteria that I will be tasked with defining. After the data has been processed I want to pipeline it into an AWS instance. Here are some scenarios I will need to consider:
How can I ensure that each cell entry meets certain criteria of the column it resides in? For example, all entries in the 'Date' column should be in the format 'yyyy-mm-dd', etc.
What tools/languages are best for handling such large data? I use Python and the Pandas module often for DataFrame manipulation, and am aware of the read_sql function, but I think that this much data will simply take too long to process in Python.
I know how to manually process the data chunk-by-chunk in Python, however I think that this is probably too inefficient and the script could take well over 12 hours.
Simply put or TLDR: I'm looking for a simple, streamlined solution to manipulating and performing QC analysis on postgreSQL data.

Create HDB from Existing Tables

I have a numerous amount of tables stored in memory in KDB. I am hoping to create an HDB of these tables so I can free up memory space. I am a bit confused on the process of creating an HDB - splaying tables, etc. Can someone help me with the process of creating an HDB, and then what needs to be done moving forward - ie to upload whatever new data I have end of day.
Thanks.
There's many ways to create a HDB depending on the scenario. General practices are:
For small tables, just write them as flat/serialised files using
`:/path/to/dbroot/flat set inMemTable;
or
`:/path/to/dbroot/flat upsert inMemTable;
The latter will add new rows while the former overwrites. However since you're trying to free up memory, using flat/serialised won't be all that useful since flat/serialised files will get pulled into memory in full anyway.
For larger tables (10's of millions) that aren't growing too much on a daily basis, you can splay them using set along with .Q.en (enumeration is required when the table is not saved flat/serialised):
`:/path/to/dbroot/splay/ set .Q.en[`:/path/to/dbroot] inMemTable;
or
`:/path/to/dbroot/splay/ upsert .Q.en[`:/path/to/dbroot] inMemTable;
again depending on whether you want to overwrite or add new rows.
For tables that grow on a daily basis and have a natural date separation, you would write as a date-partitioned table. While you can also use set and .Q.en for date partitioned tables (since they are the same as splayed tables, just separated into physical date directories) the easier method might be to use .Q.dpft or dsave if you're using a recent version of kdb. These will do a lot of the work for you.
It's up to you then to maintain the tables, ensure the savedowns occur on a regular basis (usually daily), append to tables if necessary etc etc

Performance improvement for fetching records from a Table of 10 million records in Postgres DB

I have a analytic table that contains 10 million records and for producing charts i have to fetch records from analytic table. several other tables are also joined to this table and data is fetched currently But it takes around 10 minutes even though i have indexed the joined column and i have used Materialized views in Postgres.But still performance is very low it takes 5 mins for executing the select query from Materialized view.
Please suggest me some technique to get the result within 5sec. I dont want to change the DB storage structure as so much of code changes has to be done to support it. I would like to know if there is some in built methods for query speed improvement.
Thanks in Advance
In general you can take care of this issue by creating a better data structure(Most engines do this to an extent for you with keys).
But if you were to create a sorting column of sorts. and create a tree like structure then you'd be left to a search rate of (N(log[N]) rather then what you may be facing right now. This will ensure you always have a huge speed up in your searches.
This is in regards to binary tree's, Red-Black trees and so on.
Another implementation for a speedup may be to make use of something allong the lines of REDIS, ie - a nice database caching layer.
For analytical reasons in the past I have also chosen to make use of technologies related to hadoop. Though this may be a larger migration in your case at this point.

historic data storage and retrieval

I am using a standard splayed format for my trade data where i have directories for each date and each column as separate file in there.
I am reading from csv files and storing using the below code. I am using the trial version 32 bit on win 7, 64 bit.
readDat: {[x]
tmp: read data from csv file(x)
tmp: `sym`time`trdId xasc tmp;
/trd: update `g#sym from trd;
trade:: trd;
.Q.dpft[`:/kdb/ndb; dt; `sym; `trade];
.Q.gc[];
};
\t readDat each 50#dtlist
I have tried both using the `g#sym and without it. Data has typically 1.5MM rows per date. select time for this is from 0.5 to 1 second for a day
Is there a way to improve times for either of the below queries.
\t select from trade where date=x
\t select from trade where date=x, sym=y
I have read the docs on segmentation, partitioning etc. but not sure if anything would help here.
On second thoughts, will creating a table for each sym speed up things? I am trying that out but wanted to know if there are memory/space tradeoffs i should be aware of.
Have you done any profiling to see what the actual bottleneck is? If you find that the problem has to do with disk read speed (using something like iostat) you can either get a faster disk (SSD), more memory (for bigger disk cache), or use par.txt to shard your database across multiple disks such that the query happens on multiple disks and cores in parallel.
As you are using .Q.dpft, you are already partitioning your DB. If your use case is always to pass one date in your queries, then segmenting by date will not provide any performance improvements. You could possibly segment by symbol range (see here), although this is never something I've tried.
One basic way to improve performance would be to select a subset of the columns. Do you really need to read all of the fields when querying? Depending on the width of your table this can have a large impact as it now can ignore some files completely.
Another way to improve performance would be to apply `u# to the sym file. This will speed up your second query as the look up on the sym file will be faster. Although this really depends on the size of your universe. The benefit of this would be marginal in comparison to reducing the number of columns requested I would imagine.
As user1895961 mentioned, selecting only certain columns will be faster.
KDB splayed\partitioned tables are almost exactly just files on the filesystem, the smaller the files and the fewer you have to read, the faster it will be. The balance between the number of folders and the number of files is key. 1.5mln per partition is ok, but is on the large side. Perhaps you might want to partition by something else.
You may also want to normalise you data, splitting it into multiple tables and using linked columns to join it back again on the fly. Linked columns, if set up correctly, can be very powerful and can help avoid reading too much data back from disk if filtering is added in.
Also try converting your data to char instead of sym, i found big performance increases from doing so.

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.