Cannot allocate memory for a column of compound floats on a partitioned table - kdb

I have a partitioned table in my hdb that includes a column containing large lists of floats (at most 400 floats per element). eg each element looks like
(100.0 1.0 ...)
When trying to select on this column from days where there are particularly high numbers of rows I get an error saying
'./2015.02.07/table/column# Cannot allocate memory
The same error arises from a query like:
select column[;0] from table where date=2015.02.07
even though on days with fewer rows this query returns the first value of each element in the column.
Is there a way to stream this column in a select to decrease the memory requirements of holding the whole column in memory for a large day?
EDIT
.Q.ind on large days fails with the same error.
ie given I can work with 2015.02.01 but not 2015.02.02:
.Q.ind[select from table where date=2015.02.01;enlist 1]
is fine but
.Q.ind[select from table where date=2015.02.02;enlist 1]
fails with
{0!$[#.Q.pm;p3;(?).]#[x;0;p1[;y;z]]}
'./2015.02.10/table/column2#: Cannot allocate memory
#
.[?]
(+`time`sym`column1`column2!`:./2015.02.02/table;();0b;())
I should note I am using the free 32-bit version

I think this is all just a combination of the free-32bit memory limitation, the fact that your row counts are possibly large and the fact that (unavoidably) something must be pulled entirely into memory when retrieving data from a column, whether it is the column itself that gets entirely pulled in (in the non-nested case) or if its the nested-index column that gets entirely pulled in.
Another thing to consider is that kdb uses powers-of-two (buddy) memory allocation. Even if todays table only contains one more row than yesterdays, the memory requirements per column could double. Take a simple example:
In the free 32bit version (windows) you can create this many floats and it only uses ~1.07gb of memory
q)\ts 134217726?1.0
3093 1073741952
However, try to generate one extra float and you hit a memory limit
q)\ts 134217727?1.0
wsfull
So even a small amount of rows in the difference between one day and the next can be very significant if you're near the boundary of allocatable powers of two.
--DISCLAIMER-- the following is hacky and is only intended for debugging!
You can actually manually try to access the data from the nested list, though you may still have memory issues here anyway.
Create a nested table and splay it
q)tab:([] col1:(101 102 103f;104 105f;106 107 108 109 110f;111 112f))
q)tab
col1
--------------------
101 102 103f
104 105f
106 107 108 109 110f
111 112f
q)
q)`:test/ set tab
`:test/
You can try to read in the indices from the nested-index file
q)2_first (enlist "j";enlist 8)1:`:test/col1
3 5 10 12
So the indices for splitting the full list of floats (the col1# file) is index 3, index 5, 10 etc etc
Say I want the first 3 rows
q)myrows:3#2_first (enlist "j";enlist 8)1:`:test/col1
q)myrows
3 5 10
then I know that I need the first 10 floats from the col1# file and need to split them at index 3 and 5. Then I can read the col1# file partially and split it correctly
q)(0,-1_myrows) cut raze (enlist "f";enlist 8)1:(`$":test/col1#";0;8*last myrows)
101 102 103f
104 105f
106 107 108 109 110f
But this is precisely what KDB does under the covers anyway so I suspect that you'll still have trouble even reading in the nested-index file in the first place.
Check this debug/hack and see if you can partially read that way. But obviously it's not a long-term solution!

Nested columns make querying in the usual way difficult, as the # file also needs to be loaded into memory (even with a [;0])
Your best bet is to select map a date partition in, and then select within that chunk by chunk, e.g. a million rows at a time (or whatever is sensible given the size of nested floats).
Perhaps also consider 32bit floats, if some decimal accuracy can be sacrificed.
EDIT
So after comments I guess the best way is to go each partition a number of lines at a time with .Q.ind

Just to give my 2 cents on this, I had a similar error but with a 64-bit instance.
I suspected that the memory needed to be de-fragmented as it was running for almost a year.
Bouncing the instance solved the issue, and released a lot of virtual memory

Related

Attributes internal working in aj for performance benefits in kdb

Considering the trade table 't' and quotes table 'q' in memory:
q)t:([] sym:`GOOG`AMZN`GOOG`AMZN; time:10:01 10:02 10:02 10:03; px:10 20 11 19)
q)q:([] sym:`GOOG`AMZN`AMZN`GOOG`AMZN; time:10:01 10:01 10:02 10:02 10:03; vol:100 200 210 110 220)
In order to get performance benefits applying grouped attribute on 'sym' column of q table and making 'time' column sorted within sym.
Using this, I can clearly see the performance benefits from it:
q)\t:1000000 aj[`sym`time;t;q]
9573
q)\t:1000000 aj[`sym`time;t;q1]
8761
q)\t:100000 aj[`sym`time;t;q]
968
q)\t:100000 aj[`sym`time;t;q1]
893
And in large tables the performance is far better.
Now, I'm trying to understand how it works internally when we are applying grouped attribute to sym column and sort time within sym.
My understanding is internally the aj should happen in below way, can someone please let me know the correct internal working?
* Since, grouped attribute is applied on sym; so it creates a hashtable for table q1, then since we are sorting on time so the internal q1 table might look like.
GOOG|(10:01;10:02)|(100;110)
AMZN|(10:01;10:02:10:03)|(200;210;220)
So in this case of q1, if the interpreter has to join (AMZN;10:02) of t table; it will directly find it in q1's hasttable in less time, but for joining same value(AMZN;10:02) of table 't' in table 'q' the interpreter will have to search linearly through table 'q' hence taking more time.
I believe you're on the right track, though we can't know for sure as we don't have access to the kdb source code to see precisely what it does.
If you look at the definition of aj you'll see that it's based on bin:
q)aj
k){.Q.ft[{d:x_z;$[&/j:-1<i:(x#z)bin x#y;y,'d i;+.[+.Q.ff[y]d;(!+d;j);:;.+d i j:&j]]}[x,();;0!z]]y}
specifically,
(`sym`time#q)bin `sym`time#t
and the bin documentation provides some more details on how bin behaves: https://code.kx.com/q/ref/bin/
I believe in the two-column case it will first match on the sym column and then use bin on the second column. Like you said, the grouped attribute on sym speeds up the matching of syms part and the sorting on time ensures the bin returns the correct results. Note that for on-disk queries it's optimal to put `p# on sym rather than `g# as the parted attribute is optimal for matching/retrieving by sym from disk.

What's the impact of TOAST on performance? (adding a hundred varchar columns)

Consider a table with the following data:
id bigint Auto Increment
name character varying(255) NULL
category character varying(255) NULL
english character varying(255) NULL
french character varying(255) NULL
pivot character varying(255) NULL
credits character varying(255) NULL
hash character varying(20) NULL
The english column contains data of the following size (in bytes): max 116, min 5, average 42, median: 40.
The number of rows in the table is around 30,000 and will hardly change.
The new 107 columns will be translations of the English.
Will adding 107 columns hurt performance?
The Postgres site says the maximum number of columns on a Postgres table is
250-1600 depending on column types
and
The maximum number of columns for a table is further reduced as the tuple being stored must fit in a single 8192-byte heap page
Will the data fall under that limit?
Size of the largest row
What is the actual storage size of the table's rows? pg_column_size is the
Number of bytes used to store a particular value (possibly compressed)
SELECT id, pg_column_size(t.*) FROM table as t ORDER BY pg_column_size DESC
-- Some stats derived from the query:
-- Min 87 bytes
-- Max 514 bytes
-- Average 216 bytes
-- Median: 209 bytes
But no compression is actually happening here, because:
When a row that is to be stored is "too wide" (the threshold for that is 2KB by default), the TOAST mechanism first attempts to compress any wide field values. If that isn't enough to get the row under 2KB, it breaks up the wide field values into chunks that get stored in the associated TOAST table. Each original field value is replaced by a small pointer that shows where to find this "out of line" data in the TOAST table. TOAST will attempt to squeeze the user-table row down to 2KB in this way, but as long as it can get below 8KB, that's good enough and the row can be stored successfully.
Compression would start to kick in once the table gets bigger and those new columns are added.
It's unclear to me what the compression ratio would be for such data?
I wonder how effective it'll be on lots of short multilingual sentences. Also, tried to find the exact name of the compression algorithm used by Postgres: the docs say "the LZ family of compression techniques", but which one – LZ77? LZ78? A twist on one of them?
The best way to find out how much compression will achieve here is certainly to try… once I've got the translations. But I'd rather get an idea of it beforehand, as I won't get all the data at once.
TOAST'ed?
If the size of the table goes beyond the page size limit, then Posgres will rely on TOAST not just to compress but also to split the data for "out-of-line" rows.
I understand this will increase fetch times for those rows that don't fit… But what's the impact of TOAST on performance? Is it negligible for such a use case?
Bottom-line
At the end of the day…
Is adding those 107 columns a good idea, or should I use a different approach?
If fine, how important is it to be fetching only those columns the user needs? (No user will need all of them.)
Or am I approaching this the wrong way, i.e. is it a case of premature optimization where I'd have been better off just adding the columns and only investigate later if faced with problems?
Using Postgres 9.6. Upgrading is an option if needed.
The best way to find out how much compression will achieve here is certainly to try… once I've got the translations. But I'd rather get an idea of it beforehand, as I won't get all the data at once.
I'd just copy the English version into each of the 107 columns. That should be good enough to get some useful findings. You might worry that the repetition would cause the compression to be idiosyncratic; but each value is compressed in isolation so won't "know" it is identical to some other value.
It's unclear to me what the compression ratio would be for such data?
Not very much. For example, the paragraph of yours I quoted first doesn't get any benefit from compression (when I copied it into 107 other columns). Short segments of ordinary text do not have enough repetition in them to be very compressible. Translating them to other languages is unlikely to change this.
If fine, how important is it to be fetching only those columns the user needs? (No user will need all of them.)
This question has a very clear answer. You should absolutely select only what you need. Assembling a row from 100+ toasted columns, just to throw most of them away, will slow you down.
I don't know if this falls under "premature optimization" so much as falling under poor design. In one way or another you will need some method of know which of the 108 versions you need. But what happens when you need to add the 108th translation, or you delete say the 93rd. So use this information to form a key to a translation table. Something like Translation_Test (for_ref_in bigint, language text, translation text). Then access the necessary text (including perhaps the English version) from that table.

kdb q - efficiently count tables in flatfiles

I have a lot of tables stored in flat files (in a directory called basepath) and I want to check their number of rows. The best I can so right now is:
c:([] filename:system "ls ",basepath;
tablesize:count each get each hsym `$basepath,/:system "ls ",basepath)
which loads each table entirely into memory and then performs the count (that's quite slow). Is saving as splayed tables the only way to make this faster (because I would only load 1 column and count that) or is there a trick in q that I can use?
Thanks for the help
If you have basepath defined as a string of the path to directory where all your flat tables are stored then you can create a dictionary of the row counts as follows:
q)cnt:{count get hsym x}
q)filename:key hsym `$basepath
q)filename!cnt each filename
t| 2
g| 3
This is where I have flat tables t and g saved in my basepath directory. This stops you from having to use system commands which are often less effiecient.
The function cnt takes the path of each flat table (as a symbol) and returns the number of rows without saving them into memory.
The best solution if you have control of the process of saving such files down is to add an extra step of saving the meta information of the row count somewhere seperate at the same time of saving the raw data. This would allow you to quickly access the table size from this file instead of reading the full tbale in each time.
However, note that to avoid pulling them into memory at all you would have to instead use read1 and look at the headers on the binary data. As you said it would be better to save as a splayed table and read in one column.
UPDATE: I would not recommend doing this and strongly suggest doing the above but for curiosity after looking into using read1 here's an example what what a hacky solution might look like:
f:{
b:read1(y;0;x);
if[not 0x62630b~b[2 4 5];'`$"not a table"];
cc:first first((),"i";(),4)1:b 7+til 4;
if[null ce:first where cc=sums 0x0=11 _ b;:.z.s[x*2;y]];
c:`$"\000" vs "c"$b[11+til ce];
n:first first((),"i";(),4)1:b[(20+ce)+til 4];
:`columns`rows!(c;n);
}[2000]
The q binary file format isn’t documented anywhere, the only way to figure it out is to save different things and see how the bytes change. It’s also subject to changes between versions - the above is written for 3.5 and is probably valid for 3.0-3.5 only, not the latest 3.6 release or anything 2.X.
The given code works in the following way:
reads a chunk from the front of the file
validates that it looks like a flat unkeyed table (flip[98] of a dict[99] with symbol[11] keys)
reads the count of symbols in the list of columns as a little endian 4 byte int
scans through the null terminated strings for that many zero bytes
if the columns are so numerous or verbose that we don’t have them
all in this chunk it will double the size of the chunk and try again
turn the strings into symbols
using the offset we get from the end of the column list, skip a bit
more of the header for the mixed list of columns
then read the count from the header of the first column
Hope this answers your question!
From experimenting with the binary files, it seems that the table count is saved as part of the binary file when you save down a flat file, taking up 4 bytes after the initial object type and column headings which will vary from table to table.
`:test set ([]a:1 2 3;b:4 5 6;c:7 8 9;aa:10 11 12;bb:13 14 15)
q)read1 `:test
0xff016200630b000500000061006200630061610062620000000500000009000300000
0 7 11 31
bytes | example | meaning
---------------------------------------------------------------------------------------
0 - 5 | 0xff016200630b0 | object is a flat table
7 - 11 | 0x05000000 | number of columns (5)
12- 22 | 0x6100620063006161006262 | one byte for the ascii values of column "a" and "b" in hex followed by the one byte separator
23 - 30 | 0x0000050000000900 | 8 bytes that can be skipped
31 - 34 | 0x0300000 | 4 bytes for row count of first column (3)
This should help you understand the function that Fiona posted.
The binary is saved down little-endian meaning the most-significant byte is the right-hand most digit - doing this in decimal for the number 100 would give 001, with the 100's (most significant) on the right and then 10s and finally 1s on the left. In the binary file, each group of 2 digits is a byte.
You can use 1: to read in the contents of a binary file, with additional arguments in the list specifying the offset - where to start reading from, and how many bytes to read. In our case we want to start at byte 31 and read in 4 bytes, specifying the output should be an integer and to cut the input into separate 4 byte chunks.
q)first first (enlist "i";enlist 4)1:(`:test;31;4)
3i
Converting the little-endian bytes into a long gives us the row count. Since this only has to read in 4 bytes instead of the whole file it is a lot quicker.
For a table with 10000 rows and 2 columns there is not much difference:
q)\t 0x0 sv reverse first (enlist "x";enlist 1)1:(`:test10000;31;4)
0
q)\t count get `:test10000
0
For a table with 100m rows and 2 columns:
q)\t 0x0 sv reverse first (enlist "x";enlist 1)1:(`:test10m;31;4)
0
q)\t count get `:test10m
2023
If you have a splayed table instead you can read in the number of elements in one of the columns from bytes 9-13 like so, assuming the column is a simple list:
q)first first (enlist "i";enlist 4)1:(`:a;8;4)
3i
You can read more about reading in from binary files here https://code.kx.com/q/ref/filenumbers/#1-binary-files
You can make what you currently have more efficient by using the following
counttables:{count each get each hsym `$basepath}
This will improve the speed of the count by not including the extra read in of the data as well as the join which you are currently doing. You are correct though that if the tables where saved splayed you would only have to read in the one column making it much more efficient.
If your tables are stored uncompressed there's probably something quite hacky you could do with a read1 on the headers within the file until you find the first column header.
But v hacky :-(
Are you responsible for saving these down? Can you keep a running state as you do?

Orange3 Frequent Itemsets performance

I just realized that the performance of Frequent Itemsets is strongly correlated with number of item per basket. I run the following code:
import datetime
from datetime import datetime
from orangecontrib.associate.fpgrowth import *
%%time
T = [[1,3, 4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]]
itemsets = frequent_itemsets(T, 1)
a=list(itemsets)
As I increased the number of item in T the running time increased as follow:
Item running time
21 3.39s
22 9.14s
23 15.8s
24 37.4s
25 1.2 min
26 10 min
27 35 min
28 95 min
For 31 item sets it took 10 hours without returning any results. I am wondering if there is anyway to run it for more than 31 items in reasonable time? In this case I just need pairwise item set (A-->B) while my understanding is frequent_itemsets count all possible combination and that is probably why it is running time is highly correlated with number of items. Is there any way to tell the method to limit the number item to count like instead of all combination just pairwise?
You could use other software that allows to specify constraints on the itemsets such as a length constraints. For example, you can consider the SPMF data mining library (disclosure: I am the founder), which offers about 120 algorithms for itemset and pattern mining. It will let you use FPGrowth with a length constraint. So you could for example mine only the patterns with 2 items or 3 items. You could also try other features such as mining association rules. That software works on text file and can be called for the command line, and is quite fast.
A database of a one single transaction of 21 items results in 2097151 itemsets.
>>> T = [list(range(21))]
>>> len(list(frequent_itemsets(T, 1)))
2097151
Perhaps instead of absolute support set as low as a single transaction (1), choose the support to be e.g. 5% of all transactions (.05).
You should also limit the returned itemsets to contain exactly two items (antecedent and consequent for later association rule discovery), but the runtime will still be high due to, as you understand, sheer combinatorics.
len([itemset
for itemset, support in frequent_itemsets(T, 1)
if len(itemset) == 2])
At the moment, there is no such filtering available inside the algorithm, but the source is open to tinkering.

Spark window functions: how to implement complex logic with good performance and without looping

I have a data set that lends itself to window functions, 3M+ rows that once ranked can be partitioned into groups of ~20 or less rows. Here is a simplified example:
id date1 date2 type rank
171 20090601 20090601 attempt 1
171 20090701 20100331 trial_fail 2
171 20090901 20091101 attempt 3
171 20091101 20100201 attempt 4
171 20091201 20100401 attempt 5
171 20090601 20090601 fail 6
188 20100701 20100715 trial_fail 1
188 20100716 20100730 trial_success 2
188 20100731 20100814 trial_fail 3
188 20100901 20100901 attempt 4
188 20101001 20101001 success 5
The data is ranked by id and date1, and the window created with:
Window.partitionBy("id").orderBy("rank")
In this example the data has already been ranked by (id, date1). I could also work on the unranked data and rank it within Spark.
I need to implement some logic on these rows, for example, within a window:
1) Identify all rows that end during a failed trial (i.e. a row's date2 is between date1 and date2 of any previous row within the same window of type "trial_fail").
2) Identify all trials after a failed trial (i.e. any row with type "trial_fail" or "trial success" after a row within the same window of type "trial_fail").
3) Identify all attempts before a successful attempt (i.e. any row with type "attempt" with date1 earlier than date1 of another later row of type "success").
The exact logic of these conditions is not important to my question (and there will be other different conditions), what's important is that the logic depends on values in many rows in the window at once. This can't be handled by the simple Spark SQL functions like first, last, lag, lead, etc. and isn't as simple as the typical example of finding the largest/smallest 1 or n rows in the window.
What's also important is that the partitions don't depend on one another so this seems like this a great candidate for Spark to do in parallel, 3 million rows with 150,000 partitions of 20 rows each, in fact I wonder if this is too many partitions.
I can implement this with a loop something like (in pseudocode):
for i in 1..20:
for j in 1..20:
// compare window[j]'s type and dates to window[i]'s etc
// add a Y/N flag to the DF to identify target rows
This would require 400+ iterations (the choice of 20 for the max i and j is an educated guess based on the data set and could actually be larger), which seems needlessly brute force.
However I am at a loss for a better way to implement it. I think this will essentially collect() in the driver, which I suppose might be ok if it is not much data. I thought of trying to implement the logic as sub-queries, or by creating a series of sub-DF's each with a subset or reduction of data.
If anyone is aware of any API's or techniques that I am missing any info would be appreciated.
Edit: This is somewhat related:
Spark SQL window function with complex condition