kdb q - efficiently count tables in flatfiles

kdb q - efficiently count tables in flatfiles - kdb

I have a lot of tables stored in flat files (in a directory called basepath) and I want to check their number of rows. The best I can so right now is:
c:([] filename:system "ls ",basepath;
tablesize:count each get each hsym `$basepath,/:system "ls ",basepath)
which loads each table entirely into memory and then performs the count (that's quite slow). Is saving as splayed tables the only way to make this faster (because I would only load 1 column and count that) or is there a trick in q that I can use?
Thanks for the help

If you have basepath defined as a string of the path to directory where all your flat tables are stored then you can create a dictionary of the row counts as follows:
q)cnt:{count get hsym x}
q)filename:key hsym `$basepath
q)filename!cnt each filename
t| 2
g| 3
This is where I have flat tables t and g saved in my basepath directory. This stops you from having to use system commands which are often less effiecient.
The function cnt takes the path of each flat table (as a symbol) and returns the number of rows without saving them into memory.
The best solution if you have control of the process of saving such files down is to add an extra step of saving the meta information of the row count somewhere seperate at the same time of saving the raw data. This would allow you to quickly access the table size from this file instead of reading the full tbale in each time.
However, note that to avoid pulling them into memory at all you would have to instead use read1 and look at the headers on the binary data. As you said it would be better to save as a splayed table and read in one column.
UPDATE: I would not recommend doing this and strongly suggest doing the above but for curiosity after looking into using read1 here's an example what what a hacky solution might look like:
f:{
b:read1(y;0;x);
if[not 0x62630b~b[2 4 5];'`$"not a table"];
cc:first first((),"i";(),4)1:b 7+til 4;
if[null ce:first where cc=sums 0x0=11 _ b;:.z.s[x*2;y]];
c:`$"\000" vs "c"$b[11+til ce];
n:first first((),"i";(),4)1:b[(20+ce)+til 4];
:`columns`rows!(c;n);
}[2000]
The q binary file format isn’t documented anywhere, the only way to figure it out is to save different things and see how the bytes change. It’s also subject to changes between versions - the above is written for 3.5 and is probably valid for 3.0-3.5 only, not the latest 3.6 release or anything 2.X.
The given code works in the following way:
reads a chunk from the front of the file
validates that it looks like a flat unkeyed table (flip[98] of a dict[99] with symbol[11] keys)
reads the count of symbols in the list of columns as a little endian 4 byte int
scans through the null terminated strings for that many zero bytes
if the columns are so numerous or verbose that we don’t have them
all in this chunk it will double the size of the chunk and try again
turn the strings into symbols
using the offset we get from the end of the column list, skip a bit
more of the header for the mixed list of columns
then read the count from the header of the first column
Hope this answers your question!

From experimenting with the binary files, it seems that the table count is saved as part of the binary file when you save down a flat file, taking up 4 bytes after the initial object type and column headings which will vary from table to table.
`:test set ([]a:1 2 3;b:4 5 6;c:7 8 9;aa:10 11 12;bb:13 14 15)
q)read1 `:test
0xff016200630b000500000061006200630061610062620000000500000009000300000
0 7 11 31
bytes | example | meaning
---------------------------------------------------------------------------------------
0 - 5 | 0xff016200630b0 | object is a flat table
7 - 11 | 0x05000000 | number of columns (5)
12- 22 | 0x6100620063006161006262 | one byte for the ascii values of column "a" and "b" in hex followed by the one byte separator
23 - 30 | 0x0000050000000900 | 8 bytes that can be skipped
31 - 34 | 0x0300000 | 4 bytes for row count of first column (3)
This should help you understand the function that Fiona posted.
The binary is saved down little-endian meaning the most-significant byte is the right-hand most digit - doing this in decimal for the number 100 would give 001, with the 100's (most significant) on the right and then 10s and finally 1s on the left. In the binary file, each group of 2 digits is a byte.
You can use 1: to read in the contents of a binary file, with additional arguments in the list specifying the offset - where to start reading from, and how many bytes to read. In our case we want to start at byte 31 and read in 4 bytes, specifying the output should be an integer and to cut the input into separate 4 byte chunks.
q)first first (enlist "i";enlist 4)1:(`:test;31;4)
3i
Converting the little-endian bytes into a long gives us the row count. Since this only has to read in 4 bytes instead of the whole file it is a lot quicker.
For a table with 10000 rows and 2 columns there is not much difference:
q)\t 0x0 sv reverse first (enlist "x";enlist 1)1:(`:test10000;31;4)
0
q)\t count get `:test10000
0
For a table with 100m rows and 2 columns:
q)\t 0x0 sv reverse first (enlist "x";enlist 1)1:(`:test10m;31;4)
0
q)\t count get `:test10m
2023
If you have a splayed table instead you can read in the number of elements in one of the columns from bytes 9-13 like so, assuming the column is a simple list:
q)first first (enlist "i";enlist 4)1:(`:a;8;4)
3i
You can read more about reading in from binary files here https://code.kx.com/q/ref/filenumbers/#1-binary-files

You can make what you currently have more efficient by using the following
counttables:{count each get each hsym `$basepath}
This will improve the speed of the count by not including the extra read in of the data as well as the join which you are currently doing. You are correct though that if the tables where saved splayed you would only have to read in the one column making it much more efficient.

If your tables are stored uncompressed there's probably something quite hacky you could do with a read1 on the headers within the file until you find the first column header.
But v hacky :-(
Are you responsible for saving these down? Can you keep a running state as you do?

Related

Powershell: How can you remove an arbitrary range of rows from a csv?

I'm working with a very large CSV (700 GB). As a part of processing this file, I'd like to break it into smaller chunks and process each chunk individually.
My primary goal is to avoid reading the entire file upfront! I want the processing to happen gradually through the file, so I want to avoid having to wait for the entire 700 GB file to be read (and allocated to memory) before I can start processing the data.
If I'm understanding it right, and based on my testing, both Get-Content | Select -skip and StreamReader.ReadLine() require iterating over each row of the file, starting from the beginning. So if I were breaking a 10-row csv into 3-row chunks, PowerShell reads the rows like this:
PASS 1 |PASS 2
----------------+----------------
Line 1 READ |CONTINUE
Line 2 READ |CONTINUE
Line 3 READ |CONTINUE
Line 4 BREAK |READ
Line 5 -- |READ
Line 6 -- |READ
Line 7 -- |BREAK
Line 8 -- |--
Line 9 -- |--
Line 10 -- |--
Because of this, as you get into the hundreds-of-millions rows, the runtime is significantly impacted because of all the CONTINUEs. It would be a lot more optimal if I could pop off a few hundred thousand rows from the top of the CSV and remove them from the original CSV, just like you would if you were popping something off of the beginning of an array. Is this possible?
What I do not want to do:
I don't want to read through the entire CSV and split off chunks as I go, giving me one single pass through the rows. That's a huge upfront cost in terms of runtime, but also in terms of storage space. That's 1.4 TB of data before I even process it!
I don't want to write a copy of the CSV to another file without the first 100,000 rows; that would involve reading the entire CSV at once (many times).

Cannot allocate memory for a column of compound floats on a partitioned table

I have a partitioned table in my hdb that includes a column containing large lists of floats (at most 400 floats per element). eg each element looks like
(100.0 1.0 ...)
When trying to select on this column from days where there are particularly high numbers of rows I get an error saying
'./2015.02.07/table/column# Cannot allocate memory
The same error arises from a query like:
select column[;0] from table where date=2015.02.07
even though on days with fewer rows this query returns the first value of each element in the column.
Is there a way to stream this column in a select to decrease the memory requirements of holding the whole column in memory for a large day?
EDIT
.Q.ind on large days fails with the same error.
ie given I can work with 2015.02.01 but not 2015.02.02:
.Q.ind[select from table where date=2015.02.01;enlist 1]
is fine but
.Q.ind[select from table where date=2015.02.02;enlist 1]
fails with
{0!$[#.Q.pm;p3;(?).]#[x;0;p1[;y;z]]}
'./2015.02.10/table/column2#: Cannot allocate memory
#
.[?]
(+`time`sym`column1`column2!`:./2015.02.02/table;();0b;())
I should note I am using the free 32-bit version

I think this is all just a combination of the free-32bit memory limitation, the fact that your row counts are possibly large and the fact that (unavoidably) something must be pulled entirely into memory when retrieving data from a column, whether it is the column itself that gets entirely pulled in (in the non-nested case) or if its the nested-index column that gets entirely pulled in.
Another thing to consider is that kdb uses powers-of-two (buddy) memory allocation. Even if todays table only contains one more row than yesterdays, the memory requirements per column could double. Take a simple example:
In the free 32bit version (windows) you can create this many floats and it only uses ~1.07gb of memory
q)\ts 134217726?1.0
3093 1073741952
However, try to generate one extra float and you hit a memory limit
q)\ts 134217727?1.0
wsfull
So even a small amount of rows in the difference between one day and the next can be very significant if you're near the boundary of allocatable powers of two.
--DISCLAIMER-- the following is hacky and is only intended for debugging!
You can actually manually try to access the data from the nested list, though you may still have memory issues here anyway.
Create a nested table and splay it
q)tab:([] col1:(101 102 103f;104 105f;106 107 108 109 110f;111 112f))
q)tab
col1
--------------------
101 102 103f
104 105f
106 107 108 109 110f
111 112f
q)
q)`:test/ set tab
`:test/
You can try to read in the indices from the nested-index file
q)2_first (enlist "j";enlist 8)1:`:test/col1
3 5 10 12
So the indices for splitting the full list of floats (the col1# file) is index 3, index 5, 10 etc etc
Say I want the first 3 rows
q)myrows:3#2_first (enlist "j";enlist 8)1:`:test/col1
q)myrows
3 5 10
then I know that I need the first 10 floats from the col1# file and need to split them at index 3 and 5. Then I can read the col1# file partially and split it correctly
q)(0,-1_myrows) cut raze (enlist "f";enlist 8)1:(`$":test/col1#";0;8*last myrows)
101 102 103f
104 105f
106 107 108 109 110f
But this is precisely what KDB does under the covers anyway so I suspect that you'll still have trouble even reading in the nested-index file in the first place.
Check this debug/hack and see if you can partially read that way. But obviously it's not a long-term solution!

Nested columns make querying in the usual way difficult, as the # file also needs to be loaded into memory (even with a [;0])
Your best bet is to select map a date partition in, and then select within that chunk by chunk, e.g. a million rows at a time (or whatever is sensible given the size of nested floats).
Perhaps also consider 32bit floats, if some decimal accuracy can be sacrificed.
EDIT
So after comments I guess the best way is to go each partition a number of lines at a time with .Q.ind

Just to give my 2 cents on this, I had a similar error but with a 64-bit instance.
I suspected that the memory needed to be de-fragmented as it was running for almost a year.
Bouncing the instance solved the issue, and released a lot of virtual memory

KDB/KX appending table to a file without reading the entire file

I'm new to KDB ( sorry if this question is dumb). I'm creating the following table
q)dsPricing:([id:`int$(); date:`date$()] open:`float$();close:`float$();high:`float$();low:`float$();volume:`int$())
q)dsPricing:([id:`int$(); date:`date$()] open:`float$();close:`float$();high:`float$();low:`float$();volume:`int$())
q)`dsPricing insert(123;2003.03.23;1.0;3.0;4.0;2.0;1000)
q)`dsPricing insert(123;2003.03.24;1.0;3.0;4.0;2.0;2000)
q)save `:dsPricing
Let's say after saving I exit. After starting q, I like to add another pricing item in there without loading the entire file because the file could be large
q)`dsPricing insert(123;2003.03.25;1.0;3.0;4.0;2.0;1500)
I've been looking at .Q.dpft but I can't really figure it out. Also this table/file doesn't need to be partitioned.
Thanks

You can upsert with the file handle of a table to append on disk, your example would look like this:
`:dsPricing upsert(123;2003.03.25;1.0;3.0;4.0;2.0;1500)
You can load the table into your q session using get, load or \l
q)get `:dsPricing
id date | open close high low volume
--------------| --------------------------
123 2003.03.23| 1 3 4 2 1000
123 2003.03.24| 1 3 4 2 2000
123 2003.03.25| 1 3 4 2 1500
.Q.dpft will save a table splayed(one file for each column in the table and a .d file containing column names) with a parted attribute(p#) on one of the symbol columns. Any symbol columns will also be enumerated by .Q.en.

Process two space delimited text files into one by common column [duplicate]

This question already has answers here:
merge two files by key if exists in the first file / bash script [duplicate]
(2 answers)
Closed 9 years ago.
I have two text files that look like:
col1 primary col3 col4
blah 1 blah 4
1 2 5 6
...
and
colA primary colC colD
1 1 7 27
foo 2 11 13
I want to merge them into a single wider table, such as:
primary col1 col3 col4 colA colC colD
1 blah blah 4 a 7 27
2 1 5 6 foo 11 13
I'm pretty new to Perl, so I'm not sure what the best way is to do this.
Note that column order does not matter, and there are a couple million rows. Also my files are unfortunately not sorted.
My current plan unless there's an alternative:
For a given line in one of the files, scan the other file for the matching row and append them both as necessary into the new file. This sounds slow and cumbersome though.
Thanks!

Solution 1.
Read the smaller of two files line by line, using a standard CPAN delimited-file parser like TXT::CSV_XS to parse out columns.
Save each record (as arrayref of columns) in a hash, with your merge column being the hash key
When done, read the larger of two files line by line, using a standard CPAN delimited-file parser like TXT::CSV_XS to parse out columns.
For each record, find the join key field, find the matching record from your hash storing the data from file#1, merge the 2 records as needed, and print.
NOTE: This is pretty memory intensive as the entire smaller file will live in memory, but won't require you to read one of the files million times.
Solution 2.
Sort file1 (using Unix sort or some simple Perl code) into "file1.sorted"
Sort file2 (using Unix sort or some simple Perl code) into "file2.sorted"
Open both files for reading. Loop until both are fully read:
Read 1 line from each file into the buffer if the buffer for that file is empty (buffer being simply a variable containing the next record).
Compare indexes between 2 lines.
If index1 < index2, write the record for file1 into output (without merging) and empty buffer1. Repeat step 3
If index1 > index2, write the record for file2 into output (without merging) and empty buffer2. Repeat.
If index1 == index2, merge 2 records, write the merged record into output and empty out both buffers (assuming the join index column is unique. If not unique, this step is more complicated).
NOTE: this does NOT require you to keep entire file in memory, aside from sorting the files (which CAN be done in memory constrained way if you need to).

Making sense of Postgres row sizes

I got a large (>100M rows) Postgres table with structure {integer, integer, integer, timestamp without time zone}. I expected the size of a row to be 3*integer + 1*timestamp = 3*4 + 1*8 = 20 bytes.
In reality the row size is pg_relation_size(tbl) / count(*) = 52 bytes. Why?
(No deletes are done against the table: pg_relation_size(tbl, 'fsm') ~= 0)

Calculation of row size is much more complex than that.
Storage is typically partitioned in 8 kB data pages. There is a small fixed overhead per page, possible remainders not big enough to fit another tuple, and more importantly dead rows or a percentage initially reserved with the FILLFACTOR setting.
And there is even more overhead per row (tuple): an item identifier of 4 bytes at the start of the page, the HeapTupleHeader of 23 bytes and alignment padding. The start of the tuple header as well as the start of tuple data are aligned at a multiple of MAXALIGN, which is 8 bytes on a typical 64-bit machine. Some data types require alignment to the next multiple of 2, 4 or 8 bytes.
Quoting the manual on the system table pg_tpye:
typalign is the alignment required when storing a value of this type.
It applies to storage on disk as well as most representations of the
value inside PostgreSQL. When multiple values are stored
consecutively, such as in the representation of a complete row on
disk, padding is inserted before a datum of this type so that it
begins on the specified boundary. The alignment reference is the
beginning of the first datum in the sequence.
Possible values are:
c = char alignment, i.e., no alignment needed.
s = short alignment (2 bytes on most machines).
i = int alignment (4 bytes on most machines).
d = double alignment (8 bytes on many machines, but by no means all).
Read about the basics in the manual here.
Your example
This results in 4 bytes of padding after your 3 integer columns, because the timestamp column requires double alignment and needs to start at the next multiple of 8 bytes.
So, one row occupies:
23 -- heaptupleheader
+ 1 -- padding or NULL bitmap
+ 12 -- 3 * integer (no alignment padding here)
+ 4 -- padding after 3rd integer
+ 8 -- timestamp
+ 0 -- no padding since tuple ends at multiple of MAXALIGN
Plus item identifier per tuple in the page header (as pointed out by #A.H. in the comment):
+ 4 -- item identifier in page header
------
= 52 bytes
So we arrive at the observed 52 bytes.
The calculation pg_relation_size(tbl) / count(*) is a pessimistic estimation. pg_relation_size(tbl) includes bloat (dead rows) and space reserved by fillfactor, as well as overhead per data page and per table. (And we didn't even mention compression for long varlena data in TOAST tables, since it doesn't apply here.)
You can install the additional module pgstattuple and call SELECT * FROM pgstattuple('tbl_name'); for more information on table and tuple size.
Related:
Table size with page layout
Calculating and saving space in PostgreSQL

Each row has metadata associated with it. The correct formula is (assuming naïve alignment):
3 * 4 + 1 * 8 == your data
24 bytes == row overhead
total size per row: 23 + 20
Or roughly 53 bytes. I actually wrote postgresql-varint specifically to help with this problem with this exact use case. You may want to look at a similar post for additional details re: tuple overhead.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse