KDB/KX appending table to a file without reading the entire file - kdb

I'm new to KDB ( sorry if this question is dumb). I'm creating the following table
q)dsPricing:([id:`int$(); date:`date$()] open:`float$();close:`float$();high:`float$();low:`float$();volume:`int$())
q)dsPricing:([id:`int$(); date:`date$()] open:`float$();close:`float$();high:`float$();low:`float$();volume:`int$())
q)`dsPricing insert(123;2003.03.23;1.0;3.0;4.0;2.0;1000)
q)`dsPricing insert(123;2003.03.24;1.0;3.0;4.0;2.0;2000)
q)save `:dsPricing
Let's say after saving I exit. After starting q, I like to add another pricing item in there without loading the entire file because the file could be large
q)`dsPricing insert(123;2003.03.25;1.0;3.0;4.0;2.0;1500)
I've been looking at .Q.dpft but I can't really figure it out. Also this table/file doesn't need to be partitioned.
Thanks

You can upsert with the file handle of a table to append on disk, your example would look like this:
`:dsPricing upsert(123;2003.03.25;1.0;3.0;4.0;2.0;1500)
You can load the table into your q session using get, load or \l
q)get `:dsPricing
id date | open close high low volume
--------------| --------------------------
123 2003.03.23| 1 3 4 2 1000
123 2003.03.24| 1 3 4 2 2000
123 2003.03.25| 1 3 4 2 1500
.Q.dpft will save a table splayed(one file for each column in the table and a .d file containing column names) with a parted attribute(p#) on one of the symbol columns. Any symbol columns will also be enumerated by .Q.en.

Related

updating the cells in a partitioned table using .Q.ind[] in q

I have a partitioned table and can read it using a get command as such:
get `:hdb/2018.01.01/trade
and will give me:
sym size exchange
-----------------
0 100 2
1 200 2
1 300 2
I like to modify the cell value like size from 200 and 300 to a 1000 given an index or list of rows. So I am using
.Q.ind[`:hdb/2018.01.01/trade; 1 2j]
to get the rows and then change the cell. But I am getting a `rank error when running .Q.ind[].
The error you're getting is that the first input param to .Q.ind is the mapped table name, not a symbol representing the table name/location
I'm not sure if .Q.ind is going to help you here though, it's more useful for data retrieval than data (re)write.
A couple of approaches you could take:
Pull in the whole date slice select from table where date=X, modify it in memory and then write it back down using `:hdb/2018.01.01/trade/ set delete date from modifiedTable. This is assuming you're not modifying any enumerated/symbol columns. You'd have to be careful to maintain same schema, maintain same compression etc
Use the dbmaint package to handle the changes: https://github.com/KxSystems/kdb/blob/master/utils/dbmaint.md
If you're careful enough you could pull in only the column itself, modify it and write it back down. p set #[get p:`:hdb/2018.01.01/trade/col1;1 2;:;1000]
You could also use an amend operation to update the values.
#[`:hdb/2018.01.01/trade;`size;#[;1 2;:;1000]
This will edit your table on disk.
q)get`:hdb/2018.01.01/trade
sym size exchange
-----------------
0 100 2
1 200 2
1 300 2
q)#[`:hdb/2018.01.01/trade;`size;#[;1 2;:;1000]]
`:hdb/2018.01.01/trade
q)get `:hdb/2018.01.01/trade/
sym size exchange
-----------------
0 100 2
1 1000 2
2 1000 2

kdb q - efficiently count tables in flatfiles

I have a lot of tables stored in flat files (in a directory called basepath) and I want to check their number of rows. The best I can so right now is:
c:([] filename:system "ls ",basepath;
tablesize:count each get each hsym `$basepath,/:system "ls ",basepath)
which loads each table entirely into memory and then performs the count (that's quite slow). Is saving as splayed tables the only way to make this faster (because I would only load 1 column and count that) or is there a trick in q that I can use?
Thanks for the help
If you have basepath defined as a string of the path to directory where all your flat tables are stored then you can create a dictionary of the row counts as follows:
q)cnt:{count get hsym x}
q)filename:key hsym `$basepath
q)filename!cnt each filename
t| 2
g| 3
This is where I have flat tables t and g saved in my basepath directory. This stops you from having to use system commands which are often less effiecient.
The function cnt takes the path of each flat table (as a symbol) and returns the number of rows without saving them into memory.
The best solution if you have control of the process of saving such files down is to add an extra step of saving the meta information of the row count somewhere seperate at the same time of saving the raw data. This would allow you to quickly access the table size from this file instead of reading the full tbale in each time.
However, note that to avoid pulling them into memory at all you would have to instead use read1 and look at the headers on the binary data. As you said it would be better to save as a splayed table and read in one column.
UPDATE: I would not recommend doing this and strongly suggest doing the above but for curiosity after looking into using read1 here's an example what what a hacky solution might look like:
f:{
b:read1(y;0;x);
if[not 0x62630b~b[2 4 5];'`$"not a table"];
cc:first first((),"i";(),4)1:b 7+til 4;
if[null ce:first where cc=sums 0x0=11 _ b;:.z.s[x*2;y]];
c:`$"\000" vs "c"$b[11+til ce];
n:first first((),"i";(),4)1:b[(20+ce)+til 4];
:`columns`rows!(c;n);
}[2000]
The q binary file format isn’t documented anywhere, the only way to figure it out is to save different things and see how the bytes change. It’s also subject to changes between versions - the above is written for 3.5 and is probably valid for 3.0-3.5 only, not the latest 3.6 release or anything 2.X.
The given code works in the following way:
reads a chunk from the front of the file
validates that it looks like a flat unkeyed table (flip[98] of a dict[99] with symbol[11] keys)
reads the count of symbols in the list of columns as a little endian 4 byte int
scans through the null terminated strings for that many zero bytes
if the columns are so numerous or verbose that we don’t have them
all in this chunk it will double the size of the chunk and try again
turn the strings into symbols
using the offset we get from the end of the column list, skip a bit
more of the header for the mixed list of columns
then read the count from the header of the first column
Hope this answers your question!
From experimenting with the binary files, it seems that the table count is saved as part of the binary file when you save down a flat file, taking up 4 bytes after the initial object type and column headings which will vary from table to table.
`:test set ([]a:1 2 3;b:4 5 6;c:7 8 9;aa:10 11 12;bb:13 14 15)
q)read1 `:test
0xff016200630b000500000061006200630061610062620000000500000009000300000
0 7 11 31
bytes | example | meaning
---------------------------------------------------------------------------------------
0 - 5 | 0xff016200630b0 | object is a flat table
7 - 11 | 0x05000000 | number of columns (5)
12- 22 | 0x6100620063006161006262 | one byte for the ascii values of column "a" and "b" in hex followed by the one byte separator
23 - 30 | 0x0000050000000900 | 8 bytes that can be skipped
31 - 34 | 0x0300000 | 4 bytes for row count of first column (3)
This should help you understand the function that Fiona posted.
The binary is saved down little-endian meaning the most-significant byte is the right-hand most digit - doing this in decimal for the number 100 would give 001, with the 100's (most significant) on the right and then 10s and finally 1s on the left. In the binary file, each group of 2 digits is a byte.
You can use 1: to read in the contents of a binary file, with additional arguments in the list specifying the offset - where to start reading from, and how many bytes to read. In our case we want to start at byte 31 and read in 4 bytes, specifying the output should be an integer and to cut the input into separate 4 byte chunks.
q)first first (enlist "i";enlist 4)1:(`:test;31;4)
3i
Converting the little-endian bytes into a long gives us the row count. Since this only has to read in 4 bytes instead of the whole file it is a lot quicker.
For a table with 10000 rows and 2 columns there is not much difference:
q)\t 0x0 sv reverse first (enlist "x";enlist 1)1:(`:test10000;31;4)
0
q)\t count get `:test10000
0
For a table with 100m rows and 2 columns:
q)\t 0x0 sv reverse first (enlist "x";enlist 1)1:(`:test10m;31;4)
0
q)\t count get `:test10m
2023
If you have a splayed table instead you can read in the number of elements in one of the columns from bytes 9-13 like so, assuming the column is a simple list:
q)first first (enlist "i";enlist 4)1:(`:a;8;4)
3i
You can read more about reading in from binary files here https://code.kx.com/q/ref/filenumbers/#1-binary-files
You can make what you currently have more efficient by using the following
counttables:{count each get each hsym `$basepath}
This will improve the speed of the count by not including the extra read in of the data as well as the join which you are currently doing. You are correct though that if the tables where saved splayed you would only have to read in the one column making it much more efficient.
If your tables are stored uncompressed there's probably something quite hacky you could do with a read1 on the headers within the file until you find the first column header.
But v hacky :-(
Are you responsible for saving these down? Can you keep a running state as you do?

Tableau collate data from multiple columns

I am not sure what the technical term for what I am trying to do is.
Hoping raw data and output below will clearly define the use case.
Raw data :
This is what my raw data looks like
Output 1 :
this is what I am trying extract first
Here I am trying to get a table where the first column has the name of the guests and 2nd column has the count of times they have featured in the table as a guest.
Output 2 :
this what I am trying extract next
Here I am trying to map months against names and see how many nights one has collected in which month.
One way to achieve this would be to create a temp table with 5 columns,column 1 with Guest names,
column 2 with count of occurrence in guest 1 column in raw data table,
column 3 with count of occurrence in guest 2 column in raw data table,
column 4 with count of occurrence in guest 3 column in raw data table,
column 5 with total of previous 3 columns.
But I am trying to find a proper solution through tableau, if possible. Because this way would not help me achieve Output 2.
Plain text raw data if you'd like to work on it :
booking by,Guest 1,Guest 2,Guest 3,stay start,stay end,hotel code
Ram,Seema,Ram,,May 1 2018,May 2 2018,BBST
Karan,Ram,Seema,,May 6 2018,May 7 2018,BRRLY
Mahesh,Mahesh,Seema,Ram,June 2 2018,June 4 2018,BBST
Krishna,Krishna,,,June 2 2018,June 3 2018,BRRLY
Seema,Seema,,,June 7 2018,June 8 2018,BRRLY

Pentaho spoon transformation from excel file

I have yearly data in my excel file in such format:
Country \ Years 1980 1981 ... 2010
Abkhazia 234 334 ... 456
Afghanistan 466 789 ... 732
...
Here is picture
And I want my data transform to 3 different tables and load it to postgres database.
Tables should look something like that
First table - country:
id | name
1 | Abkhazia
2 | Afghanistan
Second table dates:
id | date
1 | 1980
2 | 1981
And third is a table where all data is stored depending on country and date:
country_id date_id data
1 1 234
1 2 334
2 1 466
2 2 789
... ... ...
Any ideas how I could achieve my goal?
Assuming the source excel structure is as below (i have custom built this):
There are basically 3 parts to your question. I break down the transformation into part for better understanding:
1. Loading Table - Country
This is pretty straight forward based on the data given in the excel. Simply take an
Excel Input >> Add a sequence step. Give the Sequence name as Country ID >> Select only the Country Name and Country ID >> Load into the Country Table using Table Output.
2. Loading Table - Year:
The idea here is to display the Year ID in Row wise format instead of the columns given the excel source data. PDI version 5 and above provides you with a very useful step called Metadata Structure. This step allows you to get the structure of your table. In this case, we need to have the year columns pulled, ignoring the country column.
Follow the steps as below:
Read the Excel Data >> Get the Metadata structure of your source >> Filter Out the Country Column (which is available in row at position=1) >> Add a Sequence Number. Name it YearID >> Finally Load the Year Table.
3. Loading the Final Table - Country and Year along with Data:
The way to display all the column data values to a row level in PDI is using Row Normalizer step. Use this step to display a normalized output. Now follow the below steps:
Read the Excel source data >> use Row Normalizer Step to normalize the rows based on the Years >> Do a Stream Lookup with the Above Country and Year tables to fetch the CountryID and YearID respectively >> Finally Load the necessary column data into Table Output
Hope it helps :)
I have placed the codes in github repo along with the data file which i have used. Its here.
Also, just realized that i have given the wrong naming conventions as per your question. Consider date_id as YearID and instead of id's i have given countryid and yearid.

Process two space delimited text files into one by common column [duplicate]

This question already has answers here:
merge two files by key if exists in the first file / bash script [duplicate]
(2 answers)
Closed 9 years ago.
I have two text files that look like:
col1 primary col3 col4
blah 1 blah 4
1 2 5 6
...
and
colA primary colC colD
1 1 7 27
foo 2 11 13
I want to merge them into a single wider table, such as:
primary col1 col3 col4 colA colC colD
1 blah blah 4 a 7 27
2 1 5 6 foo 11 13
I'm pretty new to Perl, so I'm not sure what the best way is to do this.
Note that column order does not matter, and there are a couple million rows. Also my files are unfortunately not sorted.
My current plan unless there's an alternative:
For a given line in one of the files, scan the other file for the matching row and append them both as necessary into the new file. This sounds slow and cumbersome though.
Thanks!
Solution 1.
Read the smaller of two files line by line, using a standard CPAN delimited-file parser like TXT::CSV_XS to parse out columns.
Save each record (as arrayref of columns) in a hash, with your merge column being the hash key
When done, read the larger of two files line by line, using a standard CPAN delimited-file parser like TXT::CSV_XS to parse out columns.
For each record, find the join key field, find the matching record from your hash storing the data from file#1, merge the 2 records as needed, and print.
NOTE: This is pretty memory intensive as the entire smaller file will live in memory, but won't require you to read one of the files million times.
Solution 2.
Sort file1 (using Unix sort or some simple Perl code) into "file1.sorted"
Sort file2 (using Unix sort or some simple Perl code) into "file2.sorted"
Open both files for reading. Loop until both are fully read:
Read 1 line from each file into the buffer if the buffer for that file is empty (buffer being simply a variable containing the next record).
Compare indexes between 2 lines.
If index1 < index2, write the record for file1 into output (without merging) and empty buffer1. Repeat step 3
If index1 > index2, write the record for file2 into output (without merging) and empty buffer2. Repeat.
If index1 == index2, merge 2 records, write the merged record into output and empty out both buffers (assuming the join index column is unique. If not unique, this step is more complicated).
NOTE: this does NOT require you to keep entire file in memory, aside from sorting the files (which CAN be done in memory constrained way if you need to).