updating the cells in a partitioned table using .Q.ind[] in q - kdb

I have a partitioned table and can read it using a get command as such:
get `:hdb/2018.01.01/trade
and will give me:
sym size exchange
-----------------
0 100 2
1 200 2
1 300 2
I like to modify the cell value like size from 200 and 300 to a 1000 given an index or list of rows. So I am using
.Q.ind[`:hdb/2018.01.01/trade; 1 2j]
to get the rows and then change the cell. But I am getting a `rank error when running .Q.ind[].

The error you're getting is that the first input param to .Q.ind is the mapped table name, not a symbol representing the table name/location
I'm not sure if .Q.ind is going to help you here though, it's more useful for data retrieval than data (re)write.
A couple of approaches you could take:
Pull in the whole date slice select from table where date=X, modify it in memory and then write it back down using `:hdb/2018.01.01/trade/ set delete date from modifiedTable. This is assuming you're not modifying any enumerated/symbol columns. You'd have to be careful to maintain same schema, maintain same compression etc
Use the dbmaint package to handle the changes: https://github.com/KxSystems/kdb/blob/master/utils/dbmaint.md
If you're careful enough you could pull in only the column itself, modify it and write it back down. p set #[get p:`:hdb/2018.01.01/trade/col1;1 2;:;1000]

You could also use an amend operation to update the values.
#[`:hdb/2018.01.01/trade;`size;#[;1 2;:;1000]
This will edit your table on disk.
q)get`:hdb/2018.01.01/trade
sym size exchange
-----------------
0 100 2
1 200 2
1 300 2
q)#[`:hdb/2018.01.01/trade;`size;#[;1 2;:;1000]]
`:hdb/2018.01.01/trade
q)get `:hdb/2018.01.01/trade/
sym size exchange
-----------------
0 100 2
1 1000 2
2 1000 2

Related

SPSS aggergate on 2 variables

I am trying to compute a N_break that has to "satisfy" a condition. I have a variable which indicates 1 or 0. Lets call that variable "HT". Every lopnr is also labled in every row multiple times. So first 10 rows can be ID nr 1. And next 20 can be ID nr 2 and so on.
My question is: How do i create a N-break with lopnr as breakvariable that has to have HT=1? I am not allowed to select only 1s on variable HT before, since i need the 0s in the file.
A few simple ways to do this:
1 - USE FILTER
filter cases by HT.
aggregate ....
when you get back to original dataset, use:
filter off.
use all.
2 - COPY DATASET
dataset name orig.
dataset copy foragg.
dataset activate foragg.
select if HT.
aggregate....
3 - TEMPORARY SELECTION
temporary.
select if HT.
aggregate....

kdb q - efficiently count tables in flatfiles

I have a lot of tables stored in flat files (in a directory called basepath) and I want to check their number of rows. The best I can so right now is:
c:([] filename:system "ls ",basepath;
tablesize:count each get each hsym `$basepath,/:system "ls ",basepath)
which loads each table entirely into memory and then performs the count (that's quite slow). Is saving as splayed tables the only way to make this faster (because I would only load 1 column and count that) or is there a trick in q that I can use?
Thanks for the help
If you have basepath defined as a string of the path to directory where all your flat tables are stored then you can create a dictionary of the row counts as follows:
q)cnt:{count get hsym x}
q)filename:key hsym `$basepath
q)filename!cnt each filename
t| 2
g| 3
This is where I have flat tables t and g saved in my basepath directory. This stops you from having to use system commands which are often less effiecient.
The function cnt takes the path of each flat table (as a symbol) and returns the number of rows without saving them into memory.
The best solution if you have control of the process of saving such files down is to add an extra step of saving the meta information of the row count somewhere seperate at the same time of saving the raw data. This would allow you to quickly access the table size from this file instead of reading the full tbale in each time.
However, note that to avoid pulling them into memory at all you would have to instead use read1 and look at the headers on the binary data. As you said it would be better to save as a splayed table and read in one column.
UPDATE: I would not recommend doing this and strongly suggest doing the above but for curiosity after looking into using read1 here's an example what what a hacky solution might look like:
f:{
b:read1(y;0;x);
if[not 0x62630b~b[2 4 5];'`$"not a table"];
cc:first first((),"i";(),4)1:b 7+til 4;
if[null ce:first where cc=sums 0x0=11 _ b;:.z.s[x*2;y]];
c:`$"\000" vs "c"$b[11+til ce];
n:first first((),"i";(),4)1:b[(20+ce)+til 4];
:`columns`rows!(c;n);
}[2000]
The q binary file format isn’t documented anywhere, the only way to figure it out is to save different things and see how the bytes change. It’s also subject to changes between versions - the above is written for 3.5 and is probably valid for 3.0-3.5 only, not the latest 3.6 release or anything 2.X.
The given code works in the following way:
reads a chunk from the front of the file
validates that it looks like a flat unkeyed table (flip[98] of a dict[99] with symbol[11] keys)
reads the count of symbols in the list of columns as a little endian 4 byte int
scans through the null terminated strings for that many zero bytes
if the columns are so numerous or verbose that we don’t have them
all in this chunk it will double the size of the chunk and try again
turn the strings into symbols
using the offset we get from the end of the column list, skip a bit
more of the header for the mixed list of columns
then read the count from the header of the first column
Hope this answers your question!
From experimenting with the binary files, it seems that the table count is saved as part of the binary file when you save down a flat file, taking up 4 bytes after the initial object type and column headings which will vary from table to table.
`:test set ([]a:1 2 3;b:4 5 6;c:7 8 9;aa:10 11 12;bb:13 14 15)
q)read1 `:test
0xff016200630b000500000061006200630061610062620000000500000009000300000
0 7 11 31
bytes | example | meaning
---------------------------------------------------------------------------------------
0 - 5 | 0xff016200630b0 | object is a flat table
7 - 11 | 0x05000000 | number of columns (5)
12- 22 | 0x6100620063006161006262 | one byte for the ascii values of column "a" and "b" in hex followed by the one byte separator
23 - 30 | 0x0000050000000900 | 8 bytes that can be skipped
31 - 34 | 0x0300000 | 4 bytes for row count of first column (3)
This should help you understand the function that Fiona posted.
The binary is saved down little-endian meaning the most-significant byte is the right-hand most digit - doing this in decimal for the number 100 would give 001, with the 100's (most significant) on the right and then 10s and finally 1s on the left. In the binary file, each group of 2 digits is a byte.
You can use 1: to read in the contents of a binary file, with additional arguments in the list specifying the offset - where to start reading from, and how many bytes to read. In our case we want to start at byte 31 and read in 4 bytes, specifying the output should be an integer and to cut the input into separate 4 byte chunks.
q)first first (enlist "i";enlist 4)1:(`:test;31;4)
3i
Converting the little-endian bytes into a long gives us the row count. Since this only has to read in 4 bytes instead of the whole file it is a lot quicker.
For a table with 10000 rows and 2 columns there is not much difference:
q)\t 0x0 sv reverse first (enlist "x";enlist 1)1:(`:test10000;31;4)
0
q)\t count get `:test10000
0
For a table with 100m rows and 2 columns:
q)\t 0x0 sv reverse first (enlist "x";enlist 1)1:(`:test10m;31;4)
0
q)\t count get `:test10m
2023
If you have a splayed table instead you can read in the number of elements in one of the columns from bytes 9-13 like so, assuming the column is a simple list:
q)first first (enlist "i";enlist 4)1:(`:a;8;4)
3i
You can read more about reading in from binary files here https://code.kx.com/q/ref/filenumbers/#1-binary-files
You can make what you currently have more efficient by using the following
counttables:{count each get each hsym `$basepath}
This will improve the speed of the count by not including the extra read in of the data as well as the join which you are currently doing. You are correct though that if the tables where saved splayed you would only have to read in the one column making it much more efficient.
If your tables are stored uncompressed there's probably something quite hacky you could do with a read1 on the headers within the file until you find the first column header.
But v hacky :-(
Are you responsible for saving these down? Can you keep a running state as you do?

Tableau collate data from multiple columns

I am not sure what the technical term for what I am trying to do is.
Hoping raw data and output below will clearly define the use case.
Raw data :
This is what my raw data looks like
Output 1 :
this is what I am trying extract first
Here I am trying to get a table where the first column has the name of the guests and 2nd column has the count of times they have featured in the table as a guest.
Output 2 :
this what I am trying extract next
Here I am trying to map months against names and see how many nights one has collected in which month.
One way to achieve this would be to create a temp table with 5 columns,column 1 with Guest names,
column 2 with count of occurrence in guest 1 column in raw data table,
column 3 with count of occurrence in guest 2 column in raw data table,
column 4 with count of occurrence in guest 3 column in raw data table,
column 5 with total of previous 3 columns.
But I am trying to find a proper solution through tableau, if possible. Because this way would not help me achieve Output 2.
Plain text raw data if you'd like to work on it :
booking by,Guest 1,Guest 2,Guest 3,stay start,stay end,hotel code
Ram,Seema,Ram,,May 1 2018,May 2 2018,BBST
Karan,Ram,Seema,,May 6 2018,May 7 2018,BRRLY
Mahesh,Mahesh,Seema,Ram,June 2 2018,June 4 2018,BBST
Krishna,Krishna,,,June 2 2018,June 3 2018,BRRLY
Seema,Seema,,,June 7 2018,June 8 2018,BRRLY

KDB/KX appending table to a file without reading the entire file

I'm new to KDB ( sorry if this question is dumb). I'm creating the following table
q)dsPricing:([id:`int$(); date:`date$()] open:`float$();close:`float$();high:`float$();low:`float$();volume:`int$())
q)dsPricing:([id:`int$(); date:`date$()] open:`float$();close:`float$();high:`float$();low:`float$();volume:`int$())
q)`dsPricing insert(123;2003.03.23;1.0;3.0;4.0;2.0;1000)
q)`dsPricing insert(123;2003.03.24;1.0;3.0;4.0;2.0;2000)
q)save `:dsPricing
Let's say after saving I exit. After starting q, I like to add another pricing item in there without loading the entire file because the file could be large
q)`dsPricing insert(123;2003.03.25;1.0;3.0;4.0;2.0;1500)
I've been looking at .Q.dpft but I can't really figure it out. Also this table/file doesn't need to be partitioned.
Thanks
You can upsert with the file handle of a table to append on disk, your example would look like this:
`:dsPricing upsert(123;2003.03.25;1.0;3.0;4.0;2.0;1500)
You can load the table into your q session using get, load or \l
q)get `:dsPricing
id date | open close high low volume
--------------| --------------------------
123 2003.03.23| 1 3 4 2 1000
123 2003.03.24| 1 3 4 2 2000
123 2003.03.25| 1 3 4 2 1500
.Q.dpft will save a table splayed(one file for each column in the table and a .d file containing column names) with a parted attribute(p#) on one of the symbol columns. Any symbol columns will also be enumerated by .Q.en.

How to pick items from warehouse to minimise travel in TSQL?

I am looking at this problem from a TSQL point of view, however any advice would be appreciated.
Scenario
I have 2 sets of criteria which identify items in a warehouse to be selected.
Query 1 returns 100 items
Query 2 returns 100 items
I need to pick any 25 of the 100 items returned in query 1.
I need to pick any 25 of the 100 items returned in query 2.
- The items in query 1/2 will not be the same, ever.
Each item is stored in a segment of the warehouse.
A segment of the warehouse may contain numerous items.
I wish to select the 50 items (25 from each query) in a way as to reduce the number of segments I must visit to select the items.
Suggested Approach
My initial idea has been to combined the 2 result sets and produce a list of
Segment ID, NumberOfItemsRequiredInSegment
I would then select 25 items from each query, giving preference to those in a segments with the most NumberOfItemsRequiredInSegment. I know this would not be optimal but would be an easy to implement heuristic.
Questions
1) I suspect this is a standard combinational problem, but I don't recognise it.. perhaps multiple knapsack, does anyone recognise it?
2) Is there a better (easy-ish to impliment) heuristic or solution - ideally in TSQL?
Many thanks.
This might also not be optimal but i think would at least perform fairly well.
Calculate this set for query 1.
Segment ID, NumberOfItemsRequiredInSegment
take the top 25, Just by sorting by NumberOfItemsRequiredInSegment. call this subset A.
take the top 25 from query 2, by joining to A and sorting by "case when A.segmentID is not null then 1 else 0, NumberOfItemsRequiredInSegmentFromQuery2".
repeat this but take the top 25 from query 2 first. return the better performing of the 2 sets.
The one scenario where i think this fails would be if you got something like this.
Segment Count Query 1 Count Query 2
A 10 1
B 5 1
C 5 1
D 5 4
E 5 4
F 4 4
G 4 5
H 1 5
J 1 5
K 1 10
You need to make sure you choose A, D, E, from when choosing the best segments from query 1. To deal with this you'd almost still need to join to query two, so you can get the count from there to use as a tie breaker.