How to sort columns in a HDB to apply the p attribute - kdb

I have a HDB that is date partitioned. I want to apply the p attribute historically to a specific column. As far as I am aware, to do this, I need to first ensure this column is sorted in a way that all common occurrences are adjacent. Currently, this is not the case. How can I sort this HDB so that this column in each partition has common values adjacent to each other.
Thank you!

You can use xasc on disk.
https://code.kx.com/q/ref/asc/#xasc
You'd want to sort each partition and apply the parted attribute. Could build up the paths with .Q.PD & .Q.PV as I don't think this is something that exists in dbmaint.q.
This is just a general idea, it is untested so use on some test data and modify to meet your hdb structure if needed.
You may need to modify the xasc part if you want additional sorting within each part.
{[tbl;sortPartCol]
{[sortPartCol;path] sortPartCol xasc path;#[path;sortPartCol;`p#]
}[sortPartCol] each distinct ` sv/: (.Q.PD cross `$string .Q.PV) cross tbl
}
https://code.kx.com/q/ref/dotq/#qpv-partition-values
https://code.kx.com/q/ref/dotq/#qpd-partition-locations

Related

kdb: Is there a way to efficiently merge two sorted tables?

Say we have two tables both sorted on the time column:
t1:`time xasc ([]time:5?100;v:5?1000)
t2:`time xasc ([]time:5?100;v:5?1000)
Is there an efficient way to get the same result as `time xasc t1,t2 , using the fact that the two tables are already sorted? I looked at aj but I wasn't able to find the "combine two tables" functionality I need here.
There is no native merge-sort/binary-sort in kdb so the optimal available approach is asc x,y. If you go down the path of replicating a merge/binary sort in kdb then you're unlikely to get it faster than the native asc x,y. You could alternatively try to write a merge/binary sort in C and import a shared library to use in kdb

kdb: getting one row from HDB

For a normal table, we can select one row using select[1] from t. How can I do this for HDB?
I tried select[1] from t where date=2021.02.25 but it gives error
Not yet implemented: it probably makes sense, but it’s not defined nor implemented, and needs more thinking about as the language evolves
select[n] syntax works only if table is already loaded in memory.
The easiest way to get 1st row of HDB table is:
1#select from t where date=2021.02.25
select[n] will work if applied on already loaded data, e.g.
select[1] from select from t where date=2021.02.25
I've done this before for ad-hoc queries by using the virtual index i, which should avoid the cost of pulling all data into memory just to select a couple of rows. If your query needs to map constraints in first before pulling a subset, this is a reasonable solution.
It will however pull N rows for each date partition selected due to the way that q queries work under the covers. So YMMV and this might not be the best solution if it was behind an API for example.
/ 5 rows (i[5] is the 6th row)
select from t where date=2021.02.25, sum=`abcd, price=1234.5, i<i[5]
If your table is date partitioned, you can simply run
select col1,col2 from t where date=2021.02.25,i=0
That will get the first record from 2021.02.25's partition, and avoid loading every record into memory.
Per your first request (which is different to above) select[1] from t, you can achieve that with
.Q.ind[t;enlist 0]

kdb q - group table within partition

Starting from a fresh partition mydb I am saving the following three tables table1, table2, table3 in partitions 2018.01.01, 2018.01.02, 2018.01.03, respectively:
npertable:10000000;
table1:([]date:npertable?2018.01.01+til 25;acc:npertable?`C123`C132`C321`C121`C131;c:npertable?til 100);
table2:([]date:npertable?2018.02.01+til 25;acc:npertable?`C123`C132`C321`C121`C131;c:npertable?til 100);
table3:([]date:npertable?2018.03.01+til 25;acc:npertable?`C123`C132`C321`C121`C131;c:npertable?til 100);
table1:`date xasc table1;
table2:`date xasc table2;
table3:`date xasc table3;
`:mydb/2018.01.01/t/ set .Q.en[`:mydb;table1];
`:mydb/2018.01.02/t/ set .Q.en[`:mydb;table2];
`:mydb/2018.01.03/t/ set .Q.en[`:mydb;table3];
You can see that I have different acc groups that I will later select on.
When I sort the tables before storing additionally by acc I get a slight speedup (253 vs 391 milliseconds). So if I later want to query
select from t where date=2018.01.01, acc=`C123
is sorting by acc before storing the best I can do? Or is there something in storing the partitions that will create an index for the different acc groups?
Thanks for the help
I think you should use the parted attribute for optimizing your queries.
For example you can use this bit to sort by acc and apply the attribute.
{#[`acc xasc .Q.par[`:mydb;x;`t];`acc;`p#]}'[2018.01.01 2018.01.02 2018.01.03]
For more details about the parted attribute and its effects you can read this whitepaper from KX -> https://kx.com/media/2017/11/Columnar_database_and_query_optimization.pdf
Also please be aware that you can use a month partition to suit your needs.
If I properly understand your example you have year.day.month so you can reduce this to year.month if day will always be 01
i.e Instead of using
`:mydb/2018.01.01/t/ set .Q.en[`:mydb;table1];
you can simply use
`:mydb/2018.01/t/ set .Q.en[`:mydb;table1];
You can find more details about achieving this here -> https://code.kx.com/wiki/JB:KdbplusForMortals/partitioned_tables#1.3.7.2_Monthly

Perl : Tracking duplicates

I am trying to figure out what would be the best way to go ahead and locate duplicates in a 5 column csv data. The real data has more than million rows in it.
Following is the content of mentioned 6 columns.
Name, address, city, post-code, phone number, machine number
Data does not have fixed length, data might in certain columns might be missing in certain instances.
I am thinking of using perl to first normalize all the short forms used in names, city and address. Fellow perl enthusiasts from stackoverflow have helped me a lot.
But there would still be a lot of data which would be difficult to match.
So I am wondering is it possible to match content based on "LIKELINESS / SIMILARITY" (eg. google similar to gugl) the likeliness would be required to overcome errors that creeped in while collecting data.
I have 2 tasks in hand w.r.t. the data.
Flag duplicate rows with certain identifier
Mention the percentage match between similar rows.
I would really appreciate if I could get suggestions as to what all possible methods could be employed and which would propbably be best because of their certain merits.
You could write a Perl program to do this, but it will be easier and faster to put it into a SQL database and use that.
Most SQL databases have a way to import CSV. For this answer, I suggest PostgreSQL because it has very powerful string functions which you will need to find your fuzzy duplicates. Create your table with an auto incremented ID column if your CSV data doesn't already have unique IDs.
Once the import is done, add indexes on the columns you want to check for duplicates.
CREATE INDEX name ON whatever (name);
You can do a self-join to look for duplicates in whatever way you like. Here's an example that finds duplicate names.
SELECT id
FROM whatever t1
JOIN whatever t2 ON t1.id < t2.id
WHERE t1.name = t2.name
PostgreSQL has powerful string functions including regexes to do the comparisons.
Indexes will have a hard time working on things like lower(t1.name). Depending on the sorts of duplicates you want to work with, you can add indexes for these transforms (this is a feature of PostgreSQL). For example, if you wanted to search case insensitively you can add an index on the lower-case name. (Thanks #asjo for pointing that out)
CREATE INDEX ON whatever ((lower(name)));
// This will be muuuuuch faster
SELECT id
FROM whatever t1
JOIN whatever t2 ON t1.id < t2.id
WHERE lower(t1.name) = lower(t2.name)
A "likeness" match can be achieved in several ways, a simple one would be to use the fuzzystrmatch functions like metaphone(). Same trick as before, add a column with the transformed row and index it.
Other simple things like data normalization are better done on the data itself before adding indexes and looking for duplicates. For example, trim out and squish extra whitespace.
UPDATE whatever SET name = trim(both from name);
UPDATE whatever SET name = regexp_replace(name, '[[:space:]]+', ' ');
Finally, you can use the Postgres Trigram module to add fuzzy indexing to your table (thanks again to #asjo).

How to correctly enum and partition a kdb table?

I put together a few lines to partition my kdb table, which contains string columns of course and thus must to be enumerated.
I wonder if this code is completely correct or if it can be simplified further. In particular, I have some doubt about the need to create a partitioned table schema given the memory table and the disk table will have exactly the same layout. Also, there might be a way to avoid creating the temporary tbl_mem and tbl_mem_enum tables:
...
tbl_mem: select ts,sym,msg_type from oms_mem lj sym_mem;
tbl_mem_enum: .Q.en[`$sym_path] tbl_mem;
delete tbl_mem from `.;
(`$db;``!((17;2;9);(17;2;9))) set ([]ts:`time$(); ticker:`symbol$(); msg_type:`symbol$());
(`$db) upsert (select ts,ticker:sym,msg_type from tbl_mem_enum)
delete tbl_mem_enum from `.;
PS: I know, I shouldn't use "_" to name variables, but then what do I use to separate words in a variable or function name? . is also a kdb function.
I think you mean that your table contains symbol columns - these are the columns that you need to enumerate (strings don't need enumeration). You can do the write and enumeration in a single step. Also if you are using the same compression algo/level on all columns then it may be easier to just use .z.zd:
.z.zd:17 2 9i;
(`$db) set .Q.en[`$sym_path] select ts, ticker:sym, msg_type from oms_mem lj sym_mem;
It's generally recommended to use camelCase instead of '_'. Some useful info here: http://www.timestored.com/kdb-guides/q-coding-standards