kdb q - group table within partition - kdb

Starting from a fresh partition mydb I am saving the following three tables table1, table2, table3 in partitions 2018.01.01, 2018.01.02, 2018.01.03, respectively:
npertable:10000000;
table1:([]date:npertable?2018.01.01+til 25;acc:npertable?`C123`C132`C321`C121`C131;c:npertable?til 100);
table2:([]date:npertable?2018.02.01+til 25;acc:npertable?`C123`C132`C321`C121`C131;c:npertable?til 100);
table3:([]date:npertable?2018.03.01+til 25;acc:npertable?`C123`C132`C321`C121`C131;c:npertable?til 100);
table1:`date xasc table1;
table2:`date xasc table2;
table3:`date xasc table3;
`:mydb/2018.01.01/t/ set .Q.en[`:mydb;table1];
`:mydb/2018.01.02/t/ set .Q.en[`:mydb;table2];
`:mydb/2018.01.03/t/ set .Q.en[`:mydb;table3];
You can see that I have different acc groups that I will later select on.
When I sort the tables before storing additionally by acc I get a slight speedup (253 vs 391 milliseconds). So if I later want to query
select from t where date=2018.01.01, acc=`C123
is sorting by acc before storing the best I can do? Or is there something in storing the partitions that will create an index for the different acc groups?
Thanks for the help

I think you should use the parted attribute for optimizing your queries.
For example you can use this bit to sort by acc and apply the attribute.
{#[`acc xasc .Q.par[`:mydb;x;`t];`acc;`p#]}'[2018.01.01 2018.01.02 2018.01.03]
For more details about the parted attribute and its effects you can read this whitepaper from KX -> https://kx.com/media/2017/11/Columnar_database_and_query_optimization.pdf
Also please be aware that you can use a month partition to suit your needs.
If I properly understand your example you have year.day.month so you can reduce this to year.month if day will always be 01
i.e Instead of using
`:mydb/2018.01.01/t/ set .Q.en[`:mydb;table1];
you can simply use
`:mydb/2018.01/t/ set .Q.en[`:mydb;table1];
You can find more details about achieving this here -> https://code.kx.com/wiki/JB:KdbplusForMortals/partitioned_tables#1.3.7.2_Monthly

Related

How to sort columns in a HDB to apply the p attribute

I have a HDB that is date partitioned. I want to apply the p attribute historically to a specific column. As far as I am aware, to do this, I need to first ensure this column is sorted in a way that all common occurrences are adjacent. Currently, this is not the case. How can I sort this HDB so that this column in each partition has common values adjacent to each other.
Thank you!
You can use xasc on disk.
https://code.kx.com/q/ref/asc/#xasc
You'd want to sort each partition and apply the parted attribute. Could build up the paths with .Q.PD & .Q.PV as I don't think this is something that exists in dbmaint.q.
This is just a general idea, it is untested so use on some test data and modify to meet your hdb structure if needed.
You may need to modify the xasc part if you want additional sorting within each part.
{[tbl;sortPartCol]
{[sortPartCol;path] sortPartCol xasc path;#[path;sortPartCol;`p#]
}[sortPartCol] each distinct ` sv/: (.Q.PD cross `$string .Q.PV) cross tbl
}
https://code.kx.com/q/ref/dotq/#qpv-partition-values
https://code.kx.com/q/ref/dotq/#qpd-partition-locations

kdb: Is there a way to efficiently merge two sorted tables?

Say we have two tables both sorted on the time column:
t1:`time xasc ([]time:5?100;v:5?1000)
t2:`time xasc ([]time:5?100;v:5?1000)
Is there an efficient way to get the same result as `time xasc t1,t2 , using the fact that the two tables are already sorted? I looked at aj but I wasn't able to find the "combine two tables" functionality I need here.
There is no native merge-sort/binary-sort in kdb so the optimal available approach is asc x,y. If you go down the path of replicating a merge/binary sort in kdb then you're unlikely to get it faster than the native asc x,y. You could alternatively try to write a merge/binary sort in C and import a shared library to use in kdb

kdb: getting one row from HDB

For a normal table, we can select one row using select[1] from t. How can I do this for HDB?
I tried select[1] from t where date=2021.02.25 but it gives error
Not yet implemented: it probably makes sense, but it’s not defined nor implemented, and needs more thinking about as the language evolves
select[n] syntax works only if table is already loaded in memory.
The easiest way to get 1st row of HDB table is:
1#select from t where date=2021.02.25
select[n] will work if applied on already loaded data, e.g.
select[1] from select from t where date=2021.02.25
I've done this before for ad-hoc queries by using the virtual index i, which should avoid the cost of pulling all data into memory just to select a couple of rows. If your query needs to map constraints in first before pulling a subset, this is a reasonable solution.
It will however pull N rows for each date partition selected due to the way that q queries work under the covers. So YMMV and this might not be the best solution if it was behind an API for example.
/ 5 rows (i[5] is the 6th row)
select from t where date=2021.02.25, sum=`abcd, price=1234.5, i<i[5]
If your table is date partitioned, you can simply run
select col1,col2 from t where date=2021.02.25,i=0
That will get the first record from 2021.02.25's partition, and avoid loading every record into memory.
Per your first request (which is different to above) select[1] from t, you can achieve that with
.Q.ind[t;enlist 0]

Postgres extended statistics with partitioning

I am using Postgres 13 and have created a table with columns A, B and C. The table is partitioned by A with 2 possible values. Partition 1 contains 100 possible values each for B and C, whereas partition 2 has 100 completely different values for B, and 1 different value for C. I have set the statistics for both columns to maximum so that this definitely doesn't cause any issue
If I group by B and C on either partition, Postgres estimates the number of groups correctly. However if I run the query against the base table where I really want it, it estimates what I assume is no functional dependency between A, B and C, i.e. (p1B + p1C) * (p2B + p2C) for 200 * 101 as opposed to the reality of p1B * p1C + p2B * p2C for 10000 + 100.
I guess I was half expecting it to sum the underlying partitions rather than use the full count of 200 B's and 101 C's that the base table can see. Moreover, if I also add A into the group by then the estimate erroneously doubles further still, as it then thinks that this set will also be duplicated for each value of A.
This all made me think that I need an extended statistic to tell it that A influences either B or C or both. However if I set one on the base partition and analyze, the value in pg_statistic_ext_data->stxdndistinct is null. Whereas if I set it on the partitions themselves, this does appear to work, though isn't particularly useful because the estimation is already correct at this level. How do I go about having Postgres estimate against the base table correctly without having to run the query against all of the partitions and unioning them together?
You can define extended statistics on a partitioned table, but PostgreSQL doesn't collect any data in that case. You'll have to create extended statistics on all partitions individually.
You can confirm that by querying the collected data after an ANALYZE:
SELECT s.stxrelid::regclass AS table_name,
s.stxname AS statistics_name,
d.stxdndistinct AS ndistinct,
d.stxddependencies AS dependencies
FROM pg_statistic_ext AS s
JOIN pg_statistic_ext_data AS d
ON d.stxoid = s.oid;
There is certainly room for improvement here; perhaps don't allow defining extended statistics on a partitioned table in the first place.
I found that I just had to turn enable_partitionwise_aggregate on to get this to estimate correctly

How to implement application level pagination over ScalarDB

This question is part-Cassandra and part ScalarDB. I am using ScalarDB which provide ACID support on top of Cassandra. The library seem to be working well! Unfortunately, ScalarDB doesn't support pagination though so I have to implement it in the application.
Consider this scenario in which P is primary key, C is clustering key and E is other data within the partition
Partition => { P,C1,E1
P,C2,E1
P,C2,E2
P,C2,E3
P,C2,E4
P,C3,E1
...
P,Cm,En
}
In ScalarDB, I can specify start and end values of keys so I suppose ScalarDB will get data only from the specified rows. I can also limit the no. of entries fetched.
https://scalar-labs.github.io/scalardb/javadoc/com/scalar/db/api/Scan.html
Say I want to get entries E3 and E4 from P,C2. For smaller values, I can specify start and end clustering keys as C2 and set fetch limit to say 4 and ignore E1 and E2. But if there are several hundred records then this method will not scale.
For example say P,C1 has 10 records, P,C2 has 100 records and I want to implement pagination of 20 records per query. Then to implement this, I'll have to
Query 1 – Scan – primary key will be P, clustering start will be C1, clustering end will be Cn as I don’t know how many records are there.
get P,C1. This will give 10 records
get P,C2. This will give me 20 records. I'll ignore last 10 and combine P,C1's 10 with P,C2's first 10 and return the result.
I'll also have to maintain that the last cluster key queried was C2 and also that 10 records were fetched from it.
Query 2 (for next pagination request) - Scan – primary key will be P, clustering start will be C2, clustering end will be Cn as I don’t know how many records are there.
Now I'll fetch P,C2 and get 20, ignore 1st 10 (as they were sent last time), take the remaining 10, do another fetch using same Scan and take first 10 from that.
Is this how it should be done or is there a better way? My concern with above implementation is that every time I'll have to fetch loads of records and dump them. For example, say I want to get records 70-90 from P,C2 then I'll still query up to record 60 and dump the result!
Primary keys and Clustering keys compose a primary key so your above example looks not right.
Let' assume the following data structure.
P, C1, ...
P, C2, ...
P, C3, ...
...
Anyways, I think one of the ways could be as follows. Assuming the page size is 2.
Scan with start (P, C1) inclusive, ascending and with limit 2. Results stored in R1
Get the last record of R1 -> (P, C2).
Scan with start the previous last record (P, C2) not inclusive, ascending with limit 2.
...