What is the meaning of `s attribute on a table? - kdb

In the Abridged Q Language Manual Arthur mentioned:
`s#table marks the table to use binary search and marks first column sorted
And if we look into 3.6 version:
N:1000000;
t1:t2:([]n:til N; m:N?`6);
t1:update `p#n from t1;
t2:`s#t2;
(meta t1)[`n]`a / `p
(meta t2)[`n]`a / `p
attr t1 / `
attr t2 / `s
\ts:10000 select count i from t1 where n in 1000?N
/ ~7000
\ts:10000 select count i from t2 where n in 1000?N
/ ~7000
we find that yes, t2 has this attribute: s.
But for some reason an attribute on the first column is not s but p. And also search times are the same. And the sizes of both tables with attributes are the same - I used objsize function described in AquaQ blogpost to ensure.
So are there any differences in 3.6+ version of q between 's#table and a table with '#p attribute on a first column?

I think the only way that the s# on the table itself would improve search times is if you were doing lookups using ? as described here: https://code.kx.com/q/ref/find/#searching-tables
q)\ts:100000 t1?t1[0]
105 800
q)\ts:100000 t2?t2[0]
86 800
q)
q)\ts:100000 t1?t1[500000]
108 800
q)\ts:100000 t2?t2[500000]
83 800
q)
q)\ts:100000 t1?t1[999999]
107 800
q)\ts:100000 t2?t2[999999]
83 800
It behaves differently for a keyed table (turns it into a step function) but I think that's beyond the scope of your original question.

Related

How can I return all numeric values in PostgreSQL?

I had to select from the column all the values that contain numbers (NULL values aren't interested). How can I make a query to my table using something like this:
SELECT c1
FROM t1
WHERE c1 ...;
I tried to find suitable regular expressions, which are popular in many imperative programming languages, but they didn't fit for PostgreSQL, unfortunately (or I used them incorrectly)
Possible solution:
SELECT c1 FROM t1 WHERE c1 ~ '^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$';
From Pattern Matching:
SELECT
fld
FROM (
VALUES ('1'),
('test1'),
('34'),
('dog'),
('3cat')) AS t (fld)
WHERE
regexp_match(fld, '^[[:digit:]]{1,}$') IS NOT NULL;
fld
-----
1
34
Another possibility:
SELECT c1 FROM t1 WHERE c1 ~ '[0-9]';
From this table named STACKOVERFLOW:
id_stack
name
1
Niklaus Wirth
2
Bjarne Stroustrup
3
Linus Torvalds
4
Donald Knuth
5
C3PO
6
R2D2
7
ZX Spectrum +3
The Query SELECT NAME FROM STACKOVERFLOW WHERE NAME ~ '[0-9]'; will return:
name
C3PO
R2D2
ZX Spectrum +3
SELECT * FROM <table> WHERE <column> ~ 'PRF:\d{11}-\d{4}'

Enums for tables

I found no information about what the enum is over the table domain on https://code.kx.com/q/ref/enumerate/. But something interesting exists there: https://code.kx.com/q/kb/linking-columns. I tried those examples and found an enum structure that behaves in some situations like a normal enum, but has a strange behaviour in others.
q)kt:1!t:([]a:`a`b`c;b:10 20 30)
q)tt:([]k:`a`a`a`b;d:11 21 31 41)
q)show et1:`t!t[`a]?tt[`k]
`t!0 0 0 1
q)show et2:`kt$tt[`k]
`kt$`a`a`a`b
q)meta select k,d,et1,et2 from tt
c | t f a
---| ------
k | s
d | j
et1| j t
et2| s kt
q)select r1.a, r1.b, r2.a, r2.b from update r1:et1, r2:et2 from tt
a b a1 b1
----------
a 10 a 10
a 10 a 10
a 10 a 10
b 20 b 20
From this perspective et1 and et2 both have similar behaviour. But if we check other enum properties, we see differences:
q)et2[0]
`kt$`a
q)et2[0]:`a
q)
q)et1[0]
`t!0
q)et1[0]:0 / neither works this
't
[0] et1[0]:0
^
q)et1[0]:(`a`b!(`a;10)) / nor that
't
[0] et1[0]:(`a`b!(`a;10))
^
The situation seems more weird if we build enums for just a keyed tables: see a difference for a table with one key column and for two:
q)kkt:2!t:([]a:`a`b`c;b:10 20 30;c:11 22 33)
q)kt:1!0!kkt
q)show ekkt:`kkt$((`a;10);(`b;20);(`b;20))
`kkt!0 1 1
q)show ekt:`kt$(`a`b`b)
`kt$`a`b`b
The same hardcoded (with !) enum notation for kkt.
So the question: what are they? - those enums with a familiar $ and with a hardcoded ! notaions for a table? Is it possible to apply enum-extend technique (?) for them and how? And is there any documentation for them?
What you're seeing is the difference between a simple foreign key and a linked column. As mentioned in the documentation, differences include:
a foreign key is specifically designed to link to the keys of a keyed table.
A foreign key does not allow the link if there's an "unknown key" that isn't one of the keys in the keyed table
linked columns can link to any arbitrary column (if even a value doesn't appear in the other table - thus it doesn't guarantee referential integrity)
linked columns are generally used for on-disk tables
q)kt:([eid:1001 1002 1003] name:`Dent`Beeblebrox`Prefect; iq:98 42 126)
q)tdetails2:([] eid:1003 1001 1002 1001 1002 1001 777;sc:126 36 92 39 98 42 7)
q)update linker:`kt!((0!kt)`eid)?eid from `tdetails2
`tdetails2
q)select linker.name from tdetails2
name
----------
Prefect
Dent
Beeblebrox
Dent
Beeblebrox
Dent
The latter would not have been allowed for a simple foreign key.
Also I don't know why you would want to modify /edit the values of an enumeration - don't do that!

kdb: best way to delete all rows from a table

Given table t:([c1:1 2]c2:3 4) I've come across two ways of clearing its content:
t:0#t
delete from `t
Apart from the fact that option (2) returns symbol t, are there other differences between the two?
Just on that timing test you did you should not re-assign the table if using a number of iterations. After the first iteration you have lost your data so really it only deleted from the table once.
q)n:100000000
q)tbl:([]a:til n;b:n?`3;c:n?1000.0)
q)\ts:3 0N!"Count tbl : ",string count tbl;tbl:0#tbl
"Count tbl : 100000000"
"Count tbl : 0"
"Count tbl : 0"
0 2544
Doing it again with more data and just once shows:
q)n:100000000
q)tbl:([]a:til n;b:n?`3;c:n?1000.0)
q)\ts delete from `tbl
69 268435856
q)tbl:([]a:til n;b:n?`3;c:n?1000.0)
q).Q.gc[]
3489660928
q)\ts tbl:0#tbl
0 944
So looks more efficient to re-assign using :
There is no actual difference in the data returned, however the operations differ in speed slightly.
q)b:([c1:1 2]c2:3 4)
q)\ts:1000000 delete from `b
644 720
q)a:([c1:1 2]c2:3 4)
q)\ts:1000000 a:0#a
195 944
q)b~a
1b

Can PostgreSQL LAG() function refer to itself?

I've just discovered LAG() function in PostgreSQL and I've been experimenting to see what it can achieve. I've though that I might calculate factorial with it and I wrote
SELECT i, i * lag(factorial, 1, 1) OVER (ORDER BY i, 1) as factorial FROM generate_series(1, 10) as i;
But online IDE complains that 42703 column "factorial" does not exist.
Is there any way I can access the result of previous LAG call?
You can't refer to the column recursively in its definition.
However, you can express the factorial calculation as:
SELECT i, EXP(SUM(LN(i)) OVER w)::int factorial
FROM generate_series(1, 10) i
WINDOW w AS (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW);
-- outputs:
i | factorial
----+-----------
1 | 1
2 | 2
3 | 6
4 | 24
5 | 120
6 | 720
7 | 5040
8 | 40320
9 | 362880
10 | 3628800
(10 rows)
Postgresql does support an advanced SQL feature called recursive query, which can also be used to express the factorial table recursively:
WITH RECURSIVE series AS (
SELECT i FROM generate_series(1, 10) i
)
, rec AS (
SELECT i, 1 factorial FROM series WHERE i = 1
UNION ALL
SELECT series.i, series.i * rec.factorial
FROM series
JOIN rec ON series.i = rec.i + 1
)
SELECT *
FROM rec;
what EXP(SUM(LN(i)) OVER w) does:
This exploits the mathematical identities that:
[1]: log(a * b * c) = log (a) + log (b) + log (c)
[2]: exp (log a) = a
[combining 1&2]: exp(log a + log b + log c) = a * b * c
SQL does not have an aggregate multiply operation, so to perform an aggregate multiply operation, we first have to take the log of each value, then we can use the sum aggregate function to give us the the log of the values' product. This we invert with the final exponentiation step.
This works as long as the values being multiplied are positive as log is undefined for 0 and negative numbers. If you have negative numbers, or zero, the trick is to check if any value is 0, then the whole aggregation is 0, and check if the number of negative values is even, then the result is positive, else it is negative. Alternatively, you could also convert the reals to the complex plane and then use the identity Log(z) = ln(r) - iπ
what ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW does
This declares an expanding window frame that includes all preceding rows, and the current row.
e.g.
when i equals 1 the values in this window frame are {1}
when i equals 2 the values in this window frame are {1,2}
when i equals 3 the values in this window frame are {1,2,3}
what is a recursive query
A recursive query lets you express recursive logic using SQL. Recursive queries are often used to generate parent-child relationships from relational data (think manager-report, or product classification hierarchy), but they can generally be used to query any tree like structure.
Here is a SO answer I wrote a while back that illustrates and explains some of the capabilities of recursive queries.
There are also a tonne of useful tutorials on recursive queries. It is a very powerful sql-language feature and solves a type of problem that are very difficult do do without recursion.
Hope this gives you more insight into what the code does. Happy learning!

slow aggregate on multiple columns

This kdb query that aggregates multiple columns takes approximately 31 seconds compared to 3 seconds with J
Is there a faster way to do the sum in kdb?
Ultimately this will be running against a partitioned database on the 32-bit version
/test 1 - using symbols
n: 13000000;
cust: n?`8;
prod: n?`8;
v: n?100
a:([]cust:cust; prod:prod ;v:v)
/query 1 - using simple by
q)\t select sum(v) by cust, prod from a
31058
/query 2 - grouping manually
\t {sum each x[`v][group[flip (x[`cust]; x[`prod])]]}(select v, cust, prod from a)
12887
/query 3 - simpler method of accessing
\t {sum each a.v[group x]} (flip (a.cust;a.prod))
11576
/test 2 - using strings, very slow
n: 13000000;
cust: string n?`8;
prod: string n?`8;
v: n?100
a:([]cust:cust; prod:prod ;v:v)
q)\t select sum(v) by cust, prod from a
116745
comparison J code
n=:13000000
cust=: _8[\ a. {~ (65+?(8*n)#26)
prod=: _8[\ a. {~ (65+?(8*n)#26)
v=: ?.n#100
agg=: 3 : 0
keys=:i.~ |: i.~ every (cust;prod)
c=.((~.keys) { cust)
p=.((~.keys) { prod)
s=.keys +//. v
c;p;s
)
NB. 3.57 seconds
6!:2 'r=.agg 0'
3.57139
({.#$) every r
13000000 13000000 13000000
Update:
From the kdbplus forums, we can get down to about 2x the speed difference
q)\t r:(`cust`prod xkey a inds) + select sum v by cust,prod from a til[count a] except inds:(select cust,prod from a) ? d:distinct select cust,prod from a
6809
Update 2: added another dataset per #user3576050
This dataset has the same overall number of rows, but is distributed 4 instances per group
n: 2500000
g: 4
v: (g*n)?100
cust: (g*n)#(n?`8)
prod: (g*n)#(n?`8)
b:([]cust:cust; prod:prod ;v:v)
q)\ts select sum v by cust, prod from b
9737 838861968
The previous query runs poorly on the new dataset
q)\ts r:(`cust`prod xkey b inds) + select sum v by cust,prod from a til[count b] except inds:(select cust,prod from b) ? d:distinct select cust,prod from b
17181 671090384
If you update this data less frequently than you query it, how about pre-computing a group index? It’s about the same cost to create as a single query, and it allows querying at ~30x the speed.
q)\ts select sum v by cust,prod from b
14014 838861360
q)\ts update g:`g#{(key group x)?x}flip(cust;prod)from`b
14934 1058198384
q)\ts select first cust,first prod,sum v by g from b
473 201327488
q)
The results match up to row order and schema details:
q)(select sum v by cust,prod from b)~`cust`prod xasc 2!delete g from select first cust,first prod,sum v by g from b
1b
q)
(BTW, I know essentially nothing about J, but my guess would be that it’s computing a similar multi-column group index. q’s g index is unfortunately (currently?) limited to plain vector data—if it were possible to somehow apply it to the combination of cust and prod, I expect we’d see results like mine from the simple query.)
You are using a pathological dataset, a set of random symbols of length 8 will have few duplicates making the grouping redundant.
q)n:13000000; (count distinct n?`8)%n
0.9984848
p#/g# attributes(mentioned in comments above) will have no impact on performance for the same reasons.
You will see better performance with more appropriate data.
q)n:1000000
q)
q)a:([]cust:n?`8; prod:n?`8; v:n?100)
q)b:([]cust:n?`3; prod:n?`3; v:n?100)
q)
q)\ts select sum v by cust, prod from a
3779 92275568
q)
q)\ts select sum v by cust, prod from b
762 58786352