How can I drop the first 252 rows by sym? - kdb

I have a table with sym-date indexing.
I'm trying to get the same table back, but skipping the first 252 rows for each symbol.
I expected it would be:
ungroup 252_select by sym from t
but this doesn't work. What am I doing wrong?

You are looking for something like this
select from t where 252<=(rank;i) fby sym
where rank returns the position in the sorted list and fby is used to apply this function to each subset of i when split on sym
Reasons why your attempt wasn't working
select by sym from t returns only the last row for each sym
therefore when you drop rows using 252_ you are dropping 252 last rows
ungroup is then likely failing because you have a two or more columns with different length vector elements
If you wanted to do this via ungroup you could do the following using xgroup as to keep all the rows in the grouping
ungroup 252_/:/:`sym xgroup t

select from t where 1=({x>252};i) fby sym

I came up with an admittedly more convoluted solution:
t:([] date:.z.D+til 1008;sym:(504#`A),(504#`B);px:1008?1.0); / test table
s:252; / # of elements to skip
ungroup (key tt)!{flip (x)_flip y}/: [s;tt[key tt:?[t;();((,)`sym)!(,)`sym;`date`px!`date`px]]]
The logic involves:
grouping by sym
assigning the result on the fly to table tt
processing the grouped dictionary one by one
reconstructing the table
Now, I initially tried to benchmark vs the fby solution proposed above testing against a very small table, and the solution using fby is 50% faster:
t:([] date:.z.D+til 10;sym:(5#`A),(5#`B);px:10?1.0);
s:2;
\t:100000 ungroup (key tt)!{flip (x)_flip y}/: [s;tt[key tt:?[t;();((,)`sym)!(,)`sym;`sym`px!`sym`px]]]
796
\t:100000 select from t where s<=(rank;i) fby sym
396
However, when the larger table proposed at the beginning (1008 rows in total, first 252 skipped per ticker) is used, the performance ranking changes:
\t:100000 select from t where s<=(rank;i) fby sym
2384
\t:100000 ungroup (key tt)!{flip (x)_flip y}/: [s;tt[key tt:?[t;();((,)`sym)!(,)`sym;`sym`px!`sym`px]]]
1679

Related

Why does select and exec give different results for aggregate-function column

At first glance the next two queries should give the same result:
q)exec a from select sum[a] from ([]a:1 2)
,3
q)exec sum[a] from ([]a:1 2)
3
but as we see, their return types are different.
Why the exec in this example does not act like a regular select (just without column name)?
In case 1 you are first creating a table of length 1 by applying the sum function to the a column within the select statement (the output of a select statement is always a table).
You are then running exec to pull the raw column (which is a list of length 1) from that table.
In case 2 you are directly accesing the a column within the exec statement and performing the sum aggregation on this list. The result will thus be a scalar.
The select in the first piece of code is creating an intermediary table which is not present in the second piece of code.
With regards the return types this is also a somewhat special-case example of exec where you are only requesting one column. If you aggregated multiple columns the result would be a dictionary (again comprised of scalar values):
q)t:([]a:1 2 3;b:4 5 6)
q)exec sum a,sum b from t
a| 6
b| 15
select returns columns which must be lists, not atoms. Your sum returns an atom but kdb recognizes this aggregation and it automatically enlists your atom under the covers. E.g.
/this works even though sum would return an atom
select sum[a] from ([]a:1 2)
/this doesn't work because kdb doesn't recognize the aggregation function
select {sum x}[a] from ([]a:1 2)
/this works by manually enlisting
select {enlist sum x}[a] from ([]a:1 2)
The list of recognized aggregations for which this occurs is .Q.a0.
So in case 1 you've forced a hidden enlist and when you exec it out it's still enlisted, in case 2 there's no hidden enlist since exec doesn't mandate column/list output, i.e. it allows atom output

Optimize the query using group by on certain condition in q kdb

We have a table t as below
q)t:([] sym:10?`GOOG`AMZN`IBM; px:10?100.; size:10?1000; mkt:10?`ab`cd`ef)
Our requirement is to 'group by' the table 't' by column 'sym' if column 'mkt' value is 'ef', for rest of the markets('ab`cd') we need all the values(not group by).
For this use case I have written below query which works as expected,
q)(select px, size, sym, mkt from select by sym from t where mkt=`ef), select px, size, sym, mkt from t where mkt in `ab`cd
please help me optimize the above query in a way i.e
sudo code -
if mkt=`ef:
then use group by on table
else if mkt in `ab`cd
don't use group by on table
I have found two different ways to make your query that are different from the one you have provided.
You can use the following query to accomplish what you want in one select statement:
select from t where (mkt<>`ef)|(mkt=`ef)&i=(last;i)fby ([]sym;mkt)
However if you compare its speed:
q)\t:1000 select from t where (mkt<>`ef)|(mkt=`ef)&i=(last;i)fby ([]sym;mkt)
68
to your original query:
q)\t:1000 (select px, size, sym, mkt from select by sym from t where mkt=`ef), select px, size, sym, mkt from t where mkt in `ab`cd
40
You can see that your query is faster.
Additionally you can try this which does not require explicitly stating every mkt in t you wish to not group by sym
(0!select by sym from t where mkt=`ef),select from t where mkt<>`ef
But again this ends up being around the same speed as your original solution:
q)\t:1000 (0!select by sym from t where mkt=`ef),select from t where mkt<>`ef
42
So in terms of optimization it seems your query works well for what you want it to accomplish.
This isn't any quicker either (as Rob says, your query is already good in terms of speed), but is shorter at least
delete x from select by sym,(1+i)*`ef<>mkt from t
...provided you don't mind the order changing a little.
In fby form
select from t where i=(last;i)fby([]sym;(1+i)*`ef<>mkt)

select query with where condition using variables instead of column names in q kdb

I have a table with columns sym, px size
t:([] sym:`GOOG`IBM`APPL; px:10 20 30; size:1000 2000 3000)
Now, if I assign sym column to variable ab
ab:`sym
Then, running below query is not giving proper output
select [ab],px from t where [ab]=`IBM / returns empty table
?[t;(=;`sym;`IBM);0b; [ab]`px![ab]`px]/ type
Got understanding here and here but could not create a working query.
The answer above is close but there are some things to consider. The query you are running is basically:
q)parse"select sym,px from t where sym=`IBM"
?
`t
,,(=;`sym;,`IBM)
0b
`sym`px!`sym`px
The key thing here is that , usually indicates that a term needs enlisted. Additionally for the dictionary of column names you just need to join the value ab to px. With all that in mind I have modified your query above:
q)?[t;enlist(=;`sym;enlist`IBM);0b;(ab,`px)!ab,`px]
sym px
------
IBM 20
And assuming the where clause should also refer to ab:
q)?[t;enlist(=;ab;enlist`IBM);0b;(ab,`px)!ab,`px]
sym px
------
IBM 20

Optimal use of LIKE on indexed column

I have a large table (+- 1 million rows, 7 columns including the primary key). The table contains two columns (ie: symbol_01 and symbol_02) that are indexed and used for querying. This table contains rows such as:
id symbol_01 symbol_02 value_01 value_02
1 aaa bbb 12 15
2 bbb aaa 12 15
3 ccc ddd 20 50
4 ddd ccc 20 50
As per the example rows 1 and 2 are identical except that symbol_01 and symbol_02 are swapped but they have the same values for value_01 and value_02. That is true once again with row 3 and 4. This is the case for the entire table, there are essentially two rows for each combination of symbol_01+symbol_02.
I need to figure out a better way of handling this to get rid of the duplication. So far the solution I am considering is to just have one column called symbol which would be a combination of the two symbols, so the table would be as follows:
id symbol value_01 value_02
1 ,aaa,bbb, 12 15
2 ,ccc,ddd, 20 50
This would cut the number of rows in half. As a side note, every value in the symbol column will be unique. Results always need to be queried for using both symbols, so I would do:
select value_01, value_02
from my_table
where symbol like '%,aaa,%' and symbol like '%,bbb,%'
This would work but my question is around performance. This is still going to be a big table (and will get bigger soon). So my question is, is this the best solution for this scenario given that symbol will be indexed, every symbol combination will be unique, and I will need to use LIKE to query results.
Is there a better way to do this? Im not sure how great LIKE is for performance but I don't see an alternative?
There's no high performance solution, because your problem is shoehorning multiple values into one column.
Create a child table (with a foreign key to your current/main table) to separately hold all the individual values you want to search on, index that column and your query will be simple and fast.
With this index:
create index symbol_index on t (
least(symbol_01, symbol_02),
greatest(symbol_01, symbol_02)
)
The query would be:
select *
from t
where
least(symbol_01, symbol_02) = least('aaa', 'bbb')
and
greatest(symbol_01, symbol_02) = greatest('aaa', 'bbb')
Or simply delete the duplicates:
delete from t
using (
select distinct on (
greatest(symbol_01, symbol_02),
least(symbol_01, symbol_02),
value_01, value_02
) id
from t
order by
greatest(symbol_01, symbol_02),
least(symbol_01, symbol_02),
value_01, value_02
) s
where id = s.id
Depending on the columns semantics it might be better to normalize the table as suggested by #Bohemian

T-SQL - CROSS APPLY to a PIVOT? (using pivot with a table-valued function)?

I have a table-valued function, basically a split-type function, that returns up to 4 rows per string of data.
So I run:
select * from dbo.split('a','1,a15,b20,c40;2,a25,d30;3,e50')
I get:
Seq Data
1 15
2 25
However, my end data needs to look like
15 25
so I do a pivot.
select [1],[2],[3],[4]
from dbo.split('a','1,a15,b20,c40;2,a25,d30;3,e50')
pivot (max(data) for seq in ([1],[2],[3],[4]))
as pivottable
which works as expected:
1 2
--- ---
15 25
HOWEVER, that's great for one row. I now need to do it for several hundred records at once. My thought is to do a CROSS APPLY, but not sure how to combine a CROSS APPLY and a PIVOT.
(yes, obviously the easy answer is to write a modified version that returns 4 columns, but that's not a great option for other reasons)
Any help greatly appreciated.
And the reason I'm doing this: the current query uses as scalar-valued version of SPLIT, called 12 times within the same SELECT against the same million rows (where the data string is 500+ bytes).
So far as I know, that would require it scan the same 500bytes * 1000000rows, 12 times.
This is how you use cross apply. Assume table1 is your table and Line is the field in your table you want to split
SELECT * fROM table1 as a
cross apply dbo.split(a.Line) as b
pivot (max(data) for seq in ([1],[2],[3],[4])) as p