Compare two compound columns in two tables - kdb

I have two tables(see below) with sym and lp. I want to pull out every row from tab1 that doesn't have the complete set of symbols corresponding to the
same sym from tab2.
tab1:([]sym:`EUR`AUD`GBP;lp:(`aa`bb`cc;`dd`ee;`ff`gg`aa`ee))
tab2:([]sym:`EUR`AUD`GBP;lp:(`aa`bb`ff`cc;`ee`dd;`gg`ff`ee`aa`rr`xx))
i.e. my result should be:
tab3:([]sym:`EUR`GBP;lp:(`ff;`rr`xx))
Thanks

I think this might fit what you're looking for:
q)b: where 0 <> count each a: (exec lp from tab2) except' (exec lp from tab1)
q)update lp: a b from tab1 b
sym lp
----------
EUR ,`ff
GBP `rr`xx
One assumption that I've made is that you always have the syms in the same order in both tables, is this always true?

An except on keyed tables might work here also:
q)(1!tab2)except''1!tab1
sym| lp
---| ----------
EUR| ,`ff
AUD| `symbol$()
GBP| `rr`xx
Performance wouldn't be good though.

Related

How can I drop the first 252 rows by sym?

I have a table with sym-date indexing.
I'm trying to get the same table back, but skipping the first 252 rows for each symbol.
I expected it would be:
ungroup 252_select by sym from t
but this doesn't work. What am I doing wrong?
You are looking for something like this
select from t where 252<=(rank;i) fby sym
where rank returns the position in the sorted list and fby is used to apply this function to each subset of i when split on sym
Reasons why your attempt wasn't working
select by sym from t returns only the last row for each sym
therefore when you drop rows using 252_ you are dropping 252 last rows
ungroup is then likely failing because you have a two or more columns with different length vector elements
If you wanted to do this via ungroup you could do the following using xgroup as to keep all the rows in the grouping
ungroup 252_/:/:`sym xgroup t
select from t where 1=({x>252};i) fby sym
I came up with an admittedly more convoluted solution:
t:([] date:.z.D+til 1008;sym:(504#`A),(504#`B);px:1008?1.0); / test table
s:252; / # of elements to skip
ungroup (key tt)!{flip (x)_flip y}/: [s;tt[key tt:?[t;();((,)`sym)!(,)`sym;`date`px!`date`px]]]
The logic involves:
grouping by sym
assigning the result on the fly to table tt
processing the grouped dictionary one by one
reconstructing the table
Now, I initially tried to benchmark vs the fby solution proposed above testing against a very small table, and the solution using fby is 50% faster:
t:([] date:.z.D+til 10;sym:(5#`A),(5#`B);px:10?1.0);
s:2;
\t:100000 ungroup (key tt)!{flip (x)_flip y}/: [s;tt[key tt:?[t;();((,)`sym)!(,)`sym;`sym`px!`sym`px]]]
796
\t:100000 select from t where s<=(rank;i) fby sym
396
However, when the larger table proposed at the beginning (1008 rows in total, first 252 skipped per ticker) is used, the performance ranking changes:
\t:100000 select from t where s<=(rank;i) fby sym
2384
\t:100000 ungroup (key tt)!{flip (x)_flip y}/: [s;tt[key tt:?[t;();((,)`sym)!(,)`sym;`sym`px!`sym`px]]]
1679

SQL: How to prevent double summing

I'm not exactly sure what the term is for this but, when you have a many-to-many relationship when joining 2 tables and you want to sum up one of the variables, I believe that you can sum the same values over and over again.
What I want to accomplish is to prevent this from happening. How do I make sure that my sum function is returning the correct number?
I'm using PostgreSQL
Example:
Table 1 Table 2
SampleID DummyName SampleID DummyItem
1 John 1 5
1 John 1 4
2 Doe 1 5
3 Jake 2 3
3 Jake 2 3
3 2
If I join these two tables ON SampleID, and I want to sum the DummyItem for each DummyName, how can I do this without double summing?
The solution is to first aggregate and then do the join:
select t1.sampleid, t1.dummyname, t.total_items
from table_1 t1
join (
select t2.sampleid, sum(dummyitem) as total_items
from table_2 t2
group by t2
) t ON t.sampleid = t1.sampleid;
The real question is however: why are the duplicates in table_1?
I would take a step back and try to assess the database design. Specifically, what rules allow such duplicate data?
To address your specific issue given your data, here's one option: create a temp table that contains unique rows from Table 1, then join the temp table with Table 2 to get the sums I think you are expecting.

How do I add several columns at once in kdb?

Somehow, I can only find examples that show how to add one column.
So I have written this code, which works, but I know there is a much better way to do this:
table t already exists with columns filled with data, and I need to add new columns that are initially null.
t: update column1:` from t;
t: update column2:` from t;
t: update column3:` from t;
t: update column4:` from t;
I tried making it a function:
colNames:`column1`column2`column3`column4;
t:{update x:` from t}each colNamesList;
But this only added one column and called it x.
Any suggestions to improve this code will be greatly appreciated. I have to add a lot more than just 4 columns and my code is very long because of this. Thank you!
Various ways to achieve this....
q)newcols:`col3`col4;
q)#[tab;newcols;:;`]
col1 col2 col3 col4
-------------------
a 1
b 2
c 3
Can also specify different types
q)#[tab;newcols;:;(`;0N)]
col1 col2 col3 col4
-------------------
a 1
b 2
c 3
Or do a functional update
q)![`tab;();0b;newcols!count[newcols]#enlist (),`]
`tab

kdb Update entire column with data from another table

I have two partitioned tables. Table A is my main table and Table B is full of columns that are exact copies of some of the columns in Table A. However, there is one column in Table B that has data I need- because the matching column in Table A is full of nulls.
I would like to get rid of Table B completely, since most of it is redundant, and update the matching column in Table A with the data from the one column in Table B.
Visually,
Table A: Table B:
a b c d a b d
__________________ ______________
1 null 11 A 1 joe A
2 null 22 B 2 bob B
3 null 33 C 3 sal C
I want to fill the b column in Table A with the values from the b column in Table B, and then I no longer need Table B and can delete it. I will have to do this repeatedly since these two tables are given to me daily from two separate sources.
I cannot key these tables, since they are both partitioned.
I have tried:
update columnb:(exec columnb from TableB) from TableA;
but I get a `length error.
Suggestions on how to approach this in any manner are appreciated.
To replace a column in memory you would do the following.
t1:([]a:1 2 3;b:0N)
a b
---
1
2
3
t2:([]c:`aa`bb`cc;b:5 6 7)
c b
----
aa 5
bb 6
cc 7
t1,'t2
a b c
------
1 5 aa
2 6 bb
3 7 cc
If you are getting length errors then the columns do not have
the same count and the following would solve it. The obvious
problem with this solution is that it will start to repeat
data if t2 has a lower column count that t1. You will have to find out why that is.
t1,'count[t1]#t2
Now for partitions, you will use the amend function to change
the the b column of partitioned table, table A, at date 2007.02.23 (or whatever date your partition is).
This loads the b column of tableB into memory to preform the amend. You must perform the amend for each partition.
#[`:2007.02.23/tableA/;`b;:;count[tableA]#exec b from select b from tableB where date=2007.02.23]

what is count(*) % 2 = 1

I see a query like
select *
from Table1
group by Step
having count(*) % 2 = 1
What is the trick about having count(*) % 2 = 1
Can anyone explain?
edit: What are the common usage areas?
Well % is the modulo operator, which gives the remainder of a division so it would give 0 when the number is exactly divisible by 2 (even) and 1 when not (e.g. it is odd). So the query basically selects elements for which count is odd (as said above).
Would that not be checking if you have an odd number of entries per step?
It will return all the steps which had odd number of rows.
just test it
declare #t1 table (step char(1))
insert into #t1(step)
select 'a'
union all select 'b'
union all select 'b'
union all select 'c'
union all select 'c'
union all select 'c'
union all select 'd'
union all select 'd'
union all select 'd'
union all select 'd'
select * from #t1
group by step
having count(*)%2 = 1
that will return values of column step that exist add number of times
in this example it will return
'a'
'c'
the select * is confusing here though and I would rather write it as
select step from #t1
group by step
having count(*)%2 = 1
or even for more visibility
select step, count(*) from #t1
group by step
having count(*)%2 = 1
A reason to do this:
Say you want to seperate the odd and even entries into two columns. You could use the even one for one of them and the odd for the other.
I also put this in a comment but wasn't getting a response.
The COUNT(*) will count all the rows in the database. The % is the modulus character, which will give you the remainder of a division problem. So this is dividing all rows by two and returning those which have a remainder of 1 (meaning an odd number of rows.)
As Erik pointed out, that would not be all the rows, but rather the ones grouped by step, meaning this is all the odd rows per step.
It's impossible for us to answer your question without knowing what the tables are used for.
For a given "Step" it might be that it is required to have an equal amount of "something" and that this will produce a list of elements to be displayed in some interface where this is not the case.
Example:
Lets forget "Steps" for a moment and assume this was a table of students and that "Step" was instead "Groups" the students are devided into. A requirement for a group is that there are an even number of students because the students will work in pairs. For an administrative tool you could write a query like this to see a list of groups where this is not true.
Group: Count
A, 10
B, 9
C, 17
D, 8
E, 4
F, 5
And the query will return groups B, C, F
Thanks to everybody. All of you said the query returns grouped rows that has odd count.
but this is not point! i will continue to inspect this case will and write the reason in the programmer's mind (if i find who write this)
Lessons learned: Programmers must write comments about stupid logic like that...