kdb: best way to delete all rows from a table - kdb

Given table t:([c1:1 2]c2:3 4) I've come across two ways of clearing its content:
t:0#t
delete from `t
Apart from the fact that option (2) returns symbol t, are there other differences between the two?

Just on that timing test you did you should not re-assign the table if using a number of iterations. After the first iteration you have lost your data so really it only deleted from the table once.
q)n:100000000
q)tbl:([]a:til n;b:n?`3;c:n?1000.0)
q)\ts:3 0N!"Count tbl : ",string count tbl;tbl:0#tbl
"Count tbl : 100000000"
"Count tbl : 0"
"Count tbl : 0"
0 2544
Doing it again with more data and just once shows:
q)n:100000000
q)tbl:([]a:til n;b:n?`3;c:n?1000.0)
q)\ts delete from `tbl
69 268435856
q)tbl:([]a:til n;b:n?`3;c:n?1000.0)
q).Q.gc[]
3489660928
q)\ts tbl:0#tbl
0 944
So looks more efficient to re-assign using :

There is no actual difference in the data returned, however the operations differ in speed slightly.
q)b:([c1:1 2]c2:3 4)
q)\ts:1000000 delete from `b
644 720
q)a:([c1:1 2]c2:3 4)
q)\ts:1000000 a:0#a
195 944
q)b~a
1b

Related

Removing duplicatated rows in PostgreSQL

I have thousands of lines of duplicate data in PostgreSQL database. To find out which row are duplicated, I am using this code:
SELECT "Date" FROM stockdata
group by "Date"
having count("Date")>1
This has produced again thousands of lines of date column which have more then 1 entry. How can I remove the row with the date so that just 1 entry of the duplicated item remains.
P.S I cannot use a Primary Key when entering data.
Update
As per the comment. There is no primary key. Also the Date is unique thus there cannot be 2 or more of it.
df look like this:
Date High Low Open Close Volume Adj Close
0 2017-04-03 893.489990 885.419983 888.000000 891.510010 3422300 891.510010
1 2017-04-04 908.539978 890.280029 891.500000 906.830017 4984700 906.830017
2 2017-04-05 923.719971 905.619995 910.820007 909.280029 7508400 909.280029
3 2017-04-06 917.190002 894.489990 913.799988 898.280029 6344100 898.280029
4 2017-04-07 900.090027 889.309998 899.650024 894.880005 3710900 894.880005
... ... ... ... ... ... ... ...
12595 2022-03-28 1097.880005 1053.599976 1065.099976 1091.839966 34168700 1091.839966
12596 2022-03-29 1114.770020 1073.109985 1107.989990 1099.569946 24538300 1099.569946
12597 2022-03-30 1113.949951 1084.000000 1091.170044 1093.989990 19955000 1093.989990
12598 2022-03-31 1103.140015 1076.640015 1094.569946 1077.599976 16265600 1077.599976
12599 2022-04-01 1094.750000 1066.640015 1081.150024 1076.352783 11449987 1076.352783
12600 rows × 7 columns
The data is repeated a few times at places.
However the rows with the same date with have the same data.
This data is not a stock data (i am using it as a troubleshoot example) but from yokogawa datalogger. https://www.yokogawa.com/in/solutions/products-platforms/data-acquisition/data-logger/#Overview
There are redundancies in the system and the earlier integrator had just dumped all the data on 1 database and thus if redundant logger comes online, the database has multiple entries. I need to remove it so we can actually use the data. I don't have access to their software.
Further Update:
Using this code as suggested in the comments:
delete from stockdata s
using
(SELECT "Date" , max(ctid) as max_ctid from stockdata group by "Date") t
where s.ctid<>t.max_ctid
and s."Date"=t."Date";
It was able to do the job but going forward, is this dangerous solution for production?
This should do the trick:
DELETE FROM
stockdata a
USING stockdata b
WHERE
a.id < b.id
AND a.Date = b.Date;
But be careful, this will immediately delete all duplicates. There is no way to restore them.

What is the meaning of `s attribute on a table?

In the Abridged Q Language Manual Arthur mentioned:
`s#table marks the table to use binary search and marks first column sorted
And if we look into 3.6 version:
N:1000000;
t1:t2:([]n:til N; m:N?`6);
t1:update `p#n from t1;
t2:`s#t2;
(meta t1)[`n]`a / `p
(meta t2)[`n]`a / `p
attr t1 / `
attr t2 / `s
\ts:10000 select count i from t1 where n in 1000?N
/ ~7000
\ts:10000 select count i from t2 where n in 1000?N
/ ~7000
we find that yes, t2 has this attribute: s.
But for some reason an attribute on the first column is not s but p. And also search times are the same. And the sizes of both tables with attributes are the same - I used objsize function described in AquaQ blogpost to ensure.
So are there any differences in 3.6+ version of q between 's#table and a table with '#p attribute on a first column?
I think the only way that the s# on the table itself would improve search times is if you were doing lookups using ? as described here: https://code.kx.com/q/ref/find/#searching-tables
q)\ts:100000 t1?t1[0]
105 800
q)\ts:100000 t2?t2[0]
86 800
q)
q)\ts:100000 t1?t1[500000]
108 800
q)\ts:100000 t2?t2[500000]
83 800
q)
q)\ts:100000 t1?t1[999999]
107 800
q)\ts:100000 t2?t2[999999]
83 800
It behaves differently for a keyed table (turns it into a step function) but I think that's beyond the scope of your original question.

Length error in update statement in q kdb

I have a table where I want to update few columns of a row based on a condition
q)t:([] id:10 20; l1:("Blue hor";"Antop"); l2:("Malad"; "KC"); pcd:("NCD";"FRB") )
When I used update statement, it throws 'length error
q)update l1:"Chin", l2:"Gor" from t where id=10
'length
q)update l1:"Chin", l2:"Gor" from `t where id=10
'length
I read below in Q for Mortals but is there any way to update few columns of a row based on a condition?
The actions in the Where phrase and the Update phrase are vector
operations on entire column lists. This is the Zen of update.
Please try statement below:
update l1:count[i]#enlist"Chin", l2:count[i]#enlist"Gor" from t where id=10
It works regardless to how many rows are matched to where clause.
On update, length of assigned list should be equal to number of updated rows. Q treats string as list of characters. This is why, when you assign "Chin" to l1, Q tries to assign list of length 4, when list of length 1 is expected. This causes 'length error.
count[i]#enlist"Chin" creates list of N repeated values: ("Chin";"Chin";...). Where N is number of updated rows. This fixes the issue
As you are dealing with Char-lists here (rather than symbol), you need use enlist:
q)update l1:enlist "Chin", l2:enlist "Gor" from t where id=10
id l1 l2 pcd
----------------------
10 "Chin" "Gor" "NCD"
20 "Antop" "KC" "FRB"
Otherwise you are trying to update a vector of length 1 (t where id=10) with a vector of length 4 ("Chin"), or 3 ("Gor").
To update the table like this, you need to add the enlist keyword:
q)update l1:enlist "Chin", l2:enlist "Gor" from t where id=10
id l1 l2 pcd
----------------------
10 "Chin" "Gor" "NCD"
20 "Antop" "KC" "FRB"
This is because you need to add lists of strings rather than just strings

Outputting conditionally from merge

I want to update a history file in SAS. I have new observations, which may overlap with existing data lines.
What is needed, is a file, which would have lines from dataset (new_data) where they exist and in case the lines do not exist, then from old set (old_data). What I've come up is a clunky merge operation, which is conditional on the order of the datasets. (==Works only if New_data is after Old_data. :?)
data new_data;
input key value;
datalines;
1 10
1 11
2 20
2 21
;
run;
data old_data;
input key value;
datalines;
2 50
2 51
3 30
3 31
;
run;
So I'd like to have the following:
key value
1 10
1 11
2 20
2 21
3 30
3 31
However the following does not work. It produces the output below it.
data updated_history;
merge New_data(in=a) old_data(in=b) ;
by key;
if a or (b and not a );
run;
....
2 50
2 51
...
But for some reason this does:
data updated_history;
merge old_data(in=b) New_data(in=a);
by key;
if a or (b and not a );
run;
Question: Is there an intelligent way to manage from which dataset the values are select from. Something like: if a then value_from_dataset a;
The order in which you list the data sets in the MERGE is the order the data is taken. So when the order is old, new values from old are read and then values from new overwrite the values from old. This is why your second version works and the first does not.
Since you have multiple observations per key value you probably do NOT want to use MERGE to combine these files. You could do it using SET by reading the data twice using two DOW loops. In that case it won't matter the order of the dataset in the SET statement since the records are interleaved instead of being joined. This first loop will calculate which of the two input datasets have any observations for this KEY value.
data want ;
anyold=0;
anynew=0;
do until (last.key);
set old_data (in=inold) new_data(in=innew);
by key ;
if inold then anyold=1;
if innew then anynew=1;
end;
do until (last.key);
set old_data (in=inold) new_data(in=innew);
by key ;
if not (anyold and anynew and inold) then output;
end;
drop anyold anynew;
run;
This type of combination is probably easier to code using SQL.
proc sql ;
create table want as
select key,value from new_data
union
select key,value from old_data
where key in (select key from old_data except select key from new_data)
order by 1
;
quit;

Divide records into groups - quick solution

I need to divide with UPDATE command rows (selected from subselect) in PostgreSQL table into groups, these groups will be identified with integer value in one of its columns. These groups should be with the same size. Source table contains billions of records.
For example I need to divide 213 selected rows into groups, every group should contains 50 records. The result will be:
1 - 50. row => 1
51 - 100. row => 2
101 - 150. row => 3
151 - 200. row => 4
200 - 213. row => 5
There is no problem to do it with some loop (or use PostgreSQL window functions), but I need to do it very efficiently and quickly. I can't use sequence in id because there should be gaps in these ids.
I have an idea to use random integer number generator and set it as default value for a row. But this is not useable when I need to adjust group size.
The query below should display 213 rows with a group-number from 0-4. Just add 1 if you want 1-5
SELECT i, (row_number() OVER () - 1) / 50 AS grp
FROM generate_series(1001,1213) i
ORDER BY i;
create temporary sequence s minvalue 0 start with 0;
select *, nextval('s') / 50 grp
from t;
drop sequence s;
I think it has the potential to be faster than the row_number version #Richard. But the difference could be not relevant depending on the specifics.