kdb Update entire column with data from another table - kdb

I have two partitioned tables. Table A is my main table and Table B is full of columns that are exact copies of some of the columns in Table A. However, there is one column in Table B that has data I need- because the matching column in Table A is full of nulls.
I would like to get rid of Table B completely, since most of it is redundant, and update the matching column in Table A with the data from the one column in Table B.
Visually,
Table A: Table B:
a b c d a b d
__________________ ______________
1 null 11 A 1 joe A
2 null 22 B 2 bob B
3 null 33 C 3 sal C
I want to fill the b column in Table A with the values from the b column in Table B, and then I no longer need Table B and can delete it. I will have to do this repeatedly since these two tables are given to me daily from two separate sources.
I cannot key these tables, since they are both partitioned.
I have tried:
update columnb:(exec columnb from TableB) from TableA;
but I get a `length error.
Suggestions on how to approach this in any manner are appreciated.

To replace a column in memory you would do the following.
t1:([]a:1 2 3;b:0N)
a b
---
1
2
3
t2:([]c:`aa`bb`cc;b:5 6 7)
c b
----
aa 5
bb 6
cc 7
t1,'t2
a b c
------
1 5 aa
2 6 bb
3 7 cc
If you are getting length errors then the columns do not have
the same count and the following would solve it. The obvious
problem with this solution is that it will start to repeat
data if t2 has a lower column count that t1. You will have to find out why that is.
t1,'count[t1]#t2
Now for partitions, you will use the amend function to change
the the b column of partitioned table, table A, at date 2007.02.23 (or whatever date your partition is).
This loads the b column of tableB into memory to preform the amend. You must perform the amend for each partition.
#[`:2007.02.23/tableA/;`b;:;count[tableA]#exec b from select b from tableB where date=2007.02.23]

Related

How to efficiently join two huge tables by nearest timestamp?

I have two huge tables, A and B. Table A has around 500 million rows of time-series data. Table B has around 10 million rows of time-series data. To simplify, we can assume they are constituted by the following columns:
Table A
factory
machine
timestamp_1
part
suplement
1
1
2022-01-01 23:54:01
1
1
1
1
2022-01-01 23:54:05
1
2
1
1
2022-01-01 23:54:10
1
3
...
...
...
...
Table B
machine
timestamp_2
measure
1
2022-01-01 23:54:00
0
1
2022-01-01 23:54:07
10
1
2022-01-01 23:54:08
0
...
...
...
I want to create a table C, that results from "joining" both tables by matching each value of timestamp_1 of table A to the nearest value of timestamp_2 of table B whose measure is 0, and also for the same factory and machine. I also only need this for the part = 1 values of table A. For the small example above, the resulting table C would have the same amount of rows as A and would look like:
Table C
machine
timestamp_1
time_since_measure_0
1
2022-01-01 23:54:01
1
1
2022-01-01 23:54:05
5
1
2022-01-01 23:54:10
2
...
...
...
Some things that are also important to consider are:
Table A has an index on columns (factory, machine, timestamp_1, part, suplement). That index is essential working great for other queries not related to this. Table B has indexes on columns (machine, timestamp_2, measure).
Table A is a compressed timescaleDB partitioned table by (factory, timestamp_1). This is also because of other queries. Table B is a postgresql vanilla table.
I used the following statement to create table C:
create table C (
machine int4 not null,
timestamp_1 timestamptz,
time_since_measure_0 interval,
constraint C primary key (machine,timestamp_1)
)
I then tried this code to select and insert data into table C:
insert into C (
select
factory,
machine,
timestamp_1,
timestamp_1 - (
select timestamp_2
from B
where
A.machine = B.machine
and B.measure = 0
and B.timestamp_2 <= A.timestamp_1
order by B.timestamp_2 desc
limit 1
) as "time_since_measure_0"
from A
where A.part = 1
)
However, this seems takes a loooot of time. I know I am dealing with very big tables, but is there something I am missing or how could I optimize this?
Because of course we don't have access to your tables and you haven't posted a query plan it's difficult to do more than make some general observations. The indexes you describe as being in place do not appear to be useful to this query. Looking at your query it appears to me that you need to add the following indexes:
Table A
Index on (machine, timestamp_1)
Table B
Index on (machine, measure, timestamp_2)
Give that a shot and see what happens.
What you want is called "as-of join". That joins each timestamp to the nearest value in the other table.
Some time-series databases, like clickhouse, support this directly. This is the only way to make it fast. It is quite similar to a merge join, with a few modifications: the engine must scan both tables in timestamp order, and join to the nearest value row instead of the equal value row.
I've looked into it briefly and it doesn't look like timescaledb supports it, but this post shows a workaround using lateral join and a covering index. This is likely to have similar performance to your query, because it will use a nested loop and an index-only scan to pick the nearest value for each row.

In Postgres, is there a way to have a derived column automatically update when I change the original data that calculated it?

Say I have two tables A and B.
A has two columns:
x | y
__|__
1 | 2
3 | 4
5 | 6
and B has one column, which is the product of the columns in A:
z
_
2
12
30
So, say if I changed the value 4 from table A to 3, then the 12 in table B should change to 9 (3*3), but it doesn't. How do I make B automatically update when the original values are changed?
This is the sort of thing that lends itself to a View:
CREATE VIEW
results
AS
SELECT a.x * a.y AS product
Then you would select from the view for you results. The down side is that this query is run every time you select from the view. There is also the option of a Materialized View where you determine when the values are updated with REFRESH MATERIALIZED VIEW.

Optimal use of LIKE on indexed column

I have a large table (+- 1 million rows, 7 columns including the primary key). The table contains two columns (ie: symbol_01 and symbol_02) that are indexed and used for querying. This table contains rows such as:
id symbol_01 symbol_02 value_01 value_02
1 aaa bbb 12 15
2 bbb aaa 12 15
3 ccc ddd 20 50
4 ddd ccc 20 50
As per the example rows 1 and 2 are identical except that symbol_01 and symbol_02 are swapped but they have the same values for value_01 and value_02. That is true once again with row 3 and 4. This is the case for the entire table, there are essentially two rows for each combination of symbol_01+symbol_02.
I need to figure out a better way of handling this to get rid of the duplication. So far the solution I am considering is to just have one column called symbol which would be a combination of the two symbols, so the table would be as follows:
id symbol value_01 value_02
1 ,aaa,bbb, 12 15
2 ,ccc,ddd, 20 50
This would cut the number of rows in half. As a side note, every value in the symbol column will be unique. Results always need to be queried for using both symbols, so I would do:
select value_01, value_02
from my_table
where symbol like '%,aaa,%' and symbol like '%,bbb,%'
This would work but my question is around performance. This is still going to be a big table (and will get bigger soon). So my question is, is this the best solution for this scenario given that symbol will be indexed, every symbol combination will be unique, and I will need to use LIKE to query results.
Is there a better way to do this? Im not sure how great LIKE is for performance but I don't see an alternative?
There's no high performance solution, because your problem is shoehorning multiple values into one column.
Create a child table (with a foreign key to your current/main table) to separately hold all the individual values you want to search on, index that column and your query will be simple and fast.
With this index:
create index symbol_index on t (
least(symbol_01, symbol_02),
greatest(symbol_01, symbol_02)
)
The query would be:
select *
from t
where
least(symbol_01, symbol_02) = least('aaa', 'bbb')
and
greatest(symbol_01, symbol_02) = greatest('aaa', 'bbb')
Or simply delete the duplicates:
delete from t
using (
select distinct on (
greatest(symbol_01, symbol_02),
least(symbol_01, symbol_02),
value_01, value_02
) id
from t
order by
greatest(symbol_01, symbol_02),
least(symbol_01, symbol_02),
value_01, value_02
) s
where id = s.id
Depending on the columns semantics it might be better to normalize the table as suggested by #Bohemian

SQL: How to prevent double summing

I'm not exactly sure what the term is for this but, when you have a many-to-many relationship when joining 2 tables and you want to sum up one of the variables, I believe that you can sum the same values over and over again.
What I want to accomplish is to prevent this from happening. How do I make sure that my sum function is returning the correct number?
I'm using PostgreSQL
Example:
Table 1 Table 2
SampleID DummyName SampleID DummyItem
1 John 1 5
1 John 1 4
2 Doe 1 5
3 Jake 2 3
3 Jake 2 3
3 2
If I join these two tables ON SampleID, and I want to sum the DummyItem for each DummyName, how can I do this without double summing?
The solution is to first aggregate and then do the join:
select t1.sampleid, t1.dummyname, t.total_items
from table_1 t1
join (
select t2.sampleid, sum(dummyitem) as total_items
from table_2 t2
group by t2
) t ON t.sampleid = t1.sampleid;
The real question is however: why are the duplicates in table_1?
I would take a step back and try to assess the database design. Specifically, what rules allow such duplicate data?
To address your specific issue given your data, here's one option: create a temp table that contains unique rows from Table 1, then join the temp table with Table 2 to get the sums I think you are expecting.

How to fuse values from 2 separate columns into a common column in PostgreSQL?

Is there a simple way to fuse values from two separate (albeit similar) columns in PostgreSQL?
For example, the following statement:
SELECT a, b FROM stuff;
would currently result in:
a b
-----------
1 2
1 3
1 4
However, I'd like to have the two columns fused in the following way:
ab
---
1
1
1
2
3
4
If you need to get 2 results from same complex query without losing performance try something like:
WITH source AS
(SELECT A,B
FROM your_complex_query)
SELECT A as AB
FROM source
UNION ALL
SELECT B as AB
FROM source
select a as ab from stuff
union all
select b from stuff
order by 1