SQL (Redshift) to get the intersect of multiple tables - amazon-redshift

I'm using Redshift and have 6 tables of IDs in. I want to get the intersect between each of the tables.
So my final output would look something like this:
Table 1 & Table 2 have 10% common IDs
Table 1 & Table 3 have 50% common IDs
.....
.....
Table 6 & Table 4 have 20% common IDs
Table 6 & Table 5 have 3% common IDs
I can easily get the data, but it would be a lot of repeating the same SQL, so I've tried to create some tables of all the IDs and tables they are in but I'm stuck as to what to get the data in one or two SQL's.
Any ideas welcome!

you could try to full join all these tables by ID in a subquery and then use conditional aggregate so that Table 1 & Table 2 have 10% common IDs would be expressed as
100.0*sum(case when id1 is not null and id2 is not null then 1 end)/count(id1)
(taking Table 1 row count as denominator)

Related

Best usage of indexes and primary key on joined and filtered data in PostgreSQL

I have 2 tables with the exact same number of rows and the same non-repeated id. Because the data comes from 2 sources I want to keep it 2 tables and not combine it. I assume the best approach would be to leave the unique id as the primary key and join on it?
SELECT * FROM tableA INNER JOIN tableB ON tableA primary key = tableB primary key
The data is used by an application that force the user to select 1 or many values from 5 drop downs in cascading order:
select 1 or many values from tableA column1.
select 1 or many values from tableA column2 but filtered from the first filter.
select 1 or many values from tableA column3 but filtered from the second filter which in turn is filtered from the first filter.
For example:
pk
Column 1
Column 2
Column 3
123
Doe
Jane
2022-01
234
Doe
Jane
2021-12
345
Doe
John
2022-03
456
Jones
Mary
2022-04
Selecting "Doe" from column1 would limit the second filter to ("Jane","John"). And selecting "Jane" from column2 would filter column3 to ("2022-01","2021-12")
And last part of the question;
The application have 3 selection options for column3:
picking the exact value (for example "2022-01") or picking the year ("2022") or picking the quarter that the month falls into ("Q1", which equates in "01","02","03").
What would be the best usage of indexes AND/OR additional columns for this scenario?
Volume of data would be 20-100 million rows.
Each filter is in the range of 5-25 distinct values.
Which version of Postgres do you operate?
The volume you state is rather daunting for such a use case of populating drop-down boxes using live data for a PG db.
No kidding, it's possible, Kibana/Elastic has even a filter widget that works exactly this way for instance.
My guess is you may consider storing the different combinations of search columns in another table simply to speed up populating the dropboxes. You can achieve that with triggers on the two main tables. So instead of additional columns/indexes you may end with an additional table ;)
Regarding indexing strategy and given the hints you stated (AND/OR), I'd say there's no silver bullet. Index the columns that will be queried the most often.
Index each column individually because Postgres starting from 11 IIRC can combine multiple indexes to answer conjunctive/disjunctive formulas in WHERE clauses.
Hope this helps

How to efficiently join two huge tables by nearest timestamp?

I have two huge tables, A and B. Table A has around 500 million rows of time-series data. Table B has around 10 million rows of time-series data. To simplify, we can assume they are constituted by the following columns:
Table A
factory
machine
timestamp_1
part
suplement
1
1
2022-01-01 23:54:01
1
1
1
1
2022-01-01 23:54:05
1
2
1
1
2022-01-01 23:54:10
1
3
...
...
...
...
Table B
machine
timestamp_2
measure
1
2022-01-01 23:54:00
0
1
2022-01-01 23:54:07
10
1
2022-01-01 23:54:08
0
...
...
...
I want to create a table C, that results from "joining" both tables by matching each value of timestamp_1 of table A to the nearest value of timestamp_2 of table B whose measure is 0, and also for the same factory and machine. I also only need this for the part = 1 values of table A. For the small example above, the resulting table C would have the same amount of rows as A and would look like:
Table C
machine
timestamp_1
time_since_measure_0
1
2022-01-01 23:54:01
1
1
2022-01-01 23:54:05
5
1
2022-01-01 23:54:10
2
...
...
...
Some things that are also important to consider are:
Table A has an index on columns (factory, machine, timestamp_1, part, suplement). That index is essential working great for other queries not related to this. Table B has indexes on columns (machine, timestamp_2, measure).
Table A is a compressed timescaleDB partitioned table by (factory, timestamp_1). This is also because of other queries. Table B is a postgresql vanilla table.
I used the following statement to create table C:
create table C (
machine int4 not null,
timestamp_1 timestamptz,
time_since_measure_0 interval,
constraint C primary key (machine,timestamp_1)
)
I then tried this code to select and insert data into table C:
insert into C (
select
factory,
machine,
timestamp_1,
timestamp_1 - (
select timestamp_2
from B
where
A.machine = B.machine
and B.measure = 0
and B.timestamp_2 <= A.timestamp_1
order by B.timestamp_2 desc
limit 1
) as "time_since_measure_0"
from A
where A.part = 1
)
However, this seems takes a loooot of time. I know I am dealing with very big tables, but is there something I am missing or how could I optimize this?
Because of course we don't have access to your tables and you haven't posted a query plan it's difficult to do more than make some general observations. The indexes you describe as being in place do not appear to be useful to this query. Looking at your query it appears to me that you need to add the following indexes:
Table A
Index on (machine, timestamp_1)
Table B
Index on (machine, measure, timestamp_2)
Give that a shot and see what happens.
What you want is called "as-of join". That joins each timestamp to the nearest value in the other table.
Some time-series databases, like clickhouse, support this directly. This is the only way to make it fast. It is quite similar to a merge join, with a few modifications: the engine must scan both tables in timestamp order, and join to the nearest value row instead of the equal value row.
I've looked into it briefly and it doesn't look like timescaledb supports it, but this post shows a workaround using lateral join and a covering index. This is likely to have similar performance to your query, because it will use a nested loop and an index-only scan to pick the nearest value for each row.

PostgreSQL: How to use a lookup table to select data across multiple tables?

I have a schema with many large tables which all have the same structure. Each table has an index on its id. I also have a separate table with all the id's across the other tables, pointing to their tablename; for example, the tables in the schema:
Table 'A'
id content
1 ...
2 ...
3 ...
Table 'B'
id content
4 ...
5 ...
6 ...
Table 'C'
id content
5 ...
6 ...
7 ...
(As you can see the id's are not always unique across the tables) And then a table with lookup:
Table 'lookup'
id tablename
1 'A'
2 'A'
3 'A'
4 'B'
5 'B'
5 'C'
6 'B'
6 'C'
7 'C'
Now, how can I make a view like this?
SELECT
id, content
FROM
view
WHERE
id = 6
where it would select the content from B and C (where id is 6). Also, it should only do an index scan on B and C to reduce search time. Once again, there are many tables and they are very large. By far the most of the id's are unique across the tables.
How can I do this? (Or maybe, should I do it this way?)
PS The content of the tables is not stored into a single table because the volume is constantly growing and inserting/copying into this indexed table becomes very slow after a while. Also, it is more easy to remove specific data by just truncating separate tables.
You are looking for table partitioning:
CREATE TABLE object (
country text,
id bigint,
content bytea,
PRIMARY KEY (country, id)
) PARTITION BY LIST (country);
CREATE TABLE object_a PARTITION OF object FOR VALUES IN ('A');
CREATE TABLE object_b PARTITION OF object FOR VALUES IN ('B');
CREATE TABLE object_c PARTITION OF object FOR VALUES IN ('C');
This alone won't give you quick access by id, but simplifies managing the union and querying by country name. You'd still need the lookup table for the countries:
CREATE MATERIALIZED VIEW lookup AS SELECT country, id FROM object;
Then you can do
SELECT content
FROM object
JOIN lookup ON object.country = lookup.country AND object.id = lookup.id
WHERE lookup.id = 6

Optimal use of LIKE on indexed column

I have a large table (+- 1 million rows, 7 columns including the primary key). The table contains two columns (ie: symbol_01 and symbol_02) that are indexed and used for querying. This table contains rows such as:
id symbol_01 symbol_02 value_01 value_02
1 aaa bbb 12 15
2 bbb aaa 12 15
3 ccc ddd 20 50
4 ddd ccc 20 50
As per the example rows 1 and 2 are identical except that symbol_01 and symbol_02 are swapped but they have the same values for value_01 and value_02. That is true once again with row 3 and 4. This is the case for the entire table, there are essentially two rows for each combination of symbol_01+symbol_02.
I need to figure out a better way of handling this to get rid of the duplication. So far the solution I am considering is to just have one column called symbol which would be a combination of the two symbols, so the table would be as follows:
id symbol value_01 value_02
1 ,aaa,bbb, 12 15
2 ,ccc,ddd, 20 50
This would cut the number of rows in half. As a side note, every value in the symbol column will be unique. Results always need to be queried for using both symbols, so I would do:
select value_01, value_02
from my_table
where symbol like '%,aaa,%' and symbol like '%,bbb,%'
This would work but my question is around performance. This is still going to be a big table (and will get bigger soon). So my question is, is this the best solution for this scenario given that symbol will be indexed, every symbol combination will be unique, and I will need to use LIKE to query results.
Is there a better way to do this? Im not sure how great LIKE is for performance but I don't see an alternative?
There's no high performance solution, because your problem is shoehorning multiple values into one column.
Create a child table (with a foreign key to your current/main table) to separately hold all the individual values you want to search on, index that column and your query will be simple and fast.
With this index:
create index symbol_index on t (
least(symbol_01, symbol_02),
greatest(symbol_01, symbol_02)
)
The query would be:
select *
from t
where
least(symbol_01, symbol_02) = least('aaa', 'bbb')
and
greatest(symbol_01, symbol_02) = greatest('aaa', 'bbb')
Or simply delete the duplicates:
delete from t
using (
select distinct on (
greatest(symbol_01, symbol_02),
least(symbol_01, symbol_02),
value_01, value_02
) id
from t
order by
greatest(symbol_01, symbol_02),
least(symbol_01, symbol_02),
value_01, value_02
) s
where id = s.id
Depending on the columns semantics it might be better to normalize the table as suggested by #Bohemian

SQL: How to prevent double summing

I'm not exactly sure what the term is for this but, when you have a many-to-many relationship when joining 2 tables and you want to sum up one of the variables, I believe that you can sum the same values over and over again.
What I want to accomplish is to prevent this from happening. How do I make sure that my sum function is returning the correct number?
I'm using PostgreSQL
Example:
Table 1 Table 2
SampleID DummyName SampleID DummyItem
1 John 1 5
1 John 1 4
2 Doe 1 5
3 Jake 2 3
3 Jake 2 3
3 2
If I join these two tables ON SampleID, and I want to sum the DummyItem for each DummyName, how can I do this without double summing?
The solution is to first aggregate and then do the join:
select t1.sampleid, t1.dummyname, t.total_items
from table_1 t1
join (
select t2.sampleid, sum(dummyitem) as total_items
from table_2 t2
group by t2
) t ON t.sampleid = t1.sampleid;
The real question is however: why are the duplicates in table_1?
I would take a step back and try to assess the database design. Specifically, what rules allow such duplicate data?
To address your specific issue given your data, here's one option: create a temp table that contains unique rows from Table 1, then join the temp table with Table 2 to get the sums I think you are expecting.