DB2 Joining a Large Physical table with a small Global Temp table - db2

I have the below requirement of Joining 3 tables
a) Table T1 - large physical table with 100 Million rows
Index columns: C1, C2, C3 in this order
b) Table T2 - Temp table with 50 records
contains C2 & additional columns. No Index
c) Table T3 - Temp table with 100 records
contains C3 & additional columns. No Index
Tables T2 and T3 have no common columns
I tried to extract data from T1, T2, T3 as below:
Select T1.*, T2.*, T3.*
from T1
Inner join T2 (on T1.C2 = T2.C2)
Inner join T3 (T1.C3 = T3.C3)
where
T1.C1 = a constant value (coming from my program).
Explain of above query shows that on T1, Index scan was performed using only 1 column. (I believe it is T1.C3 as i provided WHERE clause)
The query is executing fine but taking slightly longer. Is there a better way to code the query for above requirement?
Any inputs are greatly appreciated

You mention you're using a temp table. Did you run RUNSTATS on the temporary tables, including collecting column statistics?
An index scan matching on one column has to be against T1 matching on column 1, since that is the leading column of the index. When examining explain, you should also pay attention to PRIMARY_ACCESSTYPE. Db2 may choose to scan one or all of T1, T2, and T3 and create a sparse index, which would be reflected with PRIMARY_ACCESSTYPE = T in the PLAN_TABLE.
What is the cardinality of the 3 column index on the 100 million row table? Is it unique? Is it highly selective - close to table size of 100 million rows, or would significant duplicate rows qualify for every probe?
Accurate statistics are important in this scenario. The cost of cartesian join is quite high, so it's important Db2 understands how small the temporary tables are, and how selective the join columns are in considering the access path to choose. If no statistics are collected on tables T2 and T3, Db2 by defauls assumes 10,000 rows in a table. Then a cartesian join of T2 and T3 would be estimated at 10,000 * 10,000 rows = 100 million, then it would make sense for Db2 to just access that table once using the local filter, then join to T2 and T3, possibly with a sparse index.
If collecting statistics does not resolve, please update the question with plan table results.

Related

Joining too many tables makes Postgres query extremely slow

I've been trying to optimize this simple query on Postgres 12 that joins several tables to a base relation. They each have 1-to-1 relation and have anywhere from 10 thousand to 10 million rowss.
SELECT *
FROM base
LEFT JOIN t1 ON t1.id = base.t1_id
LEFT JOIN t2 ON t2.id = base.t2_id
LEFT JOIN t3 ON t3.id = base.t3_id
LEFT JOIN t4 ON t4.id = base.t4_id
LEFT JOIN t5 ON t5.id = base.t5_id
LEFT JOIN t6 ON t6.id = base.t6_id
LEFT JOIN t7 ON t7.id = base.t7_id
LEFT JOIN t8 ON t8.id = base.t8_id
LEFT JOIN t9 ON t9.id = base.t9_id
(the actual relations are a bit more complicated than this, but for demonstration purposes this is fine)
I noticed that the query is still very slow when I only do SELECT base.id which seems odd, because then query planner should know that the joins are unnecessary and shouldn't affect the performance.
Then I noticed that 8 seems to be some kind of magic number. If I remove any single one of the joins, the query time goes from 500ms to 1ms. With EXPLAIN I was able to see that Postgres is doing index only scans when joining 8 tables, but with 9 tables it starts doing sequential scans.
That's even when I only do SELECT base.id so somehow the amount of tables is tripping up the query planner.
We finally found out that there is indeed a configuration setting in postgres called join_collapse_limit, which is set to 8 by default.
https://www.postgresql.org/docs/current/runtime-config-query.html
The planner will rewrite explicit JOIN constructs (except FULL JOINs) into lists of FROM items whenever a list of no more than this many items would result. Smaller values reduce planning time but might yield inferior query plans. By default, this variable is set the same as from_collapse_limit, which is appropriate for most uses. Setting it to 1 prevents any reordering of explicit JOINs. Thus, the explicit join order specified in the query will be the actual order in which the relations are joined. Because the query planner does not always choose the optimal join order, advanced users can elect to temporarily set this variable to 1, and then specify the join order they desire explicitly.
After reading this article we decided to increase the limit, along with other values such as from_collapse_limit and geco_threshold. Beware that query planning time increases exponentially with the amount of joins, so the limit is there for a reason and should not be increased carelessly.

PostgreSQL 12.4 query planner ignores sub-partition constraint, resulting in table scan

I have a table
T (A int, B int, C long, D varchar)
partitioned by each A and sub-partitioned by each B (i.e. list partitions with a single value each). A has cardinality of <10 and B has cardinality of <100. T has about 6 billion rows.
When I run the query
select distinct B from T where A = 1;
it prunes the top-level partitions (those where A != 1) but performs a table scan on all sub-partitions to find distinct values of B. I thought it would know, based on the partition design, that it would only have to check the partition constraint to determine the possible values of B given A, but alas, that is not the case.
There are no indexes on A or B, but there is a primary key on (C,D) at each partition, which seems immaterial, but figured I should mention it. I also have a BRIN index on C. Any idea why the Postgres query planner is not consulting the sub-partition constraints to avoid the table scan?
The reason is that nobody implemented such an optimization in the query planner. I cannot say that that surprises me, since it is a rather unusual query. Every such optimization built into the optimizer would mean that each query on a partitioned table that has a DISTINCT would need some extra query planning time, while only few queries would profit. Apart from the expense of writing and maintaining the code, that would be a net loss for most users.
Maybe you could use a metadata query:
CREATE TABLE list (id bigint NOT NULL, p integer NOT NULL) PARTITION BY LIST (p);
CREATE TABLE list_42 PARTITION OF list FOR VALUES IN (42);
CREATE TABLE list_101 PARTITION OF list FOR VALUES IN (101);
SELECT regexp_replace(
pg_get_expr(
p.relpartbound,
p.oid
),
'^FOR VALUES IN \((.*)\)$',
'\1'
)::integer
FROM pg_class AS p
JOIN pg_inherits AS i ON p.oid = i.inhrelid
WHERE i.inhparent = 'list'::regclass;
regexp_replace
----------------
42
101
(2 rows)

Inner join on tables with 50M and 30K entries

I have two tables A and B. A contains 50 million entries and B contains just 30 thousand. I have created default indexes (B-tree) on the columns used to join the tables. The join field is of type character varying.
I am querying the database with this query:
SELECT count(*)
from B INNER JOIN A
ON B.id = A.id;
The execution time of the above query is approximately 8 seconds. When I saw the execution plan, the planner applies a sequential scan to table A scanning all the 50 million entries (this is taking most of the time) and an index scan on table B.
How can I speed up the query?
You cannot speed up this query if you want an exact result.
The most efficient join strategy will probably be a hash or merge join, depending on your work_mem setting.
You might be able to get some speed improvement with an index only scan; try to VACUUM both tables before querying.
The only tuning method would be to make sure both tables are cached in RAM.
There are ways to get estimated counts, see my blog for details.

PostgreSQL 9.4.5: Limit number of results on INNER JOIN

I'm trying to implement a many-to-many relationship using PostgreSQL's Array type, because it scales better for my use case than a join table would. I have two tables: table1 and table2. table1 is the parent in the relationship, having the column child_ids bigint[] default array[]::bigint[]. A single row in table1 can have upwards of tens of thousands of references to table2 in the table1.child_ids column, therefore I want to try to limit the amount returned by my query to a maximum of 10. How would I structure this query?
My query to dereference the child ids is SELECT *, json_agg(table2.*) as children FROM table1 INNER JOIN table2 ON table2 = ANY(table1.child_ids). I don't see a way I could set a limit without limiting the entire response as a whole. Is there a way to either limit this INNER JOIN, or at least utilize a subquery to that I can use LIMIT to restrict the amount of results from table2?
This would have been dead simple with properly normalized tables, but here goes with arrays:
SELECT *
FROM table1 t1, LATERAL (
SELECT json_agg(*) AS children
FROM table2
WHERE id = ANY (t1.child_ids)
LIMIT 10) t2;
Of course, you have no influence over which 10 rows per id of table2 will be selected.

Calculating correlation coefficient using PostgreSQL?

I have worked out how to calculate the correlation coefficient between two fields if both are in the same table:
SELECT corr(column1, column2) FROM table WHERE <my filters>;
...but I can't work out how to do it when the columns are from different tables (I need to apply the same filters to both tables).
Any hints, please?
If the tables are related to one another such that you can join them, it's fairly simple. Just join them and do the correlation:
SELECT corr(t1.col1, t2.col2)
FROM table1 t1
JOIN table2 t2
ON t1.join_field = t2.join_field
WHERE
<filters for t1>
AND
<filters for t2>
If they're not, then how are you supposed to find out which combination of fields from each table you want to run corr on?
try this
SELECT corr(t1.column1, t2.column2)
FROM table1 t1
join table2 t2 on t1.SomeColumn = t2.SomeColumn
WHERE t1.<my filters>
AND t2.<my filters>;