How to optimize a query that searches a many-to-many table

How to optimize a query that searches a many-to-many table - postgresql

I have 3 tables:
table1:{id, uid}
table2:{id, uid}
table1_table2:{table1_id, table2_id}
I need to execute the following queries:
SELECT 1 FROM table1_table2
LEFT JOIN table1 ON table1.id = table1_table2.table1_id
LEFT JOIN table2 ON table2.id = table1_table2.table2_id
WHERE table1.uid = ? and table2.uid = ?
I have unique indices on UUID columns, so I expected the search to be fast. When I have an almost empty database, select takes 0 ms, when there are 50,000 records in table 1, 100 records in table 2 and 110,000 records in table1_table2, select takes 10 ms, which is a lot, because I have to make 400,000 queries. Can I have O(1) on select?
Now I'm using hibernate(spring data) and postgres.

You have unique indices but have you updated statistics with ANALYZE as well?
What type is used for UID column and what type are you feeding it with from Java?
Is there any difference, when you run it from Hibernate/Java and from Postgres console?
Run the query with "EXPLAIN", get the execution plan - from Java as well as from Postgres console, and observe any differences. See How to get query plan information from Postgres into JDBC

Related

Inner join on tables with 50M and 30K entries

I have two tables A and B. A contains 50 million entries and B contains just 30 thousand. I have created default indexes (B-tree) on the columns used to join the tables. The join field is of type character varying.
I am querying the database with this query:
SELECT count(*)
from B INNER JOIN A
ON B.id = A.id;
The execution time of the above query is approximately 8 seconds. When I saw the execution plan, the planner applies a sequential scan to table A scanning all the 50 million entries (this is taking most of the time) and an index scan on table B.
How can I speed up the query?

You cannot speed up this query if you want an exact result.
The most efficient join strategy will probably be a hash or merge join, depending on your work_mem setting.
You might be able to get some speed improvement with an index only scan; try to VACUUM both tables before querying.
The only tuning method would be to make sure both tables are cached in RAM.
There are ways to get estimated counts, see my blog for details.

Very long query planning times for database with lots of partitions in PostgreSQL

I have a PostgreSQL 10 database that contains two tables which both have two levels of partitioning (by list).
The data is now stored within 5K to 10K partitioned tables (grand-children of the two tables mentioned above) depending on the day.
There are three indexes per grand-child partition table but the two columns on which partitioning is done aren't indexed.
(Since I don't think this is needed no?)
The issue I'm observing is that the query planning time is very slow but the query execution time very fast.
Even when the partition values were hard-coded in the query.
Researching the issue, I thought that the linear search use by PostgreSQL 10 to find the metadata of the partition was the cause of it.
cf: https://blog.2ndquadrant.com/partition-elimination-postgresql-11/
So I decided to try out PostgreSQL 11 which includes the two aforementioned patches:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=499be013de65242235ebdde06adb08db887f0ea5
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9fdb675fc5d2de825414e05939727de8b120ae81
Helas, it seems that the version change doesn't change anything.
Now I know that having lots of partitions isn't greatly appreciated by PostgreSQL but I still would like to understand why the query planner is so slow in PostgreSQL 10 and now PostgreSQL 11.
An example of a query would be:
EXPLAIN ANALYZE
SELECT
table_a.a,
table_b.a
FROM
(
SELECT
a,
b
FROM
table_a
WHERE
partition_level_1_column = 'foo'
AND
partition_level_2_column = 'bar'
)
AS table_a
INNER JOIN
(
SELECT
a,
b
FROM
table_b
WHERE
partition_level_1_column = 'baz'
AND
partition_level_2_column = 'bat'
)
AS table_b
ON table_b.b = table_a.b
LIMIT
10;
Running it will on database with 5K partitions will return Planning Time: 7155.647 ms but Execution Time: 2.827 ms.

MS SQL Server not using index when WHERE has IN with more than 6 values

So i have a very strange issue, If i run a query like this:
SELECT *
FROM tbl_x
WHERE tbl_x.SomeCode IN ('1','2','3','4','5','6')
The query uses the index on the table, however if i do query this:
SELECT *
FROM tbl_x
WHERE tbl_x.SomeCode IN ('1','2','3','4','5','6','7')
The query ignores the index and decides to do a Table Scan
Why is MS SQL Server not using the index when there are more than 6 values in the Where clause?

I guess that based on data distribution and cardinality query optimizer decides to use full table scan, because it is cheaper than to use index.
You could check:
SELECT COUNT(*)
FROM tbl_x;
and
SELECT COUNT(*)
FROM tbl_x
WHERE tbl_x.SomeCode IN ('1','2','3','4','5','6','7');
Probably you exceeded 20% rows of entire table.

Update with join condition is taking too long in redshift

I have a table in a Redshift cluster with 5 billion rows. I have a job that tries to update some column values based on some filter. Updating anything at all in this table is incredibly slow. Here's an example:
Update tbl1
set price=tbl2.price, flag=true
from tbl2 join tbl1 on tbl1.id=tbl2.id
where tbl1.time between (some value) and
tbl2.createtime between (some value)
I have sort key on time and dist key on id. When I checked stl_scan table, its shows that my query is scanning 50 million rows on each slice, and only returning 50K rows on each slice. I stopped the query after 20 mins.
For testing, I created same table with 1 billion rows and same update query took 3 mins.
When I run select with same condition I get the results in few seconds.Is there anything I am doing wrong?

I believe the correct syntax is:
Update tbl1
set price = tbl2.price,
flag = true
from tbl2
where tbl1.id = tbl2.id and
tbl1.time between (some value) and
tbl2.createtime between (some value);
Note that tbl1 is only mentoned once, in the update clause. There is no join, just a correlation clause.

way to reduce the cost in db2 for count(*)

Hi I had a DB2 Query as below
select count(*) as count from
table_a,
table_b,
table c
where
b.xxx=234 AND
b.yyy=c.wedf
Result SEt:
Count
618543562
For the above query i even tried with Count(1) but when i took the access plan, cost is same.
select count(1) as count from
table_a,
table_b,
table c
where
b.xxx=234 AND
b.yyy=c.wedf
Result SEt:
Count
618543562
Is there any other way to reduce the cost.
PS: b.xxx,b.yyy, c.wedf is indexed..
Thanks in advance.

I think one of the problem are statistics on the table. Did you execute Runstats? Probably, the data distribution or the quantity of rows that has to be read is a lot, and DB2 concludes that is better to read the whole table, instead of process an index, and then fetch the rows from the table.
It seems that both queries are taking the same access plan, and I think they are doing table scans.
Are the three columns part of the same index? or they are indexed separately? If they are part of different indexes, is there any ANDing between indexes in the access plan? If there is not ANDing with different indexes, the columns has to be read from the table in order to process the predicates.
The reason count(1) and count(*) are giving the same cost, is because both has to do a TableScan.
Please, take a look at the access plan, not only the results in timerons, but also the steps. Is the access plan taking the indexes? how many sorts is executing?
Try to change the optimization level, and you will see that the access plans change. I think you are executing with the default one (5)
If you want to force the query to take in account an index, you can create an optimization profile

What is the relation between (B,C) tables and A table. In your query you just use CROSS JOIN between A and (B,C). So it is the MAIN performance issue.
If you really need this count just multiply counts for A and (B,C):
select
(select count(*) from a)
*
(select count(*) from b, c where b.xxx=234 AND b.yyy=c.wedf )
for DB2 use this:
select a1.cnt*
(select count(*) as cnt2 from b, c where b.xxx=234 AND b.yyy=c.wedf )
from
(select count(*) as cnt1 from a) a1

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to optimize a query that searches a many-to-many table - postgresql

Related

Inner join on tables with 50M and 30K entries

Very long query planning times for database with lots of partitions in PostgreSQL

MS SQL Server not using index when WHERE has IN with more than 6 values

Update with join condition is taking too long in redshift

way to reduce the cost in db2 for count(*)

Categories

Resources