Inner join on tables with 50M and 30K entries - postgresql

I have two tables A and B. A contains 50 million entries and B contains just 30 thousand. I have created default indexes (B-tree) on the columns used to join the tables. The join field is of type character varying.
I am querying the database with this query:
SELECT count(*)
from B INNER JOIN A
ON B.id = A.id;
The execution time of the above query is approximately 8 seconds. When I saw the execution plan, the planner applies a sequential scan to table A scanning all the 50 million entries (this is taking most of the time) and an index scan on table B.
How can I speed up the query?

You cannot speed up this query if you want an exact result.
The most efficient join strategy will probably be a hash or merge join, depending on your work_mem setting.
You might be able to get some speed improvement with an index only scan; try to VACUUM both tables before querying.
The only tuning method would be to make sure both tables are cached in RAM.
There are ways to get estimated counts, see my blog for details.

Related

How to optimize the following query by adding more indexes?

I am trying to optimize a query which has been destroying my DB.
https://explain.depesz.com/s/isM1
If you have any insights into how to make this better please let me know.
We are using RDS/Postgres 11.9
explain analyze SELECT "src_rowdifference"."key",
"src_rowdifference"."port_id",
"src_rowdifference"."shipping_line_id",
"src_rowdifference"."container_type_id",
"src_rowdifference"."shift_id",
"src_rowdifference"."prev_availability_id",
"src_rowdifference"."new_availability_id",
"src_rowdifference"."date",
"src_rowdifference"."prev_last_update",
"src_rowdifference"."new_last_update"
FROM "src_rowdifference"
INNER JOIN "src_containertype" ON ("src_rowdifference"."container_type_id" = "src_containertype"."key")
WHERE ("src_rowdifference"."container_type_id" IN
(SELECT U0."key"
FROM "src_containertype" U0
INNER JOIN "notification_tablenotification_container_types" U1 ON (U0."key" = U1."containertype_id")
WHERE U1."tablenotification_id" = 'test#test.com')
AND "src_rowdifference"."new_last_update" >= '2020-01-15T03:11:06.291947+00:00'::timestamptz
AND "src_rowdifference"."port_id" IN
(SELECT U0."key"
FROM "src_port" U0
INNER JOIN "notification_tablenotification_ports" U1 ON (U0."key" = U1."port_id")
WHERE U1."tablenotification_id" = 'test#test.com')
AND "src_rowdifference"."shipping_line_id" IN
(SELECT U0."key"
FROM "src_shippingline" U0
INNER JOIN "notification_tablenotification_shipping_lines" U1 ON (U0."key" = U1."shippingline_id")
WHERE U1."tablenotification_id" = 'test#test.com')
AND "src_rowdifference"."prev_last_update" IS NOT NULL
AND NOT ("src_rowdifference"."prev_availability_id" = 'na'
AND "src_rowdifference"."prev_availability_id" IS NOT NULL)
AND NOT ("src_rowdifference"."key" IN
(SELECT V1."rowdifference_id"
FROM "notification_tablenotificationtrigger_row_differences" V1
WHERE V1."tablenotificationtrigger_id" IN
(SELECT U0."id"
FROM "notification_tablenotificationtrigger" U0
WHERE U0."notification_id" = 'test#test.com'))));
All my indexes are btree + btree(varchar_pattern_ops)
"src_rowdifference_port_id_shipping_line_id_9b3465fc_uniq" UNIQUE CONSTRAINT, btree (port_id, shipping_line_id, container_type_id, shift_id, date, new_last_update)
Edit: A little unrelated change that I made was added some more ssd disk space to my RDS instance. That made a huge difference to the CPU usage and in turn made a huge difference to the number of connections we have.
It is hard to think about the plan as a whole, as I don't understand what it is looking for. But looking at the individual pieces, there are two which together dominate the run time.
One is the index scan on src_rowdifference_port_id_shipping_line_id_9b3465fc, which seems pretty slow given the number of rows returned. Comparing the Index Condition to the index columns, I can see that the condition on new_last_update cannot be applied efficiently in the index because two columns in the index come before it and have no equality conditions in the node. So instead that >= is applied as an "in-index filter" where it needs to test each row and reject it, rather than just skipping it in bulk. I don't know how many rows that removes as the "Rows Removed by Filter" does not count in-index filters, but it is potentially large. So one thing to try would be to make a new index on (port_id, shipping_line_id, container_type_id, new_last_update). Or maybe replace that index with a reordered version (port_id, shipping_line_id, container_type_id, new_last_update, shift_id, date) but of course that might make some other query worse.
The other time consuming thing is kicking the materialized node 47 thousand times (each one looping over up to 22 thousand rows) to implement NOT (SubPlan 1). That should be using a hashed subplan, rather than a linear searched subplan. The only reason I can think of that it not doing the hashed subplan is that work_mem is not large enough to anticipate fitting it into memory. What is your setting for work_mem? What happens if you bump it up to "100MB" or so?
The NOT (SubPlan 1) from the EXPLAIN corresponds to the part of your query AND NOT ("src_rowdifference"."key" IN (...)). If bumping up work_mem doesn't work, you could try rewriting that into a NOT EXISTS clause instead.

Very long query planning times for database with lots of partitions in PostgreSQL

I have a PostgreSQL 10 database that contains two tables which both have two levels of partitioning (by list).
The data is now stored within 5K to 10K partitioned tables (grand-children of the two tables mentioned above) depending on the day.
There are three indexes per grand-child partition table but the two columns on which partitioning is done aren't indexed.
(Since I don't think this is needed no?)
The issue I'm observing is that the query planning time is very slow but the query execution time very fast.
Even when the partition values were hard-coded in the query.
Researching the issue, I thought that the linear search use by PostgreSQL 10 to find the metadata of the partition was the cause of it.
cf: https://blog.2ndquadrant.com/partition-elimination-postgresql-11/
So I decided to try out PostgreSQL 11 which includes the two aforementioned patches:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=499be013de65242235ebdde06adb08db887f0ea5
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9fdb675fc5d2de825414e05939727de8b120ae81
Helas, it seems that the version change doesn't change anything.
Now I know that having lots of partitions isn't greatly appreciated by PostgreSQL but I still would like to understand why the query planner is so slow in PostgreSQL 10 and now PostgreSQL 11.
An example of a query would be:
EXPLAIN ANALYZE
SELECT
table_a.a,
table_b.a
FROM
(
SELECT
a,
b
FROM
table_a
WHERE
partition_level_1_column = 'foo'
AND
partition_level_2_column = 'bar'
)
AS table_a
INNER JOIN
(
SELECT
a,
b
FROM
table_b
WHERE
partition_level_1_column = 'baz'
AND
partition_level_2_column = 'bat'
)
AS table_b
ON table_b.b = table_a.b
LIMIT
10;
Running it will on database with 5K partitions will return Planning Time: 7155.647 ms but Execution Time: 2.827 ms.

Optimal way of using joins in Redshift

I have 2 tables in AWS redshift. The details are as below
a) impressions (to count the number of impressions of a particular ad)
Number of rows (170 million)
distribution key(ad_campaign)
sort key (created_on)
b) clicks (to count the number of clicks of a particular ad).
Number of rows (80 million)
distribution key(ad_campaign)
sort key (created_on)
I have a single DC1 Large cluster with 2 slices.
I am trying to run the below query
select impressions.offer_id, count(imp_cnt) from
bidsflyer.tblImpressionLog_Opt impressions
full join bidsflyer.tblTrackingLinkLog_Opt clicks
on impressions.offer_id=clicks.offer_id and date_trunc('week',
impressions.created_on)=date_trunc('week', clicks.created_on)
where impressions.created_on >= '2017-07-27 00:00:00'
group by 1
This query takes more then 8 mins to run. I think this is quite large considering the volume of data, which I feel is not huge.
The query plan looks like something below
XN HashAggregate (cost=2778257688268.43..2778257688268.60 rows=67 width=12)
-> XN Hash Left Join DS_DIST_NONE (cost=179619.84..2778170875920.65 rows=17362469555 width=12)
Hash Cond: (("outer".offer_id = "inner".offer_id) AND (date_trunc('week'::text, "outer".created_on) = date_trunc('week'::text, "inner".created_on)))
-> XN Seq Scan on tblimpressionlog_opt impressions (cost=0.00..724967.36 rows=57997389 width=20)
Filter: (created_on >= '2017-07-27 00:00:00'::timestamp without time zone)
-> XN Hash (cost=119746.56..119746.56 rows=11974656 width=12)
-> XN Seq Scan on tbltrackinglinklog_opt clicks (cost=0.00..119746.56 rows=11974656 width=12)
Can anyone provide me guidance of the correct usage of distribution key and sort keys.
How should I design my query?
Table setup:
1) According to the plan, the most expensive operation is grouping by offer_id. This makes sense because you didn't sort or distribute your data by offer_id. Your tables are quite large, so you can recreate the table with an interleaved sort key by (offer_id,created_on) (interleaved keys are supposed to give equal and order-independent weight to the included columns and are known to have positive effect on larger tables).
2) If you join by weeks you can materialize your week column (create a physical column and populate it with date_trunc output). That might save you some computation effort to get these values dynamically during the join. However, this operation is cheap and if your table is already sorted by timestamp column Redshift might already scan the appropriate blocks only. Also, if each offer runs for a short period of time (meaning offer column has high cardinality and high correlation with time column) you can have a compound sort key by (offer_id,week_created) that will allow merge join that is faster, and aggregate will fun fast as well.
3) if you don't use ad_campaign in other queries you can distribute both tables by offer_id. Having join column in distribution key is a good practice, it's unlikely that your query will benefit from this since you have a single node and distribution style mostly affects multinode setups.
All recommendations are just the assumptions without knowing the exact nature of your data, they require running benchmarks (create table with the recommended configuration, copy data, vaccuum, analyze, run the same query at least 3 times and compare times with the original setup). I would appreciate if you do this and post results here.
RE the query itself, you can replace FULL JOIN with JOIN because you don't need it. FULL JOIN should be used when you want to get not only the intersection of both tables but also impressions that don't have related clicks and vice versa. Which doesn't seem the case because you are filtering by impressions.created_on and group by impressions.offer_id. So, all you need is just the intersection. Replacing FULL JOIN by simple JOIN also might affect query performance. If you want to see the offers that have zero clicks you can use LEFT JOIN.
Merge join is faster than hash join, you should try to achieve merge join. You sort key looks okay, but is your data actually sorted? Redshift does not automatically keep table's rows sorted by sort key, there is no way for redshift to perform merge join on your table. Running a full vacuum on the table, redshift will start performing merge join.
select * from svv_table_info where table = 'impressions'
select * from svv_table_info where table = 'clicks'
Use above query to check the amount of unsorted data you have in your table.
Run a full vacuum on both your tables. Depending on the amount of unsorted data this might take a while and use a lot of your cluster resource.
VACUUM impressions to 100 percent
VACUUM clicks to 100 percent
If I’ve made a bad assumption please comment and I’ll refocus my answer.

Optimise LEFT JOIN in PostgreSQL (PGAdmin4)

I have 2 tables in PostgreSQL one of which is 16 million rows and the other is around 3000. They both share 2 common IDs, but the larger table has thousands of iterations of the same ID.
I'm trying to do a LEFT JOIN with a few conditions as follows:
SELECT LT.Col1, LT.Col2, LT.Col3, ST.Col1, ST.Col2
FROM large_table as LT
LEFT JOIN small_table as ST
ON LT.id1 = ST.id1 AND LT.id2 = ST.id2
WHERE LT.Col1 > 30
AND LT.Col2 > 2
AND LT.Col3 BETWEEN '11:00:00'::time AND '21:00:00'::time
I have created multi-column Indexes based on id1 and id2 for each table, but the query is just running and running. Using PGAdmin4 on a macbook pro 16gb RAM, 2.9ghz quad core i7. I've checked the computer performance and it's not struggling. Does anybody have any advice on how to speed up the query? Am I just asking too much of it?
Since this is a left outer join, your best bet is to use indexes on large_table that reduces the number of rows early on.
Unfortunately none of your conditions checks for equality, so a combined index is useless.
You could create indexes on the three columns of large_table and see if PostgreSQL uses them (e.g. using a bitmap inex scan and combining the results).
Those indexes that PostgreSQL chooses not to use can be dropped again.
You might try creating combined index for tuple (id1, id2) in both tables. Then use ON (LT.id1, LT.id2) = (ST.id1, ST.id2)

Postgres slow bitmap heap scan

I have tables messages phones with around 6M rows. And this query perfomance is very poor
SELECT t1.id, t2.number, t1.name, t1.gender
FROM messages t1
INNER JOIN phones t2 ON t2.id = t1.parent_id
INNER JOIN regions t6 ON t6.id = t1.region_id
WHERE t2.number IS NOT NULL AND t1.entity AND NOT t2.type AND t1.region_id = 50
ORDER BY t1.id LIMIT 100
EXPLAIN ANALYZE result: http://explain.depesz.com/s/Pd6D
Btree indexes on all colums in where condition. Primary keys on all id colums, foreign keys in messages table on parent_id and region_id as well. Vacuum on all tables runned too.
But over 15sec on just 100 rows is too slow. What is wrong?
Postgres 9.3, ubuntu 13.10, cpu 2x 2.5Ghz, 4gb ram, pg config http://pastebin.com/mPVH1YJi
This completely depends on your read vs. write load, but one solution may be to create composite indexes for the most common / general cases.
For example, BTREE(parent_id, region_id) to turn that heap scan into an index scan would be huge. Since you have dynamic queries, there might be a few other combinations of composite indexes you might need for other queries, but I would recommend using only two columns in your composite indexes for now (as each query is different). Note that BTREE(parent_id, region_id) can also be scanned when only parent_id is needed, so there is no need to carry a BTREE(parent_id) index as well.