Query simplification based on selected columns - postgresql

I'm trying to understand how PostgreSQL simplifies a query: let's say I have 2 tables ("tb_thing" and "tb_thing_template"), where each thing points to a template, and that I run a query like this:
EXPLAIN SELECT
tb_thing.id
FROM
tb_thing,
tb_thing_template
WHERE
tb_thing_template.id = tb_thing.template_id
;
This is the result:
QUERY PLAN
---------------------------------------------------------------------------------
Hash Join (cost=34.75..64.47 rows=788 width=4)
Hash Cond: (tb_thing.template_id = tb_thing_template.id)
-> Seq Scan on tb_thing (cost=0.00..18.88 rows=788 width=8)
-> Hash (cost=21.00..21.00 rows=1100 width=4)
-> Seq Scan on tb_thing_template (cost=0.00..21.00 rows=1100 width=4)
The planner is joining the two tables even if I'm just selecting one field from "tb_thing" and nothing from "tb_thing_template". I was hoping the planner was smart enough to figure out it didn't need to actually join the "tb_thing_template" table because I'm not selecting anything from it.
Why does it do the join anyway? Why isn't the column selection taken into account when the query is planned?
Thanks!

Semantically your query and a simple SELECT tb_thing.id FROM tb_thing are not the same.
Assume, for instance, that table tb_thing_template has 4 rows with an identical id value that is also a tb_thing.template_id. The result of your query will then have 4 rows with the same tb_thing.id. Inversely, if a tb_thing.template_id is not present in tb_thing_template.id then that row will not be output.
Only when tb_thing_template.id is a PRIMARY KEY (so unique) and tb_thing.template_id is a FOREIGN KEY to that id with just a single row for each PRIMARY KEY, so a 1:1 relationship, are both queries semantically the same. Even a 1:N relationship, which is more typical in a PK-FK relationship, would require the join in a semantic sense. But the planner has no way of knowing if the relationship is 1:1, so you get the join.
But you should not try to spoof the query planner; it is smart, but not necessarily smarter than you (might be) dumb.

Related

What is the proper postgresql index for listing all distinct json array values?

I have the following query
select distinct c1::text
from (select json_array_elements((value::jsonb -> 'boundaries')::json) as c1 from geoinfo) t1;
And I get this query plan:
QUERY PLAN
----------------------------------------------------------------------------------------------------
HashAggregate (cost=912918.87..912921.87 rows=200 width=32)
Group Key: (t1.c1)::text
-> Subquery Scan on t1 (cost=1000.00..906769.25 rows=2459849 width=32)
-> Gather (cost=1000.00..869871.51 rows=2459849 width=32)
Workers Planned: 2
-> ProjectSet (cost=0.00..622886.61 rows=102493700 width=32)
-> Parallel Seq Scan on geoinfo (cost=0.00..89919.37 rows=1024937 width=222)
There are ~500 rows returned from a table with 2.5 Million rows.
What index can I create that will cause this query to execute much faster?
I tried the somewhat obvious, and it didn't work:
# create index gin_boundaries_array on geoinfo using gin (json_array_elements((value::jsonb -> 'boundaries')::json));
ERROR: set-returning functions are not allowed in index expressions
LINE 1: ... index gin_boundaries_array on geoinfo using gin (json_array...
+1000 on Bergi's comment. JSON is a fantastic interchange format. But for searchable storage? It's an edge-case that's become mainstream. Doesn't mean that it's always ill-advised, but when you start having to do joins against nested/embedded elements, spend a lot of mental bandwidth (and syntactic sugar) to get things done, etc., it's worth asking "is the convenience of stuffing things into JSON outweighed by the cost and hassle of seeing inside the objects, and getting things out?"
Specifically in Postgres, you can index JSON elements, but the planner cannot maintain useful statistics in the way that it can on full columns. (As I understand it, I haven't tested this out...I use JSON(B) for raw storage of JSON and search by other columns.)
As you may already know, Postgres has a lot of handy utilities for dealing with JSON. I have one JSONB field that I expand using jsonb_to_recordset and a cross join. If you're in the mood, it would be worth setting yourself up a test database. Unpack the JSON into a real table, and then try your queries against that. See for yourself if it's a better fit for your needs or not.

Postgres doing a sort on simple join

I have two tables in my database (address and person_address). Address has a PK in address_id. person_address has a PK on (address_id, person_id, usage_code)
When joining this two tables through the address_id, my expectation is that the PK index is used on both cases. However, Postgres is adding sort and materialize steps to the plan, which slows down the execution of the query. I have tried dropping indexes (person_address had an index on address_id), analyzing stats, without success.
I will appreciate any help on how to isolate this situation since those queries run slower than expected on our production environment
This is the query:
select *
from person_addresses pa
join address a
on pa.address_id = a.address_id
This is the plan :
Merge Join (cost=1506935.96..2416648.39 rows=16033774 width=338)
Merge Cond: (pa.address_id = ((a.address_id)::numeric))
-> Index Scan using person_addresses_pkey on person_addresses pa (cost=0.43..592822.76 rows=5256374 width=104)
-> Materialize (cost=1506935.53..1526969.90 rows=4006874 width=234)
-> Sort (cost=1506935.53..1516952.71 rows=4006874 width=234)
Sort Key: ((a.address_id)::numeric)
-> Seq Scan on address a (cost=0.00..163604.74 rows=4006874 width=234)
Thanks.
Edit 1. After the comment checked the data types and found a discrepancy. Fixing the data type changed the plan to the following
Hash Join (cost=343467.18..881125.47 rows=5256374 width=348)
Hash Cond: (pa.address_id = a.address_id)
-> Seq Scan on person_addresses pa (cost=0.00..147477.74 rows=5256374 width=104)
-> Hash (cost=159113.97..159113.97 rows=4033697 width=244)
-> Seq Scan on address_normalization a (cost=0.00..159113.97 rows=4033697 width=244)
Performance improvement is evident on the plan, but am wondering if the sequential scans are expected without any filters
So there are two questions here:
why did Postgres choose the (expensive) "Merge Join" in the first query?
The reason for this is that it could not use the more efficient "Hash Join" because the hash values of integer and numeric values would be different. But the Merge join requires that the values are sorted, and that's where the "Sort" step comes from in the first execution plan. Given the number of rows a "Nested Loop" would have been even more expensive.
The second question is:
I am wondering if the sequential scans are expected without any filters
Yes they are expected. The query retrieves all matching rows from both tables and that is done most efficiently by scanning all rows. An index scan requires about 2-3 I/O operations per row that has to be retrieved. A sequential scan usually requires less than one I/O operation as one block (which is the smallest unit the database reads from the disk) contains multiple rows.
You can run explain (analyze, buffers) to see how much "logical reads" each step takes.

PostgreSQL - existence of index causes hash-join

I was looking at the EXPLAIN of a natural join query of two simple tables. At first, the postgresql planner is using merge-join. Then, I add index on the join's attribute, and it causes the planner to use hash-join instead (and, with sequential read of the data!).
So my question is: why is the existence of an index causes an hash-join?
Additional data & code:
I defined two relations: R(A,B) and S(B,C). (without primary keys or
such).
Filled the tables with few lines of data (~5 each, such that there are common values of the attribute B in R and S).
then executed:
EXPLAIN VERBOSE SELECT * FROM R NATURAL JOIN S;
which resulted
Merge Join (cost=317.01..711.38 rows=25538 width=12)...
and finally, executed:
CREATE INDEX SI on S(B);
EXPLAIN VERBOSE SELECT * FROM R NATURAL JOIN S;
which resulted
Hash Join (cost=1.09..42.62 rows=45 width=12)...
Seq Scan on "user".s (cost=0.00..1.04 rows=4 width=8)

Optimal way of using joins in Redshift

I have 2 tables in AWS redshift. The details are as below
a) impressions (to count the number of impressions of a particular ad)
Number of rows (170 million)
distribution key(ad_campaign)
sort key (created_on)
b) clicks (to count the number of clicks of a particular ad).
Number of rows (80 million)
distribution key(ad_campaign)
sort key (created_on)
I have a single DC1 Large cluster with 2 slices.
I am trying to run the below query
select impressions.offer_id, count(imp_cnt) from
bidsflyer.tblImpressionLog_Opt impressions
full join bidsflyer.tblTrackingLinkLog_Opt clicks
on impressions.offer_id=clicks.offer_id and date_trunc('week',
impressions.created_on)=date_trunc('week', clicks.created_on)
where impressions.created_on >= '2017-07-27 00:00:00'
group by 1
This query takes more then 8 mins to run. I think this is quite large considering the volume of data, which I feel is not huge.
The query plan looks like something below
XN HashAggregate (cost=2778257688268.43..2778257688268.60 rows=67 width=12)
-> XN Hash Left Join DS_DIST_NONE (cost=179619.84..2778170875920.65 rows=17362469555 width=12)
Hash Cond: (("outer".offer_id = "inner".offer_id) AND (date_trunc('week'::text, "outer".created_on) = date_trunc('week'::text, "inner".created_on)))
-> XN Seq Scan on tblimpressionlog_opt impressions (cost=0.00..724967.36 rows=57997389 width=20)
Filter: (created_on >= '2017-07-27 00:00:00'::timestamp without time zone)
-> XN Hash (cost=119746.56..119746.56 rows=11974656 width=12)
-> XN Seq Scan on tbltrackinglinklog_opt clicks (cost=0.00..119746.56 rows=11974656 width=12)
Can anyone provide me guidance of the correct usage of distribution key and sort keys.
How should I design my query?
Table setup:
1) According to the plan, the most expensive operation is grouping by offer_id. This makes sense because you didn't sort or distribute your data by offer_id. Your tables are quite large, so you can recreate the table with an interleaved sort key by (offer_id,created_on) (interleaved keys are supposed to give equal and order-independent weight to the included columns and are known to have positive effect on larger tables).
2) If you join by weeks you can materialize your week column (create a physical column and populate it with date_trunc output). That might save you some computation effort to get these values dynamically during the join. However, this operation is cheap and if your table is already sorted by timestamp column Redshift might already scan the appropriate blocks only. Also, if each offer runs for a short period of time (meaning offer column has high cardinality and high correlation with time column) you can have a compound sort key by (offer_id,week_created) that will allow merge join that is faster, and aggregate will fun fast as well.
3) if you don't use ad_campaign in other queries you can distribute both tables by offer_id. Having join column in distribution key is a good practice, it's unlikely that your query will benefit from this since you have a single node and distribution style mostly affects multinode setups.
All recommendations are just the assumptions without knowing the exact nature of your data, they require running benchmarks (create table with the recommended configuration, copy data, vaccuum, analyze, run the same query at least 3 times and compare times with the original setup). I would appreciate if you do this and post results here.
RE the query itself, you can replace FULL JOIN with JOIN because you don't need it. FULL JOIN should be used when you want to get not only the intersection of both tables but also impressions that don't have related clicks and vice versa. Which doesn't seem the case because you are filtering by impressions.created_on and group by impressions.offer_id. So, all you need is just the intersection. Replacing FULL JOIN by simple JOIN also might affect query performance. If you want to see the offers that have zero clicks you can use LEFT JOIN.
Merge join is faster than hash join, you should try to achieve merge join. You sort key looks okay, but is your data actually sorted? Redshift does not automatically keep table's rows sorted by sort key, there is no way for redshift to perform merge join on your table. Running a full vacuum on the table, redshift will start performing merge join.
select * from svv_table_info where table = 'impressions'
select * from svv_table_info where table = 'clicks'
Use above query to check the amount of unsorted data you have in your table.
Run a full vacuum on both your tables. Depending on the amount of unsorted data this might take a while and use a lot of your cluster resource.
VACUUM impressions to 100 percent
VACUUM clicks to 100 percent
If I’ve made a bad assumption please comment and I’ll refocus my answer.

optimize Query in PostgreSQL

SELECT count(*)
FROM contacts_lists
JOIN plain_contacts
ON contacts_lists.contact_id = plain_contacts.contact_id
JOIN contacts
ON contacts.id = plain_contacts.contact_id
WHERE plain_contacts.has_email
AND NOT contacts.email_bad
AND NOT contacts.email_unsub
AND contacts_lists.list_id =67339
how can i optimize this query.. could you please explain...
Reformatting your query plan for clarity:
QUERY PLAN Aggregate (cost=126377.96..126377.97 rows=1 width=0)
-> Hash Join (cost=6014.51..126225.38 rows=61033 width=0)
Hash Cond: (contacts_lists.contact_id = plain_contacts.contact_id)
-> Hash Join (cost=3067.30..121828.63 rows=61033 width=8)
Hash Cond: (contacts_lists.contact_id = contacts.id)
-> Index Scan using index_contacts_lists_on_list_id_and_contact_id
on contacts_lists (cost=0.00..116909.97 rows=61033 width=4)
Index Cond: (list_id = 66996)
-> Hash (cost=1721.41..1721.41 rows=84551 width=4)
-> Seq Scan on contacts (cost=0.00..1721.41 rows=84551 width=4)
Filter: ((NOT email_bad) AND (NOT email_unsub))
-> Hash (cost=2474.97..2474.97 rows=37779 width=4)
-> Seq Scan on plain_contacts (cost=0.00..2474.97 rows=37779 width=4)
Filter: has_email
Two partial indexes might eliminate seq scans depending on your data distribution:
-- if many contacts have bad emails or are unsubscribed:
CREATE INDEX contacts_valid_email_idx ON contacts (id)
WHERE (NOT email_bad AND NOT email_unsub);
-- if many contacts have no email:
CREATE INDEX plain_contacts_valid_email_idx ON plain_contacts (id)
WHERE (has_email);
You might be missing an index on a foreign key:
CREATE INDEX plain_contacts_contact_id_idx ON plain_contacts (contact_id);
Last but not least if you've never analyzed your data, you need to run:
VACUUM ANALYZE;
If it's still slow once all that is done, there isn't much you can do short of merging your plain_contacts and your contacts tables: getting the above query plan in spite of the above indexes means most/all of your subscribers are subscribed to that particular list -- in which case the above query plan is the fastest you'll get.
This is already a very simple query that the database will run in the most efficient way providing that statistics are up to date
So in terms of the query itself there's not much to do.
In terms of database administration you can add indexes - there should be indexes in the database for all the join conditions and also for the most selective part of the where clause (list_id, contact_id as FK in plain_contacts and contacts_lists). This is the most significant opportunity to improve performance of this query (orders of magnitude). Still as SpliFF notes, you probably already have those indexes, so check.
Also, postgres has good explain command that you should learn and use. It will help with optimizing queries.
Since you only want to inlude rows that has some flags set in the joined tables, I would move that statements into the join clause:
SELECT count(*)
FROM contacts_lists
JOIN plain_contacts
ON contacts_lists.contact_id = plain_contacts.contact_id
AND NOT plain_contacts.has_email
JOIN contacts
ON contacts.id = plain_contacts.contact_id
AND NOT contacts.email_unsub
AND NOT contacts.email_bad
WHERE contacts_lists.list_id =67339
I'm not sure if this would make a great impact on performance, but worth a try. You should probably have indexes on the joined tables as well for optimal performance, like this:
plain_contacts: contact_id, has_email
contacts: id, email_unsub, email_bad
Have you run ANALYZE on the database recently? Do the row counts in the EXPLAIN plan look like they make sense? (Looks like you ran only EXPLAIN. EXPLAIN ANALYZE gives both estimated and actual timings.)
You can use SELECT count(1) ... but other than that I'd say it looks fine. You could always cache some parts of the query using views or put indexes on contact_id and list_id if you're really struggling (I assume you have one on id already).