postgres chooses an aweful query plan , how can that be fixed - postgresql

I'm trying to optimize this query :
EXPLAIN ANALYZE
select
dtt.matching_protein_seq_ids
from detected_transcript_translation dtt
join peptide_spectrum_match psm
on psm.detected_transcript_translation_id =
dtt.detected_transcript_translation_id
join peptide_spectrum_match_sequence psms
on psm.peptide_spectrum_match_sequence_id =
psms.peptide_spectrum_match_sequence_id
WHERE
dtt.matching_protein_seq_ids && ARRAY[654819, 294711]
;
When seq_scan are allowed (set enable_seqscan = on), the optimizer chooses a pretty awful plan that runs in 49.85 seconds :
https://explain.depesz.com/s/WKbew
With set enable_seqscan = off, the plan chosen uses proper indexes and the query runs instantely.
https://explain.depesz.com/s/ISHV
note that I did run a ANALYZE on all three tables...

Your problem is that PostgreSQL cannot estimate the WHERE condition well, so it estimates it as a certain percentage of the estimated total rows, which is way too much.
If you know that there will always few result rows for a query like this, you could cheat by defining a function
CREATE OR REPLACE FUNCTION matching_transcript_translations(integer[])
RETURNS SETOF detected_transcript_translation
LANGUAGE SQL
STABLE STRICT
ROWS 2 /* pretend there are always exactly two matching rows */
AS
'SELECT * FROM detected_transcript_translation
WHERE matching_protein_seq_ids && $1';
You could use that like
select
dtt.matching_protein_seq_ids
from matching_transcript_translations(ARRAY[654819, 294711]) dtt
join peptide_spectrum_match psm
on psm.detected_transcript_translation_id =
dtt.detected_transcript_translation_id
join peptide_spectrum_match_sequence psms
on psm.peptide_spectrum_match_sequence_id =
psms.peptide_spectrum_match_sequence_id;
Then PostgreSQL should be cheated into thinking that there will be exactly one matching row.
However, if there are a lot of matching rows, the resulting plan will be even worse than your current plan is…

Related

PostgreSQL 11.5 doing sequential scan for SELECT EXISTS query

I have a multi tenant environment where each tenant (customer) has its own schema to isolate their data. Not ideal I know, but it was a quick port of a legacy system.
Each tenant has a "reading" table, with a composite index of 4 columns:
site_code char(8), location_no int, sensor_no int, reading_dtm timestamptz.
When a new reading is added, a function is called which first checks if there has already been a reading in the last minute (for the same site_code.location_no.sensor_no):
IF EXISTS (
SELECT
FROM reading r
WHERE r.site_code = p_site_code
AND r.location_no = p_location_no
AND r.sensor_no = p_sensor_no
AND r.reading_dtm > p_reading_dtm - INTERVAL '1 minute'
)
THEN
RETURN;
END IF;
Now, bare in mind there are many tenants, all behaving fine except 1. In 1 of the tenants, the call is taking nearly half a second rather than the usual few milliseconds because it is doing a sequential scan on a table with nearly 2 million rows instead of an index scan.
My random_page_cost is set to 1.5.
I could understand a sequential scan if the query was returning possibly many rows, checking for the existance of any.
I've tried ANALYZE on the table, VACUUM FULL, etc but it makes no difference.
If I put "SET LOCAL enable_seqscan = off" before the query, it works perfectly... but it feels wrong, but it will have to be a temporary solution as this is a live system and it needs to work.
What else can I do to help Postgres make what is clearly the better decision of using the index?
EDIT: If I do a similar query manually (outside of a function) it chooses an index.
My guess is that the engine is evaluating the predicate and considers is not selective enough (thinks too many rows will be returned), so decides to use a table scan instead.
I would do two things:
Make sure you have the correct index in place:
create index ix1 on reading (site_code, location_no,
sensor_no, reading_dtm);
Trick the optimizer by making the selectivity look better. You can do that by adding the extra [redundant] predicate and r.reading_dtm < :p_reading_dtm:
select 1
from reading r
where r.site_code = :p_site_code
and r.location_no = :p_location_no
and r.sensor_no = :p_sensor_no
and r.reading_dtm > :p_reading_dtm - interval '1 minute'
and r.reading_dtm < :p_reading_dtm

Slow running Postgres query

I have this query that takes a very long time on my database. This SQL is generated from an ORM (Hibernate) inside of an application. I don't have access to the source code.
I was wondering if anyone can take a look at the following ANALYZE EXPLAIN output and suggest any Postgres tweaks I can make.
I don't know where to start or how to tune my database to service this query.
The query looks like this
select
resourceta0_.RES_ID as col_0_0_
from
HFJ_RESOURCE resourceta0_
left outer join HFJ_RES_LINK myresource1_ on resourceta0_.RES_ID = myresource1_.TARGET_RESOURCE_ID
left outer join HFJ_SPIDX_DATE myparamsda2_ on resourceta0_.RES_ID = myparamsda2_.RES_ID
left outer join HFJ_SPIDX_TOKEN myparamsto3_ on resourceta0_.RES_ID = myparamsto3_.RES_ID
where
(myresource1_.SRC_RESOURCE_ID in ('4954427' ... many more))
and myparamsda2_.HASH_IDENTITY=`5247847184787287691` and
(myparamsda2_.SP_VALUE_LOW>='1950-07-01 11:30:00' or myparamsda2_.SP_VALUE_HIGH>='1950-07-01 11:30:00')
and myparamsda2_.HASH_IDENTITY='5247847184787287691'
and (myparamsda2_.SP_VALUE_LOW<='1960-06-30 12:29:59.999' or myparamsda2_.SP_VALUE_HIGH<='1960-06-30 12:29:59.999')
and (myparamsto3_.HASH_VALUE in ('-5305902187566578701'))
limit '500'
And the execution plan looks like this: https://explain.depesz.com/s/EJgOq
Edit - updated to add the depesz link.
Edit 2 - added more information about the query.
The cause for the slowness are the bad row count estimates which make PostgreSQL choose a nested loop join. Almost all your time is spent in the index scan on hfj_res_link, which is repeated 1113 times.
My first attempt would be to ANALYZE hfj_spidx_date and see if that helps. If yes, make sure that autoanalyze treats that table more frequently.
The next attempt would be to
SET default_statistics_target = 1000;
and then ANALYZE as above. If that helps, use ALTER TABLE to increase the STATISTICS on the hash_identity and sp_value_high columns.
If that doesn't help either, and you have a recent version of PostgreSQL, you could try extended statistics:
CREATE STATISTICS myparamsda2_stats (dependencies)
ON hash_identity, sp_value_high FROM hfj_spidx_date;
Then ANALYZE the table again and see if that helps.
If all that doesn't help, and you cannot get the estimates correct, you have to try a different angle:
CREATE INDEX ON hfj_res_link (target_resource_id, src_resource_id);
That should speed up the index scan considerably and give you good response times.
Finally, if none of the above has any effect, you could use the cruse measure of disallowing nested loop joins for this query:
BEGIN;
SET LOCAL enable_nestloop = off;
SELECT /* your query goes here */;
COMMIT;

PostgreSQL - PostGIS query optimization

I have a query which creates an input to pgRouting pgr_drivingDistance function:
CREATE TEMP TABLE tmp_edge AS
SELECT
e."Id" as id,
e."Source" as source,
e."Target" as target,
e."Length" / (1000*LEAST("Speed", "SpeedMin")/60) as cost
FROM "Edge" e,
"SpeedLimit" sl
WHERE sl."VehicleKindId" = 1
AND e.the_geom &&
ST_MakeEnvelope(
x1-(1000*GREATEST("Speed", "SpeedMax")/60)*13,
y1-(1000*GREATEST("Speed", "SpeedMax")/60)*13,
x1+(1000*GREATEST("Speed", "SpeedMax")/60)*13,
y1+(1000*GREATEST("Speed", "SpeedMax")/60)*13, 3857)
AND sl."RoadCategoryId" = e."CategoryId";
In the WHERE clause I calculate the same thing several times to get bounding box coordinates.
I tried to put calculations into FROM part and use alias for calculated column, but then whole execution time increases twice.
Edge table is quite large (1 milion) and SpeedLimit is several dozen record.
Is there any way to enhance this query?
It is recommended way to join tables using JOIN syntax. And then later restrict given set wit WHERE. What is ST_MakeEnvelope? You can use Index on expression in PostgreSQL ;)
Expression indexes in PostgreSQL
Since you are using expressions you might benefit from them.
And you might use Explain analyize to notice your bottlenecks in the query

optimize sql "With ismember Select From query"

I have a query which run extremely slow when "checking is_member" in comparison to just loading the whole dataset. This view acts as a security check, it checks if you are a member of a particular group - ie group 1, then the next column will state what access it has - ie division 2.
This view then is joined with the Fact table, so that it will only retrieve division 2 rows.
The question is, does the is_member execute for each line of Fact data? Just my theory because it runs 1000 times faster without this view. And if anyone can suggest an alternative structure?
WITH group_security AS (SELECT DISTINCT division_cod FROM dbo.dim_group_security_division AS gsd
WHERE (IS_MEMBER(group_name) = 1))
SELECT dbo.dim_division.dim_division_key, dbo.dim_division.division_ID, dbo.dim_division.division_code, dbo.dim_division.division_name
FROM dbo.dim_division INNER JOIN
group_security ON dbo.dim_division.division_code = group_security.division_code OR group_security.division_code = 'ALL'
Since you JOIN on dbo.dim_division.division_code, do you have an index on this column?
Alternatively you could give this a try:
SELECT dim.dim_division_key,
dim.division_ID,
dim.division_code,
dim.division_name
FROM dim_division dim
WHERE EXISTS ( SELECT *
FROM dbo.dim_group_security_division gsd
WHERE gsd.division_code IN ('ALL', dim.division_code)
AND IS_MEMBER(gsd.group_name) = 1 )
This way the system can stop at the first 'match' in dim_group_security_division instead of having to find all and then aggregate the result because of the DISTINCT.
In this case, it might also be useful to have an index on gsd.division_code to speed things up a bit.

Query Optimization. Why did TOAD do this?

SQL Server 2008. I have this very large query which has a high cost associated with it. TOAD has Query Tuning functionality in it and the only change made was the following:
Before:
LEFT OUTER JOIN (SELECT RIN_EXT.rejected,
RIN_EXT.scar,
RIN.fcreceiver,
RIN.fcitemno
FROM RCINSP_EXT RIN_EXT
INNER JOIN dbo.rcinsp RIN
ON RIN_EXT.fkey_id = RIN.identity_column) RIN1
ON RCI.freceiver = RIN1.fcreceiver
AND RCI.fitemno = RIN1.fcitemno
WHERE RED.[YEAR] = '2009'
After:
LEFT OUTER JOIN (SELECT RIN_EXT.rejected,
RIN_EXT.scar,
RIN.fcreceiver,
RIN.fcitemno
FROM dbo.rcinsp RIN
INNER JOIN RCINSP_EXT RIN_EXT
ON RIN.identity_column = COALESCE (RIN_EXT.fkey_id , RIN_EXT.fkey_id)) RIN1
ON RCI.freceiver = RIN1.fcreceiver
AND RCI.fitemno >= RIN1.fcitemno -- ***** RIGHT HERE
AND RCI.fitemno <= RIN1.fcitemno
WHERE RED.[YEAR] = '2009'
The field is a char(3) field and this is SQL Server 2008.
Any idea why theirs is so much faster than mine?
You didn't show the ON condition in the "Before" query, so I don't know what TOAD changed. However, I'll take a guess about what happened.
The SQL Server query optimizer uses cost estimates to choose the query plan. The cost estimates are based on rowcount estimates. If the rowcount estimates are not accurate, the optimizer might not choose the best plan.
Some rowcount estimates are typically accurate, like those of the form (column = value) for a column with statistics. However, some rowcount estimates can only be guessed at, like (column = othercolumn) if the columns aren't related by a foreign key constraint, or (expression = value), where the expression isn't trivial or involves more than one column.
When statistics don't guide a rowcount estimate, SQL Server uses generic estimates. If you compare the rowcount estimates in an estimated plan to the actual rowcounts in an actual plan, you can sometimes see the problem and "trick" the optimizer into changing its rowcount estimate.
If you add predicates with AND that don't actually restrict the results, you may lower the rowcount estimate if the optimizer can't recognize that they are superfluous. Similarly, if you add predicates with OR that don't actually yield additional results, you may raise the rowcount estimate.
Perhaps here the rowcount estimate was too high, and the extra predicates are correcting them, resulting in better cost estimates for the plans being considered and a better plan choice in the end.
Looks like an ascending index search argument thing - since it added a >= - We're not seeing enough about the rest of your query, but obviously there is further data about RCI.fitemno which it was able to deduce from the rest of your query.
It's odd that this:
AND RCI.fitemno >= RIN1.fcitemno -- ***** RIGHT HERE
AND RCI.fitemno <= RIN1.fcitemno
was not turned into this:
AND RCI.fitemno = RIN1.fcitemno
Since they are equivalent.
Adding larger-than and smaller-than in a query is an old trick which sometimes nudges the query optimizer to use an index on that column. So this trick:
AND RCI.fitemno >= RIN1.fcitemno
AND RCI.fitemno <= RIN1.fcitemno
forces the database to use indexes on RIN1 and RCI fitemno columns if present. I'm not sure if temporary indexes get created on the fly when you do this.
I used to do these tricks with a DB2 database, and they worked nicely.