PostgreSQL - PostGIS query optimization - postgresql

I have a query which creates an input to pgRouting pgr_drivingDistance function:
CREATE TEMP TABLE tmp_edge AS
SELECT
e."Id" as id,
e."Source" as source,
e."Target" as target,
e."Length" / (1000*LEAST("Speed", "SpeedMin")/60) as cost
FROM "Edge" e,
"SpeedLimit" sl
WHERE sl."VehicleKindId" = 1
AND e.the_geom &&
ST_MakeEnvelope(
x1-(1000*GREATEST("Speed", "SpeedMax")/60)*13,
y1-(1000*GREATEST("Speed", "SpeedMax")/60)*13,
x1+(1000*GREATEST("Speed", "SpeedMax")/60)*13,
y1+(1000*GREATEST("Speed", "SpeedMax")/60)*13, 3857)
AND sl."RoadCategoryId" = e."CategoryId";
In the WHERE clause I calculate the same thing several times to get bounding box coordinates.
I tried to put calculations into FROM part and use alias for calculated column, but then whole execution time increases twice.
Edge table is quite large (1 milion) and SpeedLimit is several dozen record.
Is there any way to enhance this query?

It is recommended way to join tables using JOIN syntax. And then later restrict given set wit WHERE. What is ST_MakeEnvelope? You can use Index on expression in PostgreSQL ;)
Expression indexes in PostgreSQL
Since you are using expressions you might benefit from them.
And you might use Explain analyize to notice your bottlenecks in the query

Related

create 2 indexes on same column

I have a table with geometry column.
I have 2 indexes on this column:
create index idg1 on tbl using gist(geom)
create index idg2 on tbl using gist(st_geomfromewkb((geom)::bytea))
I have a lot of queries using the geom (geometry) field.
Which index is used ? (when and why)
If there are two indexes on same column (as I show here), can the select queries run slower than define just one index on column ?
The use of an index depends on how the index was defined, and how the query is invoked. If you SELECT <cols> FROM tbl WHERE geom = <some_value>, then you will use the idg1 index. If you SELECT <cols> FROM tabl WHERE st_geomfromewkb(geom) = <some_value>, then you will use the idg2 index.
A good way to know which index will be used for a particular query is to call the query with EXPLAIN (i.e., EXPLAIN SELECT <cols> FROM tbl WHERE geom = <some_value>) -- this will print out the query plan, which access methods, which indexes, which joins, etc. will be used.
For your question regarding performance, the SELECT queries could run slower because there are more indexes to consider in the query planning phase. In terms of executing a given query plan, a SELECT query will not run slower because by then the query plan has been established and the decision of which index to use has been made.
You will certainly experience performance impact upon INSERT/UPDATE/DELETE of the table, as all indexes will need to be updated with respect to the changes in the table. As such, there will be extra I/O activity on disk to propagate the changes, slowing down the database, especially at scale.
Which index is used depends on the query.
Any query that has
WHERE geom && '...'::geometry
or
WHERE st_intersects(geom, '...'::geometry)
or similar will use the first index.
The second index will only be used for queries that have the expression st_geomfromewkb((geom)::bytea) in them.
This is completely useless: it converts the geometry to EWKB format and back. You should find and rewrite all queries that have this weird construct, then you should drop that index.
Having two indexes on a single column does not slow down your queries significantly (planning will take a bit longer, but I doubt if you can measure that). You will have a performance penalty for every data modification though, which will take almost twice as long as with a single index.

PostgreSQL 11.5 doing sequential scan for SELECT EXISTS query

I have a multi tenant environment where each tenant (customer) has its own schema to isolate their data. Not ideal I know, but it was a quick port of a legacy system.
Each tenant has a "reading" table, with a composite index of 4 columns:
site_code char(8), location_no int, sensor_no int, reading_dtm timestamptz.
When a new reading is added, a function is called which first checks if there has already been a reading in the last minute (for the same site_code.location_no.sensor_no):
IF EXISTS (
SELECT
FROM reading r
WHERE r.site_code = p_site_code
AND r.location_no = p_location_no
AND r.sensor_no = p_sensor_no
AND r.reading_dtm > p_reading_dtm - INTERVAL '1 minute'
)
THEN
RETURN;
END IF;
Now, bare in mind there are many tenants, all behaving fine except 1. In 1 of the tenants, the call is taking nearly half a second rather than the usual few milliseconds because it is doing a sequential scan on a table with nearly 2 million rows instead of an index scan.
My random_page_cost is set to 1.5.
I could understand a sequential scan if the query was returning possibly many rows, checking for the existance of any.
I've tried ANALYZE on the table, VACUUM FULL, etc but it makes no difference.
If I put "SET LOCAL enable_seqscan = off" before the query, it works perfectly... but it feels wrong, but it will have to be a temporary solution as this is a live system and it needs to work.
What else can I do to help Postgres make what is clearly the better decision of using the index?
EDIT: If I do a similar query manually (outside of a function) it chooses an index.
My guess is that the engine is evaluating the predicate and considers is not selective enough (thinks too many rows will be returned), so decides to use a table scan instead.
I would do two things:
Make sure you have the correct index in place:
create index ix1 on reading (site_code, location_no,
sensor_no, reading_dtm);
Trick the optimizer by making the selectivity look better. You can do that by adding the extra [redundant] predicate and r.reading_dtm < :p_reading_dtm:
select 1
from reading r
where r.site_code = :p_site_code
and r.location_no = :p_location_no
and r.sensor_no = :p_sensor_no
and r.reading_dtm > :p_reading_dtm - interval '1 minute'
and r.reading_dtm < :p_reading_dtm

How to know which argument of function is a column in the table in PostgreSQL?

I'm using PostgreSQL with GisT and PostGIS, I want to find geometries who have distance within a threshold to a query geometry. So first, I should expand the bounding box of the query, and second , I should pass the expanded bounding box into the GisT Index.
I think the linguistic meaning of the two queries : SELECT * FROM table WHERE DWithin(querygeom, table.col) , and SELECT * FROM table WHERE DWithin(table.col, querygeom) is the same, where tabel.col is a column of geometry, and querygeom is a static geometry i pass in. However, as I have a GisT index on table.col, I want to always expand the query, but not the column, in order to use the index. (If I understand right, if I expand the boxes inside that column, I could not use the index?)
Is that any way for it? For example, using rule to rewrite the query?
I think I find this answer...
It uses a symmetric fomulation:
'SELECT $1 OPERATOR(#extschema#.&&) #extschema#.ST_Expand($2,$3) AND $2 OPERATOR(#extschema#.&&) #extschema#.ST_Expand($1,$3) AND #extschema#._ST_DWithin($1, $2, $3)'
so either side could use that index(with higher computation cost, i guess)

postgres chooses an aweful query plan , how can that be fixed

I'm trying to optimize this query :
EXPLAIN ANALYZE
select
dtt.matching_protein_seq_ids
from detected_transcript_translation dtt
join peptide_spectrum_match psm
on psm.detected_transcript_translation_id =
dtt.detected_transcript_translation_id
join peptide_spectrum_match_sequence psms
on psm.peptide_spectrum_match_sequence_id =
psms.peptide_spectrum_match_sequence_id
WHERE
dtt.matching_protein_seq_ids && ARRAY[654819, 294711]
;
When seq_scan are allowed (set enable_seqscan = on), the optimizer chooses a pretty awful plan that runs in 49.85 seconds :
https://explain.depesz.com/s/WKbew
With set enable_seqscan = off, the plan chosen uses proper indexes and the query runs instantely.
https://explain.depesz.com/s/ISHV
note that I did run a ANALYZE on all three tables...
Your problem is that PostgreSQL cannot estimate the WHERE condition well, so it estimates it as a certain percentage of the estimated total rows, which is way too much.
If you know that there will always few result rows for a query like this, you could cheat by defining a function
CREATE OR REPLACE FUNCTION matching_transcript_translations(integer[])
RETURNS SETOF detected_transcript_translation
LANGUAGE SQL
STABLE STRICT
ROWS 2 /* pretend there are always exactly two matching rows */
AS
'SELECT * FROM detected_transcript_translation
WHERE matching_protein_seq_ids && $1';
You could use that like
select
dtt.matching_protein_seq_ids
from matching_transcript_translations(ARRAY[654819, 294711]) dtt
join peptide_spectrum_match psm
on psm.detected_transcript_translation_id =
dtt.detected_transcript_translation_id
join peptide_spectrum_match_sequence psms
on psm.peptide_spectrum_match_sequence_id =
psms.peptide_spectrum_match_sequence_id;
Then PostgreSQL should be cheated into thinking that there will be exactly one matching row.
However, if there are a lot of matching rows, the resulting plan will be even worse than your current plan is…

PosgtreSQL Optimize Query with st_transform, st_makepoint, and st_contains

I have the following query:
UPDATE DestinTable
SET destin = geomVal
FROM GeomTable
WHERE st_contains(st_transform(geom, 4326), st_setsrid(
st_makepoint(d_lon::double precision, d_lat::double precision), 4326));
This query works, but it is very slow. I have to run an update on a very large table, and it is taking a 8+ hours to complete (I run this on 5 different columns). I wanted to know if there was a way to optimize this query to make it run faster. I am unaware of the behind the scenes work associated with an st_contains() method, so there may be some obvious solutions that I am missing.
The easiest way is to create an index on ST_TRANSFORM
CREATE INDEX idx_geom_4326_geomtable
ON GeomTable
USING gist
(ST_Transform(geom, 26986))
WHERE geom IS NOT NULL;
If you have all the fields in one SRID in the table it will be even easier to create a normal GIST index on that table and transform the point you're supplying to the local SRID