ST_Intersects() query took too long - postgresql

I'm working on a query using the PostGIS extension that implements a 'spatial join' work. Running the query took an incredibly long time and failed in the end. The query is as follows:
CREATE INDEX buffer_table_geom_idx ON buffer_table USING GIST (geom);
CREATE INDEX point_table_geom_idx ON point_table USING GIST (geom);
SELECT
point_table.*,
buffer_table.something
FROM
point_table
LEFT JOIN buffer_table ON ST_Intersects (buffer_table.geom, point_table.geom);
where the point_table stands for a table that contains over 10 million rows of point records; the buffer_table stands for a table that contains only one multi-polygon geometry.
I would want to know if there is anything wrong with my code and ways to adjust. Thanks in advance.

With a LEFT JOIN you're going through every single record of point_table and therefore ignoring the index. Try this and see the difference:
SELECT point_table.*
FROM point_table
JOIN buffer_table ON ST_Contains(buffer_table.geom, point_table.geom);
Divide and conquer with ST_SubDivide
Considering the size of your multipolygon (see comments), it might be interesting to divide it into smaller pieces, so that the number of vertices for each containment/intersection calculation also gets reduced, consequently making the query less expensive.
First divide the large geometry into smaller pieces and store in another table (you can also use a CTE/Subquery)
CREATE TABLE buffer_table_divided AS
SELECT ST_SubDivide(geom) AS geom FROM buffer_table
CREATE INDEX buffer_table_geom_divided_idx ON buffer_table_divided USING GIST (geom);
.. and perform your query once again against this new table:
SELECT point_table.*
FROM point_table
JOIN buffer_table_divided d ON ST_Contains (d.geom, point_table.geom);
Demo: db<>fiddle

Related

select 10000 records take too long time in PostgreSQL

my table contains 1 billion records. It is also partitioned by month.Id and datetime is the primary key for the table. When I select
select col1,col2,..col8
from mytable t
inner join cte on t.Id=cte.id and dtime>'2020-01-01' and dtime<'2020-10-01'
It uses index scan, but takes more than 5 minutes to select.
Please suggest me.
Note: I have set work_mem to 1GB. cte table results comes with in 3 seconds.
Well it's the nature of join and it is usually known as a time consuming operation.
First of all, I recommend to use in rather than join. Of course they have got different meanings, but in some cases technically you can use them interchangeably. Check this question out.
Secondly, according to the relation algebra whenever you use join each rows of mytable table is combined with each rows from the second table, and DBMS needs to make a huge temporary table, and finally igonre unsuitable rows. Undoubtedly all the steps and the result would take much time. Before using the Join opeation, it's better to filter your tables (for example mytable based date) and make them smaller, and then use the join operations.

PostgreSQL different index creation time for same datatype

I have a table with three columns A, B, C, all of type bytea.
There are around 180,000,000 rows in the table. A, B and C all have exactly 20 bytes of data, C sometimes contains NULLs
When creating indexes for all columns with
CREATE INDEX index_A ON transactions USING hash (A);
CREATE INDEX index_B ON transactions USING hash (B);
CREATE INDEX index_C ON transactions USING hash (C);
index_A is created in around 10 minutes, while B and C are taking over 10 hours after which I aborted them. I ran every CREATE INDEX on their own, so no indices were created in parallel. There are also no other queries running in the database.
When running
SELECT * FROM pg_stat_activity;
wait_event_type and wait_event are both NULL, state is active.
Why are the second index creations taking so long, and can I do anything to speed them up?
Ensure the statistics on your table are up-to-date.
Then execute the following query:
SELECT attname, n_distinct, correlation
from pg_stats
where tablename = '<Your table name here>'
Basically, the database will have more work to create indexes when:
The number of distinct values gets higher.
The correlation (= are values in the field physically stored in order) is close to 0.
I suspect you will see field A is different in terms of distinct values and/or a higher correlation than the other 2 fields.
Edit: Basically, creating an index = FULL SCAN of the table and create entries in the index as you progress. With the stats you have shared below that means:
Column A: it was detected as unique
A single scan is enough as the DB knows 1 record = 1 index entry.
Columns B & C : it was detected as having very few distinct values + abs(correlation) is very low.
Each index entry takes an entire FULL SCAN of the table.
Note: the description is simplified to highlight the difference.
Solution 1:
Do not create indexes for B and C.
It might sound stupid but in fact and as explained here, a small correlation means the indexes will probably not be used (an index is useful only when entries are not scattered in all the table blocks).
Solution 2:
Order records on the disk.
The initialization would be something like this:
CREATE TABLE Transactions_order as SELECT * FROM Transactions;
TRUNCATE TABLE Transactions;
INSERT INTO Transactions SELECT * FROM Transactions_order ORDER BY B,C,A;
DROP TABLE Transactions_order;
The tricky part comes next: with insert/update/delete records, you need to keep track of the correlation and ensure it does not drop too much.
If you can't guarantee that, stick to solution 1.
Solution3:
Create partitions and enjoy partition pruning.
There are quite a lot of efforts being made for partitioning recently in postgresql. It could be worth having a look into it.

ST_contains taking too much time

I am trying to match latitude/longitude to a particular neighbor location using below query
create table address_classification as (
select distinct buildingid,street,city,state,neighborhood,borough
from master_data
join
Borough_GEOM
on st_contains(st_astext(geom),coordinates) = 'true'
);
In this, coordinates is of below format
ST_GeometryFromText('POINT('||longitude||' '||latitude||')') as coordinates
and geom is of column type geometry.
i have already created indexes as below
CREATE INDEX coordinates_gix ON master_data USING GIST (coordinates);
CREATE INDEX boro_geom_indx ON Borough_GEOM USING gist(geom);
I have almost 3 million records in the main table and 200 geometric information in the GEOM table. Explain analyze of the query is taking so much time (2 hrs).
Please let me know, how can i optimize this query.
Thanks in advance.
As mentioned in the comments, don't use ST_AsText(): that doesn't belong there. It's casting the geom to text, and then going back to geom. But, more importantly, that process is likely to fumble the index.
If you're unique on only column then use DISTINCT ON, no need to compare the others.
If you're unique on the ID column and your only joining to add selectivity then consider using EXISTS. Do any of these columns come from the borough_GEOM other than geom?
I'd start with something like this,
CREATE TABLE address_classification AS
SELECT DISTINCT ON (buildingid),
buildingid,
street,
city,
state,
neighborhood,
borough
FROM master_data
JOIN borough_GEOM
ON ST_Contains(geom,coordinates);

PostgreSQL - PostGIS query optimization

I have a query which creates an input to pgRouting pgr_drivingDistance function:
CREATE TEMP TABLE tmp_edge AS
SELECT
e."Id" as id,
e."Source" as source,
e."Target" as target,
e."Length" / (1000*LEAST("Speed", "SpeedMin")/60) as cost
FROM "Edge" e,
"SpeedLimit" sl
WHERE sl."VehicleKindId" = 1
AND e.the_geom &&
ST_MakeEnvelope(
x1-(1000*GREATEST("Speed", "SpeedMax")/60)*13,
y1-(1000*GREATEST("Speed", "SpeedMax")/60)*13,
x1+(1000*GREATEST("Speed", "SpeedMax")/60)*13,
y1+(1000*GREATEST("Speed", "SpeedMax")/60)*13, 3857)
AND sl."RoadCategoryId" = e."CategoryId";
In the WHERE clause I calculate the same thing several times to get bounding box coordinates.
I tried to put calculations into FROM part and use alias for calculated column, but then whole execution time increases twice.
Edge table is quite large (1 milion) and SpeedLimit is several dozen record.
Is there any way to enhance this query?
It is recommended way to join tables using JOIN syntax. And then later restrict given set wit WHERE. What is ST_MakeEnvelope? You can use Index on expression in PostgreSQL ;)
Expression indexes in PostgreSQL
Since you are using expressions you might benefit from them.
And you might use Explain analyize to notice your bottlenecks in the query

Optimization of count query for PostgreSQL

I have a table in postgresql that contains an array which is updated constantly.
In my application i need to get the number of rows for which a specific parameter is not present in that array column. My query looks like this:
select count(id)
from table
where not (ARRAY['parameter value'] <# table.array_column)
But when increasing the amount of rows and the amount of executions of that query (several times per second, possibly hundreds or thousands) the performance decreses a lot, it seems to me that the counting in postgresql might have a linear order of execution (I’m not completely sure of this).
Basically my question is:
Is there an existing pattern I’m not aware of that applies to this situation? what would be the best approach for this?
Any suggestion you could give me would be really appreciated.
PostgreSQL actually supports GIN indexes on array columns. Unfortunately, it doesn't seem to be usable for NOT ARRAY[...] <# indexed_col, and GIN indexes are unsuitable for frequently-updated tables anyway.
Demo:
CREATE TABLE arrtable (id integer primary key, array_column integer[]);
INSERT INTO arrtable(1, ARRAY[1,2,3,4]);
CREATE INDEX arrtable_arraycolumn_gin_arr_idx
ON arrtable USING GIN(array_column);
-- Use the following *only* for testing whether Pg can use an index
-- Do not use it in production.
SET enable_seqscan = off;
explain (buffers, analyze) select count(id)
from arrtable
where not (ARRAY[1] <# arrtable.array_column);
Unfortunately, this shows that as written we can't use the index. If you don't negate the condition it can be used, so you can search for and count rows that do contain the search element (by removing NOT).
You could use the index to count entries that do contain the target value, then subtract that result from a count of all entries. Since counting all rows in a table is quite slow in PostgreSQL (9.1 and older) and requires a sequential scan this will actually be slower than your current query. It's possible that on 9.2 an index-only scan can be used to count the rows if you have a b-tree index on id, in which case this might actually be OK:
SELECT (
SELECT count(id) FROM arrtable
) - (
SELECT count(id) FROM arrtable
WHERE (ARRAY[1] <# arrtable.array_column)
);
It's guaranteed to perform worse than your original version for Pg 9.1 and below, because in addition to the seqscan your original requires it also needs an GIN index scan. I've now tested this on 9.2 and it does appear to use an index for the count, so it's worth exploring for 9.2. With some less trivial dummy data:
drop index arrtable_arraycolumn_gin_arr_idx ;
truncate table arrtable;
insert into arrtable (id, array_column)
select s, ARRAY[1,2,s,s*2,s*3,s/2,s/4] FROM generate_series(1,1000000) s;
CREATE INDEX arrtable_arraycolumn_gin_arr_idx
ON arrtable USING GIN(array_column);
Note that a GIN index like this will slow updates down a LOT, and is quite slow to create in the first place. It is not suitable for tables that get updated much at all - like your table.
Worse, the query using this index takes up to twice times as long as your original query and at best half as long on the same data set. It's worst for cases where the index is not very selective like ARRAY[1] - 4s vs 2s for the original query. Where the index is highly selective (ie: not many matches, like ARRAY[199]) it runs in about 1.2 seconds vs the original's 3s. This index simply isn't worth having for this query.
The lesson here? Sometimes, the right answer is just to do a sequential scan.
Since that won't do for your hit rates, either maintain a materialized view with a trigger as #debenhur suggests, or try to invert the array to be a list of parameters that the entry does not have so you can use a GiST index as #maniek suggests.
Is there an existing pattern I’m not aware of that applies to this
situation? what would be the best approach for this?
Your best bet in this situation might be to normalize your schema. Split the array out into a table. Add a b-tree index on the table of properties, or order the primary key so it's efficiently searchable by property_id.
CREATE TABLE demo( id integer primary key );
INSERT INTO demo (id) SELECT id FROM arrtable;
CREATE TABLE properties (
demo_id integer not null references demo(id),
property integer not null,
primary key (demo_id, property)
);
CREATE INDEX properties_property_idx ON properties(property);
You can then query the properties:
SELECT count(id)
FROM demo
WHERE NOT EXISTS (
SELECT 1 FROM properties WHERE demo.id = properties.demo_id AND property = 1
)
I expected this to be a lot faster than the original query, but it's actually much the same with the same sample data; it runs in the same 2s to 3s range as your original query. It's the same issue where searching for what is not there is much slower than searching for what is there; if we're looking for rows containing a property we can avoid the seqscan of demo and just scan properties for matching IDs directly.
Again, a seq scan on the array-containing table does the job just as well.
I think with Your current data model You are out of luck. Try to think of an algorithm that the database has to execute for Your query. There is no way it could work without sequential scanning of data.
Can You arrange the column so that it stores the inverse of data (so that the the query would be select count(id) from table where ARRAY[‘parameter value’] <# table.array_column) ? This query would use a gin/gist index.