Gist index on Postgres/PostGIS still slow - postgresql

I am not an expert at Postgres/GIS subjects and I have an issue with a large database (over 20 million records) of geometries. First of all my set up looks like this:
mmt=# select version();
-[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------
version | PostgreSQL 13.2 (Debian 13.2-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
mmt=# select PostGIS_Version();
-[ RECORD 1 ]---+--------------------------------------
postgis_version | 3.1 USE_GEOS=1 USE_PROJ=1 USE_STATS=1
The table that I am querying contains the following columns:
mmt=# \d titles
Table "public.titles"
Column | Type | Collation | Nullable | Default
----------------------+--------------------------+-----------+----------+-----------------------------------------
ogc_fid | integer | | not null | nextval('titles_ogc_fid_seq'::regclass)
wkb_geometry | bytea | | |
timestamp | timestamp with time zone | | |
end | timestamp with time zone | | |
gml_id | character varying | | |
validfrom | character varying | | |
beginlifespanversion | character varying | | |
geom_bounding_box | geometry(Geometry,4326) | | |
Indexes:
"titles_pkey" PRIMARY KEY, btree (ogc_fid)
"geom_idx" gist (geom_bounding_box)
The geom_bounding_box column holds the bounding box of the wkb_geometry. I have created that bounding box column because the wkb geometries exceed the default size limits for items in a GIST index. Some of them are quite complex geometries with several dozens of points making up a polygon. Using a bounding box instead meant I was able to put an index on that column as a way of speeding up the search.. at least that's the theory.
My search aims to find geometries which fall within 100 metres of a given point as follows, however this takes well over two minutes to return. I want to get that under one second!:
select ogc_fid, web_geometry from titles where ST_DWithin(geom_bounding_box, 'SRID=4326;POINT(-0.145872 51.509691)'::geography, 100);
Below is a basic explain output. What can I do to speed this thing up?
Thank you!
mmt=# explain select ogc_fid from titles where ST_DWithin(geom_bounding_box, 'SRID=4326;POINT(-0.145872 51.509691)'::geography, 100);
-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Gather (cost=1000.00..243806855.33 rows=2307 width=4)
-[ RECORD 2 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Workers Planned: 2
-[ RECORD 3 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | -> Parallel Seq Scan on titles (cost=0.00..243805624.63 rows=961 width=4)
-[ RECORD 4 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Filter: st_dwithin((geom_bounding_box)::geography, '0101000020E61000006878B306EFABC2BF6308008E3DC14940'::geography, '100'::double precision, true)
-[ RECORD 5 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | JIT:
-[ RECORD 6 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Functions: 4
-[ RECORD 7 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Options: Inlining true, Optimization true, Expressions true, Deforming true

The problem is that you are mixing geometry and geography, and PostgreSQL casts geom_bounding_box to geography so that they match.
Now you have indexed geom_bounding_box, but not geom_bounding_box::geography, which is something different.
Either use 'SRID=4326;POINT(-0.145872 51.509691)'::geometry as second operand or create the GiST index on ((geom_bounding_box::geography)) (note the double parentheses).

EDIT:
As pointed out by mlinth, my answer below is not really valid. It raises a danger though: beware of the arguments given to the ST_DWithin function, because the unit of distance argument is inferred differently depending if you give geographies (meters) or geometries (srid unit).
According to the ST_DWithin doc, the distance is specified in SRID unit. In your case, the spatial reference system is a geographic one, so your 100 value means 100 degree radius, not 100 meters. That means approximately the entire world. In such case, efficiently using the index is impossible.
If you want to find geometries in a 100 meter radius, you must convert a 100 meter in degree unit, but that depends on latitude (if you want to be accurate).
To start, I'd recommend you to use a (very) approximate shortcut: 100 meters at the equator is (very) approximately equal to 0.001 degrees. So replace your distance value with it, and if it speed up things (and I'm pretty convinced it will), then you will be able to refine your query to be more accurate.

I did resolve this and it was a combination of all of the above things, although not any one of them alone. As a quick summary:
Laurenz Albe was right in spotting the mix of geography and geometry types, which was easy to fix by removing the cast.
Ian Turton was also right in spotting that dozens of points shouldn't be an issue for a gist index, so I abandoned the bounding box approximation approach and went back to exploring the index issues. What I found was that the geometry column was defined with a data type of 'byte array' (bytea), which prevents creation of an spgist index due to 'no default operator class for access method "spgist"' This was resolved by changing the column type as follows:
mmt=# ALTER TABLE titles
ALTER COLUMN wkb_geometry
TYPE geometry
USING wkb_geometry::geometry;
The index then creates successfully (either gist or spgist) and I have been able to benchmark the two side by side, finding gist to be slightly more efficient in my use-case.
Amanin was also right to point out the differences between meters and radial degrees according to the spatial reference system. In some of my tests I was erroneously using the latter, but on very large radii. Since I'm indexing and searching with geometry types, that radius value needs to be very small in radial degrees in order to cover quite large areas. Fixed!
All put together, and searches across 26 million records consistently complete in 200ms to 500ms, with occasional spikes up to 1.1s. This is pretty good.
Thanks all who contributed input, ideas and discussion.

Related

Postgres Full Text Search - Find Other Similar Documents

I am looking for a way to use Postgres (version 9.6+) Full Text Search to find other documents similar to an input document - essentially looking for a way to produce similar results to Elasticsearch's more_like_this query. Far as I can tell Postgres offers no way to compare ts_vectors to each other.
I've tried various techniques like converting the source document back into a ts_query, or reprocessing the original doc but that requires too much overhead.
Would greatly appreciate any advice - thanks!
Looks like the only option is to use pg_trgm instead of the Postgres built in full text search. Here is how I ended up implementing this:
Using a simple table (or materialized view in this case) - it holds the primary key to the post and the full text body in two columns.
Materialized view "public.text_index"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+----------+--------------+-------------
id | integer | | | | plain | |
post | text | | | | extended | |
View definition:
SELECT posts.id,
posts.body AS text
FROM posts
ORDER BY posts.publication_date DESC;
Then using a lateral join we can match rows and order them by similarity to find posts that are "close to" or "related" to any other post:
select * from text_index tx
left join lateral (
select similarity(tx.text, t.text) from text_index t where t.id = 12345
) s on true
order by similarity
desc limit 10;
This of course is a naive way to match documents and may require further tuning. Additionally using a tgrm gin index on the text column will speed up the searches significantly.

PostgreSQL UPDATE doesn't seem to update some rows

I am trying to update a table from another table, but a few rows simply don't update, while the other million rows work just fine.
The statement I am using is as follows:
UPDATE lotes_infos l
SET quali_ambiental = s.quali_ambiental
FROM sirgas_lotes_centroid s
WHERE l.sql = s.sql AND l.quali_ambiental IS NULL;
It says 647 rows were updated, but I can't see the change.
I've also tried without the is null clause, results are the same.
If I do a join it seems to work as expected, the join query I used is this one:
SELECT sql, l.quali_ambiental, c.quali_ambiental FROM lotes_infos l
JOIN sirgas_lotes_centroid c
USING (sql)
WHERE l.quali_ambiental IS NULL;
It returns 787 rows, (some are both null, that's ok), this is a sample from the result from the join:
sql | quali_ambiental | quali_ambiental
------------+-----------------+-----------------
1880040001 | | PA 10
1880040001 | | PA 10
0863690003 | | PA 4
0850840001 | | PA 4
3090500003 | | PA 4
1330090001 | | PA 10
1201410001 | | PA 9
0550620002 | | PA 6
0430790001 | | PA 1
1340180002 | | PA 9
I used QGIS to visualize the results, and could not find any tips to why it is happening. The sirgas_lotes_centroid comes from the other table, the geometry being the centroid for the polygon. I used the centroid to perform faster spatial joins and now need to place the information into the table with the original polygon.
The sql column is type text, quali_ambiental is varchar(6) for both.
If a directly update one row using the following query it works just fine:
UPDATE lotes_infos
SET quali_ambiental = 'PA 1'
WHERE sql LIKE '0040510001';
If you don't see results of a seemingly sound data-modifying query, the first question to ask is:
Did you commit your transaction?
Many clients work with auto-commit by default, but some do not. And even in the standard client psql you can start an explicit transaction with BEGIN (or syntax variants) to disable auto-commit. Then results are not visible to other transactions before the transaction is actually committed with COMMIT. It might hang indefinitely (which creates additional problems), or be rolled back by some later interaction.
That said, you mention: some are both null, that's ok. You'll want to avoid costly empty updates with something like:
UPDATE lotes_infos l
SET quali_ambiental = s.quali_ambiental
FROM sirgas_lotes_centroid s
WHERE l.sql = s.sql
AND l.quali_ambiental IS NULL
AND s.quali_ambiental IS NOT NULL; --!
Related:
How do I (or can I) SELECT DISTINCT on multiple columns?
The duplicate 1880040001 in your sample can have two explanations. Either lotes_infos.sql is not UNIQUE (even after filtering with l.quali_ambiental IS NULL). Or sirgas_lotes_centroid.sql is not UNIQUE. Or both.
If it's just lotes_infos.sql, your query should still work. But duplicates in sirgas_lotes_centroid.sql make the query non-deterministic (as #jjanes also pointed out). A target row in lotes_infos can have multiple candidates in sirgas_lotes_centroid. The outcome is arbitrary for lack of definition. If one of them has quali_ambiental IS NULL, it can explain what you observed.
My suggested query fixes the observed problem superficially, in that it excludes NULL values in the source table. But if there can be more than one non-null, distinct quali_ambiental for the same sirgas_lotes_centroid.sql, your query remains broken, as the result is arbitrary.You'll have to define which source row to pick and translate that into SQL.
Here is one example how to do that (chapter "Multiple matches..."):
Updating the value of a column
Always include exact table definitions (CREATE TABLE statements) with any such question. This would save a lot of time wasted for speculation.
Aside: Why are the sql columns type text? Values like 1880040001 strike me as integer or bigint. If so, text is a costly design error.

Postgis DB Structure: Identical tables for multiple years?

I have multiple datasets for different years as a shapefile and converted them to postgres tables, which confronts me with the following situation:
I got tables boris2018, boris2017, boris2016 and so on.
They all share an identical schema, for now let's focus on the following columns (example is one row out of the boris2018 table). The rows represent actual postgis geometry with certain properties.
brw | brwznr | gema | entw | nuta
-----+--------+---------+------+------
290 | 285034 | Sieglar | B | W
the 'brwznr' column is an ID of some kind, but it does not seem to be entirely consistent across all years for each geometry.
Then again, most of the tables contain duplicate information. The geometry should be identical in every year, although this is not guaranteed, too.
What I first did was to match the brwznr of each year with the 2018 data, adding a brw17, brw2016, ... column to my boris2018 data, like so:
brw18 | brw17 | brw16 | brwznr | gema | entw | nuta
-------+-------+-------+--------+---------+------+------
290 | 260 | 250 | 285034 | Sieglar | B | W
This led to some data getting lost (because there was no matching brwznr found), some data wrongly matched (because some matching was wrong due to inconsistencies in the data) and it didn't feel right.
What I actually want to achieve is having fast queries that get me the different brw values for a certain coordinate, something around the lines of
SELECT ortst, brw, gema, gena
FROM boris2018, boris2017, boris2016
WHERE ST_Intersects(geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
or
SELECT ortst, brw18, brw17, brw16, gema, gena
FROM boris
WHERE ST_Intersects(geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
although this obviously wrong/has its deficits.
Since I am new to databases in general, I can't really tell whether this is a querying problem or a database structure problem.
I hope anyone can help, your time and effort is highly appreciated!
Tim
Have you tried using a CTE?
WITH j AS (
SELECT ortst, brw, gema, gena FROM boris2016
UNION
SELECT ortst, brw, gema, gena FROM boris2017
UNION
SELECT ortst, brw, gema, gena FROM boris2018)
SELECT * FROM j
WHERE ST_Intersects(j.geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
Depending on your needs, you might wanna use UNION ALL. Note that this approach might not be fastest one when dealing with very large tables. If it is the case, consider merging the result of these three queries into another table and create an index using the geom field. Let me know in the comments if it is the case.

optimize a postgres query that updates a big table [duplicate]

I have two huge tables:
Table "public.tx_input1_new" (100,000,000 rows)
Column | Type | Modifiers
----------------|-----------------------------|----------
blk_hash | character varying(500) |
blk_time | timestamp without time zone |
tx_hash | character varying(500) |
input_tx_hash | character varying(100) |
input_tx_index | smallint |
input_addr | character varying(500) |
input_val | numeric |
Indexes:
"tx_input1_new_h" btree (input_tx_hash, input_tx_index)
Table "public.tx_output1_new" (100,000,000 rows)
Column | Type | Modifiers
--------------+------------------------+-----------
tx_hash | character varying(100) |
output_addr | character varying(500) |
output_index | smallint |
input_val | numeric |
Indexes:
"tx_output1_new_h" btree (tx_hash, output_index)
I want to update table1 by the other table:
UPDATE tx_input1 as i
SET
input_addr = o.output_addr,
input_val = o.output_val
FROM tx_output1 as o
WHERE
i.input_tx_hash = o.tx_hash
AND i.input_tx_index = o.output_index;
Before I execute this SQL command, I already created the index for this two table:
CREATE INDEX tx_input1_new_h ON tx_input1_new (input_tx_hash, input_tx_index);
CREATE INDEX tx_output1_new_h ON tx_output1_new (tx_hash, output_index);
I use EXPLAIN command to see the query plan, but it didn't use the index I created.
It took about 14-15 hours to complete this UPDATE.
What is the problem within it?
How can I shorten the execution time, or tune my database/table?
Thank you.
Since you are joining two large tables and there are no conditions that could filter out rows, the only efficient join strategy will be a hash join, and no index can help with that.
First there will be a sequential scan of one of the tables, from which a hash structure is built, then there will be a sequential scan over the other table, and the hash will be probed for each row found. How could any index help with that?
You can expect such an operation to take a long time, but there are some ways in which you could speed up the operation:
Remove all indexes and constraints on tx_input1 before you begin. Your query is one of the examples where an index does not help at all, but actually hurts performance, because the indexes have to be updated along with the table. Recreate the indexes and constraints after you are done with the UPDATE. Depending on the number of indexes on the table, you can expect a decent to massive performance gain.
Increase the work_mem parameter for this one operation with the SET command as high as you can. The more memory the hash operation can use, the faster it will be. With a table that big you'll probably still end up having temporary files, but you can still expect a decent performance gain.
Increase checkpoint_segments (or max_wal_size from version 9.6 on) to a high value so that there are fewer checkpoints during the UPDATE operation.
Make sure that the table statistics on both tables are accurate, so that PostgreSQL can come up with a good estimate for the number of hash buckets to create.
After the UPDATE, if it affects a big number of rows, you might consider to run VACUUM (FULL) on tx_input1 to get rid of the resulting table bloat. This will lock the table for a longer time, so do it during a maintenance window. It will reduce the size of the table and as a consequence speed up sequential scans.

Which proximity function provides fastest and make use of spatial index in postGIS?

I am new to PostgreSQL / PostGIS. I am evaluating it to solve a simple algorithm : Try to find all points in a radius (meters). Here is my table :
=> \d+ theuser;
Table "public.theuser"
Column | Type | Modifiers | Storage | Description
----------+------------------------+-----------+----------+-------------
id | bigint | not null | plain |
point | geometry | | main |
Indexes:
"theuser_pkey" PRIMARY KEY, btree (id)
"point_index" gist (point)
Referenced by:
...
Has OIDs: no
I add a gist index to the point column , I don't know if it is the correct design.
All 'points' inserted are with SRID=4326.
It seems there're 2 ways to get nearby points:
ST_Distance , ST_Distance_Sphere .
Take 2 for example :
select * from theuser where
ST_distance_sphere(point , ST_GeomFromText('POINT(120.9982 24.788)',4326)) < 100;
I wonder which algorithm make use of the "point_index" ? If there are millions of points , can both execute very fast ?
Another question , how can I query the SRID of a cell (I searched by found no answer) ?
All I can do is by hibernate-spatial-postgis , getting "com.vividsolutions.jts.geom.Point" , and get the SRID from the returned point. How do I query it in SQL ? Thanks.
Environment :
=> select version();
version
-----------------------------------------------------------------------------------------------------------
PostgreSQL 8.4.9 on i486-pc-linux-gnu, compiled by GCC gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 32-bit
=> SELECT postgis_lib_version();
postgis_lib_version
---------------------
1.4.0
---- updated ----
Thanks #filiprem , I tried this :
=> explain select * from theuser where
ST_distance_sphere(point , ST_GeomFromText('POINT(120.9982 24.788)',4326)) < 100;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Seq Scan on theuser (cost=0.00..1.15 rows=3 width=2644)
Filter: (st_distance_sphere(point, '0101000020E610000080B74082E23F5E407D3F355EBAC93840'::geometry) < 100::double precision)
(2 rows)
How do I know if it makes use of the "point_index" gist (point) ? Will it survive under high data volume searching ?
I heard once that ST_DWithin is the fastest, actually in the documentation they say that in newer versions ST_DWithin has been tunned up.
Prior to 1.3, ST_Expand was commonly used in conjunction with && and
ST_Distance to achieve the same effect and in pre-1.3.4 this function
was basically short-hand for that construct. From 1.3.4, ST_DWithin
uses a more short-circuit distance function which should make it more
efficient than prior versions for larger buffer regions.
Also it uses bounding box comparitions and indexes:
This function call will automatically include a bounding box
comparison that will make use of any indexes that are available on the
geometries.