ST_DistanceSphere very slow after Postgres/PostGIS upgrade - postgresql

We've been using Postgres 9.6 and PostGIS 2.4.1 successfully on Compose for several years, but the service is shutting down so we're trying to move to Google Cloud and Cloud SQL. Our Cloud SQL instance is running Postgres 14 and PostGIS 3.1.4. The new database has more CPU, disk, and memory than the old database.
We've exported the data from Postgres 9.6 (Compose) like this:
pg_dump -h <ip> -U <username> -p <port> -d <database name> > data.sql
and imported it like this to Postgres 14 (on Google Cloud):
psql -U <username> -h <ip> --set ON_ERROR_STOP=on -f data.sql
This works without any errors.
The problem is that when we run queries such as the one below, in parallel, around 5/s, it's really slow on Postgres 14/PostGIS 3.1 (Google Cloud):
SELECT ST_DistanceSphere('SRID=4326;POINT(13.154672331767976 55.673222697684935)'::geometry, mt.geofence) AS distance
FROM my_table mt
WHERE ST_DistanceSphere('SRID=4326;POINT(13.543852374474474 55.93984692695315)'::geometry, mt.geofence) <= 2000
ORDER BY distance;
There are roughly 13100 rows in my_table. In our old database, queries like this took roughly 200-300 ms, but it can take up to 4 seconds in our new database.
We have an index defined like this (in both the new and old DB):
CREATE INDEX geofence_index ON my_table USING GIST (geofence);
Running explain (analyze, buffers, format text) on Postgres 9.6/PostGIS 2.4.1 returns:
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Sort (cost=6085.40..6096.24 rows=4338 width=8) (actual time=204.360..204.361 rows=1 loops=1) |
| Sort Key: (_st_distance('0101000020E61000003C143D36314F2A40A8BD4E292CD64B40'::geography, geography(geofence), '0'::double precision, false)) |
| Sort Method: quicksort Memory: 25kB |
| Buffers: shared hit=1392 |
| -> Seq Scan on my_table mt (cost=0.00..5823.32 rows=4338 width=8) (actual time=95.714..204.330 rows=1 loops=1) |
| Filter: (_st_distance('0101000020E61000008B7084D173162B40444173E74CF84B40'::geography, geography(geofence), '0'::double precision, false) <= '2000'::double precision) |
| Rows Removed by Filter: 13033 |
| Buffers: shared hit=1389 |
| Planning time: 0.626 ms |
| Execution time: 204.404 ms |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
EXPLAIN 10
Time: 0.261s
Running explain (analyze, buffers, format text) on Postgres 14/PostGIS 3.1.4 returns:
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Gather Merge (cost=257890.05..258183.87 rows=2555 width=8) (actual time=1893.020..1919.665 rows=1 loops=1) |
| Workers Planned: 1 |
| Workers Launched: 1 |
| Buffers: shared hit=1591 |
| -> Sort (cost=256890.04..256896.42 rows=2555 width=8) (actual time=1834.941..1834.943 rows=0 loops=2) |
| Sort Key: (st_distance('0101000020E61000003C143D36314F2A40A8BD4E292CD64B40'::geography, geography(geofence), false)) |
| Sort Method: quicksort Memory: 25kB |
| Buffers: shared hit=1591 |
| Worker 0: Sort Method: quicksort Memory: 25kB |
| -> Parallel Seq Scan on my_table mt (cost=0.00..256745.43 rows=2555 width=8) (actual time=1290.257..1834.816 rows=0 loops=2) |
| Filter: (st_distance('0101000020E61000008B7084D173162B40444173E74CF84B40'::geography, geography(geofence), false) <= '2000'::double precision) |
| Rows Removed by Filter: 6516 |
| Buffers: shared hit=1533 |
| Planning Time: 0.212 ms |
| Execution Time: 1919.704 ms |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
EXPLAIN 15
Time: 2.055s (2 seconds), executed in: 2.053s (2 seconds)
FWIW we've tried running SELECT postgis_extensions_upgrade(); (twice) in the new DB if there's some data that needs to be upgraded, but this has had no impact on query performance.
What could be causing the query to be slow in Postgres 14/PostGIS 3.1.4 and how can it be solved?

Related

How can optimize the database to reduce time on query?

There are two large table num,pre in database :
\d num
Table "public.num"
Column | Type | Collation | Nullable | Default
----------+-------------------------+-----------+----------+---------------------------------
id | integer | | not null | nextval('num_id_seq'::regclass)
adsh | character varying(20) | | |
tag | character varying(256) | | |
version | character varying(20) | | |
coreg | character varying(256) | | |
ddate | date | | |
qtrs | numeric(8,0) | | |
uom | character varying(20) | | |
value | numeric(28,4) | | |
footnote | character varying(1024) | | |
\d pre
Table "public.pre"
Column | Type | Collation | Nullable | Default
----------+------------------------+-----------+----------+---------------------------------
id | integer | | not null | nextval('pre_id_seq'::regclass)
adsh | character varying(20) | | |
report | numeric(6,0) | | |
line | numeric(6,0) | | |
stmt | character varying(2) | | |
inpth | boolean | | |
rfile | character(1) | | |
tag | character varying(256) | | |
version | character varying(20) | | |
plabel | character varying(512) | | |
negating | boolean | | |
Check how many records in the tables:
select count(*) from num;
count
----------
83862587
(1 row)
Time: 204945.436 ms (03:24.945)
select count(*) from pre;
count
----------
36738034
(1 row)
Time: 100604.085 ms (01:40.604)
Execute a long query :
explain analyze select tag,uom,qtrs,value,ddate from num
where adsh='0000320193-22-000108' and tag in
(select tag from pre where stmt='IS' and
adsh='0000320193-22-000108') and ddate='2022-09-30';
It cost almost 7 minutes 30 seconds.
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Semi Join (cost=2000.00..2909871.29 rows=2 width=59) (actual time=357717.922..450523.035 rows=45 loops=1)
Join Filter: ((num.tag)::text = (pre.tag)::text)
Rows Removed by Join Filter: 61320
-> Gather (cost=1000.00..1984125.01 rows=32 width=59) (actual time=190.355..92987.731 rows=678 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on num (cost=0.00..1983121.81 rows=13 width=59) (actual time=348.753..331304.671 rows=226 loops=3)
Filter: (((adsh)::text = '0000320193-22-000108'::text) AND (ddate = '2022-09-30'::date))
Rows Removed by Filter: 27953970
-> Materialize (cost=1000.00..925725.74 rows=43 width=33) (actual time=0.097..527.331 rows=91 loops=678)
-> Gather (cost=1000.00..925725.53 rows=43 width=33) (actual time=65.880..357527.133 rows=96 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on pre (cost=0.00..924721.22 rows=18 width=33) (actual time=201.713..357490.037 rows=32 loops=3)
Filter: (((adsh)::text = '0000320193-22-000108'::text) AND ((stmt)::text = 'IS'::text))
Rows Removed by Filter: 12245979
Planning Time: 0.632 ms
JIT:
Functions: 27
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 5.870 ms, Inlining 272.489 ms, Optimization 367.828 ms, Emission 213.288 ms, Total 859.474 ms
Execution Time: 450524.974 ms
(22 rows)
Time: 450526.084 ms (07:30.526)
How can optimize the database to reduce time on query?Add index and close something (the db running on my local pc without any other users)?
Its performing a full table scan on those columns, as you see from the analysis that says "Parallel Seq Scan" (ie sequential scan) on num and pre.
this means it is looking at every row to check if it should be used.
To speed it up, massively, you need to create an index on the columns used in the where clauses (pre.stmt, pre.adsh, num.adsh and num.ddate). Then the query will use the index to decide which rows to include, and as indexes are specially organised for this task, the performance will increase.
adding a CTE would greatly improve your query :
WITH t AS (
SELECT tag FROM tag WHERE stmt= 'yy' AND adsh = 'xxxx'
)
select tag,uom,qtrs,value,ddate
from num n
JOIN t ON t.tag = q.tag
where adsh='xxxx' and ddate='2022-09-30';
Also you can specialize one of the used columns if the queries used most frequently are very known :
CREATE INDEX idx_num_adshxxxx ON num(adsh) WHERE adsh='xxxx';
This would create a very fast index for just a small portion of the table.
Is important to notice that a indexes have very limited usage for non-preplanned selects, for large tables is often better to create a index scan select and re-query the results from a CTE than affording to load the entire table, as example this is a very common predicate of every day db usage and its problems:
WHERE LOWER(adsh) = 'xxxx' ;
Now notice how important is for your queries to be matching the correct indexing
This means if you change the column being researched your indexes should match or they doenst get used, this is same for integercolumn::text = 'x' or date::text = '2019-10-01'
after adding the proper index this get solved:
as much tables start getting bigger less and less random filters can be allowed, table scan will toggle a S.O memory cache , than later the same data will be doubled in the postgresql cache and only later will stabilize .
Unstable postgresql cached queries will reduce the speed of the previously cached queries whenever a new random query is executed if the cached memory of the server is already on the limit.
The subselect in the where clause needs to be rewritten into a table join which will be much faster.The DISTINCT may not be necessary, but I don't know the cardinality of your data, so I included it.
select DISTINCT
num.tag
, num.uom
, num.qtrs
, num.value
, num.ddate
from num
inner join pre
on num.adsh = pre.adsh
and num.adsh = 'xxxx'
and pre.stmt = 'yy'
and num.ddate = '2022-09-30'

What's the best way to Select the first N digits in an Integer (PostgreSQL)

I have a bigint column which is a 19 digit CAS, the first 13 digits of which are a unix timestamp. I need to extract the first 13 digits.
What is the best way to extract the first n digits from the bigint.
Both of these work:
select
cast(left(cast(1641394918760185856 as varchar), 13) as bigint) as withcasting,
1641394918760185856/1000000 as integerdiv;
OUTPUT:1641394918760, 1641394918760
Is there an obviously better way?
Which one of these is better as far as performance?
Which one if any is the canonical way to do it.
Intuitively, I'd think integer division is more performant because it's one operation.
But I like CAS because it's very simple (doesn't require figuring out the divisor) and expresses most clearly what I'm looking to do (extract the leading 13 digits) and also lends itself to a generalized UDF abstraction (LEFT_DIGITS(number,n)).
So I guess, assuming that it is indeed less performant, the question is really how much less performant?
Actually this way is simpler, you really don't need the left function , just cast to n number of character that you need:
select cast(1641394918760185856 as varchar(3))
well, my initial assumption was not right since actually you hav to cast it back to bigint , after looking at the execution plans, you can see actually arithmetic operation is done much easier ( with less memory usage) :
select 1641394918760185856/1000000 as ii
| QUERY PLAN |
| :--------------------------------------------------------------------------------- |
| Result (cost=0.00..0.01 rows=1 width=8) (actual time=0.001..0.001 rows=1 loops=1) |
| Output: '1641394918760'::bigint |
| Planning Time: 0.021 ms |
| Execution Time: 0.007 ms |
select cast(cast(1641394918760185856 as varchar(13)) as bigint) as ii
| QUERY PLAN |
| :--------------------------------------------------------------------------------- |
| Result (cost=0.00..0.01 rows=1 width=8) (actual time=0.000..0.001 rows=1 loops=1) |
| Output: '1641394918760'::bigint |
| Planning: |
| Buffers: shared hit=12 read=3 |
| Planning Time: 0.097 ms |
| Execution Time: 0.005 ms |
db<>fiddle here

Why is select query is very slow in Postgres?

I have a simple Postgres Table. A simple query to count total records takes ages. I have 7.5 millions records in table, I using 8 vCPUs, 32 GB memory machine. Database is in same machine.
Edit: add query.
Following query is very slow:
SELECT * FROM import_csv WHERE processed = False ORDER BY id ASC OFFSET 1 LIMIT 10000
Output of explain
$ explain SELECT * FROM import_csv WHERE processed = False ORDER BY id ASC OFFSET 1 LIMIT 10000
---------------------------------------------------------------------------------------------------------
Limit (cost=5.42..49915.17 rows=10000 width=1985)
-> Index Scan using import_csv_id_idx on import_csv (cost=0.43..19144730.02 rows=3835870 width=1985)
Filter: (NOT processed)
(3 rows)
My table is as below:
Column | Type | Collation | Nullable | Default
-------------------+----------------+-----------+----------+---------
id | integer | | |
name | character(500) | | |
domain | character(500) | | |
year_founded | real | | |
industry | character(500) | | |
size_range | character(500) | | |
locality | character(500) | | |
country | character(500) | | |
linkedinurl | character(500) | | |
employees | integer | | |
processed | boolean | | not null | false
employee_estimate | integer | | |
Indexes:
"import_csv_id_idx" btree (id)
"processed_idx" btree (processed)
Thank you
Edit 3:
# explain analyze SELECT * FROM import_csv WHERE processed = False ORDER BY id ASC OFFSET 1 LIMIT 10000;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=5.42..49915.33 rows=10000 width=1985) (actual time=8331.070..8355.556 rows=10000 loops=1)
-> Index Scan using import_csv_id_idx on import_csv (cost=0.43..19144790.06 rows=3835870 width=1985) (actual time=8331.067..8354.874 rows=10001 loops=1)
Filter: (NOT processed)
Rows Removed by Filter: 3482252
Planning time: 0.081 ms
Execution time: 8355.925 ms
(6 rows)
explain (analyze, buffers)
# explain (analyze, buffers) SELECT * FROM import_csv WHERE processed = False ORDER BY id ASC OFFSET 1 LIMIT 10000;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=5.42..49915.33 rows=10000 width=1985) (actual time=8236.899..8260.941 rows=10000 loops=1)
Buffers: shared hit=724036 read=2187905 dirtied=17 written=35
-> Index Scan using import_csv_id_idx on import_csv (cost=0.43..19144790.06 rows=3835870 width=1985) (actual time=8236.896..8260.104 rows=10001 loops=1)
Filter: (NOT processed)
Rows Removed by Filter: 3482252
Buffers: shared hit=724036 read=2187905 dirtied=17 written=35
Planning time: 0.386 ms
Execution time: 8261.406 ms
(8 rows)
It is slow because it has to dig through 3482252 rows which fail the processed = False criterion before finding the 10001st on which passes, and apparently all those failing rows are scattered randomly about the table leading to a lot of slow IO.
You either need an index on (processed, id), or on (id) where processed = false
If you do the first of these, you can drop the index on processed alone, as it would no longer be independently useful (if it ever were to start with).

Slow regex query on 80M record in PostgreSQL

I have a read-only table with 80 million rows :
Column | Type | Modifiers | Storage | Stats target | Description
-------------+------------------------+-----------+----------+--------------+-------------
id | character(11) | not null | extended | |
gender | character(1) | | extended | |
postal_code | character varying(10) | | extended | |
operator | character varying(5) | | extended | |
Indexes:
"categorised_phones_pkey" PRIMARY KEY, btree (id)
"operator_idx" btree (operator)
"postal_code_trgm_idx" gin (postal_code gin_trgm_ops)
id is Primary Key and contains unique mobile numbers. Table rows looks like this:
id | gender | postal_code | operator
----------------+--------------+----------------+------------
09567849087 | m | 7414776788 | mtn
09565649846 | f | 1268398732 | mci
09568831245 | f | 7412556443 | mtn
09469774390 | m | 5488312790 | mci
This query takes almost ~65 seconds for the first time and ~8 seconds for next times:
select operator,count(*) from categorised_phones where postal_code like '1%' group by operator;
And the output looks like this:
operator | count
----------+---------
mci | 4050314
mtn | 6235778
And the output of explain alanyze :
HashAggregate (cost=1364980.61..1364980.63 rows=2 width=10) (actual time=8257.026..8257.026 rows=2 loops=1)
Group Key: operator
-> Bitmap Heap Scan on categorised_phones (cost=95969.17..1312915.34 rows=10413054 width=2) (actual time=1140.803..6332.534 rows=10286092 loops=1)
Recheck Cond: ((postal_code)::text ~~ '1%'::text)
Rows Removed by Index Recheck: 25105697
Heap Blocks: exact=50449 lossy=237243
-> Bitmap Index Scan on postal_code_trgm_idx (cost=0.00..93365.90 rows=10413054 width=0) (actual time=1129.270..1129.270 rows=10287127 loops=1)
Index Cond: ((postal_code)::text ~~ '1%'::text)
Planning time: 0.540 ms
Execution time: 8257.392 ms
How can I make this query faster?
Any idea would be great appreciated.
P.S:
I'm using PostgreSQL 9.6.1
UPDATE
I just updated the question. I disabled Parallel Query and results changed.
For queries that involve comparisons of the form LIKE '%start', and following PostgreSQL own advice, you can use the following index:
CREATE INDEX postal_code_idx ON categorised_phones (postal_code varchar_pattern_ops) ;
With that index in place, and some simulated data, your execution plan could very likely look like:
| QUERY PLAN |
| :------------------------------------------------------------------------------------------------------------------------------------- |
| HashAggregate (cost=2368.65..2368.67 rows=2 width=12) (actual time=18.093..18.094 rows=2 loops=1) |
| Group Key: operator |
| -> Bitmap Heap Scan on categorised_phones (cost=536.79..2265.83 rows=20564 width=4) (actual time=2.564..12.061 rows=22171 loops=1) |
| Filter: ((postal_code)::text ~~ '1%'::text) |
| Heap Blocks: exact=1455 |
| -> Bitmap Index Scan on postal_code_idx (cost=0.00..531.65 rows=21923 width=0) (actual time=2.386..2.386 rows=22171 loops=1) |
| Index Cond: (((postal_code)::text ~>=~ '1'::text) AND ((postal_code)::text ~<~ '2'::text)) |
| Planning time: 0.119 ms |
| Execution time: 18.122 ms |
You can check it at dbfiddle here
If you have both queries with LIKE 'start%' and LIKE '%middle%', you should add this index, but keep the one already in place. Trigram indexes might prove useful with this second kind of match.
Why?
From PostgreSQL documentation on operator classes:
The operator classes text_pattern_ops, varchar_pattern_ops, and bpchar_pattern_ops support B-tree indexes on the types text, varchar, and char respectively. The difference from the default operator classes is that the values are compared strictly character by character rather than according to the locale-specific collation rules. This makes these operator classes suitable for use by queries involving pattern matching expressions (LIKE or POSIX regular expressions) when the database does not use the standard "C" locale.
From PostgreSQL documentation on Index Types
The optimizer can also use a B-tree index for queries involving the pattern matching operators LIKE and ~ if the pattern is a constant and is anchored to the beginning of the string — for example, col LIKE 'foo%' or col ~ '^foo', but not col LIKE '%bar'. However, if your database does not use the C locale you will need to create the index with a special operator class to support indexing of pattern-matching queries; see Section 11.9 below. It is also possible to use B-tree indexes for ILIKE and ~*, but only if the pattern starts with non-alphabetic characters, i.e., characters that are not affected by upper/lower case conversion.
UPDATE
If the queries performed involved always a fix (and relatively small) number of LIKE 'x%' expressions, consider using partial indexes.
For instance, for LIKE '1%', you'd have the following index, and the following query plan (it shows about a 3x improvement):
CREATE INDEX idx_1 ON categorised_phones (operator) WHERE postal_code LIKE '1%';
VACUUM categorised_phones ;
| QUERY PLAN |
| :-------------------------------------------------------------------------------------------------------------------------------------------- |
| GroupAggregate (cost=0.29..658.74 rows=3 width=12) (actual time=3.235..6.493 rows=2 loops=1) |
| Group Key: operator |
| -> Index Only Scan using idx_1 on categorised_phones (cost=0.29..554.10 rows=20921 width=4) (actual time=0.028..3.266 rows=22290 loops=1) |
| Heap Fetches: 0 |
| Planning time: 0.293 ms |
| Execution time: 6.517 ms |

Slow Select Query with DATE Between

I searched for my problem a bit, but couldn't find a solution.
I run a PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.7.2-5) 4.7.2, 64-bit and my Query is pretty simple.
EXPLAIN (ANALYZE) SELECT CUSTOMER, PRICE, BUYDATE FROM dbo.Invoice WHERE CUSTOMER = 11111111 AND BUYDATE BETWEEN '2012-11-01' AND '2013-10-31';
Output:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on Invoice (cost=88193.54..152981.03 rows=20699 width=14) (actual time=987.205..987.242 rows=36 loops=1)
Recheck Cond: ((CUSTOMER = 11111111) AND (BUYDATE >= '2012-11-01'::date) AND (BUYDATE <= '2013-10-31'::date))
-> BitmapAnd (cost=88193.54..88193.54 rows=20699 width=0) (actual time=987.189..987.189 rows=0 loops=1)
-> Bitmap Index Scan on ix_Invoice (cost=0.00..1375.69 rows=74375 width=0) (actual time=0.043..0.043 rows=40 loops=1)
Index Cond: (CUSTOMER = 11111111)
-> Bitmap Index Scan on ix_Invoice3 (cost=0.00..86807.24 rows=4139736 width=0) (actual time=986.562..986.562 rows=4153999 loops=1)
Index Cond: ((BUYDATE >= '2012-11-01'::date) AND (BUYDATE <= '2013-10-31'::date))
Total runtime: 987.294 ms
(8 rows)
The Table Structure:
Column | Type | Modifiers | Storage | Stats target | Description
-----------------+---------------------------+-----------+----------+--------------+-------------
profitcenter | character varying(5) | not null | extended | |
invoicenumber | character varying(10) | not null | extended | |
invoiceposition | character varying(6) | not null | extended | |
buydate | date | not null | plain | |
customer | integer | | plain | |
nettowert | numeric(18,2) | | main | |
Indexes:
"filialbons_key" PRIMARY KEY, btree (profitcenter, invoicenumber, invoiceposition, buydate)
"ix_Invoice" btree (customer)
"ix_Invoice2" btree (invoicenumber)
"ix_Invoice3" btree (buydate)
"ix_Invoice4" btree (articlenumber)
Has OIDs: no
Example Output from the Query:
customer | price | buydate
--------------+-----------+----
11111111 | 8.32 | 2013-02-06
11111111 | 5.82 | 2013-02-06
11111111 | 16.64 | 2013-02-06
I ran the same Query on a MSSQL 2010? Express with the Date Column as varchar() and it was much faster.
Thanks for your help
with index on (customer, buydate) query should work much faster.
you may try to help planner to chose better plan by collecting more statistic data:
ALTER TABLE Invoice ALTER COLUMN customer SET STATISTICS 1000;
ALTER TABLE Invoice ALTER COLUMN buydate SET STATISTICS 1000;
ANALYZE Invoice;