postgresql slow: primary key on 80M row table - postgresql

I've a 64GB RAM machine, 20 CPUs, and this is the command to create a primary key over a table with 80M rows:
ALTER TABLE ONLY wikipedia_article ADD CONSTRAINT pagelinks_pkey PRIMARY KEY (language, title);
The problem I've got with this is that:
only 1 CPU is used up to 100% (avg load 99%), the rest aren't used at all
the write speed during the primary key creation is quite low, 2.6 M/s, read speed missing completely from the report produced by pg_activity
This is the table structure:
Table "public.wikipedia_article"
Column | Type | Modifiers | Storage | Stats target | Description
--------------+------------------+-----------+----------+--------------+-------------
language | text | not null | extended | |
title | text | not null | extended | |
langcount | integer | | plain | |
othercount | integer | | plain | |
totalcount | integer | | plain | |
lat | double precision | | plain | |
lon | double precision | | plain | |
importance | double precision | | plain | |
osm_type | character(1) | | extended | |
osm_id | bigint | | plain | |
infobox_type | text | | extended | |
population | bigint | | plain | |
website | text | | extended | |
Has OIDs: no
The import happens automatically via pg_restore, this is the source
http://www.nominatim.org/data/wikipedia_article.sql.bin
Any ideas on what I can try to make things faster? I changed the value of the conf varible "maintenance_work_mem" to half of the RAM size. I also tried to change some of the kernel settings, with no joy:
sysctl -w kernel.shmmax=17179869184
sysctl -w kernel.shmall=1048576
sysctl vm.overcommit_memory=1
sysctl vm.swappiness=10
The OS is running on a VM, in Digital Ocean, on SSD drives, I was expecting this could work faster for such a VM configuration.
Server info: PostgreSQL 9.3.13 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.1) 4.8.4, 64-bit
Many thanks, VG

PostgreSQL uses for one statement one CPU - so CPU utilization 100% is expected. You can try to set maintenance_work_mem for some higher value. The default is too low - you can try '2GB' for example.

Related

Postgres query taking long to execute

I am using libpq to connect the Postgres server in c++ code. Postgres server version is 12.10
My table schema is defined below
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
---------------------+----------+-----------+----------+------------+----------+--------------+-------------
event_id | bigint | | not null | | plain | |
event_sec | integer | | not null | | plain | |
event_usec | integer | | not null | | plain | |
event_op | smallint | | not null | | plain | |
rd | bigint | | not null | | plain | |
addr | bigint | | not null | | plain | |
masklen | bigint | | not null | | plain | |
path_id | bigint | | | | plain | |
attribs_tbl_last_id | bigint | | not null | | plain | |
attribs_tbl_next_id | bigint | | not null | | plain | |
bgp_id | bigint | | not null | | plain | |
last_lbl_stk | bytea | | not null | | extended | |
next_lbl_stk | bytea | | not null | | extended | |
last_state | smallint | | | | plain | |
next_state | smallint | | | | plain | |
pkey | integer | | not null | 1654449420 | plain | |
Partition key: LIST (pkey)
Indexes:
"event_pkey" PRIMARY KEY, btree (event_id, pkey)
"event_event_sec_event_usec_idx" btree (event_sec, event_usec)
Partitions: event_spl_1651768781 FOR VALUES IN (1651768781),
event_spl_1652029140 FOR VALUES IN (1652029140),
event_spl_1652633760 FOR VALUES IN (1652633760),
event_spl_1653372439 FOR VALUES IN (1653372439),
event_spl_1653786420 FOR VALUES IN (1653786420),
event_spl_1654449420 FOR VALUES IN (1654449420)
When I execute the following query it takes 1 - 2 milliseconds to execute.
Time is provided as a parameter to function executing this query, it contains epoche seconds and microseconds.
SELECT event_id FROM event WHERE (event_sec > time.seconds) OR ((event_sec=time.seconds) AND (event_usec>=time.useconds) ORDER BY event_sec, event_usec LIMIT 1
This query is executed every 30 seconds on the same client connection (Which is persistent for weeks). This process runs for weeks, but some time same query starts taking more than 10 minutes.
If I restart the process it recreated connection with the server and now execution time again falls back to 1-2 milliseconds. This issue is intermittent, sometimes it triggers after a week of running process and some time after 2 - 3 weeks of running process.
We add a new partition to table every Sunday and write new data in new partition.
I don't know why the performance is inconsistent, there are many possibilities we can't distinguish with the info provided. Like, does the plan change when the performance changes, or does the same plan just perform worse?
But your query is not written to take maximal advantage of the index. In my hands it can use the index for ordering, but then it still needs to read and individually skip over things that fail the WHERE clause until it finds the first one which passes. And due to partitioning, I think it is even worse than that, it has to do this read-and-skip until it finds the first one which passes in each partition.
You could rewrite it to do a tuple comparison, which can use the index to determine both the order, and where to start:
SELECT event_id FROM event
WHERE (event_sec,event_sec) >= (:seconds,:useconds)
ORDER BY event_sec, event_usec LIMIT 1;
Now this might also degrade, or might not, or maybe will degrade but still be so fast that it doesn't matter.

What Postgres 13 index types support distance searches?

Original Question
We've had great results using a K-NN search with a GiST index with gist_trgm_ops. Pure magic. I've got other situations, with other datatypes like timestamp where distance functions would be quite useful. If I didn't dream it, this is, or was, available through pg_catalog. Looking around, I can't find a way to search on indexes by such properties. I think what I'm after, in this case, is AMPROP_DISTANCE_ORDERABLE under-the-hood.
Just checked, and pg_am did have a lot more attributes than it does now, prior to 9.6.
Is there another way to figure out what options various indexes have with a catalog search?
Catalogs
jjanes' answer inspired me to look at the system information functions some more, and to spend a day in the pg_catalog tables. The catalogs for indexes and operators are complicated. The system information functions are a big help. This piece proved super useful for getting a handle on things:
https://postgrespro.com/blog/pgsql/4161264
I think the conclusion is "no, you can't readily figure out what data types and indexes support proximity searches." The relevant attribute is a property of a column in a specific index. However, it looks like nearest-neighbor searching requires a GiST index, and that there are readily-available index operator classes to add K-NN searching to a huge range of common types. Happy for corrections on these conclusions, or the details below.
Built-in Distance Support
https://www.postgresql.org/docs/current/gist-builtin-opclasses.html
From various bits of the docs, it sounds like there are distance (proximity, nearest neighbor, K-NN) operators for GiST indexes on a handful of built-in geometric types.
box
circle
point
poly
B-tree Operator Classes
Not listed as such in the docs, but visible with this query:
select am.amname AS index_method
, opc.opcname AS opclass_name
, opc.opcintype::regtype AS indexed_type
, opc.opcdefault AS is_default
from pg_am am
, pg_opclass opc
where opc.opcmethod = am.oid
and am.amname = 'btree'
order by 1,2;
B-tree GiST Distance Support
https://www.postgresql.org/docs/current/btree-gist.html
I guess a B-tree is a special case of a GiST, and there's a B-tree operator class to match. The docs say these native types are supported:
int2
int4
int8
float4
float8
timestamp with time zone
timestamp without time zone
time without time zone
date
interval
oid
money
BRIN Built-in Operator Classes
https://www.postgresql.org/docs/current/brin-builtin-opclasses.html
There are over 70 listed in the internals docs.
GIN Built-in Operator Classes
https://www.postgresql.org/docs/12/gin-builtin-opclasses.html
array_ops
jsonb_ops
jsonb_path_ops
tsvector_ops
Alternative Text Opts
https://www.postgresql.org/docs/current/indexes-opclass.html
There are special operator classes for text comparisons made character-by-character, rather than through a collation. Or so the docs say:
text_pattern_ops
varchar_pattern_ops
bpchar_pattern_ops
pg_trgm
Beyond this, the included pg_trgm module includes operators for GIN and GiST, with the GiST version optimizing <->. I think this shows up as:
text
Note: Postgres 14 modifies pg_trgm to allow you to adjust the "signature length" for the index entry. Longer is possibly more accurate, shorter signatures are smaller on disk. If you've been using pg_trgm, it might be worth experimenting with the signature length in PG 14.
https://www.postgresql.org/docs/current/pgtrgm.html
SP-GiST Built-in Operator Classes
box_ops
kd_point_ops
network_ops
poly_ops
quad_point_ops
range_ops
text_ops
pg_operator search
Here's a search on pg_operator to look for matches starting from the <-> operator itself:
select oprnamespace::regnamespace::text as schema_name,
oprowner::regrole as owner,
oprname as operator,
oprleft::regtype as left,
oprright::regtype as right,
oprresult::regtype as result,
oprcom::regoperator as commutator
from pg_operator
where oprname = '<->'
order by 1
Output from one of our severs:
| schema_name | owner | operator | left | right | result | commutator |
+-------------+----------+----------+-----------------------------+-----------------------------+------------------+--------------------------------------------------------------+
| extensions | postgres | <-> | text | text | real | <->(text,text) |
| extensions | postgres | <-> | money | money | money | <->(money,money) |
| extensions | postgres | <-> | date | date | integer | <->(date,date) |
| extensions | postgres | <-> | real | real | real | <->(real,real) |
| extensions | postgres | <-> | double precision | double precision | double precision | <->(double precision,double precision) |
| extensions | postgres | <-> | smallint | smallint | smallint | <->(smallint,smallint) |
| extensions | postgres | <-> | integer | integer | integer | <->(integer,integer) |
| extensions | postgres | <-> | bigint | bigint | bigint | <->(bigint,bigint) |
| extensions | postgres | <-> | interval | interval | interval | <->(interval,interval) |
| extensions | postgres | <-> | oid | oid | oid | <->(oid,oid) |
| extensions | postgres | <-> | time without time zone | time without time zone | interval | <->(time without time zone,time without time zone) |
| extensions | postgres | <-> | timestamp without time zone | timestamp without time zone | interval | <->(timestamp without time zone,timestamp without time zone) |
| extensions | postgres | <-> | timestamp with time zone | timestamp with time zone | interval | <->(timestamp with time zone,timestamp with time zone) |
| pg_catalog | postgres | <-> | box | box | double precision | <->(box,box) |
| pg_catalog | postgres | <-> | path | path | double precision | <->(path,path) |
| pg_catalog | postgres | <-> | line | line | double precision | <->(line,line) |
| pg_catalog | postgres | <-> | lseg | lseg | double precision | <->(lseg,lseg) |
| pg_catalog | postgres | <-> | polygon | polygon | double precision | <->(polygon,polygon) |
| pg_catalog | postgres | <-> | circle | circle | double precision | <->(circle,circle) |
| pg_catalog | postgres | <-> | point | circle | double precision | <->(circle,point) |
| pg_catalog | postgres | <-> | circle | point | double precision | <->(point,circle) |
| pg_catalog | postgres | <-> | point | polygon | double precision | <->(polygon,point) |
| pg_catalog | postgres | <-> | polygon | point | double precision | <->(point,polygon) |
| pg_catalog | postgres | <-> | circle | polygon | double precision | <->(polygon,circle) |
| pg_catalog | postgres | <-> | polygon | circle | double precision | <->(circle,polygon) |
| pg_catalog | postgres | <-> | point | point | double precision | <->(point,point) |
| pg_catalog | postgres | <-> | box | line | double precision | <->(line,box) |
| pg_catalog | postgres | <-> | tsquery | tsquery | tsquery | 0 |
| pg_catalog | postgres | <-> | line | box | double precision | <->(box,line) |
| pg_catalog | postgres | <-> | point | line | double precision | <->(line,point) |
| pg_catalog | postgres | <-> | line | point | double precision | <->(point,line) |
| pg_catalog | postgres | <-> | point | lseg | double precision | <->(lseg,point) |
| pg_catalog | postgres | <-> | lseg | point | double precision | <->(point,lseg) |
| pg_catalog | postgres | <-> | point | box | double precision | <->(box,point) |
| pg_catalog | postgres | <-> | box | point | double precision | <->(point,box) |
| pg_catalog | postgres | <-> | lseg | line | double precision | <->(line,lseg) |
| pg_catalog | postgres | <-> | line | lseg | double precision | <->(lseg,line) |
| pg_catalog | postgres | <-> | lseg | box | double precision | <->(box,lseg) |
| pg_catalog | postgres | <-> | box | lseg | double precision | <->(lseg,box) |
| pg_catalog | postgres | <-> | point | path | double precision | <->(path,point) |
| pg_catalog | postgres | <-> | path | point | double precision | <->(point,path) |
+-------------+----------+----------+-----------------------------+-----------------------------+------------------+--------------------------------------------------------------+
Did I miss any index opts worth knowing about?
Checking Out Live Indexes
Here's a longer-than-it-should-be-because-I-still-find-the-catalogs-confusing query to pull out the columns from each user index, and figure out their more interesting properties. For a nice, short catalog search of much utility, see https://dba.stackexchange.com/questions/186944/how-to-list-all-the-indexes-along-with-their-type-btree-brin-hash-etc
with
basic_details as (
select relnamespace::regnamespace::text as schema_name,
indrelid::regclass::text as table_name,
indexrelid::regclass::text as index_name,
unnest(indkey) as column_ordinal_position , -- WITH ORDINALITY would be nice here, didn't get it working.
generate_subscripts(indkey, 1) + 1 as column_position_in_index --
from pg_index
join pg_class on pg_class.oid = pg_index.indrelid
),
enriched_details as (
select basic_details.schema_name,
basic_details.table_name,
basic_details.index_name,
basic_details.column_ordinal_position,
basic_details.column_position_in_index,
columns.column_name,
columns.udt_name as column_type_name
from basic_details
join information_schema.columns as columns
on columns.table_schema = basic_details.schema_name
and columns.table_name = basic_details.table_name
and columns.ordinal_position = basic_details.column_ordinal_position
where schema_name not like 'pg_%'
)
select *,
-- https://postgrespro.com/blog/pgsql/4161264
coalesce(pg_index_column_has_property(index_name,column_position_in_index,'distance_orderable'), false) as supports_knn_searches,
coalesce(pg_index_column_has_property(index_name,column_position_in_index,'search_array'), false) as supports_in_searches,
coalesce(pg_index_column_has_property(index_name,column_position_in_index,'returnable'), false) as supports_index_only_scans,
(select indexdef
from pg_indexes
where pg_indexes.schemaname = enriched_details.schema_name
and pg_indexes.indexname = enriched_details.index_name) as index_definition
from enriched_details
order by supports_in_searches desc,
schema_name,
table_name,
index_name
timestamp type supports KNN with GiST indexes using the <-> operator created by the btree_gist extension.
You can check if a specific column of a specific index supports it, like this:
select pg_index_column_has_property('pgbench_history_mtime_idx'::regclass,1,'distance_orderable');
As best as I can tell, here's the state of play as of PG 14:
GiST indexes may support nearest-neighbor (K-NN) proximity <--> search, and always have.
SP-GiST added such support as of PG 12.
RUM indexes (not in core) also support K-NN.
In all cases, support is done in the operator class:
https://www.postgresql.org/docs/current/indexes-opclass.html
That's what determines if distance_orderable works for a specific data type on a specific kind of index. Built-in, some of the geometric and text vector types work out-of-the box. Other than that small set, many more types are supported via specific operator classes, such as:
https://www.postgresql.org/docs/current/btree-gist.html
https://www.postgresql.org/docs/current/pgtrgm.html
In the case of SP-GiST, there are a lot fewer types supported than with GiST, once you've installed btree_gist:
https://www.postgresql.org/docs/14/spgist-builtin-opclasses.html
It looks like text_opts and range_opts do not support proximity searches. However, for tsrange, etc., there are likely enough options with other tools.

Timescaledb: retention policy isn't removing data from hypertable

(note: I've also posted this as a github issue https://github.com/timescale/timescaledb/issues/3653)
I have a hypertable request_logs configured with a 24 hour retention policy. The retention policy is being marked as running successfully, however no old data from the table is being removed. The table continues to grow day by day.
I checked and don't see any errors in the postgresql log files.
Could use additional guidance on where to look for information to troubleshoot this issue.
request_logs table structure
\d+ request_logs;
Table "public.request_logs"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
-----------+--------------------------+-----------+----------+---------+---------+--------------+-------------
time | timestamp with time zone | | not null | | plain | |
referer | bigint | | | | plain | |
useragent | bigint | | | | plain | |
Indexes:
"request_logs_time_idx" btree ("time" DESC)
Triggers:
ts_insert_blocker BEFORE INSERT ON request_logs FOR EACH ROW EXECUTE FUNCTION _timescaledb_internal.insert_blocker()
Child tables: _timescaledb_internal._hyper_1_37_chunk,
_timescaledb_internal._hyper_1_38_chunk,
_timescaledb_internal._hyper_1_40_chunk
Access method: heap
This is the hypertable description retrieved by running select * from _timescaledb_catalog.hypertable;
id | schema_name | table_name | associated_schema_name | associated_table_prefix | num_dimensions | chunk_sizing_func_schema | chunk_sizing_func_name | chunk_target_size | compression_state | compressed_hypertable_id | replication_factor
----+-------------+--------------+------------------------+-------------------------+----------------+--------------------------+--------------------------+-------------------+-------------------+--------------------------+--------------------
1 | public | request_logs | _timescaledb_internal | _hyper_1 | 1 | _timescaledb_internal | calculate_chunk_interval | 0 | 0 | |
(1 row)
This is the retention_policy retrieved by running SELECT * FROM timescaledb_information.job_stats;.
hypertable_schema | hypertable_name | job_id | last_run_started_at | last_successful_finish | last_run_status | job_status | last_run_duration | next_start | total_runs | total_successes | total_failures
-------------------+-----------------+--------+-------------------------------+-------------------------------+-----------------+------------+-------------------+-------------------------------+------------+-----------------+----------------
public | request_logs | 1002 | 2021-10-05 23:59:01.601404+00 | 2021-10-05 23:59:01.638441+00 | Success | Scheduled | 00:00:00.037037 | 2021-10-06 23:59:01.638441+00 | 8 | 8 | 0
| | 1 | 2021-10-05 08:38:20.473945+00 | 2021-10-05 08:38:21.153468+00 | Success | Scheduled | 00:00:00.679523 | 2021-10-06 08:38:21.153468+00 | 45 | 45 | 0
(2 rows)
Relevant system information:
OS: Ubuntu 20.04.3 LTS
PostgreSQL version (output of postgres --version): 12
TimescaleDB version (output of \dx in psql): 2.4.1
Installation method: apt install process described https://docs.timescale.com/timescaledb/latest/how-to-guides/install-timescaledb/self-hosted/ubuntu/installation-apt-ubuntu/#installation-apt-ubuntu
It looks as though this might be related to a bug that has been fixed in version 2.4.2 of TimescaleDB. The GitHub report has been updated, if you find that the issue remains after upgrade, please update the issue on GitHub with your example. Thanks for reporting!
Transparency: I work for Timescale

GIST index creation too slow on PostgreSQL

I have a database in PostgreSQL with the following structure:
Column | Type | Collation | Nullable | Default
-------------+-----------------------+-----------+----------+------------------------------------------------
vessel_hash | integer | | not null | nextval('samplecol_vessel_hash_seq'::regclass)
status | character varying(50) | | |
station | character varying(50) | | |
speed | character varying(10) | | |
longitude | numeric(12,8) | | |
latitude | numeric(12,8) | | |
course | character varying(50) | | |
heading | character varying(50) | | |
timestamp | character varying(50) | | |
the_geom | geometry | | |
Check constraints:
"enforce_dims_the_geom" CHECK (st_ndims(the_geom) = 2)
"enforce_geotype_geom" CHECK (geometrytype(the_geom) = 'POINT'::text OR the_geom IS NULL)
"enforce_srid_the_geom" CHECK (st_srid(the_geom) = 4326)
The database contains ~146.000.000 records and the size of table that contains the data is:
public | samplecol | table | postgres | 31 GB |
I try to create a GIST index on the geometry field the_geom with this command:
create index samplecol_the_geom_gist on samplecol using gist (the_geom);
but takes too long. It runs 2 hours already.
Based on this question Slow indexing of 300GB Postgis table
Ask Question, before index creation I execute in psql console:
ALTER SYSTEM SET maintenance_work_mem = '1GB';
ALTER SYSTEM
SELCT pg_reload_conf();
pg_reload_conf
----------------
t
(1 row)
But index creation takes too long. Does anyone know why? An how to fix this?
I am afraid you'll have to sit it out.
Apart from high maintenance_work_mem, there is not really a tuning option.
Increasing max_wal_size will help somewhat, since you will get fewer checkpoints.
If you can't live with an ACCESS EXCLUSIVE lock for that long, try CREATE INDEX CONCURRENTLY, which will be even slower, but won't block concurrent database activity.

Postgres Changing column from TEXT to INTEGER increases table size

I have a postgres table that has a schema like this
Table "am.old_product"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
-----------------+--------------------------+-----------+----------+---------+----------+--------------+-------------
p_config_sku | text | | | | extended | |
p_simple_sku | text | | | | extended | |
p_merchant_id | text | | | | extended | |
p_country | character varying(2) | | | | extended | |
p_discount_rate | numeric(10,2) | | | | main | |
p_black_price | numeric(10,2) | | | | main | |
p_red_price | numeric(10,2) | | | | main | |
p_received_at | timestamp with time zone | | | | plain | |
p_event_id | uuid | | | | plain | |
p_is_deleted | boolean | | | | plain | |
Indexes:
"product_p_simple_sku_p_country_p_merchant_id_idx" UNIQUE, btree (p_simple_sku, p_country, p_merchant_id)
"config_sku_country_idx" btree (p_config_sku, p_country)
We decided that it would be a better idea remove the TEXT field merchant_id and move it to another table, and reference it in the product table using a foreign key. So the new schema looks just like this.
Table "am.product"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
-------------------+--------------------------+-----------+----------+---------+----------+--------------+-------------
p_config_sku | text | | not null | | extended | |
p_simple_sku | text | | not null | | extended | |
p_country | character varying(2) | | not null | | extended | |
p_discount_rate | numeric(10,2) | | | | main | |
p_black_price | numeric(10,2) | | | | main | |
p_red_price | numeric(10,2) | | | | main | |
p_received_at | timestamp with time zone | | not null | | plain | |
p_event_id | uuid | | not null | | plain | |
p_is_deleted | boolean | | | false | plain | |
p_merchant_id_new | integer | | not null | | plain | |
Indexes:
"new_product_p_simple_sku_p_country_p_merchant_id_new_idx" UNIQUE, btree (p_simple_sku, p_country, p_merchant_id_new)
"p_config_sku_country_idx" btree (p_config_sku, p_country)
Foreign-key constraints:
"fk_merchant_id" FOREIGN KEY (p_merchant_id_new) REFERENCES am.merchant(m_id)
Now this should make the product table size drop right? we are using a 4 bytes integer instead of a TEXT. Well not really, the two tables, have the same exact number of rows. The product table (one with integer field) size is 34.3 GB. While the old table's size (with TEXT) has size of 19.7GB
Does anyone have an explanation for that?
At a wild guess you have done this with various ALTER TABLE commands forcing at least one rewrite of the entire table.
The unused space will be gradually re-used, or for a more prompt change try a CLUSTER or VACUUM FULL on the table.
Look at the VACUUM command.
A database file is an organized collection of tuples. A row can be made up of one or more tuples. When you added a new column, you added tuples to the table file. But when you dropped a column, the space occupied by the tuples remains, because to delete it from the file is a costly operation. They are dead tuples.
VACUUM FULL am.product;
This will unfortunately will create exclusive locks on the table, and you won't be able to query it in the process.