Index creation taking forever on postgres - postgresql

I was trying to create an index to an integer column using a btree index, but it was taking forever (more than 2 hours!).
The table has 17.514.879 lines. I didn't expect it to take that long.
After almost 2.5 hours, the connection to the database just died. When I reconnected to it, the index was there, but I don't know how good is this index.
How can I be sure the index was not messed up by the connection lost?

How to Check Whether the Index is Fine
Connect to the database via psql and run \d table_name (where table_name is the name of your table). For example:
grn=# \d users
Table "public.users"
Column | Type | Modifiers
--------+------------------------+-----------
name | character varying(255) |
Indexes:
"users_name_idx" btree (name)
You'll see indexes listed below the table schema. If the index is corrupt it'll be marked as so.
How to Create Indexes Without Locking the Whole Table
You can create an index in a way that doesn't lock the whole table but is even slower. To do so you need to add CONCURRENTLY to CREATE INDEX. For example:
CREATE INDEX CONCURRENTLY users_name_idx ON users(name);
How to Fix a Corrupt Index
If the index is corrupt you can either drop it and recreate CONCURRENTLY or use REINDEX INDEX index_name. For example:
REINDEX INDEX users_name_idx
will recreate the users_name_idx.

Related

Serial column takes up disproportional amount of space in PostgreSQL

I would like to create an auto-incrementing id column that is not a primary key in a PostgreSQL table. The table is currently just over 200M rows and contains 14 columns.
SELECT pg_size_pretty(pg_total_relation_size('mytable'));
The above query reveals that mytable takes up 57 GB on disk. I currently have 30 GB free space remaining on disk after checking with df -h (on Ubuntu 20.04)
What I don't understand is why, after trying to create a SERIAL column, I completely run out of disk space - the query ends up never finishing. I run the following command:
ALTER TABLE mytable ADD COLUMN id SERIAL;
and then see how gradually, my disk space runs out until there is nothing left and the query fails. I am no database expert but it does not make sense. Why would a simple serialized column take up more than half of the space of the table itself, especially when it is not a primary key and therefore has no index? Is there a known workaround to creating such an auto-incrementing id column?
As a proof of concept:
create table id_test(pk_fld integer primary key generated always as identity);
--FYI, in Postgres 14+ the overriding system value won't be needed.
--That is a hack around a bug in 13-
insert into id_test overriding system value values (default), (default);
select * from id_test;
pk_fld
--------
1
2
alter table id_test add column id_fld integer ;
update id_test set id_fld = 0;
alter table id_test alter COLUMN id_fld set not null;
alter table id_test alter COLUMN id_fld add generated always as identity;
update id_test set id_fld = default;
select * from id_test;
pk_fld | id_fld
--------+--------
1 | 1
2 | 2
Basically this breaks the process down into steps. Obviously this is just a toy table and not representative of your setup. I would try it on test table that is a subset of you actual table to see what happens to disk space consumption. It would not hurt to use VACUUM after the updates to return rows to the database.
Adding a serial column is adding an integer column with a non-constant DEFAULT value. This will cause PostgreSQL to rewrite the table, because the new column value has to be added to all existing rows. So PostgreSQL writes a new copy of the table and discards the old one after it is done. This will require more than double the disk space of the original table temporarily, which explains why you run out of disk space.
You can split the operation into several steps:
ALTER TABLE mytable ADD id bigint;
CREATE SEQUENCE mytable_id_seq OWNED BY mytable.id;
ALTER TABLE mytable ALTER id SET DEFAULT nextval('mytable_id_seq');
This will not rewrite the table, and it will leave the existing rows untouched. The value of id for these columns will be NULL.
You probably want to update the existing rows to be NOT NULL, but be careful: if you update them all at once, you will run out of disk space as well, because in PostgreSQL an UPDATE writes a complete new version of the row to the table. You'd have to update the rows in batches and run VACUUM between these updates.
All in all, this is rather annoying and complicated. So do yourself a favor and increase the disk space. That is the simple and best solution.

Unable to drop index on a db2 table

I have a table MAIN_SCHEMA.TEST in which I created a Index on a column CHECK_ID.
CHECK_ID is also a FOREIGN_KEY constraint in TEST table.
This table contains only 50 records.
By Mistake the index got created in Default schema DEFAULT_SCHEMA.CHECK_ID_IDX.
CREATE INDEX DEFAULT_SCHEMA.CHECK_ID_IDX(CHECK_ID ASC);
So I am trying to drop this index but the drop query gets stuck for long time.
DROP INDEX DEFAULT_SCHEMA.CHECK_ID_IDX.
there are no locks on this table when I checked.
Instead of dropping and recreating the index with the right schema, could you just try to RENAME the index? It requires the existing SCHEMA.NAME pair together with the new as input. It will not move any data, but just update the metadata.

duplicate key in a PostgreSQL index

I want to move my OwnCloud database to a new server, but the operation fails during restore.
pg_restore: [archive program (db)] COPY failed for table "oc_storages": ERROR: value of a duplicate key breaks unique constraint "storages_id_index"
DETAIL: The key "(id) = (local :: / var / www / owncloud_data /)" already exists.
Indeed, a simple query on the oc_sorages database shows that there is a duplicate.
ocl=# select * from oc_storages where id ~* 'owncloud_data';
id | numeric_id | available | last_checked
--------------------------------+------------+-----------+--------------
local::/var/www/owncloud_data/ | 491 | 1 |
local::/var/www/owncloud_data/ | 838 | 1 |
(2 rows)
but at the same time, postgresql managed to create an index for this table based on the id (storages_id_index). How is it possible that PostgreSQL accepts this duplicate in this table?
ocl=# SELECT indexname, indexdef FROM pg_indexes WHERE tablename = 'oc_storages';
indexname | indexdef
-------------------+-------------------------------------------------------------------------------------
oc_storages_pkey | CREATE UNIQUE INDEX oc_storages_pkey ON public.oc_storages USING btree (numeric_id)
storages_id_index | CREATE UNIQUE INDEX storages_id_index ON public.oc_storages USING btree (id)
(2 rows)
What to do to get out of this impasse: delete one of the two values? which ?
Thanks in advance.
Ernest.
There are usually two explanation for this:
Hardware problems leading to data corruption. Then remove conflicting rows manually, export the database and import it into a newly created cluster to get rid of potential lurking data corruption.
You upgraded the C library on the operating system and the collations changed, corrupting the index. Then remove conflicting rows manually and REINDEX the indexes with string columns.
This is one of those semantic annoyances I have with Postgres, but creating a UNIQUE INDEX on a table does not actually add an enforced table constraint.
You need to explicitly add each constraint USING the created index, e.g.:
CREATE UNIQUE INDEX oc_storages_pkey ON public.oc_storages USING btree (numeric_id);
ALTER TABLE public.oc_storages ADD CONSTRAINT oc_storages_pkey UNIQUE USING INDEX oc_storages_pkey;
If you do have such a table constraint already, then this would be a case of corruption.

How to safely reindex primary key on postgres?

We have a huge table that contains bloat on the primary key index. We constantly archive old records on that table.
We reindex other columns by recreating the index concurrently and dropping the old one. This is to avoid interfering with production traffic.
But this is not possible for a primary key since there are foreign keys depending on it. At least based on what we have tried.
What's the right way to reindex the primary key safely without blocking DML statements on the table?
REINDEX CONCURRENTLY seems to work as well. I tried it on my database and didn't get any error.
REINDEX INDEX CONCURRENTLY <indexname>;
I think it possibly does something similar to what #jlandercy has described in his answer. While the reindex was running I saw an index with suffix _ccnew and the existing one was intact as well. Eventually I guess that index was renamed as the original index after dropping the older one and I eventually see a unique primary index on my table.
I am using postgres v12.7.
You can use pg_repack for this.
pg_repack is a PostgreSQL extension which lets you remove bloat from tables and indexes, and optionally restore the physical order of clustered indexes.
It doesn't hold exclusive locks during the whole process. It still does execute some locks, but this should be for a short period of time only. You can check the details here: https://reorg.github.io/pg_repack/
To perform repack on indexes, you can try:
pg_repack -t table_name --only-indexes
TL;DR
Just reindex it as other index using its index name:
REINDEX INDEX <indexname>;
MCVE
Let's create a table with a Primary Key constraint which is also an Index:
CREATE TABLE test(
Id BIGSERIAL PRIMARY KEY
);
Looking at the catalogue we see the constraint name:
SELECT conname FROM pg_constraint WHERE conname LIKE 'test%';
-- "test_pkey"
Having the name of the index, we can reindex it:
REINDEX INDEX test_pkey;
You can also fix the Constraint Name at the creation:
CREATE TABLE test(
Id BIGSERIAL NOT NULL
);
ALTER TABLE test ADD CONSTRAINT myconstraint PRIMARY KEY(Id);
If you must address concurrence, then use the method a_horse_with_no_name suggested, create a unique index concurrently:
-- Ensure Uniqueness while recreating the Primary Key:
CREATE UNIQUE INDEX CONCURRENTLY tempindex ON test USING btree(Id);
-- Drop PK:
ALTER TABLE test DROP CONSTRAINT myconstraint;
-- Recreate PK:
ALTER TABLE test ADD CONSTRAINT myconstraint PRIMARY KEY(Id);
-- Drop redundant Index:
DROP INDEX tempindex;
To check Index existence:
SELECT * FROM pg_index WHERE indexrelid::regclass = 'tempindex'::regclass

About clustered index in postgres

I'm using psql to access a postgres database. When viewing the metadata of a table, is there any way to see whether an index of a table is a clustered index?
I heard that the PRIMARY KEY of a table is automatically associated with a clustered index, is it true?
Note that PostgreSQL uses the term "clustered index" to use something vaguely similar and yet very different to SQL Server.
If a particular index has been nominated as the clustering index for a table, then psql's \d command will indicate the clustered index, e.g.,
Indexes:
"timezone_description_pkey" PRIMARY KEY, btree (timezone) CLUSTER
PostgreSQL does not nominate indices as clustering indices by default. Nor does it automatically arrange table data to correlate with the clustered index even when so nominated: the CLUSTER command has to be used to reorganise the table data.
In PostgreSQL the clustered attribute is held in the metadata of the corresponding index, rather than the relation itself. It is the indisclustered attribute in pg_index catalogue. Note, however, that clustering relations within postgres is a one-time action: even if the attribute is true, updates to the table do not maintain the sorted nature of the data. To date, automatic maintenance of data clustering remains a popular TODO item.
There is often confusion between clustered and integrated indexes, particularly since the popular textbooks use conflicting names, and the terminology is different again in the manuals of postgres and SQL server (to name just two). When I talk about an integrated index (also called a main index or primary index) I mean one in which the relation data is contained in the leaves of the index, as opposed an external or secondary index in which the leaves contain index entries that point to the table records. The former type is necessarily always clustered. Unfortunately postgres only supports the latter type. Anyhow, the fact that an integrated (primary) index is always clustered may have given rise to the belief that "a PRIMARY KEY of a table is automatically associated with a clustered index". The two statements sound similar, but are different.
PostgreSQL does not have direct implementation of CLUSTER index like Microsoft SQL Server.
Reference Taken from this Blog:
In PostgreSQL, we have one CLUSTER command which is similar to Cluster Index.
Once you create your table primary key or any other Index, you can execute the CLUSTER command by specifying that Index name to achieve the physical order of the Table Data.
When a table is clustered, it is physically reordered based on the index information. Clustering is a one-time operation: when the table is subsequently updated, the changes are not clustered. That is, no attempt is made to store new or updated rows according to their index order.
Syntax of Cluster:
First time you must execute CLUSTER using the Index Name.
CLUSTER table_name USING index_name;
Cluster the table:
Once you have executed CLUSTER with Index, next time you should execute only CLUSTER TABLE because It knows that which index already defined as CLUSTER.
CLUSTER table_name;
is there any way to see whether an index of a table is a clustered index
PostgreSQL does not have a clustered index, so you won't be able to see them.
I heard that the PRIMARY KEY of a table is automatically associated with a clustered index, is it true?
No, that's not true (see above)
You can manually cluster a table along an index, but this is nothing that will be maintained automatically (as e.g. with SQL Server's clustered indexes).
For more details, see the description of the CLUSTER command in the manual.
Cluster Indexing
A cluster index means telling the database to store the close values actually close to one another on the disk. They can uniquely identify the rows in the SQL table. Every table can have exactly one one clustered index. A cluster index can cover more than one column. By default, a column with a primary key already has a clustered index.
dictionary
A dictionary itself is a table with clustered index. Because all the data is physically stored in alphabetical order.
Non-Cluster Indexing
Non-clustered indexing is like simple indexing of a book. They are just used for fast retrieval of data. Not sure to have unique data. A non-clustered index contains the non-clustered index keys and their corresponding data location pointer. For example, a book's content index contains the key of a topic or chapter and the page location of that.
book content index
A book's content table holds the content name and its page location. It is not sure that the data is unique. Because same paragraph or text line or word can be placed many times.
PostgreSQL Indexing
PostgreSQL automatically creates indexes for PRIMARY KEY and every UNIQUE constraints of a table. Login to a database in PostgreSQL terminal and type \d table_name. All stored indexes will be visualized. If there is a clustered index then it will also be identified.
Creating a table
CREATE TABLE IF NOT EXISTS profile(
uid serial NOT NULL UNIQUE PRIMARY KEY,
username varchar(30) NOT NULL UNIQUE,
phone varchar(11) NOT NULL UNIQUE,
age smallint CHECK(age>12),
address text NULL
);
3 index will be created automatically. All these indexes are non clustered
"profile_pkey" PRIMARY KEY, btree (uid)
"profile_phone_key" UNIQUE CONSTRAINT, btree (phone)
"profile_username_key" UNIQUE CONSTRAINT, btree (username)
Create our own index with uid and username
CREATE INDEX profile_index ON profile(uid, username);
This actually creates a non-clustered index. To make it clustered, run the next part.
Transform a non-clustered index into a clustered one
ALTER TABLE profile CLUSTER ON profile_index;
Check the table with \d profile. It will be like this:
Table "public.profile"
Column | Type | Collation | Nullable | Default
----------+-----------------------+-----------+----------+--------------------------------------
uid | integer | | not null | nextval('profile_uid_seq'::regclass)
username | character varying(30) | | not null |
phone | character varying(11) | | not null |
age | smallint | | |
address | text | | |
Indexes:
"profile_pkey" PRIMARY KEY, btree (uid)
"profile_phone_key" UNIQUE CONSTRAINT, btree (phone)
"profile_username_key" UNIQUE CONSTRAINT, btree (username)
"profile_index" btree (uid, username) CLUSTER
Check constraints:
"profile_age_check" CHECK (age > 12)
Notice that the profile_index is now "CLUSTER"
Now, re-cluster the table so that the table can follow the cluster index role
CLUSTER profile;
If you want to know if a given table is CLUSTERed using SQL, you can use the following query to show the index being used (tested in Postgres versions 9.5 and 9.6):
SELECT
i.relname AS index_for_cluster
FROM
pg_index AS idx
JOIN
pg_class AS i
ON
i.oid = idx.indexrelid
WHERE
idx.indisclustered
AND idx.indrelid::regclass = 'your_table_name'::regclass;