Range check error when creating GIST index on tsrange value - postgresql

Fixing range bound data and creating gist tsrange index causes exception. I can guess the PostgreSQL sees old version of records and takes them into account when creating the gist index.
You can reproduce it using this script:
BEGIN;
CREATE TABLE _test_gist_range_index("from" timestamp(0) WITHOUT time zone, "till" timestamp(0) WITHOUT time zone);
--let's enter some invalid data
INSERT INTO _test_gist_range_index VALUES ('2021-01-02', '2021-01-01');
--let's fix the data
DELETE FROM _test_gist_range_index;
CREATE INDEX idx_range_idx_test2 ON _test_gist_range_index USING gist (tsrange("from", "till", '[]'));
COMMIT;
The result is:
SQL Error [22000]: ERROR: range lower bound must be less than or equal to range upper bound
db<>fiddle
I've tested this on all versions of PostgreSQL starting from v9.5 and ending with v13 using db<>fiddle. The result is the same on all of them.
The same error is received if we fix the data using "UPDATE".
Is there a way to fix the data and have range index on it? Maybe there is a way to clean the table somehow from those old values?..
EDIT
It seems that the exception is raised only if data correcting statements (DELETE in my example) and CREATE INDEX statement are in the same transaction. If I DELETE and COMMIT first, and then creating the index succeeds.

That is working as expected and not a bug.
When you delete a row in a PostgreSQL table, it is not actually deleted, but marked as invisible. Similarly, updating a row creates a new version of the row, but the old version is retained and marked invisible. This is the way how PostgreSQL implements multiversioning: concurrent transactions can still see the "invisible" data. Eventually, invisible rows are reclaimed by VACUUM.
Now a B-tree or GiST index contains one entry for each row version in the table, unless the row version is not visible by anybody (is dead). This explains why a deleted row will still cause an error if the data don't form a valid range.
If you run the statements in autocommit mode on an otherwise idle database, the deleted rows are dead, and no index entry has to be created.

Related

PostgreSQL 13: error number of items mismatch in GIN entry tuple, 2 in tuple header, 19 decoded

I stuck with a problem with PostgreSQL at GCP that at some moment PostgreSQL starts to raise an error
number of items mismatch in GIN entry tuple, 2 in tuple header, 19 decoded
It happens after insert in the table with GIN index on JSONB field
After the first occurrence of the error, it will occur regularly, but not on every insert operation
I tried to reinsert to test DB the same rows, several rows before and after the error appeared, everything was inserted successfully
Also, I see in the Postgres logs before the error starts to appear: automatic analyze of table <table-name>, maybe it can be helpful
That is some kind of data corruption. If you want that investigated, you'll have to ask your hosting provider.
To get rid of the message, rebuild the index using REINDEX.

How to add a column to a table on production PostgreSQL with zero downtime?

Here
https://stackoverflow.com/a/53016193/10894456
is an answer provided for Oracle 11g,
My question is the same:
What is the best approach to add a not null column with default value
in production oracle database when that table contain one million
records and it is live. Does it create any locks if we do the column
creation , adding default value and making it as not null in a single
statement?
but for PostgreSQL ?
This prior answer essentially answers your query.
Cross referencing the relevant PostgreSQL doc with the PostgreSQL sourcecode for AlterTableGetLockLevel mentioned in the above answer shows that ALTER TABLE ... ADD COLUMN will always obtain an ACCESS EXCLUSIVE table lock, precluding any other transaction from accessing the table for the duration of the ADD COLUMN operation.
This same exclusive lock is obtained for any ADD COLUMN variation; ie. it doesn't matter whether you add a NULL column (with or without DEFAULT) or have a NOT NULL with a default.
However, as mentioned in the linked answer above, adding a NULL column with no DEFAULT should be very quick as this operation simply updates the catalog.
In contrast, adding a column with a DEFAULT specifier necessitates a rewrite the entire table in PostgreSQL 10 or less.
This operation is likely to take a considerable time on your 1M record table.
According to the linked answer, PostgreSQL >= 11 does not require such a rewrite for adding such a column, so should perform more similarly to the no-DEFAULT case.
I should add that for PostgreSQL 11 and above, the ALTER TABLE docs note that table rewrites are only avoided for non-volatile DEFAULT specifiers:
When a column is added with ADD COLUMN and a non-volatile DEFAULT is specified, the default is evaluated at the time of the statement and the result stored in the table's metadata. That value will be used for the column for all existing rows. If no DEFAULT is specified, NULL is used. In neither case is a rewrite of the table required.
Adding a column with a volatile DEFAULT [...] will require the entire table and its indexes to be rewritten. [...] Table and/or index rebuilds may take a significant amount of time for a large table; and will temporarily require as much as double the disk space.

Migrating `int` to `bigint` in PostgresSQL without any downtime?

I have a database that is going to experience the integer exhaustion problem that Basecamp famously faced back in November. I have several months to figure out what to do.
Is there a no-downtime-required, proactive solution to migrating this column type? If so what is it? If not, is it just a matter of eating the downtime and migrating the column when I can?
Is this article sufficient, assuming I have several days/weeks to perform the migration now before I'm forced to do it when I run out of ids?
Use logical replication.
With logical replication you can have different data types at primary and standby.
Copy the schema with pg_dump -s, change the data types on the copy and then start logical replication.
Once all data is copied over, switch the application to use the standby.
For zero down time, the application has to be able to reconnect and retry, but that's always a requirement in such a case.
You need PostgreSQL v10 or better for that, and your database
shouldn't modify the schema, as DDL is not replicated.
should not use sequence (SERIAL or IDENTITY), as the last used value would not be replicated
Another solution for pre-v10 databases where all transactions are short:
Add a bigint column to the table.
Create a BEFORE trigger that sets the new column whenever a row is added or updated.
Run a series of updates that set the new column from the old one where it IS NULL. Keep those batches short so you don't lock long and don't deadlock much. Make sure these transaction run with session_replication_role = replica so they don't trigger triggers.
Once all rows are updated, create a unique index CONCURRENTLY on the new column.
Add a unique constraint USING the index you just created. That will be fast.
Perform the switch:
BEGIN;
ALTER TABLE ... DROP oldcol;
ALTER TABLE ... ALTER newcol RENAME TO oldcol;
COMMIT;
That will be fast.
Your new column has no NOT NULL set. This cannot be done without a long invasive lock. But you can add a check constraint IS NOT NULL and create it NOT VALID. That is good enough, and you can later validate it without disruptions.
If there are foreign key constraints, things get a little more complicated. You have to drop these and create NOT VALID foreign keys to the new column.
Create a copy of the old table but with modified ID field. Next create a trigger on the old table that inserts new data to both tables. Finally copy data from the old table to the new one (it would be a good idea to distinguish pre-trigger data with post-trigger for example by id if it is sequential). Once you are done switch tables and delete the old one.
This obviously requires twice as much space (and time for copy) but will work without any downtime.

Postgres parallel/efficient load huge amount of data psycopg

I want to load many rows from a CSV file.
The file​s​ contain​ data like these​ "article​_name​,​article_time,​start_time,​end_time"
There is a contraint on the table: for the same article name, i don't insert a new row if the new ​article_time falls in an existing range​ [start_time,​end_time]​ for the same article.
ie: don't insert row y if exists [​start_time_x,​end_time_x] for which time_article_y inside range [​start_time_x,​end_time_x] , with article_​name_​y = article_​name_​x
I tried ​with psycopg by selecting the existing article names ad checking manually if there is an overlap --> too long
I tried again with psycopg, this time by setting a condition 'exclude using...' and tryig to insert with specifying "on conflict do nothing" (so that it does not fail) but still too long
I tried the same thing but this time trying to insert many values at each call of execute (psycopg): it got a little better (1M rows processed in almost 10minutes)​, but still not as fast as it needs to be for the amount of data ​I have (500M+)
I tried to parallelize by calling the same script many time, on different files but the timing didn't get any better, I guess because of the locks on the table each time we want to write something
Is there any way to create a lock only on rows containing the same article_name? (and not a lock on the whole table?)
Could you please help with any idea to make this parallellizable and/or more time efficient?
​Lots of thanks folks​
Your idea with the exclusion constraint and INSERT ... ON CONFLICT is good.
You could improve the speed as follows:
Do it all in a single transaction.
Like Vao Tsun suggested, maybe COPY the data into a staging table first and do it all with a single SQL statement.
Remove all indexes except the exclusion constraint from the table where you modify data and re-create them when you are done.
Speed up insertion by disabling autovacuum and raising max_wal_size (or checkpoint_segments on older PostgreSQL versions) while you load the data.

ADD COLUMN with DEFAULT value to a huge table

I have a postgresql DB and a table with almost billion of rows.
when I try to add a new column with default value:
ALTER TABLE big_table
ADD COLUMN some_flag integer NOT NULL DEFAULT 0;
The transaction goes on for 30+ min .. and the DB logs starts to shoots warnings.
Any way to optimize the query ?
Besides doing it in batches (which will still take a while):
You could dump the table as COPY statements and write a script to edit the contents of the COPY statements to insert another column (COPY can be CSV IIRC).
Then you just reload your altered COPY dump and it should in theory be faster than the ALTER because COPY will not log transactions.
The other option is to turn off fsync while you run the command... just remember to turn it back on.
You can also do both of the above in batches.
Starting from PostgreSQL 11 this behaviour will change.
Waiting for PostgreSQL 11 – Fast ALTER TABLE ADD COLUMN with a non-NULL default:
So, for the longest time, when you did:
alter table x add column z text;
it was virtually instantaneous. Get a lock on table, add information about new column to system catalogs, and it's done.
But when you tried:
alter table x add column z text default 'some value';
then it took long time. How long it did depend on size of table.
This was because postgresql was actually rewriting the whole table, adding the column to each row, and filling it with default value.
"What happens if you want to set the column to NOT NULL also? Are we back to the slow version in that case or does this handle that as well?"
not null doesn’t change anything. it is a constraint for new rows. so adding a column with “not null default ‘xxx'” will be fast.
I'd consider creating the column without the default and manually updating the rows in batches with intermittent commits to apply the default.