PostgreSQL: on database restart, why is starting sequence number unpredictable?

PostgreSQL: on database restart, why is starting sequence number unpredictable? - postgresql

OS: macOS 11.4 (Big Sur)
PostgreSQL: 13.4
I would expect the default behavior of sequence numbers (that is, auto-generated sequences used typically for PK generation on record-inserts) to be straightforward on server re-starts: namely, that sequence numbers always "start where they left off". If the last record inserted had an auto-sequenced ID of 5, then the next record-insert should have ID of 6. And so on.
But recently, more than once, I have observed less than desirable default behavior for sequence numbers. Here are two different observations, both presumably resulting from the same suspect behavior after database server re-starts:
Let's suppose the record in your table with ID of 1 was deleted, but that records with ID 2-5 exist. Then on server re-start, the sequence number started at 1. The first record insert works (that is, a record with ID of 1 was successfully inserted). Then the next few inserts result in PK-duplicate exceptions! Once the sequence number reaches 6, inserts start working again.
Again, let's suppose records in your table exist for IDs 2-5. Then after server re-start, the sequence number starts at some larger number, like 35! In this case, a large swath of IDs between (exclusively) 5-35 are unused (making it seem as if there were records that were deleted with those IDs).
This certainly seems awkward behavior. Is there some way to set up sequence numbers to avoid this behavior?
Sample sequence number from my database:
mydb=# \dS+ birthday_id_seq
Sequence "public.birthday_id_seq"
Type | Start | Minimum | Maximum | Increment | Cycles? | Cache
--------+-------+---------+---------------------+-----------+---------+-------
bigint | 1 | 1 | 9223372036854775807 | 1 | no | 1
mydb=# \dS+ birthdays
Table "public.birthdays"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------------+-----------------------------+-----------+----------+--------------------------------------+----------+--------------+-------------
id | bigint | | not null | nextval('birthday_id_seq'::regclass) | plain | |
birthdate | date | | | | plain | |
Indexes:
"birthdays_pkey" PRIMARY KEY, btree (id)
Access method: heap
mydb=# \d+
List of relations
Schema | Name | Type | Owner | Persistence | Size | Description
--------+---------------------+----------+-------------+-------------+------------+-------------
public | birthday_id_seq | sequence | kodecharlie | permanent | 8192 bytes |
(1 rows)

That is normal behavior:
Any sequence values that were already fetched by nextval, but never used in an INSERT that got committed, will be lost. That could happen if you perform a fast (or an immediate) shutdown while the INSERT was taking place.
Moreover, the first time you run nextval, PostgreSQL logs a WAL entry that consumes the next 32 values, so that it doesn't have to log each individual nextval. These values are lost after a restart.
As for the sequence going backwards after a restart:
Sequences, like all other objects, are WAL logged. WAL is guaranteed to be flushed during commit. Now if you start a transaction, fetch a sequence value and perform an insert, but don't commit the transaction yet, the changes to the sequence may still be in WAL buffers and not flushed to disk.
A crash that interrupts the transaction will cause the sequence to be reset to the last committed value, so you may get the same sequence number again. That is fine, because any sequence values fetched from the sequence since have not been committed either.
Which of the two behaviors you see depends on concurrent transactions: Typically, you will see missing values after a restart. But if you start a transaction, call nextval and crash the database without committing, you may see the same sequence value again after a restart.
You may want to read my article for more details.

Related

Postgresql Prune replicated data from outbox table

Problem Statement
In order to ensure disk size isn't growing unnecessary, I want to be able to delete rows that have been replicated from my outbox table.
Context
Postgres is at v12
We are using a Kafka source connector to stream changes made to a postgres table. These changes are insert only and thus are no longer needed once written to kafka. The source connector is using logical replication to stream the changes to the connector and the state of the replication can be displayed in pg_replication_slots.
When looking at the pg_replication_slots you can see useful data that it's storing in order to know what logs it has to keep to ensure replication can still happen for the client.
For example when I run:
select * from pg_replication_slots;
I might see:
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
-----------+----------+-----------+--------+--------------------+-----------+--------+------------+------+--------------+-------------+---------------------
debezium | wal2json | logical | 26593 | database_name | f | t | 7404 | | 26729 | 0/DCD98E8 | 0/DCD9920
(1 row)
What I'm interested in knowing is if I can reliably use that data and then the postgresql metadata on the table to select all rows that have been replicated from that slot.
For example, this doesn't work as far as I can tell, but ideally would return rows that have been replicated and are now safe to prune from the table:
select * from outbox where age(xmin) < (select age(catalog_xmin) from pg_replication_slots);
Any guidance would be sweet! Cheers!

I have been implementing the Outbox pattern using Debezium with MySQL and delete the outbox record straight after inserting it which I saw done here https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/ The insert is picked up and sent and the delete is ignored. So essentially there should never be anything in the outbox table(outside of the transaction).
I also pre-generate the primary keys for the entries(which I use for the event ID in Kafka) so I can bulk insert and delete.

Circling back around to this, I had to think a bit differently around how I we could tie the replications progress to our outbox table. Previously in my question I was trying to glean progress from pg_replication_slots, but in this working example I switched to using pg_stat_replication. This table can be queried by the slot_name we care about and can return lag results. For an example:
SELECT * FROM outbox WHERE created_at < (SELECT(NOW() - COALESCE(replay_lag, interval '60 seconds')) as stale_time from pg_stat_replication where pg_stat_replication.slot_name = 'outbox_slot');
So here this will return to us rows from our outbox table that were inserted outside of our replay_lag time or 1 minute.

Postgresql Sequence Increment Automatically with random value

We are using postgresql 9.5 (AWS RDS) and have 100 of tables but only 1 table's sequence behaving weirdly, current value of table where seq being used is 670 but nextval('seq1') is 20000 and when I run nextval again it is like 20090 and then 20122 etc.
So there is no fix incrementing value i.e. 1,2 or 100 sequence incrementing automatically. Above is the case when I call nextval but without explicitly called nextval seq values gets updated. I have checked all logs of system and code as well but could not find any issue that without using or running insert of table how seq gets updated this seq not being used in any of the table there is only 1 table with this seq.
Type| Start |Minimum|Maximum | Increment | Cycles? | Cache
--------+-------+---------+---------------------+-----------+
bigint | 1 | 1 | 9223372036854775807 |1 | no | 1

optimize a postgres query that updates a big table [duplicate]

I have two huge tables:
Table "public.tx_input1_new" (100,000,000 rows)
Column | Type | Modifiers
----------------|-----------------------------|----------
blk_hash | character varying(500) |
blk_time | timestamp without time zone |
tx_hash | character varying(500) |
input_tx_hash | character varying(100) |
input_tx_index | smallint |
input_addr | character varying(500) |
input_val | numeric |
Indexes:
"tx_input1_new_h" btree (input_tx_hash, input_tx_index)
Table "public.tx_output1_new" (100,000,000 rows)
Column | Type | Modifiers
--------------+------------------------+-----------
tx_hash | character varying(100) |
output_addr | character varying(500) |
output_index | smallint |
input_val | numeric |
Indexes:
"tx_output1_new_h" btree (tx_hash, output_index)
I want to update table1 by the other table:
UPDATE tx_input1 as i
SET
input_addr = o.output_addr,
input_val = o.output_val
FROM tx_output1 as o
WHERE
i.input_tx_hash = o.tx_hash
AND i.input_tx_index = o.output_index;
Before I execute this SQL command, I already created the index for this two table:
CREATE INDEX tx_input1_new_h ON tx_input1_new (input_tx_hash, input_tx_index);
CREATE INDEX tx_output1_new_h ON tx_output1_new (tx_hash, output_index);
I use EXPLAIN command to see the query plan, but it didn't use the index I created.
It took about 14-15 hours to complete this UPDATE.
What is the problem within it?
How can I shorten the execution time, or tune my database/table?
Thank you.

Since you are joining two large tables and there are no conditions that could filter out rows, the only efficient join strategy will be a hash join, and no index can help with that.
First there will be a sequential scan of one of the tables, from which a hash structure is built, then there will be a sequential scan over the other table, and the hash will be probed for each row found. How could any index help with that?
You can expect such an operation to take a long time, but there are some ways in which you could speed up the operation:
Remove all indexes and constraints on tx_input1 before you begin. Your query is one of the examples where an index does not help at all, but actually hurts performance, because the indexes have to be updated along with the table. Recreate the indexes and constraints after you are done with the UPDATE. Depending on the number of indexes on the table, you can expect a decent to massive performance gain.
Increase the work_mem parameter for this one operation with the SET command as high as you can. The more memory the hash operation can use, the faster it will be. With a table that big you'll probably still end up having temporary files, but you can still expect a decent performance gain.
Increase checkpoint_segments (or max_wal_size from version 9.6 on) to a high value so that there are fewer checkpoints during the UPDATE operation.
Make sure that the table statistics on both tables are accurate, so that PostgreSQL can come up with a good estimate for the number of hash buckets to create.
After the UPDATE, if it affects a big number of rows, you might consider to run VACUUM (FULL) on tx_input1 to get rid of the resulting table bloat. This will lock the table for a longer time, so do it during a maintenance window. It will reduce the size of the table and as a consequence speed up sequential scans.

DB2 table partitioning and delete old records based on condition

I have a table with few million records.
___________________________________________________________
| col1 | col2 | col3 | some_indicator | last_updated_date |
-----------------------------------------------------------
| | | | yes | 2009-06-09.12.2345|
-----------------------------------------------------------
| | | | yes | 2009-07-09.11.6145|
-----------------------------------------------------------
| | | | no | 2009-06-09.12.2345|
-----------------------------------------------------------
I have to delete records which are older than month with some_indicator=no.
Again I have to delete records older than year with some_indicator=yes.This job will run everyday.
Can I use db2 partitioning feature for above requirement?.
How can I partition table using last_updated_date column and above two some_indicator values?
one partition should contain records falling under monthly delete criterion whereas other should contain yearly delete criterion records.
Are there any performance issues associated with table partitioning if this table is being frequently read,upserted?
Any other best practices for above requirement will surely help.

I haven't done much with partitioning (I've mostly worked with DB2 on the iSeries), but from what I understand, you don't generally want to be shuffling things between partitions (ie - making the partition '1 month ago'). I'm not even sure if it's even possible. If it was, you'd have to scan some (potentially large) portion of your table every day, just to move it (select, insert, delete, in a transaction).
Besides which, partitioning is a DB Admin problem, and it sounds like you just have a DB User problem - namely, deleting 'old' records. I'd just do this in a couple of statements:
DELETE FROM myTable
WHERE some_indicator = 'no'
AND last_updated_date < TIMESTAMP(CURRENT_DATE - 1 MONTH, TIME('00:00:00'))
and
DELETE FROM myTable
WHERE some_indicator = 'yes'
AND last_updated_date < TIMESTAMP(CURRENT_DATE - 1 YEAR, TIME('00:00:00'))
.... and you can pretty much ignore using a transaction, as you want the rows gone.
(as a side note, using 'yes' and 'no' for indicators is terrible. If you're not on a version that has a logical (boolean) type, store character '0' (false) and '1' (true))

How to create a PostgreSQL partitioned sequence?

Is there a simple (ie. non-hacky) and race-condition free way to create a partitioned sequence in PostgreSQL. Example:
Using a normal sequence in Issue:
| Project_ID | Issue |
| 1 | 1 |
| 1 | 2 |
| 2 | 3 |
| 2 | 4 |
Using a partitioned sequence in Issue:
| Project_ID | Issue |
| 1 | 1 |
| 1 | 2 |
| 2 | 1 |
| 2 | 2 |

I do not believe there is a simple way that is as easy as regular sequences, because:
A sequence stores only one number stream (next value, etc.). You want one for each partition.
Sequences have special handling that bypasses the current transaction (to avoid the race condition). It is hard to replicate this at the SQL or PL/pgSQL level without using tricks like dblink.
The DEFAULT column property can use a simple expression or a function call like nextval('myseq'); but it cannot refer to other columns to inform the function which stream the value should come from.
You can make something that works, but you probably won't think it simple. Addressing the above problems in turn:
Use a table to store the next value for all partitions, with a schema like multiseq (partition_id, next_val).
Write a multinextval(seq_table, partition_id) function that does something like the following:
Create a new transaction independent on the current transaction (one way of doing this is through dblink; I believe some other server languages can do it more easily).
Lock the table mentioned in seq_table.
Update the row where the partition id is partition_id, with an incremented value. (Or insert a new row with value 2 if there is no existing one.)
Commit that transaction and return the previous stored id (or 1).
Create an insert trigger on your projects table that uses a call to multinextval('projects_table', NEW.Project_ID) for insertions.
I have not used this entire plan myself, but I have tried something similar to each step individually. Examples of the multinextval function and the trigger can be provided if you want to attempt this...

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse