Postgresql Prune replicated data from outbox table - postgresql

Problem Statement
In order to ensure disk size isn't growing unnecessary, I want to be able to delete rows that have been replicated from my outbox table.
Context
Postgres is at v12
We are using a Kafka source connector to stream changes made to a postgres table. These changes are insert only and thus are no longer needed once written to kafka. The source connector is using logical replication to stream the changes to the connector and the state of the replication can be displayed in pg_replication_slots.
When looking at the pg_replication_slots you can see useful data that it's storing in order to know what logs it has to keep to ensure replication can still happen for the client.
For example when I run:
select * from pg_replication_slots;
I might see:
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
-----------+----------+-----------+--------+--------------------+-----------+--------+------------+------+--------------+-------------+---------------------
debezium | wal2json | logical | 26593 | database_name | f | t | 7404 | | 26729 | 0/DCD98E8 | 0/DCD9920
(1 row)
What I'm interested in knowing is if I can reliably use that data and then the postgresql metadata on the table to select all rows that have been replicated from that slot.
For example, this doesn't work as far as I can tell, but ideally would return rows that have been replicated and are now safe to prune from the table:
select * from outbox where age(xmin) < (select age(catalog_xmin) from pg_replication_slots);
Any guidance would be sweet! Cheers!

I have been implementing the Outbox pattern using Debezium with MySQL and delete the outbox record straight after inserting it which I saw done here https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/ The insert is picked up and sent and the delete is ignored. So essentially there should never be anything in the outbox table(outside of the transaction).
I also pre-generate the primary keys for the entries(which I use for the event ID in Kafka) so I can bulk insert and delete.

Circling back around to this, I had to think a bit differently around how I we could tie the replications progress to our outbox table. Previously in my question I was trying to glean progress from pg_replication_slots, but in this working example I switched to using pg_stat_replication. This table can be queried by the slot_name we care about and can return lag results. For an example:
SELECT * FROM outbox WHERE created_at < (SELECT(NOW() - COALESCE(replay_lag, interval '60 seconds')) as stale_time from pg_stat_replication where pg_stat_replication.slot_name = 'outbox_slot');
So here this will return to us rows from our outbox table that were inserted outside of our replay_lag time or 1 minute.

Related

Cloud Data Fusion ETL from PostGres to BigQuery - idempotent load

I'm trying to use Google's Cloud Data Fusion (CDF) to perform an ETL of some OLTP data in PostGres into BigQuery (BQ). We will copy the contents of a few select tables into an equivalent table in BQ every night - we will add one column with the datestamp.
So imagine we have a table with two columns A & B, and one row of data like this in PostGres
|--------------------|
| A | B |
|--------------------|
| egg | milk |
|--------------------|
Then over two days, the BigQuery table would look like this
|-------------------------------|
| ds | A | B |
|-------------------------------|
| 22-01-01 | egg | milk |
|-------------------------------|
| 22-01-02 | egg | milk |
|-------------------------------|
However, I'm worried that the way I am doing this in CDF is not idempotent and if the pipeline runs twice I'll have duplicate data for a given day in BQ (not desired)
One idea is to delete rows for that day in BQ before doing the ETL (as part of the same pipeline). However, not sure how to do this, or if it is best practice. Any ideas?
You could delete the data in a BigQuery action at the start of the pipeline, though that runs into other issues if people are actively querying the table, or if the delete action succeeds but the rest of the pipeline fails.
The BigQuery sink allows you to configure it to upsert data instead of inserting. This should make it idempotent as long as your data has a key that can be used.
Some other possibilities are to place a BigQuery execute after the sink that runs a BigQuery MERGE, or to write a custom Condition plugin that queries BigQuery and only runs the rest of the pipeline if data for the date does not already exist.
You can use one of these 2 options, depending on what you want to do with the information:
Option 1
You can create a blank new_table with the same schema (ds,A,B). You will insert the data into the old_table from Data Fusion. With the MERGE statement, you will compare the data from the old_table with the new_table; the data that does not exist into the new_table will be inserted, and the data that exist and have different data will update this other data.
MERGE merge_example.new_table T
USING dataset.old_table S
ON T.ds = S.ds
WHEN MATCHED THEN
UPDATE SET T.A = s.a, T.B=s.b
WHEN NOT MATCHED THEN
INSERT (ds,A, B) VALUES(ds, A, B)
Option 2
It is the same process as Option 1, but this query only inserts the data that does not exist into the new_table.
insert into `dataset.new_table`
select ds, A, B from `dataset.old_table`
where ds not in (select ds from `dataset.new_table`)
The difference between Option 1 and Option 2 is that option 1 will update the data that exists which has a different value in the new_table and insert the new data. Option 2 will just insert the new data.
You can execute these queries with a Scheduled Query once a day. You can see this documentation.

PostgreSQL: on database restart, why is starting sequence number unpredictable?

OS: macOS 11.4 (Big Sur)
PostgreSQL: 13.4
I would expect the default behavior of sequence numbers (that is, auto-generated sequences used typically for PK generation on record-inserts) to be straightforward on server re-starts: namely, that sequence numbers always "start where they left off". If the last record inserted had an auto-sequenced ID of 5, then the next record-insert should have ID of 6. And so on.
But recently, more than once, I have observed less than desirable default behavior for sequence numbers. Here are two different observations, both presumably resulting from the same suspect behavior after database server re-starts:
Let's suppose the record in your table with ID of 1 was deleted, but that records with ID 2-5 exist. Then on server re-start, the sequence number started at 1. The first record insert works (that is, a record with ID of 1 was successfully inserted). Then the next few inserts result in PK-duplicate exceptions! Once the sequence number reaches 6, inserts start working again.
Again, let's suppose records in your table exist for IDs 2-5. Then after server re-start, the sequence number starts at some larger number, like 35! In this case, a large swath of IDs between (exclusively) 5-35 are unused (making it seem as if there were records that were deleted with those IDs).
This certainly seems awkward behavior. Is there some way to set up sequence numbers to avoid this behavior?
Sample sequence number from my database:
mydb=# \dS+ birthday_id_seq
Sequence "public.birthday_id_seq"
Type | Start | Minimum | Maximum | Increment | Cycles? | Cache
--------+-------+---------+---------------------+-----------+---------+-------
bigint | 1 | 1 | 9223372036854775807 | 1 | no | 1
mydb=# \dS+ birthdays
Table "public.birthdays"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------------+-----------------------------+-----------+----------+--------------------------------------+----------+--------------+-------------
id | bigint | | not null | nextval('birthday_id_seq'::regclass) | plain | |
birthdate | date | | | | plain | |
Indexes:
"birthdays_pkey" PRIMARY KEY, btree (id)
Access method: heap
mydb=# \d+
List of relations
Schema | Name | Type | Owner | Persistence | Size | Description
--------+---------------------+----------+-------------+-------------+------------+-------------
public | birthday_id_seq | sequence | kodecharlie | permanent | 8192 bytes |
(1 rows)
That is normal behavior:
Any sequence values that were already fetched by nextval, but never used in an INSERT that got committed, will be lost. That could happen if you perform a fast (or an immediate) shutdown while the INSERT was taking place.
Moreover, the first time you run nextval, PostgreSQL logs a WAL entry that consumes the next 32 values, so that it doesn't have to log each individual nextval. These values are lost after a restart.
As for the sequence going backwards after a restart:
Sequences, like all other objects, are WAL logged. WAL is guaranteed to be flushed during commit. Now if you start a transaction, fetch a sequence value and perform an insert, but don't commit the transaction yet, the changes to the sequence may still be in WAL buffers and not flushed to disk.
A crash that interrupts the transaction will cause the sequence to be reset to the last committed value, so you may get the same sequence number again. That is fine, because any sequence values fetched from the sequence since have not been committed either.
Which of the two behaviors you see depends on concurrent transactions: Typically, you will see missing values after a restart. But if you start a transaction, call nextval and crash the database without committing, you may see the same sequence value again after a restart.
You may want to read my article for more details.

Is this a postgresql bug? Only one row can not query by equal but can query by like

i have a table,only one row in this table can not query by equal query,but can query by like (not incloud %).
postgresql server version:90513
# select id,external_id,username,external_id from users where username = 'oFIC94vdidrrKHpi5lc1_2Ibv-OA';
id | external_id | username | external_id
----+-------------+----------+-------------
(0 rows)
# select id,external_id,username,external_id from users where username like 'oFIC94vdidrrKHpi5lc1_2Ibv-OA';
id | external_id | username | external_id
--------------------------------------+------------------------------+------------------------------+------------------------------
61ebea19-74f5-4713-9a30-63eb5af8ac8f | oFIC94vdidrrKHpi5lc1_2Ibv-OA | oFIC94vdidrrKHpi5lc1_2Ibv-OA | oFIC94vdidrrKHpi5lc1_2Ibv-OA
(1 row)
if i dump this table and restore it,it will be fixed. by why.
it is a postgresql bug? how can i workaround it. I've met twice.
Do you have an index on this table? If yes, this appears like corrupted index - PostgreSQL uses index in first case, and if the index is corrupt it might return no result.
This is usually bug, either software one or hardware (data loss on power loss, or any memory issues). Try dropping and recreating index, or rebuilding it with https://www.postgresql.org/docs/9.3/sql-reindex.html

How to properly index strings for lookup and excepts, the PostgreSQL way

Due to infrastructure costs, I've been studying the possibility to migrate a few databases to PostgreSQL. So far I am loving it. But there are a few topics I am quite lost. I need some guidance on one of them.
I have an ETL process that queries "deltas" in my database and imports the new data. To do so, I use lookup tables that store hashbytes of some strings to facilitate the lookup. This works in SQL Server, but apparently things work quite differently in PostgreSQL. In SQL Server, using hashbytes + except is suggested when working with millions of rows.
Let's suppose the following table
+----+-------+------------------------------------------+
| Id | Name | hash_Name |
+----+-------+------------------------------------------+
| 1 | Mark | 31e9697d43a1a66f2e45db652019fb9a6216df22 |
| 2 | Pablo | ce7169ba6c7dea1ca07fdbff5bd508d4bb3e5832 |
| 3 | Mark | 31e9697d43a1a66f2e45db652019fb9a6216df22 |
+----+-------+------------------------------------------+
And my lookup table
+------------------------------------------+
| hash_Name |
+------------------------------------------+
| 31e9697d43a1a66f2e45db652019fb9a6216df22 |
+------------------------------------------+
When querying new data (Pablo's hash), I can advance from the simplified query bellow:
SELECT hash_name
FROM mytable
EXCEPT
SELECT hash_name
FROM mylookup
Thinking the PostgreSQL way, how could I achieve this? Should I index and use EXCEPT? Or is there a better way of doing so?
From my research, I couldn't find much regarding storing hashbytes. Apparently, it is a matter of creating indexes and choosing the right index for the job. More precisely: BTREE for single field indexes and GIN for multiple field indexes.

DB2 table partitioning and delete old records based on condition

I have a table with few million records.
___________________________________________________________
| col1 | col2 | col3 | some_indicator | last_updated_date |
-----------------------------------------------------------
| | | | yes | 2009-06-09.12.2345|
-----------------------------------------------------------
| | | | yes | 2009-07-09.11.6145|
-----------------------------------------------------------
| | | | no | 2009-06-09.12.2345|
-----------------------------------------------------------
I have to delete records which are older than month with some_indicator=no.
Again I have to delete records older than year with some_indicator=yes.This job will run everyday.
Can I use db2 partitioning feature for above requirement?.
How can I partition table using last_updated_date column and above two some_indicator values?
one partition should contain records falling under monthly delete criterion whereas other should contain yearly delete criterion records.
Are there any performance issues associated with table partitioning if this table is being frequently read,upserted?
Any other best practices for above requirement will surely help.
I haven't done much with partitioning (I've mostly worked with DB2 on the iSeries), but from what I understand, you don't generally want to be shuffling things between partitions (ie - making the partition '1 month ago'). I'm not even sure if it's even possible. If it was, you'd have to scan some (potentially large) portion of your table every day, just to move it (select, insert, delete, in a transaction).
Besides which, partitioning is a DB Admin problem, and it sounds like you just have a DB User problem - namely, deleting 'old' records. I'd just do this in a couple of statements:
DELETE FROM myTable
WHERE some_indicator = 'no'
AND last_updated_date < TIMESTAMP(CURRENT_DATE - 1 MONTH, TIME('00:00:00'))
and
DELETE FROM myTable
WHERE some_indicator = 'yes'
AND last_updated_date < TIMESTAMP(CURRENT_DATE - 1 YEAR, TIME('00:00:00'))
.... and you can pretty much ignore using a transaction, as you want the rows gone.
(as a side note, using 'yes' and 'no' for indicators is terrible. If you're not on a version that has a logical (boolean) type, store character '0' (false) and '1' (true))