"Missing" column in Postgres logical replication update message - postgresql

We have a replication slot set up on a database in Postgres 12.8. For one of the tables, when a row is created, we see a series of records come through - an insert followed by a handful of updates. All of these messages in the WAL files show all columns, except for the last update message, which is missing one of the columns. (The column in question is a JSONB field, and the data is not overly large - less than 1000 chars.) This is causing the data to come into our data lake with that column as NULL.
My understanding is that all insert and update messages written to the replication logs are supposed to include all columns. But I can't find a definitive answer on whether it is expected or acceptable behavior for an update message to exclude any columns / values.
Is this normal? If not, any thoughts on what may cause that?

Related

Postgresql logical replication, duplicate key errors on subscriber

I am still quite new to postgres logical replication and am having trouble when replicating a large data set.
In our development setup there is one publisher with 5 subscribers, all tables in one schema are being replicated. All servers are running pg 13.1, the subscribers are basically all clones of the same system.
Once a month or so we have to clear down most of the tables being replicated and re-populate them from a legacy system, a process that starts with deleting a chunk of data from the table (as defined by part of the key) and then copying that chunk of data across for each table. The size of the data is around 90GB all told.
Every time we do this one or more of the subscribers will get stuck (not always the same ones), looking at the logs on the publisher it shows for the stuck subscriber(s) "could not send data to client: Connection reset by peer".
Looking at the logs on the subscriber(s) it shows duplicate key errors "ERROR: duplicate key value violates unique constraint" but, from the key it shows, it will be a different row on each server (though often the same table).
Deleting the offending row on the subscriber simply makes it then fall over at the next row (so it's obviously more than just one row).
This makes no sense to me, nothing else is writing to the tables on these subscribers and I can't really picture a situation where replication would be trying to write the same data twice.
So far the only solution I have is to drop the bad subscriber(s) and restart replication on them.
Does anyone have any advice or ideas as to why this happens or how to fix it?

Can I configure a table such that inserted rows always have a greater primary key

I would like to configure a table in Postgres to behave like an append only log. This table will have an automatically generated primary ID.
Workers will work on the items in the table in order and should only need to store the last row ID that they have completed.
How can i prevent rows being written to the table (perhaps by some transactions taking longer than others) where the row ID is less than the greatest value in the table?
There is no way to prevent concurrent inserts in the table (short of locking the table, which is a bad idea, because it breaks autovacuum).
So there is no way to to guarantee that rows are inserted in a certain order. The order in which rows are inserted isn't really a meaningful concept in PostgreSQL.
If you really want that, you have to use a different mechanism to serialize inserts, for example using PostgreSQL advisory locks or synchronization mechanisms on the client side.
Except the numbers assigned are session specific, so a session that starts earlier but lasts longer can write to the table with an id that is less then one that started later but finished sooner. So either you create your own number sequence generation that involves locking or you use an INSERT timestamp.

Postgres table partitioning based on table name

I have a table that stores information about weather for specific events and for specific timestamps. I do insert, update and select (more often than delete) on this table. All of my queries query on timestamp and event_id. Since this table is blowing up, I was considering doing table partitioning in postgres.
I could also think of having multiple tables and naming them "table_< event_id >_< timestamp >" to store specific timestamp information, instead of using postgres declarative/inheritance partitioning. But, I noticed that no one on the internet has done or written about any approach like this. Is there something I am missing?
I see that in postgres partitioning, the data is both kept in master as well as child tables. Why keep in both places? It seems less efficient to do inserts and updates to me.
Is there a generic limit on the number of tables when postgres will start to choke?
Thank you!
re 1) Don't do it. Why re-invent the wheel if the Postgres devs have already done it for you by providing declarative partitioning
re 2) You are mistaken. The data is only kept in the partition to which it belongs to. It just looks as if it is stored in the "master".
re 3) there is no built-in limit, but anything beyond a "few thousand" partitions is probably too much. It will still work, but especially query planning will be slower. And sometime the query execution might also suffer because runtime partition pruning is not as efficient any more.
Given your description you probably want to do hash partitioning on the event ID and then create range sub-partitions on the timestamp value (so each partition for an event is again partitioned on the range of the timestamps)

DB2 updated rows since last check

I want to periodically export data from db2 and load it in another database for analysis.
In order to do this, I would need to know which rows have been inserted/updated since the last time I've exported things from a given table.
A simple solution would probably be to add a timestamp to every table and use that as a reference, but I don't have such a TS at the moment, and I would like to avoid adding it if possible.
Is there any other solution for finding the rows which have been added/updated after a given time (or something else that would solve my issue)?
There is an easy option for a timestamp in Db2 (for LUW) called
ROW CHANGE TIMESTAMP
This is managed by Db2 and could be defined as HIDDEN so existing SELECT * FROM queries will not retrieve the new row which would cause extra costs.
Check out the Db2 CREATE TABLE documentation
This functionality was originally added for optimistic locking but can be used for such situations as well.
There is a similar concept for Db2 z/OS - you have to check that out as I have not tried this one.
Of cause there are other ways to solve it like Replication etc.
That is not possible if you do not have a timestamp column. With a timestamp, you can know which are new or modified rows.
You can also use the TimeTravel feature, in order to get the new values, but that implies a timestamp column.
Another option, is to put the tables in append mode, and then get the rows after a given one. However, this option is not sure after a reorg, and affects the performance and space utilisation.
One possible option is to use SQL replication, but that needs extra tables for staging.
Finally, another option is to read the logs, with the db2ReadLog API, but that implies a development. Also, just appliying the archived logs into the new database is possible, however the database will remain in roll forward pending.

How should i keep track of the delete operations in database without using triggers?

The appliocation polls the database after certain intervals of time. On each polling, the application would read all the tables.
As a part of optimization, we want that application should read the table only if any INSERT/UPDATE/DELETE has happened. So i want to use the timestamp concept.
Having a seperate timestamp column can help me in tracking any row modifications.
While querying on a table i can check if the in-memory stored timestamp is lesser than the max-of-TimeStamp in the table. if yes, it means that some row has been modified.
But if certain row gets deleted, then the latest timestamp assosiated with this row is no more pressent. Hence the above algorithm fails in this case since the max-of-timestamp does not give the correct value.
Is there a way in which i can track the delete operations as well without using triggers?
Any help would be highly appreciated.
I am using Sybase ASA database.
Maybe you could implement a logical deletion. Instead of removing a record you simply mark it as deleted with a specific flag for example.
You still have the max timestamp and you can exclude the flagged records from the selection queries (maybe create some views on top of the table to do the job automatically).