Does debezium support capture postgres schema change like 'alter table xxx add/drop/alter column xxx'?
Seems like an old question but in any way the short answer is yes. checkout the documentation here https://debezium.io/documentation/reference/connectors/postgresql.html .
With some exceptions:
The PostgreSQL connector retrieves schema information as part of the events sent by the logical decoding plug-in. However, the connector does not retrieve information about which columns compose the primary key. The connector obtains this information from the JDBC metadata (side channel). If the primary key definition of a table changes (by adding, removing or renaming primary key columns), there is a tiny period of time when the primary key information from JDBC is not synchronized with the change event that the logical decoding plug-in generates. During this tiny period, a message could be created with an inconsistent key structure. To prevent this inconsistency, update primary key structures as follows:
Put the database or an application into a read-only mode.
Let Debezium process all remaining events.
Stop Debezium.
Update the primary key definition in the relevant table.
Put the database or the application into read/write mode.
Restart Debezium.
Related
I'm working on kafka-connect JDBC sink connector for a database table.
I'm having trouble configuring the pk.mode to the proper one that supports auto-increment. Originally I set the pk.mode to the default "none" and hope the database would automatically insert a new record with the primary key incremented by one. However I get error complaining that the primary key cannot be "null".
I tried almost all other modes and running out of ideas now. I wonder if the sink connector ever supports pk.mode to be "auto-increment"?
I just realized that I made a mistake in the schema I created for the sink connector.
I include the primary key field in the schema and this field was unset, so when it reached to the sink connector it will complain that the primary key cannot be null.
In order to rely on the auto-increment feature from the db, the schema for the sink connector MUST NOT include the primary key. So after I removed the primary key from the schema and set the pk.mode to none, everything works properly.
I have an old MSAccess DB and I want to convert it to PostgreSQL. I found DBeaver very useful. Some operations may be done by hand. This is the case of Primary Keys. You must manually set the primary keys. I didn't found another way to do it. So in PGAdmin, I'm setting all this stuff. The client is using constantly this db so I can get data only on holiday when the client is not working and import data on PostgreSQL. My goal is to set the database ready to receive data from the production database. So far, in PGAdmin, I'm setting the Primary Key identity on "Always" and starting from the last number of primary key, setting it by hand. I think this is not the right way to do it. And when I'll start to import the production data, I don't want to set by hand all that stuff. How can I set the primary key ready to start the autoincrement from max of ID?
Using an identity column is the good way. Just set the sequences to a value safely above the current maximum. It doesn't matter if you lose a million sequence values.
I have a use case where I want to audit the DB table data changes into another table for compliance purposes. Primarily, any changes to the data like Inserts/Updates/Deletes should be audited. I found different options like JaVers, Hibernate Envers, Database triggers, and Debezium.
I am avoiding using JaVers, and Hibernate Envers as this will not capture any data change that happens through direct SQL queries and any data change that happens through other applications. The other issue I see is we need to add the audit-related code to the main application code in the same transaction boundary.
I am also avoiding the usage of database triggers as we are not using triggers at all for any of the deployments.
Then I left with Debezium which is promising. But, the only concern that I have is that we need to use Kafka to leverage Debezium. Is Kafka's usage is necessary to use Debezium if both the primary table and the audit table sit in the same DB instance?
Debezium is perfect for auditing, but given it is a source Connector, it represents just one part of the data pipeline in your use case. You will capture every table change event (c=create, r=read, u=update, d=delete), store it on a Kafka topic or local disk and then you need a Sink Connector (i.e. Camel Kafka SQL or JDBC, kafka-connect-jdbc) to insert into the target table.
For the same transaction boundary requirement you can use the Outbox Pattern if the eventual consistency is fine. There is also an Outbox Event Router SMT component that is part of the project.
Note that Debezium can also run embedded in a standalone Java application, storing the offset on local disk, but you lose the HA capability given by KafkaConnect running in distributed mode. With the embedded mode, you are also swtiching from a configuration-driven approach to a code-driven one.
I found Debezium to be a very comprehensive solution, and it is open source backed by Redhat. That gives it not only the credibility, but the security that it is going to be supported.
It provides a rich configuration to whitelist, blacklist databases/tables/columns (with wild card patterns), along with controls to limit data in really large columns.
Since it is driven from BinLogs, you not only get the current state, you also get the previous state. This is ideal for audit trails, and you can customize building a proper Sync to elastic topics (one for table).
Use of Kafka is necessary to account for HA and latency when bulk updates are made on DB, even though Primary and Audit tables are in same DB.
I am trying to get a stream of updates for certain tables from my PostgreSQL database. The regular way of getting all updates looks like this:
You create a logical replication slot
pg_create_logical_replication_slot('my_slot', 'wal2json');
And either connect to it using pg_recvlogical or making special SQL queries. This allows you to get all the actions from the database in json (if you used wal2json plugin or similar) and then do whatever you want with that data.
But in PostgreSQL 10 we have Publication/Subscription mechanism which allows us to replicate selected tables only. This is very handy because a lot of useless data is not being sent. The process looks like this:
First, you create a publication
CREATE PUBLICATION foo FOR TABLE herp, derp;
Then you subscribe to that publication from another database
CREATE SUBSCRIPTION mysub CONNECTION <connection stuff> PUBLICATION foo;
This creates a replication slot on a master database under the hood and starts listening to updates and commit them to the same tables on a second database. This is fine if your job was to replicate some tables, but want to get a raw stream for my stuff.
As I mentioned, the CREATE SUBSCRIPTION query is creating a replication slot on the master database under the hood, but how can I create one manually without the subscription and a second database? Here the docs say:
To make this work, create the replication slot separately (using the function pg_create_logical_replication_slot with the plugin name pgoutput)
According to the docs, this is possible, but pg_create_logical_replication_slot only creates a regular replication slot. Is the pgoutput plugin responsible for all the magic? If yes, then it becomes impossible to use other plugins like wal2json with publications.
What am I missing here?
I have limited experience with logical replication and logical decoding in Postgres, so please correct me if below is wrong. That being said, here is what I have found:
Publication support is provided by pgoutput plugin. You use it via plugin-specific options. It may be that other plugins have possibility to add the support, but I do not know whether the logical decoding plugin interface exposes sufficient details. I tested it against wal2json plugin version 9e962ba and it doesn't recognize this option.
Replication slots are created independently from publications. Publications to be used as a filter are specified when fetching changes stream. It is possible to peek changes for one publication, then peek changes for another publication and observe different set of changes despite using the same replication slot (I did not find it documented and I was testing on Aurora with Postgres compatibility, so behavior could potentially vary).
Plugin output seems to include all entries for begin and commit, even if transaction did not touch any of tables included in publication of interest. It does not however include changes to other tables than included in the publication.
Here is an example how to use it in Postgres 10+:
-- Create publication
CREATE PUBLICATION cdc;
-- Create slot
SELECT pg_create_logical_replication_slot('test_slot_v1', 'pgoutput');
-- Create example table
CREATE TABLE replication_test_v1
(
id integer NOT NULL PRIMARY KEY,
name text
);
-- Add table to publication
ALTER PUBLICATION cdc ADD TABLE replication_test_v1;
-- Insert example data
INSERT INTO replication_test_v1(id, name) VALUES
(1, 'Number 1')
;
-- Peak changes (does not consume changes)
SELECT pg_logical_slot_peek_binary_changes('test_slot_v1', NULL, NULL, 'publication_names', 'cdc', 'proto_version', '1');
-- Get changes (consumes changes)
SELECT pg_logical_slot_get_binary_changes('test_slot_v1', NULL, NULL, 'publication_names', 'cdc', 'proto_version', '1');
To stream changes out of Postgres to other systems, you can consider using Debezium project. It is an open source distributed platform for change data capture, which among others provides a PostgreSQL connector. In version 0.10 they added support for pgoutput plugin. Even if your use case is very different from what the project offers, you can look at their code to see how they interact with replication API.
After you have created the logical replication slot and the publication, you can create a subscription this way:
CREATE SUBSCRIPTION mysub
CONNECTION <conn stuff>
PUBLICATION foo
WITH (slot_name=my_slot, create_slot=false);
Not sure if this answers your question.
I'm currently working on dumping one of our customer's database in a way that allows us to create new databases from this customer's basic structure, but without bringing along their private data.
So far, I've had success with pg_dump combined with the --exclude_table and exclude-table-data commands, which allowed me to bring only the data I'll effectively need for this task.
However, there are a few tables that mix lines which references some of the data I left behind with other lines that references data that I had to bring, and this is causing me a few issues during the restore operation. Specifically, when the dump tries to enforce FOREIGN KEY constraints for certain columns on these tables, it fails because there are some lines with keys that have no matching data on the respective foreign table - because I chose to not bring this table's data!
I know I can log into the database after the dump is complete, delete any rows that reference data that no longer exists and create the constraint myself, but I'd like to automate the process as much as possible. Is there a way to tell pg_dump or pg_restore (or any other program) to not bring rows from table A if they reference table B if and table B's data was excluded from the backup? Or to tell Postgres that I'd like to have that specific foreign key to be active before importing the table's data?
For reference, I'm working with PostgreSQL 9.2 on a HREL 7 server.
What if you disable foreign key checking when you restore your database dump? And after that remove lonely rows from the referring table.
By the way, I recommend you to fix you database schema so there is no chance wrong tuples being inserted into your database.