Is it possible to define a different Reroute.key.field.name per postgres table? - debezium

Debezium by default uses the primary key as partition key, however some tables of mine should be paritioned by a different key (e.g. user)
hence I wanted to use: transforms.Reroute.key.field.name=user_id for that specific table only, and all of the rest would keep using the primary key
Docs:
https://debezium.io/documentation/reference/configuration/topic-routing.html#_example
However I'm not very clear on how to apply that transformer only to one table, but not all others.

Instead of re-routing, you could specify the message.key.columns connector option for customizing the columns that make up the message key for specific tables.
message.key.columns=inventory.customers:user_id

Related

SQLAlchemy, directly inserting primary keys seems to disable key auto generation

I am trying to populate some tables using data that I extracted from Google BigQuery. For that purpose I essentially normalized a flattened table into multiple tables that include the primary key of each row in the multiple tables. The important point is that I need to load those primary keys in order to satisfy foreign key references.
Having inserted this data into tables, I then try to add new rows to these tables. I don't specify the primary key, presuming that Postgres will auto-generate those key values.
However, I always get a 'duplicate key value violates unique constraint "xxx_pkey" ' type error, e.g.
"..duplicate key value violates unique constraint "collection_pkey" DETAIL: Key (id)=(1) already exists.
It seems this is triggered by including the primary key in the data when initializing table. That is, explicitly setting primary keys, somehow seems to disable or reset the expected autogeneration of the primary key. I.E. I was expecting that new rows would be assigned primary keys starting from the highest value already in a table.
Interestingly I get the same error whether I try to add a row via SQLAlchemy or from the psql console.
So, is this as expected? And if so, is there some way to get the system to again auto-generate keys? There must be some hidden psql state that controls this...the schema is unchanged by directly inserting keys, but psql behavior is changed by that action.
I am happy to provide additional information.
Thanks

In PostgreSQL 12, Does creating partitioning via inheritance improve query performance if queries are contained with a child table?

Using PostgreSQL 12, I'd like to take advantage of partitioning to 1: Aid in query performance, 2: Allow removing historic data more easily to keep mitigate database growth.
Unfortunately, declarative partitioning requires the key to be part of the PKs. A temporal field as primary key doesn't work well for my model -- so I'm exploring using inheritance instead (as per the docs).
My question is whether using this approach will similarly isolate the amount of rows that my SELECT statement will be exposed to if an item in my WHERE statement limits the results to a single child table.
eg.
Books => BooksJan2020, BooksFeb2020, BooksMar2020.
SELECT * FROM Books WHERE created < '01 20 2020' and author LIKE 'John%';
In declarative partitioning, I would expect the 'LIKE' statement to only be exposed to rows within the January table. Can I expect the same with inheritance? When studying how to create inherited tables, I don't see a mechanism that would tell the planner which child table to pull from.
SteveJ
You can do that by creating the appropriate check constraints on the inheritance children and leaving constraint_exclusion at its default value on.
But I want to dissuade you from using anything but declarative partitioning in v12. Partitioning by inheritance hurts. Besides, you cannot get a true primary key on anything that does not contain the partitioning key that way: even though you have a primary key on all partitions, nothing can prevent you from inserting the same key in different partitions.
My advice is to go with a primary key on (id, created). True, that does not guarantee global uniqueness of id, but it goes a long way towards that goal. With values generated from a single sequence, the risk of duplicates is marginal.
The remaining down side of a composite primary key is that you have to include both columns into any table that has a foreign key constraint to the partitioned table, but I'd say that is the price you pay for the advantages of partitioning. Besides, with inheritance partitioning you couldn't have foreign keys pointing to the partitioned table at all.

kafka sink connector creation for table having primary key as three columns

I have created a source jdbc connector for a table that has no primary key (table has column a,b,c,d,e) and it is part of an external database. I have the replica table in my database and I have created primary key using the columns a,b and c since those three combined together form unique data and can be used to form primary key. I am trying to create upsert sink connector and gave the pk.fields as a,b,c but when I launch the sink connector, it goes to degraded State and I am not able to see any proper error in the connect.log as well. I have given the pk.mode as record_value and in the pk.fields I gave it as a,b,c. Can someone please let me know if there is anything missing in the setup?
Note: it works if I change the mode to insert and remove the pk.fields. the pk.mode is record_value.
Update:
Hi Robin, Source table named as AccountDetails has columns accNumber, bankABA, bankOrigAccNumber, SpendingLimit and ExpirationDate and there is no primary key for this table. The target table is AccountInformation and has the same columns but has the primary key as (accNumber, bankABA and bankOrigAccNumber) since we need to have primary key at target for using in a different application. I have created source connector which is working fine to pull the data once in 24 hours. I am trying to create a sink connector with the mode as upsert for pushing the data from topic to table and the primary key mode as record_value and primary key fields as "accNumber,bankABA,bankOrigAccNumber". When i launch the sink, it goes to degraded state.

How to write another query in IN function when partitioning

I have 2 local docker postgresql-10.7 servers set up. On my hot instance, I have a huge table that I wanted to partition by date (I achieved that). The data from the partitioned table (Let's call it PART_TABLE) is stored on the other server, only PART_TABLE_2019 is stored on HOT instance. And here comes the problem. I don't know how to partition 2 other tables that have foreign keys from PART_TABLE, based on FK. PART_TABLE and TABLE2_PART are both stored on HOT instance.
I was thinking something like this:
create table TABLE2_PART_2019 partition of TABLE2_PART for values in (select uuid from PART_TABLE_2019);
But the query doesn't work and I don't know if this is a good idea (performance wise and logically).
Let me just mention that I can solve this with either function or script etc. but I would like to do this without scripting.
From doc at https://www.postgresql.org/docs/current/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE
"While primary keys are supported on partitioned tables, foreign keys
referencing partitioned tables are not supported. (Foreign key
references from a partitioned table to some other table are
supported.)"
With PostgreSQL v10, you can only define foreign keys on the individual partitions. But you could create foreign keys on each partition.
You could upgrade to PostgreSQL v11 which allows foreign keys to be defined on partitioned tables.
Can you explain what a HOT instance is and why it would makes this difficult?

cassandra 2.0.9: best practices for write-heavy columns

I am a little confused by clustering in Cassandra. I have an application that is very write-heavy and update-heavy. With a traditional relational database, I'd partition data into two tables: one table for data that changes infrequently; and one table (with shorter rows) for the columns that change frequently:
For example:
create table user_def ( id int primary key, email list< varchar > ); # stable
create table user_var ( id int primary key, state int ); # changes all the time
But Cassandra seems to be optimized for accessing sparsely-populated columns, so I'm not sure there is any advantage in mimicking this approach for Cassandra schemas.
With Cassandra, is there any advantage in separating frequently-updated columns to a separate table/column-family (away from infrequently-updated columns) or should I combine all the columns together into one table/column-family? Do circumstances change if I have a compound primary key and clustering comes into play?
Cassandra treats primary keys like this:
The first key in the primary key (which can be a composite) is used to partition your data. This defines which node(s) your data is saved in (and replicated to). Other fields in the primary key is then used to sort entries within a partition. The whole partition is always going to be in one node (and replica nodes) in its entirety. Moreover, each entry within a node is sorted by the "other" fields in the primary key. [The first element of the primary key is called the partition key, while the other fields in the primary key are called clustering keys.]
Based on that, I'd say you might as well simply have a table with id, state and email. It looks like you're using skinny rows, and I don't think you'd gain anything (if any) of creating the separate tables.
I had approved ashic's answer until I came upon this:
http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
which states (for delete-heavy access):
...consider partitioning data with heavy churn rate into separate rows and deleting the entire rows when you no longer need them. Alternatively, partition it into separate tables and truncate them when they aren’t needed anymore...
This falls under the 'queue' anti-pattern for the product.