Debezium: Reduce Load consumption of WalSenderMain activity

Debezium: Reduce Load consumption of WalSenderMain activity - postgresql

I'm currently using Debezium for our postgres databases which are on AWS Aurora (RDS). As you know, debezium makes use of publication (the default name: 'dbz_publication'.
One thing I've noticed is that in AWS, we can see that the WALSenderMain (Write ahead log) activity seems to be constant with a load of AAS (Average Active Sessions) of 1 (100&).
By drilling down further this is caused by the continuous call of querying the publication table.
Note that in the configuration, I've set the 'publication.autocreate.mode' to 'filtered' rather than all the tables.
Is this normal? Would there be any way for me to finetune debezium to lower WALSenderMain activity usuage? Thanks.

Related

Kafka Connect: Single connector or connector per table approach

I have a database say test and we are having multiple Kafka Debezium Connectors on it.
Each connector is associate with one table.
My question is in terms of memory usage, which is a better approach:
One connector per database OR
One connector per table

I think it really depends on your use case. I don't think there is a general approach for all the usecases.
For example, at my current job, we decided to have 4 connectors that stream changes from the same database, but, each of them is streaming from a subset of tables. The main reason is that we don't want to have a single point of failure where a bad record inside DB can break all our usecases that rely on CDC, hence, we divided the tables and assigned them to a connector. Note that it's not good to have a lot of replication slots on the database also. So it really depends on your usecase.

Considering all performance factors, it is always a recommended approach to have a single source connector (multiple instances to share the load), with replicator or configuration file per database instance (test1, test2, test3 etc), having multiple tables, so the data ingress would be 1 table -> 1 topic.
You can have a better view at Oracle Golden Gate implementation pattern for the same.
https://rmoff.net/2018/12/12/streaming-data-from-oracle-into-kafka/

How do I make sure I can keep ingestion client running if I have heavy read operations on a table?

I am using influx line to insert records into a table in Questdb at a constant and high rate. I have multiple postgres clients attached performing read operations, some are Grafana dashboards which do some heavy aggregations across the table. It looks like when I refresh the dashboards, I'm hitting some issues:
... t.LineTcpConnectionContext [31] queue full, consider increasing queue size or number of writer jobs
Is there a way to make sure I don't kick the kick the insert client out or increase the queue like mentioned in the error?

If you have one client writing Influx line protocol over TCP, it's possible to have a dedicated worker thread for this purpose. The config key that you can set this with is line.tcp.worker.count and this should be set in a configuration file or via environment variable. Setting one dedicated thread in server.conf would look like the following:
line.tcp.worker.count=1

Multiple log streams in CloudWatch - PostgreSQL RDS

I've got logs publishing from an Amazon RDS Postgres instance to CloudWatch:
Published logs
CloudWatch Logs
Postgresql
Going to CloudWatch Service, I see there are 4 different "streams" which contain similar data from the same time:
What's the difference between those 4?
I've checked, and in all the files are statements like, UPDATE, SELECT etc.
I'm not sure which one should I analyze.

Better late than never, but I was told from AWS support that in later versions of PostgreSQL in RDS they write to logs in parallel to increase performance.
I also navigated to the log the way you did, by going directly to CloudWatch, but apparently the correct way to access the log is to go to your RDS instance on AWS console, click on Configuration tab, and on the rightmost column there should be a clickable link (PostgreSQL in my case) under Published Logs/CloudWatch Logs section.
Clicking on that link should take you back to CloudWatch, but now all the log entries from the four parallel streams should now be in order.

Through experimentation I have found that the solution offered by #krispyjala does not guarantee that the log messages in CloudWatch will be sorted in the same way they are written to the logs on the instance (the logs one can download from the console or through RDS API)
The only viable solution would be for Amazon to enable logging with milliseconds, which is controlled by a RDS parameter log_line_prefix that cannot be changed.
Unfortunately, at this moment there is NO RELIABLE WAY to join multiple CloudWatch streams in to a single stream of messages which preserves the original ordering.
CloudWatch records three bits of information: original message timestamp, the message itself, and the ingestion timestamp. The original message timestamp is only given to a second, so we cannot rely on it. The ingestion timestamp can be up to several days off the original timestamp, so it is subject to random network and processing delays, hence it cannot be relied upon while joining the streams.
Sorry, but there is no solution, until logging with milliseconds is turned on, which is something AWS representatives assured me is in the works (as of Apr 8 2021)

Integration of Kafka in Web Application

I have a java based web application which is using 2 backend database servers of Microsoft SQL (1 server is live database as it is transactional and the other one is reporting database). Lag between transactional and reporting databases is of around 30 minutes and incremental data is loaded using a SQL job which runs every 30 minutes and takes around 20-25 minutes in execution. This job is executing an SSIS package and using this package, data from reporting database is further processed and is stored in HDFS and HBase which is eventually used for analytics.
Now, I want to reduce this lag and to do this, I am thinking of implementing a messaging framework. After doing some research, I learned that Kafka could solve my purpose since Kafka can also work as an ETL tool apart from being a messaging framework.
How should I proceed? should I create topics similar to the table structures in SQL server and perform operations on that? Should I redirect my application to write any change happening in Kafka first and then in Transactional database? Please advise on usage of Kafka considering the mentioned use case.

There's a couple ways to do this that require minimal code, and then there's always the option to write your own code.
(Some coworkers just got finished looking at this, with SQL Server and Oracle, so I know a little about this here)
If you're using the enterprise version of SQL Server you could use Change Data Capture and Confluent Kakfa Connect to read all the changes to the data. This (seems to) require both a Enterprise license and may include some other additional cost (I was fuzzy on the details here. This may have been because we're using an older version of SQL Server or because we have many database servers ).
If you're not / can't use the CDC stuff, Kafka Connect's JDBC support also has a mode where it polls the database for changes. This works best if your records have some kind of timestamp column, but usually this is the case.
A poll only mode without CDC means you won't get every change - ie if you poll every 30 seconds and the record changes twice, you won't get individual messages about this change, but you'll get one message with those two changes, if that makes sense. This is Probably acceptable for your business domain, but something to be aware of.
Anyway, Kafka Connect is pretty cool - it will auto create Kafka topics for you based on your table names, including posting the Avro schemas to Schema Registry. (The topic names are knowable, so if you're in an environment with auto topic creation = false, well you can create the topics manually yourself based on the table names). Starting from no Kafka Connect knowledge it took me maybe 2 hours to figure out enough of the configuration to dump a large SQL Server database to Kafka.
I found additional documentation in a Github repository of a Confluent employee describing all this, with documentation of the settings, etc.
There's always the option of having your web app be a Kafka producer itself, and ignore the lower level database stuff. This may be a better solution, like if a request creates a number of records across the data store, but really it's one related event (an Order may spawn off some LineItem records in your relational database, but the downstream database only cares that an order was made).
On the consumer end (ie "next to" your other database) you could either use Kafka Connect on the other end to pick up changes, maybe even writing a custom plugin if required, or write your own Kafka consumer microservice to put the changes into the other database.

Using HADR Standby as replication source

Trying to figure out if there's way to replicate a subset of tables (columns) from HADR Standby, with ROS enabled. Latency of O(10) sec can be tolerated. We are using luw V10.5FP8 right now and will upgrade to V11 at some point.
I understand the read-only limitation on HADR Standby, and that eliminates some options, eg. QRepl / infosphere CDC, which write monitoring/metadata back to source.
Furthermore, assuming we still limit ourselves to db2 heterogeneous env and the repl user have read access to all the tables/columns, is there a replication tool that doesn't depends on constant source db connection? Meaning such tool only scans transaction log files, and writes to its own external metadata/file, without bothering with source db at all? It would be even better if it can capture once, and replay in multiple targets.
Really appreciate your inputs.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse