Best way to stream/logically replicate RDS Postgres data to kinesis - postgresql

Our primary datastore is an RDS Postgres database. It would be nice if we could stream all changes to that happen in Postgres to some sink - whether that's kinesis, elasticsearch or any other data store.
We use Postgres 9.5 which has support for 'logical replication'. However, all the extensions that tap into this stream are blocked on RDS. There's a tutorial for streaming the MySQL RDS flavor to kinesis - the postgres equivalent would be ideal. Is this possible currently?

Have a look at https://github.com/disneystreaming/pg2k4j . It takes all changes made to your database and streams them to Kinesis. See the README for an example of how to set this up with RDS. We've been using it in production and have found it very useful for solving this exact problem. Disclaimer: I wrote https://github.com/disneystreaming/pg2k4j

Integrate a central Amazon Relational Database Service (Amazon RDS) for PostgreSQL database with other systems by streaming its modifications into Amazon Kinesis Data Streams. An earlier post, Streaming Changes in a Database with Amazon Kinesis, described how to integrate a central RDS for MySQL database with other systems by streaming modifications through Kinesis. In this post, I take it a step further and explain how to use an AWS Lambda function to capture the changes in Amazon RDS for PostgreSQL and stream those changes to Kinesis Data Streams.
https://aws.amazon.com/blogs/database/stream-changes-from-amazon-rds-for-postgresql-using-amazon-kinesis-data-streams-and-aws-lambda/

Related

how to stream data from AWS MSK (kafka) to snowflake using MSK connect

I'm trying to set up a MSK connector for snowflake and i could hardly see any documentation on how to do it. Unfortunately AWS support person also referred me to use snowflake documentation page.
By following this i can create an EC2 instance and spinoff connector but i wanted to go on serverless mode and use MSK connectors
I'm having hard time with connector properties for snowflake and aws doesnt provide much information about it
As answered on the plugins page, you'd need to upload the Snowflake ZIP/JAR plugins to S3, where they'd be downloaded prior to the connector starting
https://docs.aws.amazon.com/msk/latest/developerguide/msk-connect-plugins.html

AWS: What is the right way of PostgreSQL integration with Kinesis?

The aim that I want to achieve:
is to be notified about DB data updates, for this reason, I want to build the following chain: PostgreSQL -> Kinesis -> Lambda.
But I am now sure how to notify Kisesis properly about DB changes?
I saw a few examples where peoples try to use Postgresql triggers to send data to Kinesis.
some people use wal2json concept.
So I have some doubts about which option to choose, that why I am looking for advice.
You can leverage Debezium to do the same.
Debezium connectors can also be intergrated within the code, using Debezium Engine and you can add transformation or filtering logic(if you need) before pushing the changes out to Kinesis.
Here's a link that explains about Debezium Postgres Connector.
Debezium Server( Internally I believe it makes use of Debezium Engine).
It supports Kinesis, Google PubSub, Apache Pulsar as of now for CDC from Databases that Debezium Supports.
Here is an article that you can refer to for step by step configuration of Debezium Server
[https://xyzcoder.github.io/2021/02/19/cdc-using-debezium-server-mysql-kinesis.html][1]

Is there a way to connect to multiple databases in multiple hosts using Kafka Connect?

I have a need to get data from Informix database using Kafka Connect. The scenario is this - I have 50 Informix Databases residing in 50 hosts. What I have understood by reading from Kafka connect is that we need to install the Kafka connect in each hosts to get the data from the database residing in that host. My question is this - Is there a way in which I can create the connectors centrally for these 50 hosts instead of installing into each of them and pull data from the databases?
Kafka Connect JDBC does not have to run on the database, just as other JDBC clients don't, so you can a have a Kafka Connect cluster be larger or smaller than your database pool.
Informix seems to have a thing called "CDC Replication Engine for Kafka", however, which might be something worth looking into, as CDC overall causes less load on the database
You don’t need any additional software installation on the system where Informix server is running.I am not fully clear about the question or the type of operation you are plan to do. If you are planning to setup a real time replication type of scenario, then you may have to invoke CDC API. Then one-time setup of CDC API at server is needed, then this APIs can be invoked using any Informix database driver API. If you are plan to read existing data from table(s) and pump into Kafka topic, then no need of any additional setup at server side. You could connect to all 50 database server(s) from a single program (remotely) and then pump those records to the Kafka topic(s). Base on the program language you are using you may choose Informix database driver.

How do I read a Table In Postgresql Using Flink

I want to do some analytics using Flink on the Data in Postgresql. How and where should I give the port address,username and password. I was trying with the table source as mentioned in the link:https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/table/common.html#register-tables-in-the-catalog.
final static ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
final static TableSource csvSource = new CsvTableSource("localhost", port);
I am unable to start with actually. I went through all the documents but detailed report about this not found.
The tables and catalog referred to the link you've shared are part of Flink's SQL support, wherein you can use SQL to express computations (queries) to be performed on data ingested into Flink. This is not about connecting Flink to a database, but rather it's about having Flink behave somewhat like a database.
To the best of my knowledge, there is no Postgres source connector for Flink. There is a JDBC table sink, but it only supports append mode (via INSERTs).
The CSVTableSource is for reading data from CSV files, which can then be processed by Flink.
If you want to operate on your data in batches, one approach you could take would be to export the data from Postgres to CSV, and then use a CSVTableSource to load it into Flink. On the other hand, if you wish to establish a streaming connection, you could connect Postgres to Kafka and then use one of Flink's Kafka connectors.
Reading a Postgres instance directly isn't supported as far as I know. However, you can get realtime streaming of Postgres changes by using a Kafka server and a Debezium instance that replicates from Postgres to Kafka.
Debezium connects using the native Postgres replication mechanism on the DB side and emits all record inserts, updates or deletes as a message on the Kafka side. You can then use the Kafka topic(s) as your input in Flink.

Postgres streaming using JDBC Kafka Connect

I am trying to stream changes in my Postgres database using the Kafka Connect JDBC Connector. I am running into issues upon startup as the database is quite big and the query dies every time as rows change in between.
What is the best practice for starting off the JDBC Connector on really huge tables?
Assuming you can't pause the workload on the database that you're streaming the contents in from to allow the initialisation to complete, I would look at Debezium.
In fact, depending on your use case, I would look at Debezium regardless :) It lets you do true CDC against Postgres (and MySQL and MongoDB), and is a Kafka Connect plugin just like the JDBC Connector is so you retain all the benefits of that.