I want to capture data changes from few tables in a huge PostgreSQL database.
Initially I planned to use the logical decoding feature with Debezium. But this solution has significant overhead since it's necessary to decode the entire WAL. Another solution uses triggers and PgQ.
Is there any general way to integrate PgQ with Kafka or perhaps a Kafka connector for this purpose?
You either go transaction log, or you go query-based.
Which you use depends on your use of the data. Query-based polls the DB, log-based uses the log (WAL).
I'm interested in your assertion that Debezium has "significant overhead"—have you quantified this? I know there are lots of people using it and it's not usually raised as an issue.
For query-based capture use the Kafka Connect JDBC source connector.
You can see pros and cons of each approach here: http://rmoff.dev/ksny19-no-more-silos
Related
As the title says, I have 2 seperate servers and I want both connectors to read from same source to write to their respective topic. A single connector works well. When I create another one in a different server they seem to be running but no data flow occurs for both.
My question is, is that possible to run 2 debezium connectors that read from same source? I couldn't find any information about this topic in documentation.
Edit: I've tested it with oracle database and never seen it's working well. Definitely wouldn't recommend using it especially in oracle.
So generally speaking, Debezium does not recommend that you use multiple connectors per database source and prefer that you adjust your connector configuration instead. We understand that isn't always the case when you have different business use cases at play.
That said, it's important that if you do deploy multiple connectors you properly configure each connector so that it doesn't share state such as the same database history topic, etc.
For certain database platforms, having multiple source connectors really doesn't apply any real burden to the database, such as MySQL. But other databases like Oracle, running multiple connectors can have a pretty substantial impact.
When an Oracle connector streams changes, it starts an Oracle LogMIner mining session. This session is responsible for loading, reading, parsing, and preparing the contents of the data read in a special in-memory table that the connector uses to generate change events. When you run multiple connectors, you will have concurrent Oracle LogMiner sessions happening and each session will be consuming its own share of PGA memory to support the steps taken by Oracle LogMiner. Depending on your database's volatility, this can be stressful on the database server since Oracle specifically assigns one LogMiner session to a CPU.
For an Oracle environment, I highly recommend you avoid using multiple connectors unless you are needing to stream changes from different PDBs within the same instance since there is really no technical reason why you should want to read, load, parse, and generate change data for the same redo entries multiple times, once per connector deployment.
I need to send occasionally record from oracle db to kafka topic. Is there a way to do so using procedure etc. without use of extra tools
You have several options, detailed in this blog and this talk.
tl;dr:
"Query-based" CDC, poll the database for changes, using Kafka Connect JDBC Source connector
"Log-based" CDC, use the redo log capture every change into Kafka, using GoldenGate / Qlik Attunity / etc etc etc
There's also a REST API if you want the lowest footprint, and then you just push records from a PL/SQL REST call over to Kafka. Probably fine for small infrequent volumes but it's going to be inflexible and brittle in the long run.
I have a java based web application which is using 2 backend database servers of Microsoft SQL (1 server is live database as it is transactional and the other one is reporting database). Lag between transactional and reporting databases is of around 30 minutes and incremental data is loaded using a SQL job which runs every 30 minutes and takes around 20-25 minutes in execution. This job is executing an SSIS package and using this package, data from reporting database is further processed and is stored in HDFS and HBase which is eventually used for analytics.
Now, I want to reduce this lag and to do this, I am thinking of implementing a messaging framework. After doing some research, I learned that Kafka could solve my purpose since Kafka can also work as an ETL tool apart from being a messaging framework.
How should I proceed? should I create topics similar to the table structures in SQL server and perform operations on that? Should I redirect my application to write any change happening in Kafka first and then in Transactional database? Please advise on usage of Kafka considering the mentioned use case.
There's a couple ways to do this that require minimal code, and then there's always the option to write your own code.
(Some coworkers just got finished looking at this, with SQL Server and Oracle, so I know a little about this here)
If you're using the enterprise version of SQL Server you could use Change Data Capture and Confluent Kakfa Connect to read all the changes to the data. This (seems to) require both a Enterprise license and may include some other additional cost (I was fuzzy on the details here. This may have been because we're using an older version of SQL Server or because we have many database servers ).
If you're not / can't use the CDC stuff, Kafka Connect's JDBC support also has a mode where it polls the database for changes. This works best if your records have some kind of timestamp column, but usually this is the case.
A poll only mode without CDC means you won't get every change - ie if you poll every 30 seconds and the record changes twice, you won't get individual messages about this change, but you'll get one message with those two changes, if that makes sense. This is Probably acceptable for your business domain, but something to be aware of.
Anyway, Kafka Connect is pretty cool - it will auto create Kafka topics for you based on your table names, including posting the Avro schemas to Schema Registry. (The topic names are knowable, so if you're in an environment with auto topic creation = false, well you can create the topics manually yourself based on the table names). Starting from no Kafka Connect knowledge it took me maybe 2 hours to figure out enough of the configuration to dump a large SQL Server database to Kafka.
I found additional documentation in a Github repository of a Confluent employee describing all this, with documentation of the settings, etc.
There's always the option of having your web app be a Kafka producer itself, and ignore the lower level database stuff. This may be a better solution, like if a request creates a number of records across the data store, but really it's one related event (an Order may spawn off some LineItem records in your relational database, but the downstream database only cares that an order was made).
On the consumer end (ie "next to" your other database) you could either use Kafka Connect on the other end to pick up changes, maybe even writing a custom plugin if required, or write your own Kafka consumer microservice to put the changes into the other database.
We have a use case where we are using Kafka connect to Source and Sink data. Its like a typical ETL.
We want to understand if Kafka connect can identify the delta changes between previous streams. i.e. we want to send only the changed data to client and not the whole table or view. Also, we prefer not to execute explicit code to identify changes via a query on source and destination DB.
Is there any preferred approach towards it?
As gasparms said, use a CDC tool to pull all change events from your database. You can then use Kafka Streams or KSQL to filter, join, and aggregate as required by your ETL.
What's your source system that you want to get data from? For Oracle (and several other sources), as of GoldenGate 12.3.1 they actually bundle the Kafka Connect handler as part of the download itself. You also have other options such as DBVisit.
For open-source DBs then Debezium definitely fits the bill, and there is a nice tutorial here.
Have you looked a CDC (Change Data Capture) approach. There are several connectors that read a commit log or similar in a database and stream events. Using these events you'll get every change in a table.
Examples
Oracle Golden Gate -
http://www.oracle.com/technetwork/middleware/goldengate/oracle-goldengate-exchange-3805527.html
Postgres - https://github.com/debezium
MySQL - https://github.com/debezium
I wonder if it is possible, or if someone has tried to setup Apache Kafka as consumer of PostgreSQL logigal log stream? Does that even makes sense?
https://wiki.postgresql.org/wiki/Logical_Log_Streaming_Replication
I have a legacy source system that I need to make realtime dashboard from. For some reasons I can't hook the application events (btw, it's java app). Instead, I'm thinking of some kind of a lambda architecture: when dashboard initializes, it reads from persisted data "data warehouse" which gets there after some ETL. And then changing events are streamed via Kafka to the the dashboard.
Another use of the events stored in Kafka would be a kind of change data capture approach for data warehouse population. This is necessary because there is no commercial CDC tool that supports postgesql. And the source application is updating tables without keeping history.
A combination of xsteven's PostgreSQL WAL to protobuf project - decoderbufs (https://github.com/xstevens/decoderbufs) - and his pg_kafka producer (https://github.com/xstevens/pg_kafka) might be a start.
Take a look at Bottled Water which:
uses the logical decoding feature (introduced in PostgreSQL 9.4) to
extract a consistent snapshot and a continuous stream of change events
from a database. The data is extracted at a row level, and encoded using Avro. A
client program connects to your database, extracts this data, and
relays it to Kafka
They also have Docker images so looks like it'd be easy to try it out.
The Debezium project provides a CDC connector for streaming data changes from Postgres into Apache Kafka. Currently it supports Decoderbufs and wal2json as logical decoding plug-ins. Bottled Water referenced in Steve's answer is comparable, but it is not actively maintained any longer.
Disclaimer: I'm the project lead of Debezium