Kafka Connect - Handling delete table events

Kafka Connect - Handling delete table events - apache-kafka

we're currently in the process of syncing data from an Oracle database into a new PostreSQL instance.
For inserts/updates this works fine, but the deletes are problematic:
When something, like a customer gets deleted, the customer is removed from the customer table and moved into a customer_deleted table.
So I thought: I can listen on both tables to get the inserts, updates and deletes.
Then I would write a Kafka streams application which merges those two topics into one.
But that could result in a race condition, e.g. when an update happens after a delete.
So what would be a way to handle this? Maybe by joining the streams in a time window? Is this actually solvable at all?
P.S. I know of Debezium to capture deletes but 20k $ for the Golden Gate license is just too much for my case :(.

Related

Debezium: Produce messages only upon changes to columns in column.include.list

I am running Debezium with column.include.list configured to a subset of columns on each of the observed tables of the source MySQL database. Changes to records on the source tables are being successfully published to Kafka, with each message's values before & after only containing that subset of columns.
However, Debezium still publishes messages when changes occur on columns of the observed tables that are not in column.include.list. Those events are unnecessary to my downstream consumers, so I'd like to prevent them from being produced.
I only want changes to columns found in column.include.list to produce messages to Kafka. What is the preferred way to do this?
Using SMT Filtering seems like one way to do it—a filter that compares the before & after values and filters out any messages in which there is no difference. Is there a simpler way? Maybe a config for this behavior I missed in my search?

(From our discussion in the comments)
There's an open issue describing exactly what you expect, but not yet been picked up for development.
https://issues.redhat.com/browse/DBZ-2979
Thus it seems currently you need to rely on SMTs to filter events/messages not related to monitored columns.

What is the correct streaming pattern to replace database table polling?

I am trying to architect an event streaming system to replace our existing database table polling mechanism. We currently have a process where Application ABC will query/scan the entire XYZ (MySQL) table every 5 minutes so that we may get any updates to our data and cache them on Application ABC. As our data grows this will not be scalable or performant.
Instead, I want to have Application ABC read from a Kafka stream that contains any new events around the XYZ table, and use that to modify Application ABC's in-memory cache.
Where I'm having a hard time formulating a good solution is the initial database table load onto the Kafka stream. Since all the XYZ data that would be consumed by Application ABC is cached, we lose that data when we redeploy all of the Application ABC nodes. So we would need some kind of mechanism to be able to get all the XYZ data from the initial load onto the stream. I know Kafka streams are supposed to allow for infinite retention but I'm not sure if infinite retention is a realistic solution in this case due to cost.
What's the usually prescribed solution around this initial load case where Application ABC would need to reload the entire database again off of the stream (every time a new instance is spun up)? Also trying to think about what is the most performant solution here so that Application ABC has the lowest latency to be able to gather all the data it needs from XYZ Table.
Another constraint to mention is that Application ABC needs to have this data in memory for performance reasons. We need to be able to iterate over the entire XYZ data set at all times. We cannot do simple queries by ID.

There is a bit to unpack here but here are is some info.
Instead of polling the DB, consider using a source connector to get the data into Kafka. Debezium is made for this. You havent specified what type of database you are using, but it does support quite a few variants. The mechanism is called CDC - Change Data Capture, and it needs to be enabled on the database and each of the tables first.
As for the Application ABC side - consider using a distributed cache with persistence enabled. Redis is a good option for this. This way it will retain the data even if your application is restarted. Reloading all the data back from Kafka is not a good idea, this will take a long time (depending on the amount of data) the application will be unavailable for that duration after a restart.

System architecture - ETL

We are in the process of designing an ETL process, where we’ll be getting a daily account file (maybe half a million records, could grow) from client and we’ll be loading that file to our database.
Our current process splits the file into smaller files and load it to staging...sometime or if the process fails, we try to figure out how many records we have processed and then start again from that point. Is there any other better alternative to this problem?
We are thinking about using Kafka. I’m pretty new to Kafka. I would really appreciate some feedback if kafka is the way to go or we’re just over-killing a simple ETL process where we just load the data to a staging table and finally to destination table.

Apache Kafka® is a distributed streaming platform. What exactly does
that mean?
A streaming platform has three key capabilities:
Publish and subscribe to streams of records, similar to a message
queue or enterprise messaging system.
Store streams of records in a fault-tolerant durable way.
Process streams of records as they occur.
Kafka is generally used for two broad classes of applications:
Building real-time streaming data pipelines that reliably get data
between systems or applications
Building real-time streaming applications that transform or react to
the streams of data
https://kafka.apache.org/intro
If you encounter errors which make you check the last commited record to your staging database and need system to auto manage this stuff, Kafka can help you ease the process.
Though Kafka is built to work with massive data loads and spread across a cluster, you certainly can use it for smaller problems and utilize it's queuing functionalities and offset management, even with one broker (server) and low number of partitions (level of parallelism).
If you don't anticipate any scale at all, I would suggest you to consider RabbitMQ.
RabbitMQ is a message-queueing software also known as a message
broker or queue manager. Simply said; it is software where queues are
defined, to which applications connect in order to transfer a message
or messages.
https://www.cloudamqp.com/blog/2015-05-18-part1-rabbitmq-for-beginners-what-is-rabbitmq.html
“How to know if Apache Kafka is right for you” by Amit Rathi
https://link.medium.com/enGzNaNvT4
In case you chose Kafka:
When you receive a file, create a process which iterates all over it's lines and sends them to Kafka (Kafka Producer).
Create another process which continuously receive events from kafka (Kafka Consumer) and writes them in mini batches to the database (similar to your small files).
Setup Kafka:
https://dzone.com/articles/kafka-setup
Kafka Consumer/Producer simple example:
http://www.stackframelayout.com/programowanie/kafka-simple-producer-consumer-example/

Don't assume importing data is as easy as dumping it in your database and having the computer handle all the processing work. As you've discovered, an automated load can have problems.
First, database ELT processes depreciate the hard drive. Do not stage the data into one table prior to inserting it in its native table. Your process should only import the data one time to its native table to protect hardware.
Second, you don't need third-party software to middle-man the work. You need control so you're not manually inspecting what was inserted. This means your process is to first clean / transform the data prior to import. You want to prevent all problems prior to load by cleaning and structuring and even processing the data. The load should only be an SQL insert script. I have torn apart many T-SQL scripts where someone thought it convenient to integrate processing with database commands. Don't do it.
Here's how I manage imports from spreadsheet reports. Excel formulas are better than learning ETL tools like SSIS. I use cell formulas to validate whether the record is valid to go into our system. This result is its own column, and then if that column is true, a concatentation column displays an insert script.
=if(J1, concatenate("('", A1, "', ", B1, "),"), "")
If the column is false, the concat column shows nothing. This allows me to copy/paste the inserts into SSMS and conduct mass inserts via "insert into table values" scripts.
If this is actually updating existing records, as your comment appears to suggest, then you need to master the data, organizing what's changed in logs for your users.
Synchronization steps:
Log what is there before you update
Download and compare local vs remote copies for differences; you cannot compare the two without a) having them both in the same physical location or b) controlling the other system
Log what you're updating with, and timestamp when you're updating it
Save and close the logs
Only when 1-4 are done should you post an update to production
My guide to synchronizing data sources and handling Creates/Updates/Deletes:
sync local files with server files

Integration of Kafka in Web Application

I have a java based web application which is using 2 backend database servers of Microsoft SQL (1 server is live database as it is transactional and the other one is reporting database). Lag between transactional and reporting databases is of around 30 minutes and incremental data is loaded using a SQL job which runs every 30 minutes and takes around 20-25 minutes in execution. This job is executing an SSIS package and using this package, data from reporting database is further processed and is stored in HDFS and HBase which is eventually used for analytics.
Now, I want to reduce this lag and to do this, I am thinking of implementing a messaging framework. After doing some research, I learned that Kafka could solve my purpose since Kafka can also work as an ETL tool apart from being a messaging framework.
How should I proceed? should I create topics similar to the table structures in SQL server and perform operations on that? Should I redirect my application to write any change happening in Kafka first and then in Transactional database? Please advise on usage of Kafka considering the mentioned use case.

There's a couple ways to do this that require minimal code, and then there's always the option to write your own code.
(Some coworkers just got finished looking at this, with SQL Server and Oracle, so I know a little about this here)
If you're using the enterprise version of SQL Server you could use Change Data Capture and Confluent Kakfa Connect to read all the changes to the data. This (seems to) require both a Enterprise license and may include some other additional cost (I was fuzzy on the details here. This may have been because we're using an older version of SQL Server or because we have many database servers ).
If you're not / can't use the CDC stuff, Kafka Connect's JDBC support also has a mode where it polls the database for changes. This works best if your records have some kind of timestamp column, but usually this is the case.
A poll only mode without CDC means you won't get every change - ie if you poll every 30 seconds and the record changes twice, you won't get individual messages about this change, but you'll get one message with those two changes, if that makes sense. This is Probably acceptable for your business domain, but something to be aware of.
Anyway, Kafka Connect is pretty cool - it will auto create Kafka topics for you based on your table names, including posting the Avro schemas to Schema Registry. (The topic names are knowable, so if you're in an environment with auto topic creation = false, well you can create the topics manually yourself based on the table names). Starting from no Kafka Connect knowledge it took me maybe 2 hours to figure out enough of the configuration to dump a large SQL Server database to Kafka.
I found additional documentation in a Github repository of a Confluent employee describing all this, with documentation of the settings, etc.
There's always the option of having your web app be a Kafka producer itself, and ignore the lower level database stuff. This may be a better solution, like if a request creates a number of records across the data store, but really it's one related event (an Order may spawn off some LineItem records in your relational database, but the downstream database only cares that an order was made).
On the consumer end (ie "next to" your other database) you could either use Kafka Connect on the other end to pick up changes, maybe even writing a custom plugin if required, or write your own Kafka consumer microservice to put the changes into the other database.

Informix to Postgres, continuous data replication algorithm

The master server is Informix, version varies from 9.40 to the latest, database is unlogged by design that can't be changed. Slave server is the latest PostgreSQL. Master and slave are separate machines, network latency is unpredictable. Master schema is statically defined, well known and does not change, so it's only the data that needs to be replicated. In the master, there are three types of tables:
Numeric data tables, usually one date column, one time column and 15-300 int columns keyed by 2-3 primary keys. The data is never changed, only added once in a set interval (15, 30, or 60 minutes) and deleted when the retention point is reached. Replication data set can be up to 80,000 rows but usually is in the range of hundreds. This data needs to be replicated one way, master to slave. There is about 30 tables of this type and they need to be replicated all at once and as fast as possible, typically in under one minute after new interval set has been committed to the master.
Mixed data tables, with date, time, int, and string types, 30-100 columns, again 2-3 primary keys. This data is also never changed, added continuously and is deleted when the retention point is reached. The data set is up to 100,000 rows per hour. One way replication is needed, master to slave. There are a few tables like that, less than 5 usually.
Mixed data tables, with int and string types, less than 10 columns, 2-3 primary keys. The data largely stays intact, with occasional additions, edits or deletions. The usual replication set size is unpredictable, but probably will be in low hundreds of rows. This data needs to be replicated both ways, as fast as possible. There are a few tables of this type, and they need to be synched independently.
I've been looking for an existing tool that could do what I need, but it looks like there is none that is open source. I'm probably going to write one for my needs, and I'm looking for advice from DB gurus on how to approach this task.
In my estimate, there's probably no single algorithm that would cover all the use cases so I may be in fact looking for two or three algorithms. Here's what I found so far:
Fire trigger on master changes, record row OIDs (does Informix have them?) to temp table, dump the changed rows to a file, transfer it and load up. Question: how to buffer the trigger? The master DB is unlogged (no transactions), so trigger will fire upon each INSERT. Additional strain on the master, not good.
Add a cron job on the slave that will pull latest date/time keys from the master, and if the data is newer, pull it. Problem: although the update interval is defined, in reality it's based on the data source clock (not master DB clock) which is guaranteed to vary from slave server clock. More of it, there can be several data sources, each with varying clocks, and the data needs to be replicated ASAP. The only way here that I see is to constantly poll the master from the slave, hoping that by the time the poll comes in, the data is all committed (no transactions, remember?). Kludgy, slow, not good.
Add Informix as foreign data wrapper in the Postgres and run queries directly instead of bothering with replication. Pros: simplicity. Cons: Informix connector seems to be in alpha stage, and the whole approach is an unknown factor at best.
I've been researching this topic for some time, and it seems that the core of the problem is the lack of transactions on the master side. If the master DB was logged, it would be much easier to replicate it, but without transactions the task suddenly becomes much more complicated. For one, how do I ensure that there are no dupes? Another one, how to avoid update loops in type 3 tables? Considering all that, how to make replication as fast-reacting as possible? I mean the delay between data update and sync start here, data transfer is another topic altogether.
Any input is appreciated.

If you can't change the master in any significant way you are going to have a heck of a time with any sort of replication. Your basic problem is that you have no real way to handle replicating changes in real time without tracking which changes have been replicated, and if you can't change the master, you can't add that. So the short answer is that replication is not a solution which can work for you. Given some of Informix's other features I would think twice about going about this as continuous replication.
This leads to other approaches. The big unknown factors are that networks may not be reliable enough to just link the databases. This could lead to transactions hanging while waiting for data off a high latency connection to all kinds of other problems. You might be able to get this to work with an odbc fdw and an informix provider or with DBI-Link and DBD::Informix, but this strikes me as a problem in your current environment. You could use these in a cron job to populate a second PostgreSQL server closer to your own location periodically, however and so I would not write the approach entirely off.
One way or another it seems to me you need to get a copy of the data to your PostgreSQL server. You may want to do an ETL job to import the data periodically. You may want to use a secondary postgresql server and FDW's or DBI-Link to pull in the data. But this is not likely to be real-time, it is not likely to be continuous.
The tl;dr is that your environment isn't really set up to do this. For my money I would recommend an ETL approach and accept that your slave will not be in sync with the master.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse