How to create kafka event from database? - apache-kafka

There is a legacy service that writes values to the database.
I need to converting values to events and then sending it to kafka.
I'm going to make a service that, once in a fixed delay checks for new records and sends them, also writing the submitted records ids to the technical table, but maybe there is some other way, best practice or pattern.

You may want to look into Debezium that implements Change Data Capture on relational and NoSql data stores and streams the data into Kafka.
https://github.com/debezium/debezium
https://debezium.io/documentation

Related

SQL Server Data to Kafka in real time

I would like to add real time data from SQL server to Kafka directly and I found there is a SQL server connector provided by https://debezium.io/docs/connectors/sqlserver/
In the documentation, it says that it will create one topic for each table. I am trying to understand the architecture because I have 500 clients which means I have 500 databases and each of them has 500 tables. Does it mean that it will create 250000 topics or do I need separate Kafka Cluster for each client and each cluster/node will have 500 topics based on the number of tables in the database?
Is it the best way to send SQL data to Kafka or should we send an event to Kafka queue through code whenever there is an insert/update/delete on a table?
With debezium you are stuck with one table to one topic mapping. However, there are creative ways to get around it.
Based on the description, it looks like you have some sort of product that has SQL Server backend, and that has 500 tables. This product is being used by 500 or more clients and everyone has their own instance of the database.
You can create a connector for one client and read all 500 tables and publish it to Kafka. At this point you will have 500 Kafka topics. You can route the data from all other database instances to the same 500 topics by creating separate connectors for each client / database instance. I am assuming that since this is a backend database for a product, the table names, schema names etc. are all same, and the debezium connector will generate same topic names for the tables. If that is not the case, you can use topic routing SMT.
You can differentiate the data in Kafka by adding a few metadata columns in the topic. This can easily be done in the connector by adding SMTs. The metadata columns could be client_id, client_name or something else.
As for your other question,
Is it the best way to send SQL data to Kafka or should we send an event to Kafka queue through code whenever there is an insert/update/delete on a table?
The answer is "it depends!".
If it is a simple transactional application, I would simply write the data to the database and not worry about anything else.
The answer is also dependent on why you want to deliver data to Kafka. If you are looking to deliver data / business events to Kafka to perform some downstream business processing requiring transactional integrity, and strict SLAs, writing the data from application may make sense. However, if you are publishing data to Kafka to make it available for others to use for analytical or any other reasons, using the K-Connect approach makes sense.
There is a licensed alternative, Qlik Replicate, which is capable of something very similar.

Reprocess enriched kstream data on insert/updates of KTable

Here is a hypothetical but analogous scenario to what I am trying to achieve using Kafka Streams.
I have streaming data, sales, that I want to enrich with infrequently changing lookup data, say users & items, which I am planning to create KTable for. I plan to push this enriched data to a topic and to search engine using connect sink.
How do I ensure that updates in user/item data triggers enrichments for past sales data as well, not only the new data that is ingested in stream. As I understand, KTable inserts/updates doesn't trigger any reprocessing of past data of the stream.
I believe this may be a common use-case, at least I may not be the first one to have such need. Any guidance for solution or workarounds?
If you want to update old data, it implies that you want to do a table-table join. Note though, that for this case, all data of both inputs need to be hold on the application.

Why do we need a database when using Apache Kafka?

According to the schema data comes to Kafka, then to stream and Mapr-DB.
After storing data in DB, user can display data on the map.
Question is, why we use DB to dispaly data on the map if Kafka is already DB.
It seems to me more slowly to get realtime data from Mapr-DB that from Kafka.
What do you think, why this example uses this appoarch?
The core abstraction Kafka provides for a stream of records is known as topic. You can imagine topics as the tables in a database. A database (Kafka) can have multiple tables (topics). Like in databases, a topic can have any kind of records depending on the usecase. But note that Kafka is not a database.
Also note that in most cases, you would have to configure a retention policy. This means that messages at some point will be deleted based on a configurable time or size based retention policy. Therefore, you need to store the data into a persistent storage system and in this case, this is your Database.
You can read more about how Kafka works in this blog post.

JDBC sink connector insert/upsert based on max timestamp?

I'm very new to Kafka connect
I am inserting records from multiple sources into one table.
In some cases, it may be possible for some records to reach before others.
Since I cannot control which source will pull which record first, I want to add a check on the timestamp key of the record.
I have a key called "LastModified_timestamp" in my schema where I store the timestamp of the latest state of my record.
I want to add a check to my JDBC sink connector where I can upsert a record based on comparing the value of LastModified_timestamp
I want to ignore the records which have a older timestamp and only want to upsert/insert the latest one. I couldn't find any configuration to achieve this
Is there any way by which I can achieve this?
Will writing a custom query help in this case?
The JDBC Sink connector does not support this kind of feature. You have two options to consider:
Single Message Transform (SMT) - these apply logic to records as they pass through Kafka Connect. SMT are great for things like dropping columns, changing datatypes, etc. BUT not appropriate for more complex processing and logic, including logic which needs to span multiple records as yours does here
Process the data in the source Kafka topic first, to apply the necessary logic. You can do this with Kafka Streams, KSQL, and several other stream processing frameworks (e.g. Spark, Flink, etc). You'd need some kind of stateful logic that could work out if a record was older than already processed.
Can you describe more about your upstream source for the data? It might be there's a better way to orchestrate the data coming through to enforce the ordering.
A final idea would be to land all records to your target DB and then use logic in your database query consuming it to select the most recent (based on LastModified_timestamp) record for a given key.
Disclaimer: I work for Confluent, the company behind the open-source KSQL project.

How to Process a kafka KStream and write to database directly instead of sending it another topic

I don't want to write processed KStream to another topic, I directly want to write enriched KStream to database. How should I proceed?
You can implement a custom Processor that opens a DB connection and apply it via KStream#process(). Cf. https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#applying-processors-and-transformers-processor-api-integration
Note, you will need to do sync writes into your DB to guard against data loss.
Thus, not writing back to a topic has multiple disadvantages:
reduced throughput because of sync writes
you cannot use exactly-once semantics
coupling your application with the database (if DB goes down, your app goes down, too, as it can't write its results anymore)
Therefore, it's recommended to write the results back into a topic and use Connect API to get the data into your database.