How do we check number of records are loaded so far onto db from Kafka topic? - postgresql

I'm trying to load data from Kafka topic to Postgres using Jdbc sink connector . Now, how do we know the number of records are loaded so far into Postgres. As of now I keep on checking number of records in db using sql query. Is there any other way I can know about it?

Kafka Connect doesn't track this. I see nothing wrong with SELECT COUNT(*) on the table, however this doesn't exclude other processes writing to that table as well

it is not possible in KAFKA. Because once you have sinked the records into the target DB, KAFKA is already done its job. But you can track number of records that you are updating using SINK Record Collections write into your local file or insert into a KAFKA State store.

Related

Is it a good practice to use the exiting topic for multiple connectors?

I am using the Debezium PostgreSQL connector to get the users table into a Kafka Topic.
I have a JDBC Sink Connector connector that then reads the data from the topic and pushes it into it's own Database.
Now, I need a subset of the data for another Microservice Database. So I am planning to write another JDBC Sink Connector.
The Question: is it a good practice to use the existing users table topic? If yes, then how I can make sure that new JDBC connector get's a snapshot of entire users table
 
If Debezium snapshotted the table and data hasn't been lost in the topic due to retention, then that's what any sink or other consumer will read.
Any unique sink connector name will read unique offsets from its topic. Nothing bad will happen with multiple consumers reading the same topic; this is how Kafka is intended to be used.
You may need to ensure consumer.auto.offset.reset=earliest for connect to read from the start of the topic
To get a subset of fields, you'll need to "replace" them - https://docs.confluent.io/platform/current/connect/transforms/replacefield.html#replacefield

In Kafka, how to handle deleted rows from source table that are already reflected in Kafka topic?

I am using a JDBC source connector with mode timestamp+incrementing to fetch table from Postgres, using Kafka Connect. The updates in data are reflected in Kafka topic but the deletion of records has no effect. So, my questions are:
Is there some way to handle deleted records?
How to handle records that are deleted but still present in kafka topic?
The recommendation is to either 1) adjust your source database to be append/update only, as well, either via a boolean or timestamp that is filtered out when Kafka Connect queries the table.
If your database is running out of space, then you can delete old records, which should already have been processed by Kafka
Option 2) Use CDC tools to capture delete events immediately rather than missing them in a period table scan. Debezium is a popular option for Postgres
A Kafka topic can be seen as an "append-only" log. It keeps all meesages for as long as you like but Kafka is not built to delete individual messages out of a topic.
In the scenario you are describing it is common that the downstream application (consuming the topic) handles the information on a deleted record.
As an alternative you could set the cleanup.policy of your topic to compact which means it will eventually keep only the latest value for each key. If you now define the key of a message as the primary key of the Postgres table, your topic will eventually delete the record when you produce a message with the same key and a null value into the topic. However,
I am not sure if your connector is flexible to do this
Depending on what you do with the data in the kafka topic, this could still not be a solution to your problem as the downstream application will still read both record, the original one and the null message as the deleted record.

Is it possible to use Kafka Connect to mirror an RDBMS table to a Kafka Stream?

I know it's possible to push updates from a database to a Kafka stream using Kafka Connect. My question is, can I create a consumer to write changes from that same stream back into the table without creating an infinite loop?
I'm assuming if I create a consumer that writes updates into the database table, it would trigger Connect to push that update to the stream, etc. Is there a way around this so I can mirror a database table to a stream?
You can stream from a Kafka topic to a database using the JDBC Sink connector for Kafka Connect.
You'd need to code in your business logic for avoiding an infinite replication loop into either the connectors or your consumer. For example:
JDBC Source connector uses a WHERE clause to only pull records with a flag set to indicate they are the original record
Custom Single Message Transform in the source connector to drop records with a flag set to indicate they are not the original record
Stream application (e.g. KSQL / Kafka Streams) processes the inbound stream of all database changes to filter out only those with a flag set to indicate they are the original record
Inefficient because then you're still streaming everything from the database
Yes. It is possible to configure synchronisation/replication.

Querying MySQL tables using Apache Kafka

I am trying to use Kafka Streams for achieving a use-case.
I have two tables in MySQL - User and Account. And I am getting events from MySQL into Kafka using a Kafka MySQL connector.
I need to get all user-IDs within an account from within Kafka itself.
So I was planning to use KStream on MySQL output topic, process it to form an output and publish it to a topic with Key as the account-id and value as the userIds separated by comma (,).
Then I can use interactive query to get all userIds using account id, with the get() method of ReadOnlyKeyValueStore class.
Is this the right way to do this? Is there a better way?
Can KSQL be used here?
You can use Kafka Connect to stream data in from MySQL, e.g. using Debezium. From here you can use KStreams, or KSQL, to transform the data, including re-keying which I think is what you're looking to do here, as well as join it to other streams.
If you ingest the data from MySQL into a topic with log compaction set then you are guaranteed to always have the latest value for every key in the topic.
I would take a look at striim if you want built in CDC and interactive continuous SQL queries on the streaming data in one UI. More info here:
http://www.striim.com/blog/2017/08/making-apache-kafka-processing-preparation-kafka/

oracle Golden gate for big data kafka adapter grouping data to kafka

Source: Oracle Database
Target: kafka
Moving data from source to target by oracle golden adapter for big data. Problem is data is moving fine but when am inserting 5 records its going as one file in topic.
I want to group it. If am making 5 insert i need five separate entries in topic(kafka)
kafka handler, version gg for big data 12.3.1
Am inserting five records in source and in khafka am getting all inserts like below
{"table":"MYSCHEMATOPIC.ELASTIC_TEST","op_type":"I","op_ts":"2017-10-24 08:52:01.000000","current_ts":"2017-10-24T12:52:04.960000","pos":"00000000030000001263","after":{"TEST_ID":2,"TEST_NAME":"Francis","TEST_NAME_AR":"Francis"}}
{"table":"MYSCHEMATOPIC.ELASTIC_TEST","op_type":"I","op_ts":"2017-10-24 08:52:01.000000","current_ts":"2017-10-24T12:52:04.961000","pos":"00000000030000001437","after":{"TEST_ID":3,"TEST_NAME":"Ashfak","TEST_NAME_AR":"Ashfak"}}
{"table":"MYSCHEMATOPIC.ELASTIC_TEST","op_type":"U","op_ts":"2017-10-24 08:55:04.000000","current_ts":"2017-10-24T12:55:07.252000","pos":"00000000030000001734","before":{"TEST_ID":null,"TEST_NAME":"Francis"},"after":{"TEST_ID":null,"TEST_NAME":"updatefrancis"}}
{"table":"MYSCHEMATOPIC.ELASTIC_TEST","op_type":"D","op_ts":"2017-10-24 08:56:11.000000","current_ts":"2017-10-24T12:56:14.365000","pos":"00000000030000001865","before":{"TEST_ID":2}}
{"table":"MYSCHEMATOPIC.ELASTIC_TEST","op_type":"U","op_ts":"2017-10-24 08:57:43.000000","current_ts":"2017-10-24T12:57:45.817000","pos":"00000000030000002152","before":{"TEST_ID":3},"after":{"TEST_ID":4}}
I would recommend using the Kafka Connect Handler, since it then registers the data's schema with the Confluent Schema Registry, making it much easier to stream onwards to targets such as Elasticsearch (using Kafka Connect).
In Kafka each record from Oracle will be one Kafka message.
Made below in .props file
gg.handler.kafkahandler.mode=op.
And it worked!!