With Kafka JDBC source connector getting only 1000 records/sec. How to improve the record pulling rate - apache-kafka

I am using kafka connect with JDBC source connector. The connector is working fine but i able to get only 1000 messages/sec. to the topic from Oracle DB. I tried most of the configuration settings but no luck. i tried in both standalone and distributed modes. Pls. help on this. Below is my JDBC Source connector configuration:
curl -X POST http://localhost:8083/connectors -H "Content-Type: application/json" -d '{"name": "ORA_SRC_DEVDB",
"config": { "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:oracle:thin:#xxxxxxx/DBDEV",
"connection.user": "xxxxxx",
"connection.password": "xxxxxx",
"query": "select * from A.LOG_AUDIT",
"topic.prefix": "Topic_POC",
"tasks.max": "1",
"poll.interval.ms": "5000",
"batch.max.rows": "1000",
"table.poll.interval.ms": "60000",
"mode": "timestamp",
"timestamp.column.name": "MODIFIED_DATEnTIME" }
}'
And also the destination Topic "Topic_POC" created with 3 partitions & 3 replicas.

poll.interval.ms: Frequency in ms to poll for new data in each table(default 5000)
batch.max.rows: Maximum number of rows to include in a single batch(default 100)
In your case every 5 seconds you are polling max 1000 record from DB. Trying to decrease poll.interval.ms and increase batch.max.rows could improve fetch rate.
Not only that below factors also impact on your fetch rate
Rate of incoming data into the Database that also depends
I/O rate from DB to JDBC connector to Kafka
DB table performance if you have a proper index on the time column.
After all its uses JDBC to fetch data from the database so implies all that you face on a single JDBC application
As per my experience JDBC connector is pretty fas

Related

Slow down kafka connector JDBC source - control throughput

I want to control the throughtput of a JDBC source kafka connector.
I have a lot of data stored in a PostgreSQL table and I wand to ingest it into a Kafka topic. However I would like to avoid a huge "peak" in the ingestion.
My config looks like:
{
"name": "my-connector",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"tasks.max": "1",
"topic.prefix": "my-topic",
"connection.url": "jdbc:postgresql://localhost:5432/my-db",
"connection.user": "user",
"connection.password": "password",
"mode": "timestamp",
"timestamp.column.name": "time",
"poll.interval.ms": "10000",
"batch.max.rows": "100",
"query": "SELECT * FROM my-table",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false"
}
}
I guess I would need to play with these parameters:
poll.interval.ms
batch.max.rows
I don't understand how they impact the throughput. With these values it goes really fast.
How can I configure it properly to slow it down?
Edit: the idea looks like KIP-731 and the propose to limit record rate.
You currently have batch.max.rows=100 which is the default
https://docs.confluent.io/kafka-connectors/jdbc/current/source-connector/source_config_options.html#connector
Once 100 rows are included in the batch, the connector will send the batch to the Kafka topic. If you want to increase throughput you should try increasing this value.

Debezium connector with TimescaleDB extension

I'm having trouble with detecting changes on Postresql hyper table(TimescaleDB extension).
Setup:
I have Postresql(ver 11.10) installed with TimescaleDB(ver 1.7.1) extension.
I have 2 tables I want to monitor them with Debezium(ver 1.3.1) connector installed on Kafka Connect for the purpose CDC(Capture Data Change).
Tables are table1 and table2hyper, but table2hyper is hypertable.
After creating Debezium connector in Kafka Connect I can see 2 topics created(one for each table):
(A) kconnect.public.table1
(B) kconnect.public.table2hyper
When consuming messages with kafka-console-consumer for topic A, I can see the messages after a row update in table1.
But when consuming messages from topic B(table2hyper table changes), nothing is emitted after for example a row update in table2hyper table.
Initialy Debezium connector does a snapshot of rows from table2hyper table and sends them to topic B(I can see the messages in topic B when using kafka-console-consumer), but changes that I do after the initial snapshot are not emitted.
Why am I unable to see subsequent changes(after initial snapshot) from table2hyper?
Connector creation payload:
{
"name": "alarm-table-connector7",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "xxx",
"database.port": "5432",
"database.user": "xxx",
"database.password": "xxx",
"database.dbname": "xxx",
"database.server.name": "kconnect",
"database.whitelist": "public.dev_db",
"table.include.list": "public.table1, public.table2hyper",
"plugin.name": "pgoutput",
"tombstones.on.delete":"true",
"slot.name": "slot3",
"transforms": "unwrap",
"transforms.unwrap.type":"io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones":"false",
"transforms.unwrap.delete.handling.mode":"rewrite",
"transforms.unwrap.add.fields":"table,lsn,op"
}
}
Thx in advance!
After trying for a while, I did not succeed to stream data from hyper table with Debezium connector. I was using version 1.3.1. and upgrade to latest 1.4.1. did not help.
However, I did succeed with Confluent JDBC connector.
As far as my research and testing goes, this is the conclusion and feel free to correct me if necessary:
Debezium works on ordinary tables on INSERT, UPDATE and
DELETE events
Confluent connector captures only INSERT events(unless you combine
some columns for detecting changes) and works on ordinary and
hyper(TimescaleDB) tables.
we have never tested Debezium with TimescaleDB. I recommend you to check if TimescaleDB updates are present in logical rpelication slot. If yes it should be technically possible to have Debezium process the events. If not then is is not possible at all.

Retry Attempt without data loss when sink side solr is down during runtime

curl -X POST -H "Content-Type: application/json" --data '{
"name": "t1",
"config": {
"tasks.max": "1",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"key.converter.schemas.enable": "false",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"connector.class": "com.github.jcustenborder.kafka.connect.solr.HttpSolrSinkConnector",
"topics": "TRAN",
"solr.queue.size": "100",
"solr.commit.within": "10",
"solr.url": "http://192.168.2.221:27052/solr/TRAN",
"errors.retry.delay.max.ms":"5000",
"errors.retry.timeout":"600000",
"errors.tolerance":"all",
"errors.log.enable":"true",
"errors.log.include.messages":"false",
"errors.deadletterqueue.topic.name":"DEAD_TRAN",
"errors.deadletterqueue.topic.replication.factor":"1",
"retry.backoff.ms":"1000",
"reconnect.backoff.ms":"5000",
"reconnect.backoff.max.ms":"600000"
}
}' http://localhost:8083/connectors
Need to retry ( without any data loss) based on count from connector config if solr server is down during runtime.
In my case, Its working perfectly whenboth connector and solr are in running state [Active].
But while only solr server is down, there is no retry process until my data passed to the solr leads to data loss..
Error Information shown below
Connector Config from the Kafka Connect Log
I've just checked the SinkTask implementation of that specific connector and it does throw a RetriableException in the put() method.
In theory, and according to your connector configuration, it should block for 10 minutes ("errors.retry.timeout" : "600000"). If your SolR instance recover within the 10 minutes there shouldn't be any problem in terms of data loss.
If you want to fully block your connector until solR is up on his feet, have you tried to set "errors.retry.timeout" : "-1"?
As per the documentation of errors.retry.timeout:
The maximum duration in milliseconds that a failed operation will be
reattempted. The default is 0, which means no retries will be
attempted. Use -1 for infinite retries.
PS: IMHO this might lead to a deadlock situation if for some reason a single message is permanently failing its sink operation (i.e: if the sink is rejecting the operation).

how to override key.serializer in kafka connect jdbc

I am doing mysql to kafka connection using kafka jdbc source connector. Everything working fine. Now i need to pass key.serializer and value.serializer to encrypt data as show at macronova. but i didn't found any changes in output.
POST API to start source connector
curl -X POST -H "Content-Type: application/json" --data '{
"name": "jdbc-source-connector-2",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"key.serializer": "org.apache.kafka.common.serialization.StringSerializer",
"value.serializer": "org.apache.kafka.common.serialization.StringSerializer",
"tasks.max": 10,
"connection.url": "jdbc:mysql://localhost:3306/connect_test?user=roo&password=roo",
"mode": "incrementing",
"table.whitelist" : "test",
"incrementing.column.name": "id",
"timestamp.column.name": "modified",
"topic.prefix": "table-",
"poll.interval.ms": 1000
}
}' http://localhost:8083/connectors
Connectors take Converters only, not serializers via key and value properties
If you want to encrypt a whole string, you'd need to implement your own converter or edit your code that writes into the database to write into Kafka instead, then consume and write to the database as well as other downstream systems

kafka connect jdbc source setup is not reading data from db and so no data in kafka topic

we configured kafka connect jdbc to read data from db2 and publish to kafka topic and we are using one of the column of type timestamp as timestamp.column.name , but i see that kafka connect is not publishing any data to kafka topic , even their is no new data coming after the kafka connect setup done , their is huge data in DB2 , so atleast that it should publish to kaka topic , but that also not happening , below my connector source configuration
{
"name": "next-error-msg",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"tasks.max": "1",
"connection.url": "DB_DATA_SOURCE_URL",
"connection.user": "DB_DATA_SOURCE_USERNAME",
"connection.password": "DB_DATA_SOURCE_PASSWORD",
"schema.pattern": "DB_DATA_SCHEMA_PATTERN",
"mode": "timestamp",
"query": "SELECT SEQ_I AS error_id, SEND_I AS scac , to_char(CREATE_TS,'YYYY-MM-DD-HH24.MI.SS.FF6') AS create_timestamp, CREATE_TS, MSG_T AS error_message FROM DB_ERROR_MEG",
"timestamp.column.name": "CREATE_TS",
"validate.non.null": false,
"topic.prefix": "DB_ERROR_MSG_TOPIC_NAME"
}
}
my doubts are why it is not reading the data , and it should read existing data already present in the DB , but that is not happening , is their something i need to configure or add extra ?