Kafka Sap connector bulk mode doesn't work correctly - apache-kafka

I am using this connector
https://www.confluent.io/connector/sap-hana-connector/
this is my configuration
{
"name": "sap",
"config": {
"connector.class": "com.sap.kafka.connect.source.hana.HANASourceConnector",
"tasks.max": "10",
"topics": "sap_OHEM,sap_OHST",
"connection.url": "jdbc:sap://10.236.242.1:30015/",
"connection.user": "user",
"connection.password": "pass",
"mode":"bulk",
"poll.interval.ms":"86400000",
"sap_OHEM.table.name": "\"public\".\"OHEM\"",
"sap_OHEM.poll.interval.ms": "86400000",
"sap_OHST.table.name": "\"public\".\"OHST\"",
"sap_OHST.poll.interval.ms": "86400000",
}
}
OHEM has over 300K rows and OHST has about 500.
My problem is that the connector copy just 100 rows of each of them and push them to the topic over and over again without duing an incremental step. It doesn't copy all the table but just 100 rows of it and than copy it again and again.
I tried with batch.max.rows putting 100K as limit, this solved the problem with small tables but not with big tables because it broke JVM memory
Does anyone face the same problem?

Related

Slow down kafka connector JDBC source - control throughput

I want to control the throughtput of a JDBC source kafka connector.
I have a lot of data stored in a PostgreSQL table and I wand to ingest it into a Kafka topic. However I would like to avoid a huge "peak" in the ingestion.
My config looks like:
{
"name": "my-connector",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"tasks.max": "1",
"topic.prefix": "my-topic",
"connection.url": "jdbc:postgresql://localhost:5432/my-db",
"connection.user": "user",
"connection.password": "password",
"mode": "timestamp",
"timestamp.column.name": "time",
"poll.interval.ms": "10000",
"batch.max.rows": "100",
"query": "SELECT * FROM my-table",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false"
}
}
I guess I would need to play with these parameters:
poll.interval.ms
batch.max.rows
I don't understand how they impact the throughput. With these values it goes really fast.
How can I configure it properly to slow it down?
Edit: the idea looks like KIP-731 and the propose to limit record rate.
You currently have batch.max.rows=100 which is the default
https://docs.confluent.io/kafka-connectors/jdbc/current/source-connector/source_config_options.html#connector
Once 100 rows are included in the batch, the connector will send the batch to the Kafka topic. If you want to increase throughput you should try increasing this value.

Configure Debezium CDC -> Kafka -> JDBC Sink (Multiple tables) Question

We have around 100 tables in SQL server DB(Application DB) which needs to be synced to SQL server DB(for Analytics) in near Realtime.
Future use case: Scale the Proof of Concept for 30 Source DBs to one destination DB(for Analytics) in near Realtime.
I am thinking to use one sink connector or few sink connectors for multiple tables. Please let me know if this is a good idea.
But I am not sure how to configure the sink to cater for multiple tables especially that each table might have its own primary key. Internet seems to have very simple examples of sink connector but not addressing complex use cases.
Debezium CDC(Source) config
{ "name": "wwi",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"database.dbname": "************************",
"database.history": "io.debezium.relational.history.MemoryDatabaseHistory",
"database.hostname": "**********************",
"database.password": "**********************",
"database.port": "1433",
"database.server.name": "******",
"database.user": "*********",
"decimal.handling.mode": "string",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "true",
"snapshot.mode": "schema_only",
"table.include.list": "Sales.Orders,Warehouse.StockItems",
"tasks.max": "1",
"tombstones.on.delete": "false",
"transforms": "route,unwrap",
"transforms.route.regex": "([^.]+)\\.([^.]+)\\.([^.]+)",
"transforms.route.replacement": "$3",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"value.converter.schemas.enable": "true",
"value.convertor": "org.apache.kafka.connect.json.JsonConverter"
}
}
JDBC Sink config
{
"name": "sqlsinkcon",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"topics": "orders",
"tasks.max": "1",
"auto.evolve": "true",
"connection.user": "********",
"auto.create": "true",
"connection.url": "jdbc:sqlserver://************",
"insert.mode": "upsert",
"pk.mode":"record_key",
"pk.fields":"OrderID",
"db.name": "kafkadestination"
}
}
The sink will write one table, per consumed topic. topics or topics.regex can be used to consume multiple topics at once.
Regarding scalability (or at least, fault tolerance), I prefer one sink task, with one topic (therefore writing to one table). Otherwise, if you consume multiple topics, and connector fails, then it'll potentially crash all the task threads due to the consumer rebalancing.
Also, using JSON / plaintext formats in Kafka isn't most optimal in terms of network bandwidth. I'd suggest a binary format like Avro or Protobuf.

No CDC generated by Kafka Debezium connector for Postgres

I succeed generating CDC in a Postgres DB.
Today, when I use same step to try to set up Kafka Debezium connector for another Postgres DB.
First I ran
POST http://localhost:8083/connectors
with body:
{
"name": "postgres-kafkaconnector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "example.com",
"database.port": "5432",
"database.dbname": "my_db",
"database.user": "xxx",
"database.password": "xxx",
"database.server.name": "postgres_server",
"table.include.list": "public.products",
"plugin.name": "pgoutput"
}
}
which succeed without error.
Then I ran
GET http://localhost:8083/connectors/postgres-kafkaconnector/status
to check status. It returns this result without any error:
{
"name": "postgres-kafkaconnector",
"connector": {
"state": "RUNNING",
"worker_id": "10.xx.xx.xx:8083"
},
"tasks": [
{
"id": 0,
"state": "RUNNING",
"worker_id": "10.xx.xx.xx:8083"
}
],
"type": "source"
}
However, this time, when I updated anything in the products table. No CDC got generated.
Any idea? Any suggestion for helping further debug would be appreciate. Thanks!
Found the issue! It is because my Kafka Connector postgres-kafkaconnector was initially pointing to a DB (stage1), then I switched to another DB (stage2) by updating
"database.hostname": "example.com",
"database.port": "5432",
"database.dbname": "my_db",
"database.user": "xxx",
"database.password": "xxx",
However, they are using same configuration properties in the Kafka Connect I deployed in the very beginning:
config.storage.topic
offset.storage.topic
status.storage.topic
Since this connector with different DB config shared same above Kafka configuration properties, nd the database table schemas are same,
it became mess due to sharing same Kafka offset.
One simple way to fix is when deploying Kafka connector to test on different DBs, using different names such as postgres-kafkaconnector-stage1 and postgres-kafkaconnector-stage2 to avoid Kafka topic offset mess.

Kafka MongoDB Sink only one record from table

I am using ksqlDB, Where I have created a table from the stream. when I fire select query in that table it gives me all the record properly. Now I want to sink that table in MongoDB. I am also able to create a sink between the Kafka table to MongoDB. But somehow it sinks only one record into it(MongoDB). Whereas in the table I have 100 records. Below is my MongoDB sink connector.
{
"name": "MongoSinkConnectorConnector_1",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSinkConnector",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"topics": "FEEDS",
"connection.uri": "mongodb://xxx:xxx#x.x.x.x:27017/",
"database": "xxx",
"max.num.retries": "1000000",
"writemodel.strategy": "com.mongodb.kafka.connect.sink.writemodel.strategy.UpdateOneTimestampsStrategy",
"value.projection.type": "allowlist",
"value.projection.list": "id",
"document.id.strategy": "com.mongodb.kafka.connect.sink.processor.id.strategy.PartialValueStrategy",
"buffer.capacity": "20000",
"value.converter.schema.registry.url": "http://x.x.x.x:8081",
"key.converter.schemas.enable": "false",
"insert.mode": "upsert"
}
}
I could not able to understand that, what's the reason behind that. Any help appreciated. Thank you
you can set the "batch.size" property in the Sink Connector property and also you can have a better write model strategy that you can read through Mongo DB Source Connector Official documentation https://docs.confluent.io/cloud/current/connectors/cc-mongo-db-source.html.

Performance issues when replicating tables with Kafka Connect & Debezium to Kafka

I'm experiencing performance issues when replicating tables using Debezium and Kafka Connect.
The slow replication is only experienced during the initial snapshot of the database. One table that I tested with, contains 3.4m rows and the replication took 2 hours to complete.
At this stage, the entire database was locked and I was unable to commit data to other tables that were not being synced at the time.
My configuration (Debezium config deployed via curl request):
{
"name": "debezium-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"tasks.max": "1",
"database.hostname": "redacted",
"database.port": "3306",
"database.user": "redacted",
"database.password": "redacted",
"database.server.id": "54005",
"database.server.name": "redacted",
"database.include.list": "redacted",
"table.include.list": "redacted",
"database.history.consumer.security.protocol":"SSL",
"database.history.producer.security.protocol":"SSL",
"database.history.kafka.bootstrap.servers": "redacted",
"database.history.kafka.topic": "schema-changes.debezium-test",
"snapshot.mode": "when_needed",
"max.queue.size": 81290,
"max.batch.size": 20480
}
}
Kafka Connect configuration that was changed:
CONNECT_OFFSET_FLUSH_INTERVAL_MS: 10000
CONNECT_OFFSET_FLUSH_TIMEOUT_MS: 60000
Questions:
1 - How can I improve performance during the initial snapshot of the database?
2 - How can I replicate a limited number of tables from a database, without locking the entire database?
if you can make sure that database schema will not change during snapshot process then you can avoid locking the database via https://debezium.io/documentation/reference/1.3/connectors/mysql.html#mysql-property-snapshot-locking-mode
Also check https://debezium.io/documentation/reference/1.3/connectors/mysql.html#mysql-property-min-row-count-to-stream-results option, there might be also some performance change using it properly.
You can also max.batch.size together with max.queue.size even more than you have it right now.