kafka connect sink to mongo only last result with delay - mongodb

i have aggregation query pageView group by country, results push to out topic.
And sink to mongodb by kafka connector
{
"connector.class": "MongoDbAtlasSink",
"name": "confluent-mongodb-sink",
"input.data.format" : "JSON",
"connection.host": "ip",
"topics": "viewPageCountByUsers",
"max.num.retries": "3",
"retries.defer.timeout": "5000",
"max.batch.size": "0",
"database": "test",
"collection": "ViewPagesCountByUsers",
"tasks.max": "1"
}
The problem is that this data is very frequent and very load mongodb. How i can set kafkaconnection that send only last value by key as batch, example with 5 sec delay ?
Example: It's pointless to update the database 5 times
{countryID:7, viewCount: 111}
{countryID:7, viewCount: 112}
{countryID:7, viewCount: 113}
{countryID:7, viewCount: 114}
{countryID:7, viewCount: 115}
If there was an opportunity send only last result by key with 5 sec delay i can update 1 time.
// collect batch 5 sec and flush:
{countryID:7, viewCount: 115}
{countryID:8, viewCount: 573}
How do it?

Sink connectors just take whatever is in the topic, generally without batching.
You'd need to use a stream-processor such as Kafka Streams / KSQLdb to run a windowed-aggregation, then output to a new topic, which you'd read from the sink connector.

Related

MongoDB Kafka connect ChangeStreamHandler do not support truncatedArrays

I am using ChangeStreamHandler in mongo Kafka sink connector to stream changes from mongo source to sink collection
"change.data.capture.handler": "com.mongodb.kafka.connect.sink.cdc.mongodb.ChangeStreamHandler"
On updates events from the source MongoDB collection the change stream handler is failing with exception
ERROR Unable to process record SinkRecord{kafkaOffset=3, timestampType=CreateTime} ConnectRecord{topic='quickstart.sampleData', kafkaPartition=0, key={"_id": {"_data": "8262A5CD4B000000012B022C0100296E5A1004B80560BF7F114B04962A5F523CEAB5D046645F6964006462A5CC9B84956FD488691BF10004"}}, keySchema=Schema{STRING}, value={"_id": {"_data": "8262A5CD4B000000012B022C0100296E5A1004B80560BF7F114B04962A5F523CEAB5D046645F6964006462A5CC9B84956FD488691BF10004"}, "operationType": "update", "clusterTime": {"$timestamp": {"t": 1655033163, "i": 1}}, "ns": {"db": "quickstart", "coll": "sampleData"}, "documentKey": {"_id": {"$oid": "62a5cc9b84956fd488691bf1"}}, "updateDescription": {"updatedFields": {"hello": "moto"}, "removedFields": [], "truncatedArrays": []}}, valueSchema=Schema{STRING}, timestamp=1655033166742, headers=ConnectHeaders(headers=)} (com.mongodb.kafka.connect.sink.MongoProcessedSinkRecordData)
org.apache.kafka.connect.errors.DataException: Warning unexpected field(s) in updateDescription [truncatedArrays]. {"updatedFields": {"hello": "moto"}, "removedFields": [], "truncatedArrays": []}. Cannot process due to risk of data loss.
at com.mongodb.kafka.connect.sink.cdc.mongodb.operations.OperationHelper.getUpdateDocument(OperationHelper.java:99)
at com.mongodb.kafka.connect.sink.cdc.mongodb.operations.Update.perform(Update.java:57)
at com.mongodb.kafka.connect.sink.cdc.mongodb.ChangeStreamHandler.handle(ChangeStreamHandler.java:84)
at com.mongodb.kafka.connect.sink.MongoProcessedSinkRecordData.lambda$buildWriteModelCDC$3(MongoProcessedSinkRecordData.java:99)
at java.base/java.util.Optional.flatMap(Optional.java:294)
Below is the Change stream event received on the sink side
{"schema":{"type":"string","optional":false},"payload":"{\"_id\": {\"_data\": \"8262A5CD4B000000012B022C0100296E5A1004B80560BF7F114B04962A5F523CEAB5D046645F6964006462A5CC9B84956FD488691BF10004\"}, \"operationType\": \"update\", \"clusterTime\": {\"$timestamp\": {\"t\": 1655033163, \"i\": 1}}, \"ns\": {\"db\": \"quickstart\", \"coll\": \"sampleData\"}, \"documentKey\": {\"_id\": {\"$oid\": \"62a5cc9b84956fd488691bf1\"}}, \"updateDescription\": {\"updatedFields\": {\"hello\": \"moto\"}, \"removedFields\": [], \"truncatedArrays\": []}}"}
On looking at the code in class
com.mongodb.kafka.connect.sink.cdc.mongodb.operations.OperationHelper.getUpdateDocument(OperationHelper.java:99)
It shows that the updateDescription.updatedfields only handles updatedFields & removedFields.. support for truncatedArrays is not present.
Is this a bug? or I need to tune my source connector to somehow stop sending truncatedArrays in changeEvents.
I had the same issue here and i could solve that setting up the following configuration at the Source Connector:
"change.stream.full.document": "updateLookup"
A Full Exemple:
{
"name": "mongo-simple-source",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSourceConnector",
"connection.uri": "yourMongodbUri",
"database": "yourDataBase",
"collection": "yourCollection",
"change.stream.full.document": "updateLookup"
}
}

Debezium Connector filter "partly" working

We have a debezium connector that works without any errors. Two filtering conditions are applied and one of them works as intended but the other one seems to have no effect. These are the important parts of the config:
"connector.class": "io.debezium.connector.oracle.OracleConnector",
"transforms.filter.topic.regex": "topicname",
"database.connection.adapter": "logminer",
"transforms": "filter",
"schema.include.list": "xxxx",
"transforms.filter.type": "io.debezium.transforms.Filter",
"transforms.filter.language": "jsr223.groovy",
"tombstones.on.delete": "false",
"transforms.filter.condition": "value.op == \"c\" && value.after.QUEUELOCATIONTYPE == 5",
"table.include.list": "xxxxxx",
"skipped.operations": "u,d,r",
"snapshot.mode": "initial",
"topics": "xxxxxxx"
As you see, we want to get records which have op as "c" and "QUEUELOCATIONTYPE" as 5. In kafka topic all the records have the op field as "c". But the second condition does not work. There are records with QUEUELOCATIONTYPE as 2,3,4 etc.
A sample record is given below.
"payload": {
"before": null,
"after": {
"EVENTOBJECTID": "749dc9ea-a7aa-44c2-9af7-10574769c7db",
"QUEUECODE": "STDQSTDBKP",
"STATE": 6,
"RECORDDATE": 1638964344000,
"RECORDREQUESTOBJECTID": "32b7f617-60e8-4020-98b0-66f288433031",
"QUEUELOCATIONTYPE": 4,
"RETRYCOUNT": 0,
"RECORDCHANNELCODE": null,
"MESSAGEBROKERSERVERID": 1
},
"op": "c",
"ts_ms": 1638953572392,
"transaction": null
}
}
What may be the problem? Even though I wasn't thinking it was going to work, I've tried switching the placement of conditions. There are no error codes, connector is running.
Ok solved it. I was using a pre-created config. While reading documentations, I've seen that "skipped.operations": "u,d,r" is not an Oracle configuration. It was in the MySQL documentation. So, I deleted it and changed the connector name (cached data can cause problems so often). Looks like it's working now.

Kafka Connect - Not working for UPDATE operation

I am new to Kafka-Connect source and sink. I created application to transfer Table Data from one Schema (Schema1) to another Schema (Schema2), here I used Oracle as a Database. I successfully transferred data/row for INSERT operation from Table "Schema1.Header" to Table "Schema2.Header", but not working for UPDATE operation with below mentioned config.
SOURCE Config:
{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:oracle:thin:#localhost:1524:XE",
"connection.user": "USER",
"connection.password": "user1234",
"dialect.name": "OracleDatabaseDialect",
"topic.prefix": "Schema1.Header",
"incrementing.column.name": "SC_NO",
"mode": "incrementing",
"query": "SELECT * FROM (SELECT HEADER_V1.* FROM Schema1.Header HEADER_V1 INNER JOIN Schema1.LINE_V1 LINE_V1 ON HEADER_V1.SC_NO = LINE_V1.SC_NO AND LINE_V1.CLNAME_CODE ='XXXXXX' AND HEADER_V1.ITEM_TYPE = 'XXX')",
"transforms": "ReplaceField",
"transforms.ReplaceField.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.ReplaceField.blacklist": "col_3,col_10"
}
SINK Config:
{
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"connection.url": "jdbc:oracle:thin:#localhost:1524:XE",
"connection.user": "USER2",
"connection.password": "user21234",
"dialect.name": "OracleDatabaseDialect",
"topics": "Schema1.Header",
"table.name.format": "Schema2.Header",
"tasks.max": "1"
}
Kindly help me to fix this issue.
Note : I need to do all CRUD operations in Schema Schema1.Tables only, Using Kafka connect am transferring those data to another Schema Schema2.Tables. Newly inserted data/row got transferred but updated data/row not transferred via Kafka-Connect. What I have to do achieve this?
According to this blog you need to set the mode to timestamp (or better timestamp+incrementing if you want to both new and updated rows) in your source config.
In addition, you then need to specify the timestamp.column.name which should point to a timestamp column that is updated every time the row is updated.

Why I receive a lot of duplicates with debezium?

I'm testing Debezium platform in a local deployment with docker-compose. Here's my test case:
run postgres, kafka, zookeeper and 3 replicas of debezium/connect:1.3
configure connector in one of the replica with the following configs:
{
"name": "database-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"plugin.name": "wal2json",
"slot.name": "database",
"database.hostname": "debezium_postgis_1",
"database.port": "5432",
"database.user": "postgres",
"database.password": "postgres",
"database.dbname" : "database",
"database.server.name": "database",
"heartbeat.interval.ms": 5000,
"table.whitelist": "public.outbox",
"transforms.outbox.table.field.event.id": "event_uuid",
"transforms.outbox.table.field.event.key": "event_name",
"transforms.outbox.table.field.event.payload": "payload",
"transforms.outbox.table.field.event.payload.id": "event_uuid",
"transforms.outbox.route.topic.replacement": "${routedByValue}",
"transforms.outbox.route.by.field": "topic",
"transforms": "outbox",
"transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
"max.batch.size": 1,
"offset.commit.policy": "io.debezium.engine.spi.OffsetCommitPolicy.AlwaysCommitOffsetPolicy",
"binary.handling.mode": "bytes"
}
}
run a script that executes 2000 insert in outbox table by calling this method from another class
#Transactional
public void write(String eventName, String topic, byte[] payload) {
Outbox newRecord = new Outbox(eventName, topic, payload);
repository.save(newRecord);
repository.delete(newRecord);
}
After some seconds (when I see the first messages on Kafka), I kill the replica who's handling the stream. Let's say it delivered successfully 200 messages on the right topic.
I get from the topic where debezium stores offsets the last offset message:
{
"transaction_id": null,
"lsn_proc": 24360992,
"lsn": 24495808,
"txId": 560,
"ts_usec": 1595337502556806
}
then I open a db shell and run the following
SELECT slot_name, restart_lsn - pg_lsn('0/0') as restart_lsn, confirmed_flush_lsn - pg_lsn('0/0') as confirmed_flush_lsn FROM pg_replication_slots; and postgres reply:
[
{
"slot_name": "database",
"restart_lsn": 24360856,
"confirmed_flush_lsn": 24360992
}
]
After 5 minutes I killed the replica, Kafka rebalances connectors and it deploy a new running task on one of the living replicas.
The new connector starts handling the stream, but it seems that it starts from the beginning because after it finish I found 2200 messages on Kafka.
With that configuration (max.batch.size: 1 and AlwaysCommitPolicy) I expect to see max 2001 messages.
Where am I wrong ?
I found the problem in my configuration:
"offset.commit.policy": "io.debezium.engine.spi.OffsetCommitPolicy.AlwaysCommitOffsetPolicy" works only with the Embedded API.
Moreover the debezium/connect:1.3 docker image has a default value for OFFSET_FLUSH_INTERVAL_MS of 1 minute. So if I stop the container within its first 1 minute, no offsets will be stored on kafka

Kafka Connect: Topic shows 3x the number of events than expected

We are using Kafka Connect JDBC to sync tables between to databases (Debezium would be perfect for this but is out of the question).
The Sync in general works fine but it seems there are 3x the number of events / messages stored in the topic than expected.
What could be the reason for this?
Some additional information
The target database contains the exact number of messages (count of messages in the topics / 3).
Most of the topics are split into 3 partitions (Key is set via SMT, DefaultPartitioner is used).
JDBC Source Connector
{
"name": "oracle_source",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:oracle:thin:#dbdis01.allesklar.de:1521:stg_cdb",
"connection.user": "****",
"connection.password": "****",
"schema.pattern": "BBUCH",
"topic.prefix": "oracle_",
"table.whitelist": "cdc_companies, cdc_partners, cdc_categories, cdc_additional_details, cdc_claiming_history, cdc_company_categories, cdc_company_custom_fields, cdc_premium_custom_field_types, cdc_premium_custom_fields, cdc_premiums, cdc, cdc_premium_redirects, intermediate_oz_data, intermediate_oz_mapping",
"table.types": "VIEW",
"mode": "timestamp+incrementing",
"incrementing.column.name": "id",
"timestamp.column.name": "ts",
"key.converter": "org.apache.kafka.connect.converters.IntegerConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"validate.non.null": false,
"numeric.mapping": "best_fit",
"db.timezone": "Europe/Berlin",
"transforms":"createKey, extractId, dropTimestamp, deleteTransform",
"transforms.createKey.type": "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields": "id",
"transforms.extractId.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractId.field": "id",
"transforms.dropTimestamp.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.dropTimestamp.blacklist": "ts",
"transforms.deleteTransform.type": "de.meinestadt.kafka.DeleteTransformation"
}
}
JDBC Sink Connector
{
"name": "postgres_sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"connection.url": "jdbc:postgresql://writer.branchenbuch.psql.integration.meinestadt.de:5432/branchenbuch",
"connection.user": "****",
"connection.password": "****",
"key.converter": "org.apache.kafka.connect.converters.IntegerConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.schemas.enable": true,
"insert.mode": "upsert",
"pk.mode": "record_key",
"pk.fields": "id",
"delete.enabled": true,
"auto.create": true,
"auto.evolve": true,
"topics.regex": "oracle_cdc_.*",
"transforms": "dropPrefix",
"transforms.dropPrefix.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropPrefix.regex": "oracle_cdc_(.*)",
"transforms.dropPrefix.replacement": "$1"
}
}
Strange Topic Count
This isn't an answer per-se but it's easier to format here than in the comments box.
It's not clear why you'd be getting duplicates. Some possibilities would be:
You have more than one instance of the connector running
You have on instance of the connector running but have previously run other instances which loaded the same data to the topic
Data's coming from multiple tables and being merged into one topic (not possible here based on your config, but if you were using Single Message Transform to modify target-topic name could be a possibility)
In terms of investigation I would suggest:
Isolate the problem by splitting the connector into one connector per table.
Examine each topic and locate examples of the duplicate messages. See if there is a pattern to which topics have duplicates. KSQL will be useful here:
SELECT ROWKEY, COUNT(*) FROM source GROUP BY ROWKEY HAVING COUNT(*) > 1
I'm guessing at ROWKEY (the key of the Kafka message) - you'll know your data and which columns should be unique and can be used to detect duplicates.
Once you've found a duplicate message, use kafkacat to examine the duplicate instances. Do they have the exact same Kafka message timestamp?
For more back and forth, StackOverflow isn't such an appropriate platform - I'd recommend heading to http://cnfl.io/slack and the #connect channel.