Kafka Connect topic.key.ignore not works as expected - apache-kafka

As I understand from the documentation of kafka connect this configuration should ignore the keys for metricbeat and filebeat topic but not for alarms. But kafka connect does not ignore any key.
So that's the fully json config that i pushing to kafka-connect over rest
{
"auto.create.indices.at.start": false,
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"connection.url": "http://elasticsearch:9200",
"connection.timeout.ms": 5000,
"read.timeout.ms": 5000,
"tasks.max": "5",
"topics": "filebeat,metricbeat,alarms",
"behavior.on.null.values": "delete",
"behavior.on.malformed.documents": "warn",
"flush.timeout.ms":60000,
"max.retries":42,
"retry.backoff.ms": 100,
"max.in.flight.requests": 5,
"max.buffered.records":20000,
"batch.size":4096,
"drop.invalid.message": true,
"schema.ignore": true,
"topic.key.ignore": "metricbeat,filebeat",
"key.ignore": false
"name": "elasticsearch-ecs-connector",
"type.name": "_doc",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"transforms":"routeTS",
"transforms.routeTS.type":"org.apache.kafka.connect.transforms.TimestampRouter",
"transforms.routeTS.topic.format":"${topic}-${timestamp}",
"transforms.routeTS.timestamp.format":"YYYY.MM.dd",
"errors.tolerance": "all" ,
"errors.log.enable": false ,
"errors.log.include.messages": false,
"errors.deadletterqueue.topic.name":"logstream-dlq",
"errors.deadletterqueue.context.headers.enable":true ,
"errors.deadletterqueue.topic.replication.factor": 1
}
That's the logging during start of the connector
[2020-05-01 21:07:49,960] INFO ElasticsearchSinkConnectorConfig values:
auto.create.indices.at.start = false
batch.size = 4096
behavior.on.malformed.documents = warn
behavior.on.null.values = delete
compact.map.entries = true
connection.compression = false
connection.password = null
connection.timeout.ms = 5000
connection.url = [http://elasticsearch:9200]
connection.username = null
drop.invalid.message = true
elastic.https.ssl.cipher.suites = null
elastic.https.ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
elastic.https.ssl.endpoint.identification.algorithm = https
elastic.https.ssl.key.password = null
elastic.https.ssl.keymanager.algorithm = SunX509
elastic.https.ssl.keystore.location = null
elastic.https.ssl.keystore.password = null
elastic.https.ssl.keystore.type = JKS
elastic.https.ssl.protocol = TLS
elastic.https.ssl.provider = null
elastic.https.ssl.secure.random.implementation = null
elastic.https.ssl.trustmanager.algorithm = PKIX
elastic.https.ssl.truststore.location = null
elastic.https.ssl.truststore.password = null
elastic.https.ssl.truststore.type = JKS
elastic.security.protocol = PLAINTEXT
flush.timeout.ms = 60000
key.ignore = false
linger.ms = 1
max.buffered.records = 20000
max.in.flight.requests = 5
max.retries = 42
read.timeout.ms = 5000
retry.backoff.ms = 100
schema.ignore = true
topic.index.map = []
topic.key.ignore = [metricbeat, filebeat]
topic.schema.ignore = []
type.name = _doc
write.method = insert
Iam using Confluent Platform 5.5.0

Let's recap here, because there have been several edits to your question and problem statement :)
You want to stream multiple topics to Elasticsearch with a single connector
You want to use the message key for some topics as the Elasticsearch document ID, and for others you don't and want to use the Kafka message coordinates instead (topic+partition+offset)
You are trying to do this with key.ignore and topic.key.ignore settings
Here's my test data in three topics, test01, test02, test03:
ksql> PRINT test01 from beginning;
Key format: KAFKA_STRING
Value format: AVRO or KAFKA_STRING
rowtime: 2020/05/12 11:08:32.441 Z, key: X, value: {"COL1": 1, "COL2": "FOO"}
rowtime: 2020/05/12 11:08:32.594 Z, key: Y, value: {"COL1": 2, "COL2": "BAR"}
ksql> PRINT test02 from beginning;
Key format: KAFKA_STRING
Value format: AVRO or KAFKA_STRING
rowtime: 2020/05/12 11:08:50.865 Z, key: X, value: {"COL1": 1, "COL2": "FOO"}
rowtime: 2020/05/12 11:08:50.936 Z, key: Y, value: {"COL1": 2, "COL2": "BAR"}
ksql> PRINT test03 from beginning;
Key format: ¯\_(ツ)_/¯ - no data processed
Value format: AVRO or KAFKA_STRING
rowtime: 2020/05/12 11:16:15.166 Z, key: <null>, value: {"COL1": 1, "COL2": "FOO"}
rowtime: 2020/05/12 11:16:46.404 Z, key: <null>, value: {"COL1": 2, "COL2": "BAR"}
With this data I create a connector (I'm using ksqlDB but it's the same as if you use the REST API directly):
CREATE SINK CONNECTOR SINK_ELASTIC_TEST WITH (
'connector.class' = 'io.confluent.connect.elasticsearch.ElasticsearchSinkConnector',
'connection.url' = 'http://elasticsearch:9200',
'key.converter' = 'org.apache.kafka.connect.storage.StringConverter',
'type.name' = '_doc',
'topics' = 'test02,test01,test03',
'key.ignore' = 'false',
'topic.key.ignore'= 'test02,test03',
'schema.ignore' = 'false'
);
The resulting indices are created and populated in Elasticsearch. Here's the index and document ID of the documents:
➜ curl -s http://localhost:9200/test01/_search \
-H 'content-type: application/json' \
-d '{ "size": 5 }' |jq -c '.hits.hits[] | [._index, ._id]'
["test01","Y"]
["test01","X"]
➜ curl -s http://localhost:9200/test02/_search \
-H 'content-type: application/json' \
-d '{ "size": 5 }' |jq -c '.hits.hits[] | [._index, ._id]'
["test02","test02+0+0"]
["test02","test02+0+1"]
➜ curl -s http://localhost:9200/test03/_search \
-H 'content-type: application/json' \
-d '{ "size": 5 }' |jq -c '.hits.hits[] | [._index, ._id]'
["test03","test03+0+0"]
["test03","test03+0+1"]
So key.ignore is the default and for test01 in effect, which means that the key of the messages are used for the document ID.
Topics test02 and test03 are listed for topic.key.ignore which means that the key of the message is ignored (i.e. in effect key.ignore=true), and thus the document ID is the topic/partition/offset of the message.
I would recommend, given that I've proven out above that this does work, that you start your test again from scratch to double-check your working.

Related

Model JSON input data for JDBC Sink Connector

My goal is to process an input stream that contains DB updates coming from a legacy system and replicate it into a different DB. The messages have the following format:
Insert - The after struct contains the state of the table entry after the insert operation is complete. No before struct present;
Update - The before struct contains the state of the table entry before the update operation and the after struct only lists the modified fields along with the table key;
Delete - The before struct contains the state of the table entry before the delete operation and the after struct is not present in the message;
Examples:
Insert
{"table":"SCHEMA.TEST_TABLE","op_type":"I","after":{"id_test":8934,"test_value_1":26910,"test_value_2":"XYZ","test_value_3":"2023-01-12T18:32:13.781217"}}
Update
{"table":"SCHEMA.TEST_TABLE","op_type":"U","before":{"id_test":8934,"test_value_1":26910,"test_value_2":"XYZ","test_value_3":"2023-01-12T18:32:13.781217"},"after":{"id_test":8934,"test_value_1":null,"test_value_3":"2023-01-12T18:32:18.787337"}}
Delete
{"table":"SCHEMA.TEST_TABLE","op_type":"D","before":{"id_test":8934,"test_value_1":1499,"test_value_2":"XYZ","test_value_3":"2023-01-12T18:32:18.787337"}}
After struggling a lot with trying to deal with the lack of a schema in the input messages, I turned to kSQL to overcome this and was able - not sure if by the best approach - to get ksql to generate a output topic that can be processed by a Kafka JDBC Sink Connector using a JSON_SR format.
The problem is that the output format I get in the output topic from ksql does not allow for a correct interpretation of all the scenarios - Only the Insert scenario is correctly interpreted.
The flow in ksql was:
Process input topic to map it as a JSON object by defining a schema. Created a stream where the input object is defined TEST_TABLE_INPUT):
CREATE OR REPLACE STREAM TEST_TABLE_INPUT (
"table" VARCHAR,
op_type VARCHAR,
after STRUCT<
id_test INTEGER, test_value_1 INTEGER, test_value_2 VARCHAR, test_value_3 VARCHAR
>,
before STRUCT<
id_test INTEGER
>
) WITH (kafka_topic='input_topic', value_format='JSON');
Process TEST_TABLE_INPUT to flatten the input structures (tried using the BEFORE->* and AFTER->* strategy instead but got a Caused by: line 3:11: mismatched input '*' expecting ... error):
CREATE OR REPLACE STREAM TEST_TABLE_INTERNAL
AS SELECT
OP_TYPE,
BEFORE->id_test AS B_id_test,
AFTER->id_test AS A_id_test,
AFTER
FROM TEST_TABLE_INPUT;
With this the output stream is created:
CREATE OR REPLACE STREAM TEST_TABLE_UPDATES
WITH (
VALUE_FORMAT='JSON_SR',
KEY_FORMAT='JSON_SR',
KAFKA_TOPIC='T_TEST_UPDATES'
)
AS SELECT
COALESCE(B_id_test, A_id_test) AS id_test,
AFTER->test_value_1 AS test_value_1,
AFTER->test_value_2 AS test_value_2,
AFTER->test_value_3 AS test_value_3
FROM TEST_TABLE_INTERNAL
PARTITION BY COALESCE(B_id_test, A_id_test)
;
When processing the example input topic messages below
{"table":"SCHEMA.TEST_TABLE","op_type":"I","after":{"id_test":0,"test_value_1":17315,"test_value_2":"XYZ","test_value_3":"2023-01-12T19:06:49.694383"}}
{"table":"SCHEMA.TEST_TABLE","op_type":"U","before":{"id_test":0,"test_value_1":17315,"test_value_2":"XYZ","test_value_3":"2023-01-12T19:06:49.694383"},"after":{"id_test":0,"test_value_1":null,"test_value_3":"2023-01-12T19:06:54.702107"}}
{"table":"SCHEMA.TEST_TABLE","op_type":"D","before":{"id_test":0,"test_value_1":28605,"test_value_2":"XYZ","test_value_3":"2023-01-12T19:06:54.702107"}}
{"table":"SCHEMA.TEST_TABLE","op_type":"I","after":{"id_test":1,"test_value_1":15601,"test_value_2":"KLM","test_value_3":"2023-01-12T19:07:04.716303"}}
{"table":"SCHEMA.TEST_TABLE","op_type":"U","before":{"id_test":1,"test_value_1":15601,"test_value_2":"KLM","test_value_3":"2023-01-12T19:07:04.716303"},"after":{"id_test":1,"test_value_1":null,"test_value_3":"2023-01-12T19:07:09.723473"}}
{"table":"SCHEMA.TEST_TABLE","op_type":"I","after":{"id_test":2,"test_value_1":13386,"test_value_2":"ABC","test_value_3":"2023-01-12T19:14:23.633204"}}
to simulate the I, I-U and I-U-D scenarios the output I get in KSQL is:
ksql> SELECT * FROM TEST_TABLE_UPDATES EMIT CHANGES;
+----------+-------------+--------------+----------------------------+
|ID_TEST |TEST_VALUE_1 |TEST_VALUE_2 |TEST_VALUE_3 |
+----------+-------------+--------------+----------------------------+
|0 |17315 |XYZ |2023-01-12T19:06:49.694383 |
|0 |null |null |2023-01-12T19:06:54.702107 |
|0 |null |null |null |
|1 |15601 |KLM |2023-01-12T19:07:04.716303 |
|1 |null |null |2023-01-12T19:07:09.723473 |
|2 |13386 |ABC |2023-01-12T19:14:23.633204 |
And outside of ksQL:
kafka-console-consumer.sh --property print.key=true --topic TEST_TABLE_UPDATES --from-beginning --bootstrap-server localhost:9092
0 {"TEST_VALUE_1":17315,"TEST_VALUE_2":"XYZ","TEST_VALUE_3":"2023-01-12T19:06:49.694383"}
0 {"TEST_VALUE_1":null,"TEST_VALUE_2":null,"TEST_VALUE_3":"2023-01-12T19:06:54.702107"}
0 {"TEST_VALUE_1":null,"TEST_VALUE_2":null,"TEST_VALUE_3":null}
1 {"TEST_VALUE_1":15601,"TEST_VALUE_2":"KLM","TEST_VALUE_3":"2023-01-12T19:07:04.716303"}
1 {"TEST_VALUE_1":null,"TEST_VALUE_2":null,"TEST_VALUE_3":"2023-01-12T19:07:09.723473"}
2 {"TEST_VALUE_1":13386,"TEST_VALUE_2":"ABC","TEST_VALUE_3":"2023-01-12T19:14:23.633204"}
On top of this the Connector with the configuration as mentioned below is instantiated:
{
"name": "t_test_updates-v2",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"errors.log.enable": true,
"errors.log.include.messages": true,
"topics": "TEST_TABLE_UPDATES",
"key.converter": "io.confluent.connect.json.JsonSchemaConverter",
"key.converter.schemas.enable": false,
"key.converter.schema.registry.url": "http://schema-registry:8081",
"value.converter": "io.confluent.connect.json.JsonSchemaConverter",
"value.converter.schemas.enable": false,
"value.converter.schema.registry.url": "http://schema-registry:8081",
"connection.url": "jdbc:oracle:thin:#database:1521/xe",
"connection.user": "<REDACTED>",
"connection.password": "<REDACTED>",
"dialect.name": "OracleDatabaseDialect",
"insert.mode": "upsert",
"delete.enabled": true,
"table.name.format": "TEST_TABLE_REPL",
"pk.mode": "record_key",
"pk.fields": "ID_TEST",
"auto.create": false,
"error.tolerance": "all"
}
}
Which processes the topic - apparently - without a problem but the output in the DB does not match the expected result:
SQL> select * from test_table_repl;
ID_TEST TEST_VALUE_1 TEST_VALUE_2 TEST_VALUE_3
__________ _______________ __________________ _____________________________
1 2023-01-12T19:07:09.723473
0
2 13386 ABC 2023-01-12T19:14:23.633204
I would expect the DB query to return:
ID_TEST TEST_VALUE_1 TEST_VALUE_2 TEST_VALUE_3
__________ _______________ __________________ _____________________________
1 KLM 2023-01-12T19:07:09.723473
2 13386 ABC 2023-01-12T19:14:23.633204
As it can be seen, only the insert scenario is correctly working.
I cannot understand what I am doing wrong.
Is it the data format, the connector configuration, both or another thing I am missing?

Configuration of JDBC Sink Connector in KsqlDB from MySQL database to PostgreSQL database

I wanted to copy a table from MySQL database to PostgreSQL. I have KsqlDB which acts as a stream processor. For the start, I just want to copy a simple table from the source's 'inventory' database to the sink database (PostgreSQL). The following is the structure of inventory database:
mysql> show tables;
+---------------------+
| Tables_in_inventory |
+---------------------+
| addresses |
| customers |
| geom |
| orders |
| products |
| products_on_hand |
+---------------------+
I have logged into the KsqlDB and register a source connector using the following configuration
CREATE SOURCE CONNECTOR inventory_connector WITH (
'connector.class' = 'io.debezium.connector.mysql.MySqlConnector',
'database.hostname' = 'mysql',
'database.port' = '3306',
'database.user' = 'debezium',
'database.password' = 'dbz',
'database.allowPublicKeyRetrieval' = 'true',
'database.server.id' = '223344',
'database.server.name' = 'dbserver',
'database.whitelist' = 'inventory',
'database.history.kafka.bootstrap.servers' = 'broker:9092',
'database.history.kafka.topic' = 'schema-changes.inventory',
'transforms' = 'unwrap',
'transforms.unwrap.type'= 'io.debezium.transforms.UnwrapFromEnvelope',
'key.converter'= 'org.apache.kafka.connect.json.JsonConverter',
'key.converter.schemas.enable'= 'false',
'value.converter'= 'org.apache.kafka.connect.json.JsonConverter',
'value.converter.schemas.enable'= 'false'
);
The following are the topics created
ksql> LIST TOPICS;
Kafka Topic | Partitions | Partition Replicas
-----------------------------------------------------------------------
_ksql-connect-configs | 1 | 1
_ksql-connect-offsets | 25 | 1
_ksql-connect-statuses | 5 | 1
dbserver | 1 | 1
dbserver.inventory.addresses | 1 | 1
**dbserver.inventory.customers** | 1 | 1
dbserver.inventory.geom | 1 | 1
dbserver.inventory.orders | 1 | 1
dbserver.inventory.products | 1 | 1
dbserver.inventory.products_on_hand | 1 | 1
default_ksql_processing_log | 1 | 1
schema-changes.inventory | 1 | 1
-----------------------------------------------------------------------
Now I need to just copy the contents of the 'dbserver.inventory.customers' to the PostgreSQL database. The following are the structure of the data
ksql> PRINT 'dbserver.inventory.customers' FROM BEGINNING;
Key format: JSON or HOPPING(KAFKA_STRING) or TUMBLING(KAFKA_STRING) or KAFKA_STRING
Value format: JSON or KAFKA_STRING
rowtime: 2022/08/29 02:39:20.772 Z, key: {"id":1001}, value: {"id":1001,"first_name":"Sally","last_name":"Thomas","email":"sally.thomas#acme.com"}, partition: 0
rowtime: 2022/08/29 02:39:20.773 Z, key: {"id":1002}, value: {"id":1002,"first_name":"George","last_name":"Bailey","email":"gbailey#foobar.com"}, partition: 0
rowtime: 2022/08/29 02:39:20.773 Z, key: {"id":1003}, value: {"id":1003,"first_name":"Edward","last_name":"Walker","email":"ed#walker.com"}, partition: 0
rowtime: 2022/08/29 02:39:20.773 Z, key: {"id":1004}, value: {"id":1004,"first_name":"Anne","last_name":"Kretchmar","email":"annek#noanswer.org"}, partition: 0
I have tried the following configuration of the sink connector:
CREATE SINK CONNECTOR postgres_sink WITH (
'connector.class'= 'io.confluent.connect.jdbc.JdbcSinkConnector',
'connection.url'= 'jdbc:postgresql://postgres:5432/inventory',
'connection.user' = 'postgresuser',
'connection.password' = 'postgrespw',
'topics'= 'dbserver.inventory.customers',
'transforms'= 'unwrap',
'transforms.unwrap.type'= 'io.debezium.transforms.ExtractNewRecordState',
'transforms.unwrap.drop.tombstones'= 'false',
'key.converter'= 'org.apache.kafka.connect.json.JsonConverter',
'key.converter.schemas.enable'= 'false',
'value.converter'= 'org.apache.kafka.connect.json.JsonConverter',
'value.converter.schemas.enable'= 'false',
'auto.create'= 'true',
'insert.mode'= 'upsert',
'auto.evolve' = 'true',
'table.name.format' = '${topic}',
'pk.mode' = 'record_key',
'pk.fields' = 'id',
'delete.enabled'= 'true'
);
It creates the connector but shows the following errors:
ksqldb-server | Caused by: org.apache.kafka.connect.errors.ConnectException: Sink connector 'POSTGRES_SINK' is configured with 'delete.enabled=true' and 'pk.mode=record_key' and therefore requires records with a non-null key and non-null Struct or primitive key schema, but found record at (topic='dbserver.inventory.customers',partition=0,offset=0,timestamp=1661740760772) with a HashMap key and null key schema.
What should be the configuration of the Sink Connector to copy these data to PostgreSQL?
I have also tried creating a stream first in AVRO and then using AVRO Key, Value convertor but it did not work. I think it is something to do with using right SMTs but I am not sure.
My ultimate aim is to join different streams and then store it in the PostgreSQL as a part of implementing the CQRS architecture. So if someone can share a framework I could use in such case it would be really useful.
As the error says, the key must be a primitive, not JSON object and not Avro either.
From your shown JSON, you'd need an extract field transform on your key
transforms=getKey,unwrap
transforms.getKey.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.getKey.field=id
Or, you might be able to change your source connector to use IntegerConverter, not JSONConverter for the keys
Debezium also has an old blog post covering this exact use case - https://debezium.io/blog/2017/09/25/streaming-to-another-database/

Debezium with Postgres | Kafka Consumer not able to consume any message

Here is my docker-compose file:
version: '3.7'
services:
postgres:
image: debezium/postgres:12
container_name: postgres
networks:
- broker-kafka
environment:
POSTGRES_PASSWORD: admin
POSTGRES_USER: antriksh
ports:
- 5499:5432
zookeeper:
image: confluentinc/cp-zookeeper:latest
container_name: zookeeper
networks:
- broker-kafka
ports:
- 2181:2181
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
kafka:
image: confluentinc/cp-kafka:latest
container_name: kafka
networks:
- broker-kafka
depends_on:
- zookeeper
ports:
- 9092:9092
environment:
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092,PLAINTEXT_HOST://localhost:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
KAFKA_LOG_CLEANER_DELETE_RETENTION_MS: 5000
KAFKA_BROKER_ID: 1
KAFKA_MIN_INSYNC_REPLICAS: 1
connector:
image: debezium/connect:latest
container_name: kafka_connect_with_debezium
networks:
- broker-kafka
ports:
- "8083:8083"
environment:
GROUP_ID: 1
CONFIG_STORAGE_TOPIC: my_connect_configs
OFFSET_STORAGE_TOPIC: my_connect_offsets
BOOTSTRAP_SERVERS: kafka:29092
depends_on:
- zookeeper
- kafka
networks:
broker-kafka:
driver: bridge
I am able to create table and insert data into it. I am also able to initialise connector using following config -
curl -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d '
{
"name": "payment-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "antriksh",
"database.password": "admin",
"database.dbname" : "payment",
"database.server.name": "dbserver1",
"database.whitelist": "payment",
"database.history.kafka.bootstrap.servers": "localhost:9092",
"database.history.kafka.topic": "schema-changes.payment",
"publication.name": "mytestpub",
"publication.autocreate.mode": "all_tables"
}
}'
I start my Kafka Consumer like this
kafka-console-consumer --bootstrap-server kafka:29092 --from-beginning --topic dbserver1.public.transaction --property print.key=true --property key.separator="-"
But whenever I update or insert any changes inside my db I don't see the messages being relayed to Kafka Consumer.
I have put the config property - "publication.autocreate.mode": "all_tables" which will create a publication automatically for all tables. But when I do select * from pg_publication I see nothing. It's an empty table.
There is a Debezium named replication slot so I don't know which config or step am I missing which is preventing Kafka Consumer to consume the message.
Update:
I found out that in order for Debezium to create publications automatically we will need pgoutput as the output plugin. Also after OneCricketer's comment this is my connector's config
curl -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d '
{
"name": "payment-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "antriksh",
"database.password": "admin",
"database.dbname" : "payment",
"database.server.name": "dbserver1",
"database.whitelist": "payment",
"database.history.kafka.bootstrap.servers": "kafka:29092",
"database.history.kafka.topic": "schema-changes.payment",
"plugin.name": "pgoutput",
"publication.autocreate.mode": "all_tables",
"publication.name": "my_publication"
}
}'
Now I am able to see the publication being created.
16395 | my_publication | 10 | t | t | t | t | t
The issue now seems to be that the LSN is not moving ahead when I check the replication_slots
select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
-----------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
debezium | pgoutput | logical | 16385 | payment | f | t | 260 | | 491 | 0/176F268 | 0/176F268
(1 row)
It's stuck at 0/176F268 ever since the payment db was created.
When I see the topics list I can see that Debezium has created the topic for the transaction table
[appuser#a112a33992d1 ~]$ kafka-topics --zookeeper zookeeper:2181 --list
__consumer_offsets
connect-status
dbserver1.public.transaction
my_connect_configs
my_connect_offsets
I am unable to understand where is it going wrong.

How to select subvalue of json in topic as ksql stream

I have many topics in kafka with format as such :
value: {big json string with many subkeys etc}.
print topic looks like :
rowtime: 3/10/20 7:10:43 AM UTC, key: , value: {"#timestamp": "XXXXXXXX", "beat": {"hostname": "xxxxxxxxxx","name": "xxxxxxxxxx","version": "5.2.1"}, "input_type": "log", "log_dc": "xxxxxxxxxxx", "message": "{\"server_name\":\"xxxxxxxxxxxxxxx\",\"remote_address\":\"10.x.x.x\",\"user\":\"xxxxxx\",\"timestamp_start\":\"xxxxxxxx\",\"timestamp_finish\":\"xxxxxxxxxx\",\"time_start\":\"10/Mar/2020:07:10:39 +0000\",\"time_finish\":\"10/Mar/2020:07:10:39 +0000\",\"request_method\":\"PUT\",\"request_uri\":\"xxxxxxxxxxxxxxxxxxxxxxx\",\"protocol\":\"HTTP/1.1\",\"status\":200,\"response_length\":\"0\",\"request_length\":\"0\",\"user_agent\":\"xxxxxxxxx\",\"request_id\":\"zzzzzzzzzzzzzzzzzzzzz\",\"request_type\":\"zzzzzzzz\",\"stat\":{\"c_wait\":0.004,\"s_wait\":0.432,\"digest\":0.0,\"commit\":31.878,\"turn_around_time\":0.0,\"t_transfer\":32.319},\"object_length\":\"0\","o_name\":\"xxxxx\",\"https\":{\"protocol\":\"TLSv1.2\",\"cipher_suite\":\"TLS_RSA_WITH_AES_256_GCM_SHA384\"},\"principals\":{\"identity\":\"zzzzzz\",\"asv\":\"dddddddddd\"},\"type\":\"http\",\"format\":1}", "offset": 70827770, "source": "/var/log/xxxx.log", "type": "topicname" }
I have tried using
CREATE STREAM test
(value STRUCT<
server_name VARCHAR,
remote_address VARCHAR,
forwarded_for VARCHAR,
remote_user VARCHAR,
timestamp_start VARCHAR
..
WITH (KAFKA_TOPIC='testing', VALUE_FORMAT='JSON');
But I get a stream with value as NULL.
Is there a way to grab under the value key?
The escaped JSON is not valid JSON, which is probably going to have made this more difficult :)
In this snippet:
…\"object_length\":\"0\","o_name\":\"xxxxx\",\"https\":{\"protocol\":\…
the leading double-quote for o_name is not escaped. You can validate this with something like jq:
echo '{"message": "{\"server_name\":\"xxxxxxxxxxxxxxx\",\"remote_address\":\"10.x.x.x\",\"user\":\"xxxxxx\",\"timestamp_start\":\"xxxxxxxx\",\"timestamp_finish\":\"xxxxxxxxxx\",\"time_start\":\"10/Mar/2020:07:10:39 +0000\",\"time_finish\":\"10/Mar/2020:07:10:39 +0000\",\"request_method\":\"PUT\",\"request_uri\":\"xxxxxxxxxxxxxxxxxxxxxxx\",\"protocol\":\"HTTP/1.1\",\"status\":200,\"response_length\":\"0\",\"request_length\":\"0\",\"user_agent\":\"xxxxxxxxx\",\"request_id\":\"zzzzzzzzzzzzzzzzzzzzz\",\"request_type\":\"zzzzzzzz\",\"stat\":{\"c_wait\":0.004,\"s_wait\":0.432,\"digest\":0.0,\"commit\":31.878,\"turn_around_time\":0.0,\"t_transfer\":32.319},\"object_length\":\"0\","o_name\":\"xxxxx\",\"https\":{\"protocol\":\"TLSv1.2\",\"cipher_suite\":\"TLS_RSA_WITH_AES_256_GCM_SHA384\"},\"principals\":{\"identity\":\"zzzzzz\",\"asv\":\"dddddddddd\"},\"type\":\"http\",\"format\":1}"}' | jq '.message|fromjson'
parse error: Invalid numeric literal at line 1, column 685
With the JSON fixed this then parses successfully:
➜ echo '{"message": "{\"server_name\":\"xxxxxxxxxxxxxxx\",\"remote_address\":\"10.x.x.x\",\"user\":\"xxxxxx\",\"timestamp_start\":\"xxxxxxxx\",\"timestamp_finish\":\"xxxxxxxxxx\",\"time_start\":\"10/Mar/2020:07:10:39 +0000\",\"time_finish\":\"10/Mar/2020:07:10:39 +0000\",\"request_m
ethod\":\"PUT\",\"request_uri\":\"xxxxxxxxxxxxxxxxxxxxxxx\",\"protocol\":\"HTTP/1.1\",\"status\":200,\"response_length\":\"0\",\"request_length\":\"0\",\"user_agent\":\"xxxxxxxxx\",\"request_id\":\"zzzzzzzzzzzzzzzzzzzzz\",\"request_type\":\"zzzzzzzz\",\"stat\":{\"c_wait\":0.004,\"s_
wait\":0.432,\"digest\":0.0,\"commit\":31.878,\"turn_around_time\":0.0,\"t_transfer\":32.319},\"object_length\":\"0\",\"o_name\":\"xxxxx\",\"https\":{\"protocol\":\"TLSv1.2\",\"cipher_suite\":\"TLS_RSA_WITH_AES_256_GCM_SHA384\"},\"principals\":{\"identity\":\"zzzzzz\",\"asv\":\"dddd
dddddd\"},\"type\":\"http\",\"format\":1}"}' | jq '.message|fromjson'
{
"server_name": "xxxxxxxxxxxxxxx",
"remote_address": "10.x.x.x",
"user": "xxxxxx",
"timestamp_start": "xxxxxxxx",
"timestamp_finish": "xxxxxxxxxx",
"time_start": "10/Mar/2020:07:10:39 +0000",
"time_finish": "10/Mar/2020:07:10:39 +0000",
"request_method": "PUT",
"request_uri": "xxxxxxxxxxxxxxxxxxxxxxx",
"protocol": "HTTP/1.1",
"status": 200,
…
So now let's get this into ksqlDB. I'm using kafkacat to load it into a topic:
kafkacat -b localhost:9092 -t testing -P<<EOF
{ "#timestamp": "XXXXXXXX", "beat": { "hostname": "xxxxxxxxxx", "name": "xxxxxxxxxx", "version": "5.2.1" }, "input_type": "log", "log_dc": "xxxxxxxxxxx", "message": "{\"server_name\":\"xxxxxxxxxxxxxxx\",\"remote_address\":\"10.x.x.x\",\"user\":\"xxxxxx\",\"timestamp_start\":\"xxxxxxxx\",\"timestamp_finish\":\"xxxxxxxxxx\",\"time_start\":\"10/Mar/2020:07:10:39 +0000\",\"time_finish\":\"10/Mar/2020:07:10:39 +0000\",\"request_method\":\"PUT\",\"request_uri\":\"xxxxxxxxxxxxxxxxxxxxxxx\",\"protocol\":\"HTTP/1.1\",\"status\":200,\"response_length\":\"0\",\"request_length\":\"0\",\"user_agent\":\"xxxxxxxxx\",\"request_id\":\"zzzzzzzzzzzzzzzzzzzzz\",\"request_type\":\"zzzzzzzz\",\"stat\":{\"c_wait\":0.004,\"s_wait\":0.432,\"digest\":0.0,\"commit\":31.878,\"turn_around_time\":0.0,\"t_transfer\":32.319},\"object_length\":\"0\",\"o_name\":\"xxxxx\",\"https\":{\"protocol\":\"TLSv1.2\",\"cipher_suite\":\"TLS_RSA_WITH_AES_256_GCM_SHA384\"},\"principals\":{\"identity\":\"zzzzzz\",\"asv\":\"dddddddddd\"},\"type\":\"http\",\"format\":1}", "offset": 70827770, "source": "/var/log/xxxx.log", "type": "topicname" }
EOF
Now with ksqlDB let's declare the outline schema, in which the message field is just a lump of VARCHAR:
CREATE STREAM TEST (BEAT STRUCT<HOSTNAME VARCHAR, NAME VARCHAR, VERSION VARCHAR>,
INPUT_TYPE VARCHAR,
MESSAGE VARCHAR,
OFFSET BIGINT,
SOURCE VARCHAR)
WITH (KAFKA_TOPIC='testing', VALUE_FORMAT='JSON');
We can query this stream to check that it's working:
SET 'auto.offset.reset' = 'earliest';
SELECT BEAT->HOSTNAME,
BEAT->VERSION,
SOURCE,
MESSAGE
FROM TEST
EMIT CHANGES LIMIT 1;
+-----------------+---------------+--------------------+--------------------------------------------------------------------+
|BEAT__HOSTNAME |BEAT__VERSION |SOURCE |MESSAGE |
+-----------------+---------------+--------------------+--------------------------------------------------------------------+
|xxxxxxxxxx |5.2.1 |/var/log/xxxx.log |{"server_name":"xxxxxxxxxxxxxxx","remote_address":"10.x.x.x","user":|
| | | |"xxxxxx","timestamp_start":"xxxxxxxx","timestamp_finish":"xxxxxxxxxx|
| | | |","time_start":"10/Mar/2020:07:10:39 +0000","time_finish":"10/Mar/20|
| | | |20:07:10:39 +0000","request_method":"PUT","request_uri":"xxxxxxxxxxx|
| | | |xxxxxxxxxxxx","protocol":"HTTP/1.1","status":200,"response_length":"|
| | | |0","request_length":"0","user_agent":"xxxxxxxxx","request_id":"zzzzz|
| | | |zzzzzzzzzzzzzzzz","request_type":"zzzzzzzz","stat":{"c_wait":0.004,"|
| | | |s_wait":0.432,"digest":0.0,"commit":31.878,"turn_around_time":0.0,"t|
| | | |_transfer":32.319},"object_length":"0","o_name":"xxxxx","https":{"pr|
| | | |otocol":"TLSv1.2","cipher_suite":"TLS_RSA_WITH_AES_256_GCM_SHA384"},|
| | | |"principals":{"identity":"zzzzzz","asv":"dddddddddd"},"type":"http",|
| | | |"format":1} |
Limit Reached
Query terminated
Now let's extract the embedded JSON fields using the EXTRACTJSONFIELD function (I've not done every field, just a handful of them to illustrate the pattern to follow):
SELECT EXTRACTJSONFIELD(MESSAGE,'$.remote_address') AS REMOTE_ADDRESS,
EXTRACTJSONFIELD(MESSAGE,'$.time_start') AS TIME_START,
EXTRACTJSONFIELD(MESSAGE,'$.protocol') AS PROTOCOL,
EXTRACTJSONFIELD(MESSAGE,'$.status') AS STATUS,
EXTRACTJSONFIELD(MESSAGE,'$.stat.c_wait') AS STAT_C_WAIT,
EXTRACTJSONFIELD(MESSAGE,'$.stat.s_wait') AS STAT_S_WAIT,
EXTRACTJSONFIELD(MESSAGE,'$.stat.digest') AS STAT_DIGEST,
EXTRACTJSONFIELD(MESSAGE,'$.stat.commit') AS STAT_COMMIT,
EXTRACTJSONFIELD(MESSAGE,'$.stat.turn_around_time') AS STAT_TURN_AROUND_TIME,
EXTRACTJSONFIELD(MESSAGE,'$.stat.t_transfer') AS STAT_T_TRANSFER
FROM TEST
EMIT CHANGES LIMIT 1;
+----------------+--------------------------+----------+--------+------------+-------------+------------+------------+----------------------+----------------+
|REMOTE_ADDRESS |TIME_START |PROTOCOL |STATUS |STAT_C_WAIT |STAT_S_WAIT |STAT_DIGEST |STAT_COMMIT |STAT_TURN_AROUND_TIME |STAT_T_TRANSFER |
+----------------+--------------------------+----------+--------+------------+-------------+------------+------------+----------------------+----------------+
|10.x.x.x |10/Mar/2020:07:10:39 +0000|HTTP/1.1 |200 |0.004 |0.432 |0 |31.878 |0 |32.319 |
We can persist this to a new Kafka topic, and for good measure reserialise it to Avro to make it easier for downstream applications to use:
CREATE STREAM BEATS WITH (VALUE_FORMAT='AVRO') AS
SELECT EXTRACTJSONFIELD(MESSAGE,'$.remote_address') AS REMOTE_ADDRESS,
EXTRACTJSONFIELD(MESSAGE,'$.time_start') AS TIME_START,
EXTRACTJSONFIELD(MESSAGE,'$.protocol') AS PROTOCOL,
EXTRACTJSONFIELD(MESSAGE,'$.status') AS STATUS,
EXTRACTJSONFIELD(MESSAGE,'$.stat.c_wait') AS STAT_C_WAIT,
EXTRACTJSONFIELD(MESSAGE,'$.stat.s_wait') AS STAT_S_WAIT,
EXTRACTJSONFIELD(MESSAGE,'$.stat.digest') AS STAT_DIGEST,
EXTRACTJSONFIELD(MESSAGE,'$.stat.commit') AS STAT_COMMIT,
EXTRACTJSONFIELD(MESSAGE,'$.stat.turn_around_time') AS STAT_TURN_AROUND_TIME,
EXTRACTJSONFIELD(MESSAGE,'$.stat.t_transfer') AS STAT_T_TRANSFER
FROM TEST
EMIT CHANGES LIMIT 1;
ksql> DESCRIBE BEATS;
Name : BEATS
Field | Type
---------------------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
REMOTE_ADDRESS | VARCHAR(STRING)
TIME_START | VARCHAR(STRING)
PROTOCOL | VARCHAR(STRING)
STATUS | VARCHAR(STRING)
STAT_C_WAIT | VARCHAR(STRING)
STAT_S_WAIT | VARCHAR(STRING)
STAT_DIGEST | VARCHAR(STRING)
STAT_COMMIT | VARCHAR(STRING)
STAT_TURN_AROUND_TIME | VARCHAR(STRING)
STAT_T_TRANSFER | VARCHAR(STRING)
---------------------------------------------------
For runtime statistics and query details run: DESCRIBE EXTENDED <Stream,Table>;
To debug issues with ksqlDB returning NULLs check out this article. A lot of the time it's down to serialisation errors. For example, if you look at the ksqlDB server log you'll see this error when it tries to parse the badly-formed escaped JSON before I fixed it:
WARN Exception caught during Deserialization, taskId: 0_0, topic: testing, partition: 0, offset: 1 (org.apache.kafka.streams.processor.internals.StreamThread:36)
org.apache.kafka.common.errors.SerializationException: mvn value from topic: testing
Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('o' (code 111)): was expecting comma to separate Object entries
at [Source: (byte[])"{"#timestamp": "XXXXXXXX", "beat": {"hostname": "xxxxxxxxxx","name": "xxxxxxxxxx","version": "5.2.1"}, "input_type": "log", "log_dc": "xxxxxxxxxxx", "message": "{\"server_name\":\"xxxxxxxxxxxxxxx\",\"remote_address\":\"10.x.x.x\",\"user\":\
"xxxxxx\",\"timestamp_start\":\"xxxxxxxx\",\"timestamp_finish\":\"xxxxxxxxxx\",\"time_start\":\"10/Mar/2020:07:10:39 +0000\",\"time_finish\":\"10/Mar/2020:07:10:39 +0000\",\"request_method\":\"PUT\",\"request_uri\":\"xxxxxxxxxxxxxxxxxxxxxxx\",\"protocol\":\"HT"[truncated 604 bytes];
line: 1, column: 827]
at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1804)
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:693)
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:591)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextFieldName(UTF8StreamJsonParser.java:986)
…

Need to filter out Kafka Records based on a certain keyword

I have a Kafka topic which has around 3 million records. I want to pick-out a single record from this which has a certain parameter. I have been trying to query this using Lenses, but unable to form the correct query. below are the record contents of 1 message.
{
"header": {
"schemaVersionNo": "1",
},
"payload": {
"modifiedDate": 1552334325212,
"createdDate": 1552334325212,
"createdBy": "A",
"successful": true,
"source_order_id": "1111111111111",
}
}
Now I want to filter out a record with a particular source_order_id, but not able to figure out the right way to do so.
We have tried via lenses as well Kafka Tool.
A sample query that we tried in lenses is below:
SELECT * FROM `TEST`
WHERE _vtype='JSON' AND _ktype='BYTES'
AND _sample=2 AND _sampleWindow=200 AND payload.createdBy='A'
This query works, however if we try with source id as shown below we get an error:
SELECT * FROM `TEST`
WHERE _vtype='JSON' AND _ktype='BYTES'
AND _sample=2 AND _sampleWindow=200 AND payload.source_order_id='1111111111111'
Error : "Invalid syntax at line=3 and column=41.Invalid syntax for 'payload.source_order_id'. Field 'payload' resolves to primitive type STRING.
Consuming all 3 million records via a custom consumer and then iterating over it doesn't seem to be an optimised approach to me, so looking for any available solutions for such a use case.
Since you said you are open to other solutions, here is one built using KSQL.
First, let's get some sample records into a source topic:
$ kafkacat -P -b localhost:9092 -t TEST <<EOF
{ "header": { "schemaVersionNo": "1" }, "payload": { "modifiedDate": 1552334325212, "createdDate": 1552334325212, "createdBy": "A", "successful": true, "source_order_id": "3411976933214" } }
{ "header": { "schemaVersionNo": "1" }, "payload": { "modifiedDate": 1552334325412, "createdDate": 1552334325412, "createdBy": "B", "successful": true, "source_order_id": "3411976933215" } }
{ "header": { "schemaVersionNo": "1" }, "payload": { "modifiedDate": 1552334325612, "createdDate": 1552334325612, "createdBy": "C", "successful": true, "source_order_id": "3411976933216" } }
EOF
Using KSQL we can inspect the topic with PRINT:
ksql> PRINT 'TEST' FROM BEGINNING;
Format:JSON
{"ROWTIME":1552476232988,"ROWKEY":"null","header":{"schemaVersionNo":"1"},"payload":{"modifiedDate":1552334325212,"createdDate":1552334325212,"createdBy":"A","successful":true,"source_order_id":"3411976933214"}}
{"ROWTIME":1552476232988,"ROWKEY":"null","header":{"schemaVersionNo":"1"},"payload":{"modifiedDate":1552334325412,"createdDate":1552334325412,"createdBy":"B","successful":true,"source_order_id":"3411976933215"}}
{"ROWTIME":1552476232988,"ROWKEY":"null","header":{"schemaVersionNo":"1"},"payload":{"modifiedDate":1552334325612,"createdDate":1552334325612,"createdBy":"C","successful":true,"source_order_id":"3411976933216"}}
Then declare a schema on the topic, which enables us to run SQL against it:
ksql> CREATE STREAM TEST (header STRUCT<schemaVersionNo VARCHAR>,
payload STRUCT<modifiedDate BIGINT,
createdDate BIGINT,
createdBy VARCHAR,
successful BOOLEAN,
source_order_id VARCHAR>)
WITH (KAFKA_TOPIC='TEST',
VALUE_FORMAT='JSON');
Message
----------------
Stream created
----------------
Tell KSQL to work with all the data in the topic:
ksql> SET 'auto.offset.reset' = 'earliest';
Successfully changed local property 'auto.offset.reset' to 'earliest'. Use the UNSET command to revert your change.
And now we can select all the data:
ksql> SELECT * FROM TEST;
1552475910106 | null | {SCHEMAVERSIONNO=1} | {MODIFIEDDATE=1552334325212, CREATEDDATE=1552334325212, CREATEDBY=A, SUCCESSFUL=true, SOURCE_ORDER_ID=3411976933214}
1552475910106 | null | {SCHEMAVERSIONNO=1} | {MODIFIEDDATE=1552334325412, CREATEDDATE=1552334325412, CREATEDBY=B, SUCCESSFUL=true, SOURCE_ORDER_ID=3411976933215}
1552475910106 | null | {SCHEMAVERSIONNO=1} | {MODIFIEDDATE=1552334325612, CREATEDDATE=1552334325612, CREATEDBY=C, SUCCESSFUL=true, SOURCE_ORDER_ID=3411976933216}
^CQuery terminated
or we can selectively query it, using the -> notation to access nested fields in the schema:
ksql> SELECT * FROM TEST
WHERE PAYLOAD->CREATEDBY='A';
1552475910106 | null | {SCHEMAVERSIONNO=1} | {MODIFIEDDATE=1552334325212, CREATEDDATE=1552334325212, CREATEDBY=A, SUCCESSFUL=true, SOURCE_ORDER_ID=3411976933214}
As well as selecting all records, you can return just the fields of interest:
ksql> SELECT payload FROM TEST
WHERE PAYLOAD->source_order_id='3411976933216';
{MODIFIEDDATE=1552334325612, CREATEDDATE=1552334325612, CREATEDBY=C, SUCCESSFUL=true, SOURCE_ORDER_ID=3411976933216}
With KSQL you can write the results of any SELECT statement to a new topic, which populates it with all existing messages along with every new message on the source topic filtered and processed per the declared SELECT statement:
ksql> CREATE STREAM TEST_CREATED_BY_A AS
SELECT * FROM TEST WHERE PAYLOAD->CREATEDBY='A';
Message
----------------------------
Stream created and running
----------------------------
List topic on the Kafka cluster:
ksql> SHOW TOPICS;
Kafka Topic | Registered | Partitions | Partition Replicas | Consumers | ConsumerGroups
----------------------------------------------------------------------------------------------------
orders | true | 1 | 1 | 1 | 1
pageviews | false | 1 | 1 | 0 | 0
products | true | 1 | 1 | 1 | 1
TEST | true | 1 | 1 | 1 | 1
TEST_CREATED_BY_A | true | 4 | 1 | 0 | 0
Print the contents of the new topic:
ksql> PRINT 'TEST_CREATED_BY_A' FROM BEGINNING;
Format:JSON
{"ROWTIME":1552475910106,"ROWKEY":"null","HEADER":{"SCHEMAVERSIONNO":"1"},"PAYLOAD":{"MODIFIEDDATE":1552334325212,"CREATEDDATE":1552334325212,"CREATEDBY":"A","SUCCESSFUL":true,"SOURCE_ORDER_ID":"3411976933214"}}