Model JSON input data for JDBC Sink Connector - apache-kafka

My goal is to process an input stream that contains DB updates coming from a legacy system and replicate it into a different DB. The messages have the following format:
Insert - The after struct contains the state of the table entry after the insert operation is complete. No before struct present;
Update - The before struct contains the state of the table entry before the update operation and the after struct only lists the modified fields along with the table key;
Delete - The before struct contains the state of the table entry before the delete operation and the after struct is not present in the message;
Examples:
Insert
{"table":"SCHEMA.TEST_TABLE","op_type":"I","after":{"id_test":8934,"test_value_1":26910,"test_value_2":"XYZ","test_value_3":"2023-01-12T18:32:13.781217"}}
Update
{"table":"SCHEMA.TEST_TABLE","op_type":"U","before":{"id_test":8934,"test_value_1":26910,"test_value_2":"XYZ","test_value_3":"2023-01-12T18:32:13.781217"},"after":{"id_test":8934,"test_value_1":null,"test_value_3":"2023-01-12T18:32:18.787337"}}
Delete
{"table":"SCHEMA.TEST_TABLE","op_type":"D","before":{"id_test":8934,"test_value_1":1499,"test_value_2":"XYZ","test_value_3":"2023-01-12T18:32:18.787337"}}
After struggling a lot with trying to deal with the lack of a schema in the input messages, I turned to kSQL to overcome this and was able - not sure if by the best approach - to get ksql to generate a output topic that can be processed by a Kafka JDBC Sink Connector using a JSON_SR format.
The problem is that the output format I get in the output topic from ksql does not allow for a correct interpretation of all the scenarios - Only the Insert scenario is correctly interpreted.
The flow in ksql was:
Process input topic to map it as a JSON object by defining a schema. Created a stream where the input object is defined TEST_TABLE_INPUT):
CREATE OR REPLACE STREAM TEST_TABLE_INPUT (
"table" VARCHAR,
op_type VARCHAR,
after STRUCT<
id_test INTEGER, test_value_1 INTEGER, test_value_2 VARCHAR, test_value_3 VARCHAR
>,
before STRUCT<
id_test INTEGER
>
) WITH (kafka_topic='input_topic', value_format='JSON');
Process TEST_TABLE_INPUT to flatten the input structures (tried using the BEFORE->* and AFTER->* strategy instead but got a Caused by: line 3:11: mismatched input '*' expecting ... error):
CREATE OR REPLACE STREAM TEST_TABLE_INTERNAL
AS SELECT
OP_TYPE,
BEFORE->id_test AS B_id_test,
AFTER->id_test AS A_id_test,
AFTER
FROM TEST_TABLE_INPUT;
With this the output stream is created:
CREATE OR REPLACE STREAM TEST_TABLE_UPDATES
WITH (
VALUE_FORMAT='JSON_SR',
KEY_FORMAT='JSON_SR',
KAFKA_TOPIC='T_TEST_UPDATES'
)
AS SELECT
COALESCE(B_id_test, A_id_test) AS id_test,
AFTER->test_value_1 AS test_value_1,
AFTER->test_value_2 AS test_value_2,
AFTER->test_value_3 AS test_value_3
FROM TEST_TABLE_INTERNAL
PARTITION BY COALESCE(B_id_test, A_id_test)
;
When processing the example input topic messages below
{"table":"SCHEMA.TEST_TABLE","op_type":"I","after":{"id_test":0,"test_value_1":17315,"test_value_2":"XYZ","test_value_3":"2023-01-12T19:06:49.694383"}}
{"table":"SCHEMA.TEST_TABLE","op_type":"U","before":{"id_test":0,"test_value_1":17315,"test_value_2":"XYZ","test_value_3":"2023-01-12T19:06:49.694383"},"after":{"id_test":0,"test_value_1":null,"test_value_3":"2023-01-12T19:06:54.702107"}}
{"table":"SCHEMA.TEST_TABLE","op_type":"D","before":{"id_test":0,"test_value_1":28605,"test_value_2":"XYZ","test_value_3":"2023-01-12T19:06:54.702107"}}
{"table":"SCHEMA.TEST_TABLE","op_type":"I","after":{"id_test":1,"test_value_1":15601,"test_value_2":"KLM","test_value_3":"2023-01-12T19:07:04.716303"}}
{"table":"SCHEMA.TEST_TABLE","op_type":"U","before":{"id_test":1,"test_value_1":15601,"test_value_2":"KLM","test_value_3":"2023-01-12T19:07:04.716303"},"after":{"id_test":1,"test_value_1":null,"test_value_3":"2023-01-12T19:07:09.723473"}}
{"table":"SCHEMA.TEST_TABLE","op_type":"I","after":{"id_test":2,"test_value_1":13386,"test_value_2":"ABC","test_value_3":"2023-01-12T19:14:23.633204"}}
to simulate the I, I-U and I-U-D scenarios the output I get in KSQL is:
ksql> SELECT * FROM TEST_TABLE_UPDATES EMIT CHANGES;
+----------+-------------+--------------+----------------------------+
|ID_TEST |TEST_VALUE_1 |TEST_VALUE_2 |TEST_VALUE_3 |
+----------+-------------+--------------+----------------------------+
|0 |17315 |XYZ |2023-01-12T19:06:49.694383 |
|0 |null |null |2023-01-12T19:06:54.702107 |
|0 |null |null |null |
|1 |15601 |KLM |2023-01-12T19:07:04.716303 |
|1 |null |null |2023-01-12T19:07:09.723473 |
|2 |13386 |ABC |2023-01-12T19:14:23.633204 |
And outside of ksQL:
kafka-console-consumer.sh --property print.key=true --topic TEST_TABLE_UPDATES --from-beginning --bootstrap-server localhost:9092
0 {"TEST_VALUE_1":17315,"TEST_VALUE_2":"XYZ","TEST_VALUE_3":"2023-01-12T19:06:49.694383"}
0 {"TEST_VALUE_1":null,"TEST_VALUE_2":null,"TEST_VALUE_3":"2023-01-12T19:06:54.702107"}
0 {"TEST_VALUE_1":null,"TEST_VALUE_2":null,"TEST_VALUE_3":null}
1 {"TEST_VALUE_1":15601,"TEST_VALUE_2":"KLM","TEST_VALUE_3":"2023-01-12T19:07:04.716303"}
1 {"TEST_VALUE_1":null,"TEST_VALUE_2":null,"TEST_VALUE_3":"2023-01-12T19:07:09.723473"}
2 {"TEST_VALUE_1":13386,"TEST_VALUE_2":"ABC","TEST_VALUE_3":"2023-01-12T19:14:23.633204"}
On top of this the Connector with the configuration as mentioned below is instantiated:
{
"name": "t_test_updates-v2",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"errors.log.enable": true,
"errors.log.include.messages": true,
"topics": "TEST_TABLE_UPDATES",
"key.converter": "io.confluent.connect.json.JsonSchemaConverter",
"key.converter.schemas.enable": false,
"key.converter.schema.registry.url": "http://schema-registry:8081",
"value.converter": "io.confluent.connect.json.JsonSchemaConverter",
"value.converter.schemas.enable": false,
"value.converter.schema.registry.url": "http://schema-registry:8081",
"connection.url": "jdbc:oracle:thin:#database:1521/xe",
"connection.user": "<REDACTED>",
"connection.password": "<REDACTED>",
"dialect.name": "OracleDatabaseDialect",
"insert.mode": "upsert",
"delete.enabled": true,
"table.name.format": "TEST_TABLE_REPL",
"pk.mode": "record_key",
"pk.fields": "ID_TEST",
"auto.create": false,
"error.tolerance": "all"
}
}
Which processes the topic - apparently - without a problem but the output in the DB does not match the expected result:
SQL> select * from test_table_repl;
ID_TEST TEST_VALUE_1 TEST_VALUE_2 TEST_VALUE_3
__________ _______________ __________________ _____________________________
1 2023-01-12T19:07:09.723473
0
2 13386 ABC 2023-01-12T19:14:23.633204
I would expect the DB query to return:
ID_TEST TEST_VALUE_1 TEST_VALUE_2 TEST_VALUE_3
__________ _______________ __________________ _____________________________
1 KLM 2023-01-12T19:07:09.723473
2 13386 ABC 2023-01-12T19:14:23.633204
As it can be seen, only the insert scenario is correctly working.
I cannot understand what I am doing wrong.
Is it the data format, the connector configuration, both or another thing I am missing?

Related

The stream created in ksqlDB shows NULL value

I am trying to create a stream in ksqlDB to get the data from the kafka topic and perform query on it.
CREATE STREAM test_location (
id VARCHAR,
name VARCHAR,
location VARCHAR
)
WITH (KAFKA_TOPIC='public.location',
VALUE_FORMAT='JSON',
PARTITIONS=10);
The data in the topics public.location is in JSON format.
UPDATED topic message.
print 'public.location' from beginning limit 1;
Key format: ¯\_(ツ)_/¯ - no data processed
Value format: JSON or KAFKA_STRING
rowtime: 2021/05/23 11:27:39.429 Z, key: <null>, value: {"sourceTable":{"id":"1","name":Sam,"location":Manchester,"ConnectorVersion":null,"connectorId":null,"ConnectorName":null,"DbName":null,"DbSchema":null,"TableName":null,"payload":null,"schema":null},"ConnectorVersion":null,"connectorId":null,"ConnectorName":null,"DbName":null,"DbSchema":null,"TableName":null,"payload":null,"schema":null}, partition: 3
After the stream is created, and performing SELECT on the created stream I get NULL in the output. Although the topic has the data.
select * from test_location
>EMIT CHANGES limit 5;
+-----------------------------------------------------------------+-----------------------------------------------------------------+-----------------------------------------------------------------+
|ID |NAME |LOCATION |
+-----------------------------------------------------------------+-----------------------------------------------------------------+-----------------------------------------------------------------+
|null |null |null |
|null |null |null |
|null |null |null |
|null |null |null |
|null |null |null |
Limit Reached
Query terminated
Here is the details from docker file
version: '2'
services:
ksqldb-server:
image: confluentinc/ksqldb-server:0.18.0
hostname: ksqldb-server
container_name: ksqldb-server
depends_on:
- schema-registry
ports:
- "8088:8088"
environment:
KSQL_LISTENERS: "http://0.0.0.0:8088"
KSQL_BOOTSTRAP_SERVERS: "broker:29092"
KSQL_KSQL_SCHEMA_REGISTRY_URL: "http://schema-registry:8081"
KSQL_KSQL_LOGGING_PROCESSING_STREAM_AUTO_CREATE: "true"
KSQL_KSQL_LOGGING_PROCESSING_TOPIC_AUTO_CREATE: "true"
# Configuration to embed Kafka Connect support.
KSQL_CONNECT_GROUP_ID: "ksql-connect-01"
KSQL_CONNECT_BOOTSTRAP_SERVERS: "broker:29092"
KSQL_CONNECT_KEY_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
KSQL_CONNECT_VALUE_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
KSQL_CONNECT_VALUE_CONVERTER_SCHEMA_REGISTRY_URL: "http://schema-registry:8081"
KSQL_CONNECT_CONFIG_STORAGE_TOPIC: "_ksql-connect-01-configs"
KSQL_CONNECT_OFFSET_STORAGE_TOPIC: "_ksql-connect-01-offsets"
KSQL_CONNECT_STATUS_STORAGE_TOPIC: "_ksql-connect-01-statuses"
KSQL_CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR: 1
KSQL_CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR: 1
KSQL_CONNECT_STATUS_STORAGE_REPLICATION_FACTOR: 1
KSQL_CONNECT_PLUGIN_PATH: "/usr/share/kafka/plugins"
Update:
Here is a message in the topic that I see in the Kafka
{
"sourceTable": {
"id": "1",
"name": Sam,
"location": Manchester,
"ConnectorVersion": null,
"connectorId": null,
"ConnectorName": null,
"DbName": null,
"DbSchema": null,
"TableName": null,
"payload": null,
"schema": null
},
"ConnectorVersion": null,
"connectorId": null,
"ConnectorName": null,
"DbName": null,
"DbSchema": null,
"TableName": null,
"payload": null,
"schema": null
}
Which step or configuration I am missing?
Given your payload, you would need to declare the schema nested, because id, name, and location are not "top level" fields in the Json, but they are nested within sourceTable.
CREATE STREAM est_location (
sourceTable STRUCT<id VARCHAR, name VARCHAR, location VARCHAR>
)
It's not possible to "unwrap" the data when defining the schema, but the schema must match what is in the topic. In addition to sourceTable you could also add ConnectorVersion etc to the schema, as they are also "top level" fields in you JSON. Bottom line is, that column in ksqlDB can only be declared on top level field. Everything else is nested data that you can access using STRUCT type.
Of course later, when you query est_location you can refer to individual fields via sourceTable->id etc.
It would also be possible to declare a derived STREAM if you want to unnest the schema:
CREATE STREAM unnested_est_location AS
SELECT sourceTable->id AS id,
sourceTable->name AS name,
sourceTable->location AS location
FROM est_location;
Of course, this would write the data into a new topic.

KSQLDB coalesce always returns null despite parameters

I have the following ksql query:
SELECT
event->acceptedevent->id as id1,
event->refundedevent->id as id2,
coalesce(event->acceptedevent->id, event->refundedevent->id) as coalesce_col
FROM events
EMIT CHANGES;
Based on the documentation, (https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/scalar-functions/#coalesce) COALESCE returns the first non-null parameter.
Query returns the following:
+-----------------------------------------------+-----------------------------------------------+-----------------------------------------------+
|ID1 |ID2 |COALESCE_COL |
+-----------------------------------------------+-----------------------------------------------+-----------------------------------------------+
|1 |null |null |
|2 |null |null |
|3 |null |null |
I was expecting since ID1 is clearly not null, being the first parameter to the call, COALESCE will return same value as ID1 but it returns null. What am I missing?
I am using confluentinc/cp-ksqldb-server:6.1.1 and use avro for the value serde.
EventMessage.avsc:
{
"type": "record",
"name": "EventMessage",
"namespace": "com.example.poc.processor2.avro",
"fields": [
{
"name": "event",
"type": [
"com.example.poc.processor2.avro.AcceptedEvent",
"com.example.poc.processor2.avro.RefundedEvent"
]
}
]
}
Probably a bug in how data is deserialized, or the COALESCE function.
What KSQL version are you running
How is your data serialized in the topic?
I tried with a JSON format and it worked.
ksql> describe events;
Name : EVENTS
Field | Type
------------------------------------------------------------------------------------
EVENT | STRUCT<ACCEPTEDEVENT STRUCT<ID INTEGER>, REFUNDEDEVENT STRUCT<ID INTEGER>>
------------------------------------------------------------------------------------
ksql> print 'events' from BEGINNING;
Key format: ¯\_(ツ)_/¯ - no data processed
Value format: JSON or KAFKA_STRING
rowtime: 2021/03/24 13:57:27.403 Z, key: <null>, value: {"event":{"acceptedevent":{"id":1}, "refundedevent":{}}}, partition:
ksql> select event->acceptedevent->id, event->refundedevent->id, coalesce(event->acceptedevent->id, event->refundedevent->id) from events emit changes;
+----------------------------------------------------------+----------------------------------------------------------+----------------------------------------------------------+
|ID |ID_1 |KSQL_COL_0 |
+----------------------------------------------------------+----------------------------------------------------------+----------------------------------------------------------+
|1 |null |1 |

How to select subvalue of json in topic as ksql stream

I have many topics in kafka with format as such :
value: {big json string with many subkeys etc}.
print topic looks like :
rowtime: 3/10/20 7:10:43 AM UTC, key: , value: {"#timestamp": "XXXXXXXX", "beat": {"hostname": "xxxxxxxxxx","name": "xxxxxxxxxx","version": "5.2.1"}, "input_type": "log", "log_dc": "xxxxxxxxxxx", "message": "{\"server_name\":\"xxxxxxxxxxxxxxx\",\"remote_address\":\"10.x.x.x\",\"user\":\"xxxxxx\",\"timestamp_start\":\"xxxxxxxx\",\"timestamp_finish\":\"xxxxxxxxxx\",\"time_start\":\"10/Mar/2020:07:10:39 +0000\",\"time_finish\":\"10/Mar/2020:07:10:39 +0000\",\"request_method\":\"PUT\",\"request_uri\":\"xxxxxxxxxxxxxxxxxxxxxxx\",\"protocol\":\"HTTP/1.1\",\"status\":200,\"response_length\":\"0\",\"request_length\":\"0\",\"user_agent\":\"xxxxxxxxx\",\"request_id\":\"zzzzzzzzzzzzzzzzzzzzz\",\"request_type\":\"zzzzzzzz\",\"stat\":{\"c_wait\":0.004,\"s_wait\":0.432,\"digest\":0.0,\"commit\":31.878,\"turn_around_time\":0.0,\"t_transfer\":32.319},\"object_length\":\"0\","o_name\":\"xxxxx\",\"https\":{\"protocol\":\"TLSv1.2\",\"cipher_suite\":\"TLS_RSA_WITH_AES_256_GCM_SHA384\"},\"principals\":{\"identity\":\"zzzzzz\",\"asv\":\"dddddddddd\"},\"type\":\"http\",\"format\":1}", "offset": 70827770, "source": "/var/log/xxxx.log", "type": "topicname" }
I have tried using
CREATE STREAM test
(value STRUCT<
server_name VARCHAR,
remote_address VARCHAR,
forwarded_for VARCHAR,
remote_user VARCHAR,
timestamp_start VARCHAR
..
WITH (KAFKA_TOPIC='testing', VALUE_FORMAT='JSON');
But I get a stream with value as NULL.
Is there a way to grab under the value key?
The escaped JSON is not valid JSON, which is probably going to have made this more difficult :)
In this snippet:
…\"object_length\":\"0\","o_name\":\"xxxxx\",\"https\":{\"protocol\":\…
the leading double-quote for o_name is not escaped. You can validate this with something like jq:
echo '{"message": "{\"server_name\":\"xxxxxxxxxxxxxxx\",\"remote_address\":\"10.x.x.x\",\"user\":\"xxxxxx\",\"timestamp_start\":\"xxxxxxxx\",\"timestamp_finish\":\"xxxxxxxxxx\",\"time_start\":\"10/Mar/2020:07:10:39 +0000\",\"time_finish\":\"10/Mar/2020:07:10:39 +0000\",\"request_method\":\"PUT\",\"request_uri\":\"xxxxxxxxxxxxxxxxxxxxxxx\",\"protocol\":\"HTTP/1.1\",\"status\":200,\"response_length\":\"0\",\"request_length\":\"0\",\"user_agent\":\"xxxxxxxxx\",\"request_id\":\"zzzzzzzzzzzzzzzzzzzzz\",\"request_type\":\"zzzzzzzz\",\"stat\":{\"c_wait\":0.004,\"s_wait\":0.432,\"digest\":0.0,\"commit\":31.878,\"turn_around_time\":0.0,\"t_transfer\":32.319},\"object_length\":\"0\","o_name\":\"xxxxx\",\"https\":{\"protocol\":\"TLSv1.2\",\"cipher_suite\":\"TLS_RSA_WITH_AES_256_GCM_SHA384\"},\"principals\":{\"identity\":\"zzzzzz\",\"asv\":\"dddddddddd\"},\"type\":\"http\",\"format\":1}"}' | jq '.message|fromjson'
parse error: Invalid numeric literal at line 1, column 685
With the JSON fixed this then parses successfully:
➜ echo '{"message": "{\"server_name\":\"xxxxxxxxxxxxxxx\",\"remote_address\":\"10.x.x.x\",\"user\":\"xxxxxx\",\"timestamp_start\":\"xxxxxxxx\",\"timestamp_finish\":\"xxxxxxxxxx\",\"time_start\":\"10/Mar/2020:07:10:39 +0000\",\"time_finish\":\"10/Mar/2020:07:10:39 +0000\",\"request_m
ethod\":\"PUT\",\"request_uri\":\"xxxxxxxxxxxxxxxxxxxxxxx\",\"protocol\":\"HTTP/1.1\",\"status\":200,\"response_length\":\"0\",\"request_length\":\"0\",\"user_agent\":\"xxxxxxxxx\",\"request_id\":\"zzzzzzzzzzzzzzzzzzzzz\",\"request_type\":\"zzzzzzzz\",\"stat\":{\"c_wait\":0.004,\"s_
wait\":0.432,\"digest\":0.0,\"commit\":31.878,\"turn_around_time\":0.0,\"t_transfer\":32.319},\"object_length\":\"0\",\"o_name\":\"xxxxx\",\"https\":{\"protocol\":\"TLSv1.2\",\"cipher_suite\":\"TLS_RSA_WITH_AES_256_GCM_SHA384\"},\"principals\":{\"identity\":\"zzzzzz\",\"asv\":\"dddd
dddddd\"},\"type\":\"http\",\"format\":1}"}' | jq '.message|fromjson'
{
"server_name": "xxxxxxxxxxxxxxx",
"remote_address": "10.x.x.x",
"user": "xxxxxx",
"timestamp_start": "xxxxxxxx",
"timestamp_finish": "xxxxxxxxxx",
"time_start": "10/Mar/2020:07:10:39 +0000",
"time_finish": "10/Mar/2020:07:10:39 +0000",
"request_method": "PUT",
"request_uri": "xxxxxxxxxxxxxxxxxxxxxxx",
"protocol": "HTTP/1.1",
"status": 200,
…
So now let's get this into ksqlDB. I'm using kafkacat to load it into a topic:
kafkacat -b localhost:9092 -t testing -P<<EOF
{ "#timestamp": "XXXXXXXX", "beat": { "hostname": "xxxxxxxxxx", "name": "xxxxxxxxxx", "version": "5.2.1" }, "input_type": "log", "log_dc": "xxxxxxxxxxx", "message": "{\"server_name\":\"xxxxxxxxxxxxxxx\",\"remote_address\":\"10.x.x.x\",\"user\":\"xxxxxx\",\"timestamp_start\":\"xxxxxxxx\",\"timestamp_finish\":\"xxxxxxxxxx\",\"time_start\":\"10/Mar/2020:07:10:39 +0000\",\"time_finish\":\"10/Mar/2020:07:10:39 +0000\",\"request_method\":\"PUT\",\"request_uri\":\"xxxxxxxxxxxxxxxxxxxxxxx\",\"protocol\":\"HTTP/1.1\",\"status\":200,\"response_length\":\"0\",\"request_length\":\"0\",\"user_agent\":\"xxxxxxxxx\",\"request_id\":\"zzzzzzzzzzzzzzzzzzzzz\",\"request_type\":\"zzzzzzzz\",\"stat\":{\"c_wait\":0.004,\"s_wait\":0.432,\"digest\":0.0,\"commit\":31.878,\"turn_around_time\":0.0,\"t_transfer\":32.319},\"object_length\":\"0\",\"o_name\":\"xxxxx\",\"https\":{\"protocol\":\"TLSv1.2\",\"cipher_suite\":\"TLS_RSA_WITH_AES_256_GCM_SHA384\"},\"principals\":{\"identity\":\"zzzzzz\",\"asv\":\"dddddddddd\"},\"type\":\"http\",\"format\":1}", "offset": 70827770, "source": "/var/log/xxxx.log", "type": "topicname" }
EOF
Now with ksqlDB let's declare the outline schema, in which the message field is just a lump of VARCHAR:
CREATE STREAM TEST (BEAT STRUCT<HOSTNAME VARCHAR, NAME VARCHAR, VERSION VARCHAR>,
INPUT_TYPE VARCHAR,
MESSAGE VARCHAR,
OFFSET BIGINT,
SOURCE VARCHAR)
WITH (KAFKA_TOPIC='testing', VALUE_FORMAT='JSON');
We can query this stream to check that it's working:
SET 'auto.offset.reset' = 'earliest';
SELECT BEAT->HOSTNAME,
BEAT->VERSION,
SOURCE,
MESSAGE
FROM TEST
EMIT CHANGES LIMIT 1;
+-----------------+---------------+--------------------+--------------------------------------------------------------------+
|BEAT__HOSTNAME |BEAT__VERSION |SOURCE |MESSAGE |
+-----------------+---------------+--------------------+--------------------------------------------------------------------+
|xxxxxxxxxx |5.2.1 |/var/log/xxxx.log |{"server_name":"xxxxxxxxxxxxxxx","remote_address":"10.x.x.x","user":|
| | | |"xxxxxx","timestamp_start":"xxxxxxxx","timestamp_finish":"xxxxxxxxxx|
| | | |","time_start":"10/Mar/2020:07:10:39 +0000","time_finish":"10/Mar/20|
| | | |20:07:10:39 +0000","request_method":"PUT","request_uri":"xxxxxxxxxxx|
| | | |xxxxxxxxxxxx","protocol":"HTTP/1.1","status":200,"response_length":"|
| | | |0","request_length":"0","user_agent":"xxxxxxxxx","request_id":"zzzzz|
| | | |zzzzzzzzzzzzzzzz","request_type":"zzzzzzzz","stat":{"c_wait":0.004,"|
| | | |s_wait":0.432,"digest":0.0,"commit":31.878,"turn_around_time":0.0,"t|
| | | |_transfer":32.319},"object_length":"0","o_name":"xxxxx","https":{"pr|
| | | |otocol":"TLSv1.2","cipher_suite":"TLS_RSA_WITH_AES_256_GCM_SHA384"},|
| | | |"principals":{"identity":"zzzzzz","asv":"dddddddddd"},"type":"http",|
| | | |"format":1} |
Limit Reached
Query terminated
Now let's extract the embedded JSON fields using the EXTRACTJSONFIELD function (I've not done every field, just a handful of them to illustrate the pattern to follow):
SELECT EXTRACTJSONFIELD(MESSAGE,'$.remote_address') AS REMOTE_ADDRESS,
EXTRACTJSONFIELD(MESSAGE,'$.time_start') AS TIME_START,
EXTRACTJSONFIELD(MESSAGE,'$.protocol') AS PROTOCOL,
EXTRACTJSONFIELD(MESSAGE,'$.status') AS STATUS,
EXTRACTJSONFIELD(MESSAGE,'$.stat.c_wait') AS STAT_C_WAIT,
EXTRACTJSONFIELD(MESSAGE,'$.stat.s_wait') AS STAT_S_WAIT,
EXTRACTJSONFIELD(MESSAGE,'$.stat.digest') AS STAT_DIGEST,
EXTRACTJSONFIELD(MESSAGE,'$.stat.commit') AS STAT_COMMIT,
EXTRACTJSONFIELD(MESSAGE,'$.stat.turn_around_time') AS STAT_TURN_AROUND_TIME,
EXTRACTJSONFIELD(MESSAGE,'$.stat.t_transfer') AS STAT_T_TRANSFER
FROM TEST
EMIT CHANGES LIMIT 1;
+----------------+--------------------------+----------+--------+------------+-------------+------------+------------+----------------------+----------------+
|REMOTE_ADDRESS |TIME_START |PROTOCOL |STATUS |STAT_C_WAIT |STAT_S_WAIT |STAT_DIGEST |STAT_COMMIT |STAT_TURN_AROUND_TIME |STAT_T_TRANSFER |
+----------------+--------------------------+----------+--------+------------+-------------+------------+------------+----------------------+----------------+
|10.x.x.x |10/Mar/2020:07:10:39 +0000|HTTP/1.1 |200 |0.004 |0.432 |0 |31.878 |0 |32.319 |
We can persist this to a new Kafka topic, and for good measure reserialise it to Avro to make it easier for downstream applications to use:
CREATE STREAM BEATS WITH (VALUE_FORMAT='AVRO') AS
SELECT EXTRACTJSONFIELD(MESSAGE,'$.remote_address') AS REMOTE_ADDRESS,
EXTRACTJSONFIELD(MESSAGE,'$.time_start') AS TIME_START,
EXTRACTJSONFIELD(MESSAGE,'$.protocol') AS PROTOCOL,
EXTRACTJSONFIELD(MESSAGE,'$.status') AS STATUS,
EXTRACTJSONFIELD(MESSAGE,'$.stat.c_wait') AS STAT_C_WAIT,
EXTRACTJSONFIELD(MESSAGE,'$.stat.s_wait') AS STAT_S_WAIT,
EXTRACTJSONFIELD(MESSAGE,'$.stat.digest') AS STAT_DIGEST,
EXTRACTJSONFIELD(MESSAGE,'$.stat.commit') AS STAT_COMMIT,
EXTRACTJSONFIELD(MESSAGE,'$.stat.turn_around_time') AS STAT_TURN_AROUND_TIME,
EXTRACTJSONFIELD(MESSAGE,'$.stat.t_transfer') AS STAT_T_TRANSFER
FROM TEST
EMIT CHANGES LIMIT 1;
ksql> DESCRIBE BEATS;
Name : BEATS
Field | Type
---------------------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
REMOTE_ADDRESS | VARCHAR(STRING)
TIME_START | VARCHAR(STRING)
PROTOCOL | VARCHAR(STRING)
STATUS | VARCHAR(STRING)
STAT_C_WAIT | VARCHAR(STRING)
STAT_S_WAIT | VARCHAR(STRING)
STAT_DIGEST | VARCHAR(STRING)
STAT_COMMIT | VARCHAR(STRING)
STAT_TURN_AROUND_TIME | VARCHAR(STRING)
STAT_T_TRANSFER | VARCHAR(STRING)
---------------------------------------------------
For runtime statistics and query details run: DESCRIBE EXTENDED <Stream,Table>;
To debug issues with ksqlDB returning NULLs check out this article. A lot of the time it's down to serialisation errors. For example, if you look at the ksqlDB server log you'll see this error when it tries to parse the badly-formed escaped JSON before I fixed it:
WARN Exception caught during Deserialization, taskId: 0_0, topic: testing, partition: 0, offset: 1 (org.apache.kafka.streams.processor.internals.StreamThread:36)
org.apache.kafka.common.errors.SerializationException: mvn value from topic: testing
Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('o' (code 111)): was expecting comma to separate Object entries
at [Source: (byte[])"{"#timestamp": "XXXXXXXX", "beat": {"hostname": "xxxxxxxxxx","name": "xxxxxxxxxx","version": "5.2.1"}, "input_type": "log", "log_dc": "xxxxxxxxxxx", "message": "{\"server_name\":\"xxxxxxxxxxxxxxx\",\"remote_address\":\"10.x.x.x\",\"user\":\
"xxxxxx\",\"timestamp_start\":\"xxxxxxxx\",\"timestamp_finish\":\"xxxxxxxxxx\",\"time_start\":\"10/Mar/2020:07:10:39 +0000\",\"time_finish\":\"10/Mar/2020:07:10:39 +0000\",\"request_method\":\"PUT\",\"request_uri\":\"xxxxxxxxxxxxxxxxxxxxxxx\",\"protocol\":\"HT"[truncated 604 bytes];
line: 1, column: 827]
at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1804)
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:693)
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:591)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextFieldName(UTF8StreamJsonParser.java:986)
…

Need to filter out Kafka Records based on a certain keyword

I have a Kafka topic which has around 3 million records. I want to pick-out a single record from this which has a certain parameter. I have been trying to query this using Lenses, but unable to form the correct query. below are the record contents of 1 message.
{
"header": {
"schemaVersionNo": "1",
},
"payload": {
"modifiedDate": 1552334325212,
"createdDate": 1552334325212,
"createdBy": "A",
"successful": true,
"source_order_id": "1111111111111",
}
}
Now I want to filter out a record with a particular source_order_id, but not able to figure out the right way to do so.
We have tried via lenses as well Kafka Tool.
A sample query that we tried in lenses is below:
SELECT * FROM `TEST`
WHERE _vtype='JSON' AND _ktype='BYTES'
AND _sample=2 AND _sampleWindow=200 AND payload.createdBy='A'
This query works, however if we try with source id as shown below we get an error:
SELECT * FROM `TEST`
WHERE _vtype='JSON' AND _ktype='BYTES'
AND _sample=2 AND _sampleWindow=200 AND payload.source_order_id='1111111111111'
Error : "Invalid syntax at line=3 and column=41.Invalid syntax for 'payload.source_order_id'. Field 'payload' resolves to primitive type STRING.
Consuming all 3 million records via a custom consumer and then iterating over it doesn't seem to be an optimised approach to me, so looking for any available solutions for such a use case.
Since you said you are open to other solutions, here is one built using KSQL.
First, let's get some sample records into a source topic:
$ kafkacat -P -b localhost:9092 -t TEST <<EOF
{ "header": { "schemaVersionNo": "1" }, "payload": { "modifiedDate": 1552334325212, "createdDate": 1552334325212, "createdBy": "A", "successful": true, "source_order_id": "3411976933214" } }
{ "header": { "schemaVersionNo": "1" }, "payload": { "modifiedDate": 1552334325412, "createdDate": 1552334325412, "createdBy": "B", "successful": true, "source_order_id": "3411976933215" } }
{ "header": { "schemaVersionNo": "1" }, "payload": { "modifiedDate": 1552334325612, "createdDate": 1552334325612, "createdBy": "C", "successful": true, "source_order_id": "3411976933216" } }
EOF
Using KSQL we can inspect the topic with PRINT:
ksql> PRINT 'TEST' FROM BEGINNING;
Format:JSON
{"ROWTIME":1552476232988,"ROWKEY":"null","header":{"schemaVersionNo":"1"},"payload":{"modifiedDate":1552334325212,"createdDate":1552334325212,"createdBy":"A","successful":true,"source_order_id":"3411976933214"}}
{"ROWTIME":1552476232988,"ROWKEY":"null","header":{"schemaVersionNo":"1"},"payload":{"modifiedDate":1552334325412,"createdDate":1552334325412,"createdBy":"B","successful":true,"source_order_id":"3411976933215"}}
{"ROWTIME":1552476232988,"ROWKEY":"null","header":{"schemaVersionNo":"1"},"payload":{"modifiedDate":1552334325612,"createdDate":1552334325612,"createdBy":"C","successful":true,"source_order_id":"3411976933216"}}
Then declare a schema on the topic, which enables us to run SQL against it:
ksql> CREATE STREAM TEST (header STRUCT<schemaVersionNo VARCHAR>,
payload STRUCT<modifiedDate BIGINT,
createdDate BIGINT,
createdBy VARCHAR,
successful BOOLEAN,
source_order_id VARCHAR>)
WITH (KAFKA_TOPIC='TEST',
VALUE_FORMAT='JSON');
Message
----------------
Stream created
----------------
Tell KSQL to work with all the data in the topic:
ksql> SET 'auto.offset.reset' = 'earliest';
Successfully changed local property 'auto.offset.reset' to 'earliest'. Use the UNSET command to revert your change.
And now we can select all the data:
ksql> SELECT * FROM TEST;
1552475910106 | null | {SCHEMAVERSIONNO=1} | {MODIFIEDDATE=1552334325212, CREATEDDATE=1552334325212, CREATEDBY=A, SUCCESSFUL=true, SOURCE_ORDER_ID=3411976933214}
1552475910106 | null | {SCHEMAVERSIONNO=1} | {MODIFIEDDATE=1552334325412, CREATEDDATE=1552334325412, CREATEDBY=B, SUCCESSFUL=true, SOURCE_ORDER_ID=3411976933215}
1552475910106 | null | {SCHEMAVERSIONNO=1} | {MODIFIEDDATE=1552334325612, CREATEDDATE=1552334325612, CREATEDBY=C, SUCCESSFUL=true, SOURCE_ORDER_ID=3411976933216}
^CQuery terminated
or we can selectively query it, using the -> notation to access nested fields in the schema:
ksql> SELECT * FROM TEST
WHERE PAYLOAD->CREATEDBY='A';
1552475910106 | null | {SCHEMAVERSIONNO=1} | {MODIFIEDDATE=1552334325212, CREATEDDATE=1552334325212, CREATEDBY=A, SUCCESSFUL=true, SOURCE_ORDER_ID=3411976933214}
As well as selecting all records, you can return just the fields of interest:
ksql> SELECT payload FROM TEST
WHERE PAYLOAD->source_order_id='3411976933216';
{MODIFIEDDATE=1552334325612, CREATEDDATE=1552334325612, CREATEDBY=C, SUCCESSFUL=true, SOURCE_ORDER_ID=3411976933216}
With KSQL you can write the results of any SELECT statement to a new topic, which populates it with all existing messages along with every new message on the source topic filtered and processed per the declared SELECT statement:
ksql> CREATE STREAM TEST_CREATED_BY_A AS
SELECT * FROM TEST WHERE PAYLOAD->CREATEDBY='A';
Message
----------------------------
Stream created and running
----------------------------
List topic on the Kafka cluster:
ksql> SHOW TOPICS;
Kafka Topic | Registered | Partitions | Partition Replicas | Consumers | ConsumerGroups
----------------------------------------------------------------------------------------------------
orders | true | 1 | 1 | 1 | 1
pageviews | false | 1 | 1 | 0 | 0
products | true | 1 | 1 | 1 | 1
TEST | true | 1 | 1 | 1 | 1
TEST_CREATED_BY_A | true | 4 | 1 | 0 | 0
Print the contents of the new topic:
ksql> PRINT 'TEST_CREATED_BY_A' FROM BEGINNING;
Format:JSON
{"ROWTIME":1552475910106,"ROWKEY":"null","HEADER":{"SCHEMAVERSIONNO":"1"},"PAYLOAD":{"MODIFIEDDATE":1552334325212,"CREATEDDATE":1552334325212,"CREATEDBY":"A","SUCCESSFUL":true,"SOURCE_ORDER_ID":"3411976933214"}}

In kafka how to trasform topic in table? I need copy remote table

I did configure a connection to database and all data transfer over the topic becaus when i run the consumer it return data
How can i transform this topic to table and persist the data inside KSQL?
thanks very much
You don't persist data in KSQL. KSQL is simply an engine for querying and transforming data in Kafka. The source for KSQL queries is Kafka topic(s), and the output of KSQL queries is either interactive, or back out to another kafka topic.
If you have data in your Kafka topics—which it sounds like you have—then in KSQL run LIST TOPICS;:
ksql> LIST TOPICS;
Kafka Topic | Registered | Partitions | Partition Replicas | Consumers | ConsumerGroups
---------------------------------------------------------------------------------------------------------
_confluent-metrics | false | 12 | 1 | 0 | 0
asgard.demo.accounts | false | 1 | 1 | 0 | 0
To see your Kafka topics. From there, pick your topic, and you can run PRINT 'my-topic' FROM BEGINNING;
ksql> PRINT 'asgard.demo.accounts' FROM BEGINNING;
Format:AVRO
10/11/18 9:24:45 AM UTC, null, {"account_id": "a42", "first_name": "Robin", "last_name": "Moffatt", "email": "robin#confluent.io", "phone": "+44 123 456 789", "address": "22 Acacia Avenue", "country": "United Kingdom", "create_ts": "2018-10-11T09:23:22Z", "update_ts": "2018-10-11T09:23:22Z", "messagetopic": "asgard.demo.accounts", "messagesource": "Debezium CDC from MySQL on asgard"}
10/11/18 9:24:45 AM UTC, null, {"account_id": "a081", "first_name": "Sidoney", "last_name": "Lafranconi", "email": "slafranconi0#cbc.ca", "phone": "+44 908 687 6649", "address": "40 Kensington Pass", "country": "United Kingdom", "create_ts": "2018-10-11T09:23:22Z", "update_ts": "2018-10-11T09:23:22Z", "messagetopic": "asgard.demo.accounts", "messagesource": "Debezium CDC from MySQL on asgard"}
10/11/18 9:24:45 AM UTC, null, {"account_id": "a135", "first_name": "Mick", "last_name": "Edinburgh", "email": "medinburgh1#eepurl.com", "phone": "+44 301 837 6535", "address": "27 Blackbird Lane", "country": "United Kingdom", "create_ts": "2018-10-11T09:23:22Z", "update_ts": "2018-10-11T09:23:22Z", "messagetopic": "asgard.demo.accounts", "messagesource": "Debezium CDC from MySQL on asgard"}
to see the contents of it. Press Ctrl-C to cancel the PRINT statement and return to the command line.
Note the Format on the output of the PRINT statement. This is the serialisation format of your data.
If the data's serialised in Avro then you can run:
CREATE STREAM mydata WITH (KAFKA_TOPIC='asgard.demo.accounts', VALUE_FORMAT='AVRO');
If it's in JSON you'll need to also specify the column names and datatypes
CREATE STREAM mydata (col1 INT, col2 VARCHAR) WITH (KAFKA_TOPIC='asgard.demo.accounts', VALUE_FORMAT='JSON');
Now that you've 'registered' this topic with KSQL, you can view its schema with DESCRIBE:
ksql> DESCRIBE mydata;
Name : MYDATA
Field | Type
-------------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
ACCOUNT_ID | VARCHAR(STRING)
FIRST_NAME | VARCHAR(STRING)
LAST_NAME | VARCHAR(STRING)
EMAIL | VARCHAR(STRING)
PHONE | VARCHAR(STRING)
ADDRESS | VARCHAR(STRING)
COUNTRY | VARCHAR(STRING)
CREATE_TS | VARCHAR(STRING)
UPDATE_TS | VARCHAR(STRING)
MESSAGETOPIC | VARCHAR(STRING)
MESSAGESOURCE | VARCHAR(STRING)
-------------------------------------------
and then use KSQL to query and manipulate the data:
ksql> SET 'auto.offset.reset'='earliest';
ksql> SELECT FIRST_NAME + ' ' + LAST_NAME AS FULL_NAME, EMAIL FROM mydata WHERE COUNTRY='United Kingdom';
Robin Moffatt | robin#confluent.io
Sidoney Lafranconi | slafranconi0#cbc.ca
Mick Edinburgh | medinburgh1#eepurl.com
Merrill Stroobant | mstroobant2#china.com.cn
Press Ctrl-C to cancel a SELECT query.
KSQL can persist this to a new Kafka topic:
CREATE STREAM UK_USERS AS SELECT FIRST_NAME + ' ' + LAST_NAME AS FULL_NAME, EMAIL FROM mydata WHERE COUNTRY='United Kingdom';
If you list your KSQL topics again, you'll see the new one created and populated:
ksql> LIST TOPICS;
Kafka Topic | Registered | Partitions | Partition Replicas | Consumers | ConsumerGroups
---------------------------------------------------------------------------------------------------------
_confluent-metrics | false | 12 | 1 | 0 | 0
asgard.demo.accounts | true | 1 | 1 | 2 | 2
UK_USERS | true | 4 | 1 | 0 | 0
---------------------------------------------------------------------------------------------------------
ksql>
Every event coming into the source topic (asgard.demo.accounts) gets read and filtered by KSQL and written to the target topic (UK_USERS) based on the SQL you've executed.
For more info see the KSQL syntax docs, and tutorials.
Disclaimer: I work for Confluent, the company behind the open-source KSQL project.