In kafka how to trasform topic in table? I need copy remote table - apache-kafka

I did configure a connection to database and all data transfer over the topic becaus when i run the consumer it return data
How can i transform this topic to table and persist the data inside KSQL?
thanks very much

You don't persist data in KSQL. KSQL is simply an engine for querying and transforming data in Kafka. The source for KSQL queries is Kafka topic(s), and the output of KSQL queries is either interactive, or back out to another kafka topic.
If you have data in your Kafka topics—which it sounds like you have—then in KSQL run LIST TOPICS;:
ksql> LIST TOPICS;
Kafka Topic | Registered | Partitions | Partition Replicas | Consumers | ConsumerGroups
---------------------------------------------------------------------------------------------------------
_confluent-metrics | false | 12 | 1 | 0 | 0
asgard.demo.accounts | false | 1 | 1 | 0 | 0
To see your Kafka topics. From there, pick your topic, and you can run PRINT 'my-topic' FROM BEGINNING;
ksql> PRINT 'asgard.demo.accounts' FROM BEGINNING;
Format:AVRO
10/11/18 9:24:45 AM UTC, null, {"account_id": "a42", "first_name": "Robin", "last_name": "Moffatt", "email": "robin#confluent.io", "phone": "+44 123 456 789", "address": "22 Acacia Avenue", "country": "United Kingdom", "create_ts": "2018-10-11T09:23:22Z", "update_ts": "2018-10-11T09:23:22Z", "messagetopic": "asgard.demo.accounts", "messagesource": "Debezium CDC from MySQL on asgard"}
10/11/18 9:24:45 AM UTC, null, {"account_id": "a081", "first_name": "Sidoney", "last_name": "Lafranconi", "email": "slafranconi0#cbc.ca", "phone": "+44 908 687 6649", "address": "40 Kensington Pass", "country": "United Kingdom", "create_ts": "2018-10-11T09:23:22Z", "update_ts": "2018-10-11T09:23:22Z", "messagetopic": "asgard.demo.accounts", "messagesource": "Debezium CDC from MySQL on asgard"}
10/11/18 9:24:45 AM UTC, null, {"account_id": "a135", "first_name": "Mick", "last_name": "Edinburgh", "email": "medinburgh1#eepurl.com", "phone": "+44 301 837 6535", "address": "27 Blackbird Lane", "country": "United Kingdom", "create_ts": "2018-10-11T09:23:22Z", "update_ts": "2018-10-11T09:23:22Z", "messagetopic": "asgard.demo.accounts", "messagesource": "Debezium CDC from MySQL on asgard"}
to see the contents of it. Press Ctrl-C to cancel the PRINT statement and return to the command line.
Note the Format on the output of the PRINT statement. This is the serialisation format of your data.
If the data's serialised in Avro then you can run:
CREATE STREAM mydata WITH (KAFKA_TOPIC='asgard.demo.accounts', VALUE_FORMAT='AVRO');
If it's in JSON you'll need to also specify the column names and datatypes
CREATE STREAM mydata (col1 INT, col2 VARCHAR) WITH (KAFKA_TOPIC='asgard.demo.accounts', VALUE_FORMAT='JSON');
Now that you've 'registered' this topic with KSQL, you can view its schema with DESCRIBE:
ksql> DESCRIBE mydata;
Name : MYDATA
Field | Type
-------------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
ACCOUNT_ID | VARCHAR(STRING)
FIRST_NAME | VARCHAR(STRING)
LAST_NAME | VARCHAR(STRING)
EMAIL | VARCHAR(STRING)
PHONE | VARCHAR(STRING)
ADDRESS | VARCHAR(STRING)
COUNTRY | VARCHAR(STRING)
CREATE_TS | VARCHAR(STRING)
UPDATE_TS | VARCHAR(STRING)
MESSAGETOPIC | VARCHAR(STRING)
MESSAGESOURCE | VARCHAR(STRING)
-------------------------------------------
and then use KSQL to query and manipulate the data:
ksql> SET 'auto.offset.reset'='earliest';
ksql> SELECT FIRST_NAME + ' ' + LAST_NAME AS FULL_NAME, EMAIL FROM mydata WHERE COUNTRY='United Kingdom';
Robin Moffatt | robin#confluent.io
Sidoney Lafranconi | slafranconi0#cbc.ca
Mick Edinburgh | medinburgh1#eepurl.com
Merrill Stroobant | mstroobant2#china.com.cn
Press Ctrl-C to cancel a SELECT query.
KSQL can persist this to a new Kafka topic:
CREATE STREAM UK_USERS AS SELECT FIRST_NAME + ' ' + LAST_NAME AS FULL_NAME, EMAIL FROM mydata WHERE COUNTRY='United Kingdom';
If you list your KSQL topics again, you'll see the new one created and populated:
ksql> LIST TOPICS;
Kafka Topic | Registered | Partitions | Partition Replicas | Consumers | ConsumerGroups
---------------------------------------------------------------------------------------------------------
_confluent-metrics | false | 12 | 1 | 0 | 0
asgard.demo.accounts | true | 1 | 1 | 2 | 2
UK_USERS | true | 4 | 1 | 0 | 0
---------------------------------------------------------------------------------------------------------
ksql>
Every event coming into the source topic (asgard.demo.accounts) gets read and filtered by KSQL and written to the target topic (UK_USERS) based on the SQL you've executed.
For more info see the KSQL syntax docs, and tutorials.
Disclaimer: I work for Confluent, the company behind the open-source KSQL project.

Related

Configuration of JDBC Sink Connector in KsqlDB from MySQL database to PostgreSQL database

I wanted to copy a table from MySQL database to PostgreSQL. I have KsqlDB which acts as a stream processor. For the start, I just want to copy a simple table from the source's 'inventory' database to the sink database (PostgreSQL). The following is the structure of inventory database:
mysql> show tables;
+---------------------+
| Tables_in_inventory |
+---------------------+
| addresses |
| customers |
| geom |
| orders |
| products |
| products_on_hand |
+---------------------+
I have logged into the KsqlDB and register a source connector using the following configuration
CREATE SOURCE CONNECTOR inventory_connector WITH (
'connector.class' = 'io.debezium.connector.mysql.MySqlConnector',
'database.hostname' = 'mysql',
'database.port' = '3306',
'database.user' = 'debezium',
'database.password' = 'dbz',
'database.allowPublicKeyRetrieval' = 'true',
'database.server.id' = '223344',
'database.server.name' = 'dbserver',
'database.whitelist' = 'inventory',
'database.history.kafka.bootstrap.servers' = 'broker:9092',
'database.history.kafka.topic' = 'schema-changes.inventory',
'transforms' = 'unwrap',
'transforms.unwrap.type'= 'io.debezium.transforms.UnwrapFromEnvelope',
'key.converter'= 'org.apache.kafka.connect.json.JsonConverter',
'key.converter.schemas.enable'= 'false',
'value.converter'= 'org.apache.kafka.connect.json.JsonConverter',
'value.converter.schemas.enable'= 'false'
);
The following are the topics created
ksql> LIST TOPICS;
Kafka Topic | Partitions | Partition Replicas
-----------------------------------------------------------------------
_ksql-connect-configs | 1 | 1
_ksql-connect-offsets | 25 | 1
_ksql-connect-statuses | 5 | 1
dbserver | 1 | 1
dbserver.inventory.addresses | 1 | 1
**dbserver.inventory.customers** | 1 | 1
dbserver.inventory.geom | 1 | 1
dbserver.inventory.orders | 1 | 1
dbserver.inventory.products | 1 | 1
dbserver.inventory.products_on_hand | 1 | 1
default_ksql_processing_log | 1 | 1
schema-changes.inventory | 1 | 1
-----------------------------------------------------------------------
Now I need to just copy the contents of the 'dbserver.inventory.customers' to the PostgreSQL database. The following are the structure of the data
ksql> PRINT 'dbserver.inventory.customers' FROM BEGINNING;
Key format: JSON or HOPPING(KAFKA_STRING) or TUMBLING(KAFKA_STRING) or KAFKA_STRING
Value format: JSON or KAFKA_STRING
rowtime: 2022/08/29 02:39:20.772 Z, key: {"id":1001}, value: {"id":1001,"first_name":"Sally","last_name":"Thomas","email":"sally.thomas#acme.com"}, partition: 0
rowtime: 2022/08/29 02:39:20.773 Z, key: {"id":1002}, value: {"id":1002,"first_name":"George","last_name":"Bailey","email":"gbailey#foobar.com"}, partition: 0
rowtime: 2022/08/29 02:39:20.773 Z, key: {"id":1003}, value: {"id":1003,"first_name":"Edward","last_name":"Walker","email":"ed#walker.com"}, partition: 0
rowtime: 2022/08/29 02:39:20.773 Z, key: {"id":1004}, value: {"id":1004,"first_name":"Anne","last_name":"Kretchmar","email":"annek#noanswer.org"}, partition: 0
I have tried the following configuration of the sink connector:
CREATE SINK CONNECTOR postgres_sink WITH (
'connector.class'= 'io.confluent.connect.jdbc.JdbcSinkConnector',
'connection.url'= 'jdbc:postgresql://postgres:5432/inventory',
'connection.user' = 'postgresuser',
'connection.password' = 'postgrespw',
'topics'= 'dbserver.inventory.customers',
'transforms'= 'unwrap',
'transforms.unwrap.type'= 'io.debezium.transforms.ExtractNewRecordState',
'transforms.unwrap.drop.tombstones'= 'false',
'key.converter'= 'org.apache.kafka.connect.json.JsonConverter',
'key.converter.schemas.enable'= 'false',
'value.converter'= 'org.apache.kafka.connect.json.JsonConverter',
'value.converter.schemas.enable'= 'false',
'auto.create'= 'true',
'insert.mode'= 'upsert',
'auto.evolve' = 'true',
'table.name.format' = '${topic}',
'pk.mode' = 'record_key',
'pk.fields' = 'id',
'delete.enabled'= 'true'
);
It creates the connector but shows the following errors:
ksqldb-server | Caused by: org.apache.kafka.connect.errors.ConnectException: Sink connector 'POSTGRES_SINK' is configured with 'delete.enabled=true' and 'pk.mode=record_key' and therefore requires records with a non-null key and non-null Struct or primitive key schema, but found record at (topic='dbserver.inventory.customers',partition=0,offset=0,timestamp=1661740760772) with a HashMap key and null key schema.
What should be the configuration of the Sink Connector to copy these data to PostgreSQL?
I have also tried creating a stream first in AVRO and then using AVRO Key, Value convertor but it did not work. I think it is something to do with using right SMTs but I am not sure.
My ultimate aim is to join different streams and then store it in the PostgreSQL as a part of implementing the CQRS architecture. So if someone can share a framework I could use in such case it would be really useful.
As the error says, the key must be a primitive, not JSON object and not Avro either.
From your shown JSON, you'd need an extract field transform on your key
transforms=getKey,unwrap
transforms.getKey.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.getKey.field=id
Or, you might be able to change your source connector to use IntegerConverter, not JSONConverter for the keys
Debezium also has an old blog post covering this exact use case - https://debezium.io/blog/2017/09/25/streaming-to-another-database/

KSQL left join giving 'null' result even when data is present

I'm learning K-SQL/KSQL-DB and currently exploring joins. Below is the issue where I'm stuck.
I have 1 stream 'DRIVERSTREAMREPARTITIONEDKEYED' and one table 'COUNTRIES', below is their description.
ksql> describe DRIVERSTREAMREPARTITIONEDKEYED;
Name: DRIVERSTREAMREPARTITIONEDKEYED
Field | Type
--------------------------------------
COUNTRYCODE | VARCHAR(STRING) (key)
NAME | VARCHAR(STRING)
RATING | DOUBLE
--------------------------------------
ksql> describe countries;
Name : COUNTRIES
Field | Type
----------------------------------------------
COUNTRYCODE | VARCHAR(STRING) (primary key)
COUNTRYNAME | VARCHAR(STRING)
----------------------------------------------
This is the sample data that they have,
ksql> select * from DRIVERSTREAMREPARTITIONEDKEYED emit changes;
+---------------------------------------------+---------------------------------------------+---------------------------------------------+
|COUNTRYCODE |NAME |RATING |
+---------------------------------------------+---------------------------------------------+---------------------------------------------+
|SGP |Suresh |3.5 |
|IND |Mahesh |2.4 |
ksql> select * from countries emit changes;
+---------------------------------------------------------------------+---------------------------------------------------------------------+
|COUNTRYCODE |COUNTRYNAME |
+---------------------------------------------------------------------+---------------------------------------------------------------------+
|IND |INDIA |
|SGP |SINGAPORE |
I'm trying to do a 'left outer' join on them with the stream being on the left side, but below is the output I get,
select d.name,d.rating,c.COUNTRYNAME from DRIVERSTREAMREPARTITIONEDKEYED d left join countries c on d.COUNTRYCODE=c.COUNTRYCODE emit changes;
+---------------------------------------------+---------------------------------------------+---------------------------------------------+
|NAME |RATING |COUNTRYNAME |
+---------------------------------------------+---------------------------------------------+---------------------------------------------+
|Suresh |3.5 |null |
|Mahesh |2.4 |null |
In ideal scenario I should get the data in 'COUNTRYNAME' column as the 'COUNTRYCODE' column in both stream and data have matching data.
I tried searching a lot but to no avail.
I'm using 'Confluent Platform: 6.1.1'
For join to work it is our responsible to verify if the keys of both entities which are being joined lie in the same partition, KsqlDB can't verify whether the partitioning strategies are the same for both join inputs.
In my case My 'Drivers' topic had 2 partitions on which I had created a stream 'DriversStream' which in turn also had 2 partitions, but the table 'Countries' which I wanted to Join it with had only 1 partition, due to this I 're-keyed' the 'DriversStream' and created another stream 'DRIVERSTREAMREPARTITIONEDKEYED' shown in the question.
But the data of the table and the stream were not in the same partition hence the join was failing.
I created another topic with 1 partition 'DRIVERINFO'.
kafka-topics --bootstrap-server localhost:9092 --create --partitions 1 --replication-factor 1 --topic DRIVERINFO
Then created a stream over it 'DRIVERINFOSTREAM'.
CREATE STREAM DRIVERINFOSTREAM (NAME STRING, RATING DOUBLE, COUNTRYCODE STRING) WITH (KAFKA_TOPIC='DRIVERINFO', VALUE_FORMAT='JSON');
Finally joined it with 'COUNTRIES' table which finally worked.
ksql> select d.name,d.rating,c.COUNTRYNAME from DRIVERINFOSTREAM d left join countries c on d.COUNTRYCODE=c.COUNTRYCODE EMIT CHANGES;
+-------------------------------------------+-------------------------------------------+-------------------------------------------+
|NAME |RATING |COUNTRYNAME |
+-------------------------------------------+-------------------------------------------+-------------------------------------------+
|Suresh |2.4 |SINGAPORE |
|Mahesh |3.6 |INDIA |
Refer to below links for details,
KSQL join
Partitioning data for Joins

How to select subvalue of json in topic as ksql stream

I have many topics in kafka with format as such :
value: {big json string with many subkeys etc}.
print topic looks like :
rowtime: 3/10/20 7:10:43 AM UTC, key: , value: {"#timestamp": "XXXXXXXX", "beat": {"hostname": "xxxxxxxxxx","name": "xxxxxxxxxx","version": "5.2.1"}, "input_type": "log", "log_dc": "xxxxxxxxxxx", "message": "{\"server_name\":\"xxxxxxxxxxxxxxx\",\"remote_address\":\"10.x.x.x\",\"user\":\"xxxxxx\",\"timestamp_start\":\"xxxxxxxx\",\"timestamp_finish\":\"xxxxxxxxxx\",\"time_start\":\"10/Mar/2020:07:10:39 +0000\",\"time_finish\":\"10/Mar/2020:07:10:39 +0000\",\"request_method\":\"PUT\",\"request_uri\":\"xxxxxxxxxxxxxxxxxxxxxxx\",\"protocol\":\"HTTP/1.1\",\"status\":200,\"response_length\":\"0\",\"request_length\":\"0\",\"user_agent\":\"xxxxxxxxx\",\"request_id\":\"zzzzzzzzzzzzzzzzzzzzz\",\"request_type\":\"zzzzzzzz\",\"stat\":{\"c_wait\":0.004,\"s_wait\":0.432,\"digest\":0.0,\"commit\":31.878,\"turn_around_time\":0.0,\"t_transfer\":32.319},\"object_length\":\"0\","o_name\":\"xxxxx\",\"https\":{\"protocol\":\"TLSv1.2\",\"cipher_suite\":\"TLS_RSA_WITH_AES_256_GCM_SHA384\"},\"principals\":{\"identity\":\"zzzzzz\",\"asv\":\"dddddddddd\"},\"type\":\"http\",\"format\":1}", "offset": 70827770, "source": "/var/log/xxxx.log", "type": "topicname" }
I have tried using
CREATE STREAM test
(value STRUCT<
server_name VARCHAR,
remote_address VARCHAR,
forwarded_for VARCHAR,
remote_user VARCHAR,
timestamp_start VARCHAR
..
WITH (KAFKA_TOPIC='testing', VALUE_FORMAT='JSON');
But I get a stream with value as NULL.
Is there a way to grab under the value key?
The escaped JSON is not valid JSON, which is probably going to have made this more difficult :)
In this snippet:
…\"object_length\":\"0\","o_name\":\"xxxxx\",\"https\":{\"protocol\":\…
the leading double-quote for o_name is not escaped. You can validate this with something like jq:
echo '{"message": "{\"server_name\":\"xxxxxxxxxxxxxxx\",\"remote_address\":\"10.x.x.x\",\"user\":\"xxxxxx\",\"timestamp_start\":\"xxxxxxxx\",\"timestamp_finish\":\"xxxxxxxxxx\",\"time_start\":\"10/Mar/2020:07:10:39 +0000\",\"time_finish\":\"10/Mar/2020:07:10:39 +0000\",\"request_method\":\"PUT\",\"request_uri\":\"xxxxxxxxxxxxxxxxxxxxxxx\",\"protocol\":\"HTTP/1.1\",\"status\":200,\"response_length\":\"0\",\"request_length\":\"0\",\"user_agent\":\"xxxxxxxxx\",\"request_id\":\"zzzzzzzzzzzzzzzzzzzzz\",\"request_type\":\"zzzzzzzz\",\"stat\":{\"c_wait\":0.004,\"s_wait\":0.432,\"digest\":0.0,\"commit\":31.878,\"turn_around_time\":0.0,\"t_transfer\":32.319},\"object_length\":\"0\","o_name\":\"xxxxx\",\"https\":{\"protocol\":\"TLSv1.2\",\"cipher_suite\":\"TLS_RSA_WITH_AES_256_GCM_SHA384\"},\"principals\":{\"identity\":\"zzzzzz\",\"asv\":\"dddddddddd\"},\"type\":\"http\",\"format\":1}"}' | jq '.message|fromjson'
parse error: Invalid numeric literal at line 1, column 685
With the JSON fixed this then parses successfully:
➜ echo '{"message": "{\"server_name\":\"xxxxxxxxxxxxxxx\",\"remote_address\":\"10.x.x.x\",\"user\":\"xxxxxx\",\"timestamp_start\":\"xxxxxxxx\",\"timestamp_finish\":\"xxxxxxxxxx\",\"time_start\":\"10/Mar/2020:07:10:39 +0000\",\"time_finish\":\"10/Mar/2020:07:10:39 +0000\",\"request_m
ethod\":\"PUT\",\"request_uri\":\"xxxxxxxxxxxxxxxxxxxxxxx\",\"protocol\":\"HTTP/1.1\",\"status\":200,\"response_length\":\"0\",\"request_length\":\"0\",\"user_agent\":\"xxxxxxxxx\",\"request_id\":\"zzzzzzzzzzzzzzzzzzzzz\",\"request_type\":\"zzzzzzzz\",\"stat\":{\"c_wait\":0.004,\"s_
wait\":0.432,\"digest\":0.0,\"commit\":31.878,\"turn_around_time\":0.0,\"t_transfer\":32.319},\"object_length\":\"0\",\"o_name\":\"xxxxx\",\"https\":{\"protocol\":\"TLSv1.2\",\"cipher_suite\":\"TLS_RSA_WITH_AES_256_GCM_SHA384\"},\"principals\":{\"identity\":\"zzzzzz\",\"asv\":\"dddd
dddddd\"},\"type\":\"http\",\"format\":1}"}' | jq '.message|fromjson'
{
"server_name": "xxxxxxxxxxxxxxx",
"remote_address": "10.x.x.x",
"user": "xxxxxx",
"timestamp_start": "xxxxxxxx",
"timestamp_finish": "xxxxxxxxxx",
"time_start": "10/Mar/2020:07:10:39 +0000",
"time_finish": "10/Mar/2020:07:10:39 +0000",
"request_method": "PUT",
"request_uri": "xxxxxxxxxxxxxxxxxxxxxxx",
"protocol": "HTTP/1.1",
"status": 200,
…
So now let's get this into ksqlDB. I'm using kafkacat to load it into a topic:
kafkacat -b localhost:9092 -t testing -P<<EOF
{ "#timestamp": "XXXXXXXX", "beat": { "hostname": "xxxxxxxxxx", "name": "xxxxxxxxxx", "version": "5.2.1" }, "input_type": "log", "log_dc": "xxxxxxxxxxx", "message": "{\"server_name\":\"xxxxxxxxxxxxxxx\",\"remote_address\":\"10.x.x.x\",\"user\":\"xxxxxx\",\"timestamp_start\":\"xxxxxxxx\",\"timestamp_finish\":\"xxxxxxxxxx\",\"time_start\":\"10/Mar/2020:07:10:39 +0000\",\"time_finish\":\"10/Mar/2020:07:10:39 +0000\",\"request_method\":\"PUT\",\"request_uri\":\"xxxxxxxxxxxxxxxxxxxxxxx\",\"protocol\":\"HTTP/1.1\",\"status\":200,\"response_length\":\"0\",\"request_length\":\"0\",\"user_agent\":\"xxxxxxxxx\",\"request_id\":\"zzzzzzzzzzzzzzzzzzzzz\",\"request_type\":\"zzzzzzzz\",\"stat\":{\"c_wait\":0.004,\"s_wait\":0.432,\"digest\":0.0,\"commit\":31.878,\"turn_around_time\":0.0,\"t_transfer\":32.319},\"object_length\":\"0\",\"o_name\":\"xxxxx\",\"https\":{\"protocol\":\"TLSv1.2\",\"cipher_suite\":\"TLS_RSA_WITH_AES_256_GCM_SHA384\"},\"principals\":{\"identity\":\"zzzzzz\",\"asv\":\"dddddddddd\"},\"type\":\"http\",\"format\":1}", "offset": 70827770, "source": "/var/log/xxxx.log", "type": "topicname" }
EOF
Now with ksqlDB let's declare the outline schema, in which the message field is just a lump of VARCHAR:
CREATE STREAM TEST (BEAT STRUCT<HOSTNAME VARCHAR, NAME VARCHAR, VERSION VARCHAR>,
INPUT_TYPE VARCHAR,
MESSAGE VARCHAR,
OFFSET BIGINT,
SOURCE VARCHAR)
WITH (KAFKA_TOPIC='testing', VALUE_FORMAT='JSON');
We can query this stream to check that it's working:
SET 'auto.offset.reset' = 'earliest';
SELECT BEAT->HOSTNAME,
BEAT->VERSION,
SOURCE,
MESSAGE
FROM TEST
EMIT CHANGES LIMIT 1;
+-----------------+---------------+--------------------+--------------------------------------------------------------------+
|BEAT__HOSTNAME |BEAT__VERSION |SOURCE |MESSAGE |
+-----------------+---------------+--------------------+--------------------------------------------------------------------+
|xxxxxxxxxx |5.2.1 |/var/log/xxxx.log |{"server_name":"xxxxxxxxxxxxxxx","remote_address":"10.x.x.x","user":|
| | | |"xxxxxx","timestamp_start":"xxxxxxxx","timestamp_finish":"xxxxxxxxxx|
| | | |","time_start":"10/Mar/2020:07:10:39 +0000","time_finish":"10/Mar/20|
| | | |20:07:10:39 +0000","request_method":"PUT","request_uri":"xxxxxxxxxxx|
| | | |xxxxxxxxxxxx","protocol":"HTTP/1.1","status":200,"response_length":"|
| | | |0","request_length":"0","user_agent":"xxxxxxxxx","request_id":"zzzzz|
| | | |zzzzzzzzzzzzzzzz","request_type":"zzzzzzzz","stat":{"c_wait":0.004,"|
| | | |s_wait":0.432,"digest":0.0,"commit":31.878,"turn_around_time":0.0,"t|
| | | |_transfer":32.319},"object_length":"0","o_name":"xxxxx","https":{"pr|
| | | |otocol":"TLSv1.2","cipher_suite":"TLS_RSA_WITH_AES_256_GCM_SHA384"},|
| | | |"principals":{"identity":"zzzzzz","asv":"dddddddddd"},"type":"http",|
| | | |"format":1} |
Limit Reached
Query terminated
Now let's extract the embedded JSON fields using the EXTRACTJSONFIELD function (I've not done every field, just a handful of them to illustrate the pattern to follow):
SELECT EXTRACTJSONFIELD(MESSAGE,'$.remote_address') AS REMOTE_ADDRESS,
EXTRACTJSONFIELD(MESSAGE,'$.time_start') AS TIME_START,
EXTRACTJSONFIELD(MESSAGE,'$.protocol') AS PROTOCOL,
EXTRACTJSONFIELD(MESSAGE,'$.status') AS STATUS,
EXTRACTJSONFIELD(MESSAGE,'$.stat.c_wait') AS STAT_C_WAIT,
EXTRACTJSONFIELD(MESSAGE,'$.stat.s_wait') AS STAT_S_WAIT,
EXTRACTJSONFIELD(MESSAGE,'$.stat.digest') AS STAT_DIGEST,
EXTRACTJSONFIELD(MESSAGE,'$.stat.commit') AS STAT_COMMIT,
EXTRACTJSONFIELD(MESSAGE,'$.stat.turn_around_time') AS STAT_TURN_AROUND_TIME,
EXTRACTJSONFIELD(MESSAGE,'$.stat.t_transfer') AS STAT_T_TRANSFER
FROM TEST
EMIT CHANGES LIMIT 1;
+----------------+--------------------------+----------+--------+------------+-------------+------------+------------+----------------------+----------------+
|REMOTE_ADDRESS |TIME_START |PROTOCOL |STATUS |STAT_C_WAIT |STAT_S_WAIT |STAT_DIGEST |STAT_COMMIT |STAT_TURN_AROUND_TIME |STAT_T_TRANSFER |
+----------------+--------------------------+----------+--------+------------+-------------+------------+------------+----------------------+----------------+
|10.x.x.x |10/Mar/2020:07:10:39 +0000|HTTP/1.1 |200 |0.004 |0.432 |0 |31.878 |0 |32.319 |
We can persist this to a new Kafka topic, and for good measure reserialise it to Avro to make it easier for downstream applications to use:
CREATE STREAM BEATS WITH (VALUE_FORMAT='AVRO') AS
SELECT EXTRACTJSONFIELD(MESSAGE,'$.remote_address') AS REMOTE_ADDRESS,
EXTRACTJSONFIELD(MESSAGE,'$.time_start') AS TIME_START,
EXTRACTJSONFIELD(MESSAGE,'$.protocol') AS PROTOCOL,
EXTRACTJSONFIELD(MESSAGE,'$.status') AS STATUS,
EXTRACTJSONFIELD(MESSAGE,'$.stat.c_wait') AS STAT_C_WAIT,
EXTRACTJSONFIELD(MESSAGE,'$.stat.s_wait') AS STAT_S_WAIT,
EXTRACTJSONFIELD(MESSAGE,'$.stat.digest') AS STAT_DIGEST,
EXTRACTJSONFIELD(MESSAGE,'$.stat.commit') AS STAT_COMMIT,
EXTRACTJSONFIELD(MESSAGE,'$.stat.turn_around_time') AS STAT_TURN_AROUND_TIME,
EXTRACTJSONFIELD(MESSAGE,'$.stat.t_transfer') AS STAT_T_TRANSFER
FROM TEST
EMIT CHANGES LIMIT 1;
ksql> DESCRIBE BEATS;
Name : BEATS
Field | Type
---------------------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
REMOTE_ADDRESS | VARCHAR(STRING)
TIME_START | VARCHAR(STRING)
PROTOCOL | VARCHAR(STRING)
STATUS | VARCHAR(STRING)
STAT_C_WAIT | VARCHAR(STRING)
STAT_S_WAIT | VARCHAR(STRING)
STAT_DIGEST | VARCHAR(STRING)
STAT_COMMIT | VARCHAR(STRING)
STAT_TURN_AROUND_TIME | VARCHAR(STRING)
STAT_T_TRANSFER | VARCHAR(STRING)
---------------------------------------------------
For runtime statistics and query details run: DESCRIBE EXTENDED <Stream,Table>;
To debug issues with ksqlDB returning NULLs check out this article. A lot of the time it's down to serialisation errors. For example, if you look at the ksqlDB server log you'll see this error when it tries to parse the badly-formed escaped JSON before I fixed it:
WARN Exception caught during Deserialization, taskId: 0_0, topic: testing, partition: 0, offset: 1 (org.apache.kafka.streams.processor.internals.StreamThread:36)
org.apache.kafka.common.errors.SerializationException: mvn value from topic: testing
Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('o' (code 111)): was expecting comma to separate Object entries
at [Source: (byte[])"{"#timestamp": "XXXXXXXX", "beat": {"hostname": "xxxxxxxxxx","name": "xxxxxxxxxx","version": "5.2.1"}, "input_type": "log", "log_dc": "xxxxxxxxxxx", "message": "{\"server_name\":\"xxxxxxxxxxxxxxx\",\"remote_address\":\"10.x.x.x\",\"user\":\
"xxxxxx\",\"timestamp_start\":\"xxxxxxxx\",\"timestamp_finish\":\"xxxxxxxxxx\",\"time_start\":\"10/Mar/2020:07:10:39 +0000\",\"time_finish\":\"10/Mar/2020:07:10:39 +0000\",\"request_method\":\"PUT\",\"request_uri\":\"xxxxxxxxxxxxxxxxxxxxxxx\",\"protocol\":\"HT"[truncated 604 bytes];
line: 1, column: 827]
at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1804)
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:693)
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:591)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextFieldName(UTF8StreamJsonParser.java:986)
…

Data is duplicated when I create a flattened stream

I have a stream deriving from a topic that contains 271 total messages
the stream also contains 271 total messages, but when i create a other stream from that previous stream to flatten it, i get total messages of 542=(271*2).
this is the stream deriving from the topic
Name : TRANSACTIONSPURE
Type : STREAM
Key field :
Key format : STRING
Timestamp field : Not set - using <ROWTIME>
Value format : JSON
Kafka topic : mongo_conn.digi.transactions (partitions: 1,
replication: 1)
Field | Type
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
PAYLOAD | STRUCT<SENDER VARCHAR(STRING), RECEIVER VARCHAR(STRING),
RECEIVERWALLETID VARCHAR(STRING), STATUS VARCHAR(STRING), TYPE
VARCHAR(STRING), AMOUNT DOUBLE, TOTALFEE DOUBLE, CREATEDAT
VARCHAR(STRING), UPDATEDAT VARCHAR(STRING), ID VARCHAR(STRING),
ORDERID
VARCHAR(STRING), __V VARCHAR(STRING), TXID VARCHAR(STRING),
SENDERWALLETID VARCHAR(STRING)>
Local runtime statistics
------------------------
consumer-messages-per-sec: 0 consumer-total-bytes: 361356
consumer-total-messages: 271 last-message:
2019-09-02T10:44:14.003Z
and this is my flattened stream deriving from the previous stream
Name : TRANSACTIONSRAW
Type : STREAM
Key field :
Key format : STRING
Timestamp field : Not set - using <ROWTIME>
Value format : JSON
Kafka topic : TRANSACTIONSRAW (partitions: 4, replication: 1)
Field | Type
----------------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
SENDER | VARCHAR(STRING)
RECEIVER | VARCHAR(STRING)
RECEIVERWALLETID | VARCHAR(STRING)
STATUS | VARCHAR(STRING)
TYPE | VARCHAR(STRING)
AMOUNT | DOUBLE
TOTALFEE | DOUBLE
CREATEDAT | VARCHAR(STRING)
UPDATEDAT | VARCHAR(STRING)
ID | VARCHAR(STRING)
ORDERID | VARCHAR(STRING)
__V | VARCHAR(STRING)
TXID | VARCHAR(STRING)
SENDERWALLETID | VARCHAR(STRING)
----------------------------------------------
Queries that write into this STREAM
-----------------------------------
CSAS_TRANSACTIONSRAW_10 : CREATE STREAM transactionsraw
with(value_format='JSON') as SELECT payload->sender as sender,
payload->receiver as receiver, payload->receiverWalletId as
receiverWalletId, payload->status as status, payload->type as type,
payload->amount as amount, payload->totalFee as totalFee,
payload->createdAt as createdAt, payload->updatedAt as updatedAt,
payload->id as id, payload->orderId as orderId , payload-> __v as __v,
payload->txId as txId, payload->senderWalletId as senderWalletId from
transactionspure;
For query topology and execution plan please run: EXPLAIN <QueryId>
Local runtime statistics
------------------------
consumer-messages-per-sec: 0 consumer-total-bytes: 315500
consumer-total-messages: 542 messages-per-sec: 0 total-
messages: 271 last-message: 2019-09-02T10:44:15.493Z

Need to filter out Kafka Records based on a certain keyword

I have a Kafka topic which has around 3 million records. I want to pick-out a single record from this which has a certain parameter. I have been trying to query this using Lenses, but unable to form the correct query. below are the record contents of 1 message.
{
"header": {
"schemaVersionNo": "1",
},
"payload": {
"modifiedDate": 1552334325212,
"createdDate": 1552334325212,
"createdBy": "A",
"successful": true,
"source_order_id": "1111111111111",
}
}
Now I want to filter out a record with a particular source_order_id, but not able to figure out the right way to do so.
We have tried via lenses as well Kafka Tool.
A sample query that we tried in lenses is below:
SELECT * FROM `TEST`
WHERE _vtype='JSON' AND _ktype='BYTES'
AND _sample=2 AND _sampleWindow=200 AND payload.createdBy='A'
This query works, however if we try with source id as shown below we get an error:
SELECT * FROM `TEST`
WHERE _vtype='JSON' AND _ktype='BYTES'
AND _sample=2 AND _sampleWindow=200 AND payload.source_order_id='1111111111111'
Error : "Invalid syntax at line=3 and column=41.Invalid syntax for 'payload.source_order_id'. Field 'payload' resolves to primitive type STRING.
Consuming all 3 million records via a custom consumer and then iterating over it doesn't seem to be an optimised approach to me, so looking for any available solutions for such a use case.
Since you said you are open to other solutions, here is one built using KSQL.
First, let's get some sample records into a source topic:
$ kafkacat -P -b localhost:9092 -t TEST <<EOF
{ "header": { "schemaVersionNo": "1" }, "payload": { "modifiedDate": 1552334325212, "createdDate": 1552334325212, "createdBy": "A", "successful": true, "source_order_id": "3411976933214" } }
{ "header": { "schemaVersionNo": "1" }, "payload": { "modifiedDate": 1552334325412, "createdDate": 1552334325412, "createdBy": "B", "successful": true, "source_order_id": "3411976933215" } }
{ "header": { "schemaVersionNo": "1" }, "payload": { "modifiedDate": 1552334325612, "createdDate": 1552334325612, "createdBy": "C", "successful": true, "source_order_id": "3411976933216" } }
EOF
Using KSQL we can inspect the topic with PRINT:
ksql> PRINT 'TEST' FROM BEGINNING;
Format:JSON
{"ROWTIME":1552476232988,"ROWKEY":"null","header":{"schemaVersionNo":"1"},"payload":{"modifiedDate":1552334325212,"createdDate":1552334325212,"createdBy":"A","successful":true,"source_order_id":"3411976933214"}}
{"ROWTIME":1552476232988,"ROWKEY":"null","header":{"schemaVersionNo":"1"},"payload":{"modifiedDate":1552334325412,"createdDate":1552334325412,"createdBy":"B","successful":true,"source_order_id":"3411976933215"}}
{"ROWTIME":1552476232988,"ROWKEY":"null","header":{"schemaVersionNo":"1"},"payload":{"modifiedDate":1552334325612,"createdDate":1552334325612,"createdBy":"C","successful":true,"source_order_id":"3411976933216"}}
Then declare a schema on the topic, which enables us to run SQL against it:
ksql> CREATE STREAM TEST (header STRUCT<schemaVersionNo VARCHAR>,
payload STRUCT<modifiedDate BIGINT,
createdDate BIGINT,
createdBy VARCHAR,
successful BOOLEAN,
source_order_id VARCHAR>)
WITH (KAFKA_TOPIC='TEST',
VALUE_FORMAT='JSON');
Message
----------------
Stream created
----------------
Tell KSQL to work with all the data in the topic:
ksql> SET 'auto.offset.reset' = 'earliest';
Successfully changed local property 'auto.offset.reset' to 'earliest'. Use the UNSET command to revert your change.
And now we can select all the data:
ksql> SELECT * FROM TEST;
1552475910106 | null | {SCHEMAVERSIONNO=1} | {MODIFIEDDATE=1552334325212, CREATEDDATE=1552334325212, CREATEDBY=A, SUCCESSFUL=true, SOURCE_ORDER_ID=3411976933214}
1552475910106 | null | {SCHEMAVERSIONNO=1} | {MODIFIEDDATE=1552334325412, CREATEDDATE=1552334325412, CREATEDBY=B, SUCCESSFUL=true, SOURCE_ORDER_ID=3411976933215}
1552475910106 | null | {SCHEMAVERSIONNO=1} | {MODIFIEDDATE=1552334325612, CREATEDDATE=1552334325612, CREATEDBY=C, SUCCESSFUL=true, SOURCE_ORDER_ID=3411976933216}
^CQuery terminated
or we can selectively query it, using the -> notation to access nested fields in the schema:
ksql> SELECT * FROM TEST
WHERE PAYLOAD->CREATEDBY='A';
1552475910106 | null | {SCHEMAVERSIONNO=1} | {MODIFIEDDATE=1552334325212, CREATEDDATE=1552334325212, CREATEDBY=A, SUCCESSFUL=true, SOURCE_ORDER_ID=3411976933214}
As well as selecting all records, you can return just the fields of interest:
ksql> SELECT payload FROM TEST
WHERE PAYLOAD->source_order_id='3411976933216';
{MODIFIEDDATE=1552334325612, CREATEDDATE=1552334325612, CREATEDBY=C, SUCCESSFUL=true, SOURCE_ORDER_ID=3411976933216}
With KSQL you can write the results of any SELECT statement to a new topic, which populates it with all existing messages along with every new message on the source topic filtered and processed per the declared SELECT statement:
ksql> CREATE STREAM TEST_CREATED_BY_A AS
SELECT * FROM TEST WHERE PAYLOAD->CREATEDBY='A';
Message
----------------------------
Stream created and running
----------------------------
List topic on the Kafka cluster:
ksql> SHOW TOPICS;
Kafka Topic | Registered | Partitions | Partition Replicas | Consumers | ConsumerGroups
----------------------------------------------------------------------------------------------------
orders | true | 1 | 1 | 1 | 1
pageviews | false | 1 | 1 | 0 | 0
products | true | 1 | 1 | 1 | 1
TEST | true | 1 | 1 | 1 | 1
TEST_CREATED_BY_A | true | 4 | 1 | 0 | 0
Print the contents of the new topic:
ksql> PRINT 'TEST_CREATED_BY_A' FROM BEGINNING;
Format:JSON
{"ROWTIME":1552475910106,"ROWKEY":"null","HEADER":{"SCHEMAVERSIONNO":"1"},"PAYLOAD":{"MODIFIEDDATE":1552334325212,"CREATEDDATE":1552334325212,"CREATEDBY":"A","SUCCESSFUL":true,"SOURCE_ORDER_ID":"3411976933214"}}