Confluent 4.1.0 ->KSQL : STREAM-TABLE join -> table data null - apache-kafka

STEP 1: Run the producer to create sample data
./bin/kafka-avro-console-producer \
--broker-list localhost:9092 --topic stream-test-topic \
--property schema.registry.url=http://localhost:8081 \
--property value.schema='{"type":"record","name":"dealRecord","fields":[{"name":"DEAL_ID","type":"string"},{"name":"DEAL_EXPENSE_CODE","type":"string"},{"name":"DEAL_BRANCH","type":"string"}]}'
Sample Data :
{"DEAL_ID":"deal002", "DEAL_EXPENSE_CODE":"EXP002", "DEAL_BRANCH":"AMSTERDAM"}
{"DEAL_ID":"deal003", "DEAL_EXPENSE_CODE":"EXP003", "DEAL_BRANCH":"AMSTERDAM"}
{"DEAL_ID":"deal004", "DEAL_EXPENSE_CODE":"EXP004", "DEAL_BRANCH":"AMSTERDAM"}
{"DEAL_ID":"deal005", "DEAL_EXPENSE_CODE":"EXP005", "DEAL_BRANCH":"AMSTERDAM"}
{"DEAL_ID":"deal006", "DEAL_EXPENSE_CODE":"EXP006", "DEAL_BRANCH":"AMSTERDAM"}
{"DEAL_ID":"deal007", "DEAL_EXPENSE_CODE":"EXP001", "DEAL_BRANCH":"AMSTERDAM"}
{"DEAL_ID":"deal008", "DEAL_EXPENSE_CODE":"EXP002", "DEAL_BRANCH":"AMSTERDAM"}
{"DEAL_ID":"deal009", "DEAL_EXPENSE_CODE":"EXP003", "DEAL_BRANCH":"AMSTERDAM"}
{"DEAL_ID":"deal010", "DEAL_EXPENSE_CODE":"EXP004", "DEAL_BRANCH":"AMSTERDAM"}
{"DEAL_ID":"deal011", "DEAL_EXPENSE_CODE":"EXP005", "DEAL_BRANCH":"AMSTERDAM"}
{"DEAL_ID":"deal012", "DEAL_EXPENSE_CODE":"EXP006", "DEAL_BRANCH":"AMSTERDAM"}
STEP 2: Open another terminal and run the consumer to test the data.
./bin/kafka-avro-console-consumer --topic stream-test-topic \
--bootstrap-server localhost:9092 \
--property schema.registry.url=http://localhost:8081 \
--from-beginning
STEP 3: Open another terminal and run the producer.
./bin/kafka-avro-console-producer \
--broker-list localhost:9092 --topic expense-test-topic \
--property "parse.key=true" \
--property "key.separator=:" \
--property schema.registry.url=http://localhost:8081 \
--property key.schema='"string"' \
--property value.schema='{"type":"record","name":"dealRecord","fields":[{"name":"EXPENSE_CODE","type":"string"},{"name":"EXPENSE_DESC","type":"string"}]}'
Data:
"pk1":{"EXPENSE_CODE":"EXP001", "EXPENSE_DESC":"Regulatory Deposit"}
"pk2":{"EXPENSE_CODE":"EXP002", "EXPENSE_DESC":"ABC - Sofia"}
"pk3":{"EXPENSE_CODE":"EXP003", "EXPENSE_DESC":"Apple Corporation"}
"pk4":{"EXPENSE_CODE":"EXP004", "EXPENSE_DESC":"Confluent Europe"}
"pk5":{"EXPENSE_CODE":"EXP005", "EXPENSE_DESC":"Air India"}
"pk6":{"EXPENSE_CODE":"EXP006", "EXPENSE_DESC":"KLM International"}
STEP 4: Open another terminal and run the consumer
./bin/kafka-avro-console-consumer --topic expense-test-topic \
--bootstrap-server localhost:9092 \
--property "parse.key=true" \
--property "key.separator=:" \
--property schema.registry.url=http://localhost:8081 \
--from-beginning
STEP 5: Login to KSQL client.
./bin/ksql http://localhost:8088
create following stream and table and run join query.
KSQL:
STREAM:
CREATE STREAM SAMPLE_STREAM
(DEAL_ID VARCHAR, DEAL_EXPENSE_CODE varchar, DEAL_BRANCH VARCHAR)
WITH (kafka_topic='stream-test-topic',value_format='AVRO', key = 'DEAL_ID');
TABLE:
CREATE TABLE SAMPLE_TABLE
(EXPENSE_CODE varchar, EXPENSE_DESC VARCHAR)
WITH (kafka_topic='expense-test-topic',value_format='AVRO', key = 'EXPENSE_CODE');
FOLLOWING is the OUTPUT:
ksql> SELECT STREAM1.DEAL_EXPENSE_CODE, TABLE1.EXPENSE_DESC
from SAMPLE_STREAM STREAM1 LEFT JOIN SAMPLE_TABLE TABLE1
ON STREAM1.DEAL_EXPENSE_CODE = TABLE1.EXPENSE_CODE
WINDOW TUMBLING (SIZE 3 MINUTE)
GROUP BY STREAM1.DEAL_EXPENSE_CODE, TABLE1.EXPENSE_DESC;
EXP001 | null
EXP001 | null
EXP002 | null
EXP003 | null
EXP004 | null
EXP005 | null
EXP006 | null
EXP002 | null
EXP002 | null

tl;dr: Your table data needs to be keyed on the column on which you're joining.
Using the sample data above, here's how to investigate and fix.
Use KSQL to check the data in the topics (no need for kafka-avro-console-consumer). Format of the output data is timestamp, key, value
stream:
ksql> print 'stream-test-topic' from beginning;
Format:AVRO
30/04/18 15:59:13 BST, null, {"DEAL_ID": "deal002", "DEAL_EXPENSE_CODE": "EXP002", "DEAL_BRANCH": "AMSTERDAM"}
30/04/18 15:59:13 BST, null, {"DEAL_ID": "deal003", "DEAL_EXPENSE_CODE": "EXP003", "DEAL_BRANCH": "AMSTERDAM"}
30/04/18 15:59:13 BST, null, {"DEAL_ID": "deal004", "DEAL_EXPENSE_CODE": "EXP004", "DEAL_BRANCH": "AMSTERDAM"}
table:
ksql> print 'expense-test-topic' from beginning;
Format:AVRO
30/04/18 16:10:52 BST, pk1, {"EXPENSE_CODE": "EXP001", "EXPENSE_DESC": "Regulatory Deposit"}
30/04/18 16:10:52 BST, pk2, {"EXPENSE_CODE": "EXP002", "EXPENSE_DESC": "ABC - Sofia"}
30/04/18 16:10:52 BST, pk3, {"EXPENSE_CODE": "EXP003", "EXPENSE_DESC": "Apple Corporation"}
30/04/18 16:10:52 BST, pk4, {"EXPENSE_CODE": "EXP004", "EXPENSE_DESC": "Confluent Europe"}
30/04/18 16:10:52 BST, pk5, {"EXPENSE_CODE": "EXP005", "EXPENSE_DESC": "Air India"}
30/04/18 16:10:52 BST, pk6, {"EXPENSE_CODE": "EXP006", "EXPENSE_DESC": "KLM International"}
At this point, note that the key (pk<x>) does not match the column on which we will be joining
Register the two topics:
ksql> CREATE STREAM deals WITH (KAFKA_TOPIC='stream-test-topic', VALUE_FORMAT='AVRO');
Message
----------------
Stream created
----------------
ksql> CREATE TABLE expense_codes_table WITH (KAFKA_TOPIC='expense-test-topic', VALUE_FORMAT='AVRO', KEY='EXPENSE_CODE');
Message
---------------
Table created
---------------
Tell KSQL to query events from the beginning of each topic
ksql> SET 'auto.offset.reset' = 'earliest';
Successfully changed local property 'auto.offset.reset' from 'null' to 'earliest'
Validate that the table's declared key per the DDL (KEY='EXPENSE_CODE') matches the actual key of the underlying Kafka messages (available through the ROWKEY system column):
ksql> SELECT ROWKEY, EXPENSE_CODE FROM expense_codes_table;
pk1 | EXP001
pk2 | EXP002
pk3 | EXP003
pk4 | EXP004
pk5 | EXP005
pk6 | EXP006
The keys don't match. Our join is doomed!
Magic workaround—let's rekey the topic using KSQL!
Register the table's source topic as a KSQL STREAM:
ksql> CREATE STREAM expense_codes_stream WITH (KAFKA_TOPIC='expense-test-topic', VALUE_FORMAT='AVRO');
Message
----------------
Stream created
----------------
Create a derived stream, keyed on the correct colum. This is underpinned by a re-keyed Kafka topic.
ksql> CREATE STREAM EXPENSE_CODES_REKEY AS SELECT * FROM expense_codes_stream PARTITION BY EXPENSE_CODE;
Message
----------------------------
Stream created and running
----------------------------
Re-register the KSQL _TABLE_ on top of the re-keyed topic:
ksql> DROP TABLE expense_codes_table;
Message
----------------------------------------
Source EXPENSE_CODES_TABLE was dropped
----------------------------------------
ksql> CREATE TABLE expense_codes_table WITH (KAFKA_TOPIC='EXPENSE_CODES_REKEY', VALUE_FORMAT='AVRO', KEY='EXPENSE_CODE');
Message
---------------
Table created
---------------
Check the keys (declared vs message) match on the new table:
ksql> SELECT ROWKEY, EXPENSE_CODE FROM expense_codes_table;
EXP005 | EXP005
EXP001 | EXP001
EXP002 | EXP002
EXP003 | EXP003
EXP006 | EXP006
EXP004 | EXP004
Successful join:
ksql> SELECT D.DEAL_EXPENSE_CODE, E.EXPENSE_DESC \
FROM deals D \
LEFT JOIN expense_codes_table E \
ON D.DEAL_EXPENSE_CODE = E.EXPENSE_CODE \
WINDOW TUMBLING (SIZE 3 MINUTE) \
GROUP BY D.DEAL_EXPENSE_CODE, E.EXPENSE_DESC;
EXP006 | KLM International
EXP003 | Apple Corporation
EXP002 | ABC - Sofia
EXP004 | Confluent Europe
EXP001 | Regulatory Deposit
EXP005 | Air India

Related

NoOffsetForPartitionException when using Flink SQL with Kafka-backed table

This is my table:
CREATE TABLE orders (
`id` STRING,
`currency_code` STRING,
`total` DECIMAL(10,2),
`order_time` TIMESTAMP(3),
WATERMARK FOR `order_time` AS order_time - INTERVAL '30' SECONDS
) WITH (
'connector' = 'kafka',
'topic' = 'orders',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'groupgroup',
'value.format' = 'json'
);
I can insert into the table:
INSERT into orders
VALUES ('001', 'EURO', 9.10, TO_TIMESTAMP('2022-01-12 12:50:00', 'yyyy-MM-dd HH:mm:ss'));
And I can have verified that the data is there:
$ kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic orders --from-beginning
{"id":"001","currency_code":"EURO","total":9.1,"order_time":"2022-01-12 12:50:00"}
But when I try to query the table I get an error:
Flink SQL> select * from orders;
[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.NoOffsetForPartitionException: Undefined offset with no reset policy for partitions: [orders-0]
The default value of scan.startup.mode is group-offsets, and there are no committed group offsets. You can fix this by setting scan.startup.mode to another value, e.g.,
'scan.startup.mode' = 'earliest-offset',
in the table definition, or by arranging to commit the offsets before making a query.
Flink will commit the offsets as part of checkpointing, but if you are using something like sql-client.sh or Zeppelin to experiment with Flink SQL interactively, setting scan.startup.mode to earliest-offset is a good solution.

How should I connect clickhouse to Kafka?

CREATE TABLE readings_queue
(
`readid` Int32,
`time` DateTime,
`temperature` Decimal(5,2)
)
ENGINE = Kafka
SETTINGS kafka_broker_list = 'serverIP:9092',
kafka_topic_list = 'newtest',
kafka_format = 'CSV',
kafka_group_name = 'clickhouse_consumer_group',
kafka_num_consumers = 3
Above is the code that I build up connection of Kafka and Clickhouse. But after I execute the code, only the table is created, no data is retrieved.
~/kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic newtest --from-beginning
1,"2020-05-16 23:55:44",14.2
2,"2020-05-16 23:55:45",20.1
3,"2020-05-16 23:55:51",12.9
Is there something wrong with my query in clickhouse?
When I checked the log, I found below warning.
2021.05.10 10:19:50.534441 [ 1534 ] {} <Warning> (readings_queue): Can't get assignment. It can be caused by some issue with consumer group (not enough partitions?). Will keep trying.
Here is an example to work with:
-- Actual table to store the data fetched from an Apache Kafka topic
CREATE TABLE data(
a DateTime,
b Int64,
c Sting
) Engine=MergeTree ORDER BY (a,b);
-- Materialized View to insert any consumed data by Kafka Engine to 'data' table
CREATE MATERIALIZED VIEW data_mv TO data AS
SELECT
a,
b,
c
FROM data_kafka;
-- Kafka Engine which consumes the data from 'data_topic' of Apache Kafka
CREATE TABLE data_kafka(
a DateTime,
b Int64,
c Sting
) ENGINE = Kafka
SETTINGS kafka_broker_list = '10.1.2.3:9092,10.1.2.4:9092,
10.1.2.5:9092',
kafka_topic_list = 'data_topic'
kafka_num_consumers = 1,
kafka_group_name = 'ch-result5',
kafka_format = 'JSONEachRow';

Is there a KSQL statement to update values in table?

I have a stream of data in topic that should be treated as ksql table (only last value of given key matters) and this data is about updates of some data's specific fields in other topic. Is there any way in KSQLDB to process stream that update values in other stream/table/topic? Target topic has entities with let's say 20 fields, but my stream that contains update has update of 3 fields, so I want to update only these 3 fields and other 17 fields should remain the same in target topic (treated as table).
You can solve your problem using a JOIN STATEMENT with a little adjustment, follow the sample, will create a table with 5 fields, but only will be necessary to update the fields skill and level from another table.
1.Create the table from the source topic:
CREATE TABLE TBL_EMPLOYEE( `employee_id` VARCHAR, `name` varchar, `lastName` varchar, `age` INT, `skill` VARCHAR, `level` VARCHAR ) WITH ( KAFKA_TOPIC = 'employee-topic-input', PARTITIONS = 3, VALUE_FORMAT = 'JSON', KEY = '`employee_id`');
2.Create the table to handle the desired updates ( It can be stream or table, resulting from another query)
CREATE TABLE TBL_EMPLOYEE_DESIRED_UPDATES (`employee_id` VARCHAR, `skill` VARCHAR, `level` VARCHAR) WITH( KAFKA_TOPIC = 'employee-desired-updates-topic', PARTITIONS = 3, VALUE_FORMAT ='JSON', KEY = '`employee_id`');
3.Create the final table to update the required fields, the left join allows all the elements on the first table. if there is not any update on the second table, the skill and level fields will be the same.
SET 'auto.offset.reset' = 'earliest';
CREATE TABLE TBL_EMPLOYEE_FINAL AS
SELECT
EMP.`employee_id` AS `employee_id`,
EMP.`name` AS `name`,
EMP.`lastName` AS `lastName`,
IFNULL(UPD.`skill`, EMP.`skill`) as `skill`,
IFNULL(UPD.`level`, EMP.`level`) as `level`
FROM TBL_EMPLOYEE AS EMP
LEFT JOIN TBL_EMPLOYEE_DESIRED_UPDATES UPD ON EMP.ROWKEY = UPD.ROWKEY EMIT CHANGES;
Example:
INSERT INTO TBL_EMPLOYEE (`employee_id`, `name`, `lastName`, `age`, `skill`, `level`) VALUES ('117', 'John', 'Constantine', 30, 'java', 'jr');
INSERT INTO TBL_EMPLOYEE (`employee_id`, `name`, `lastName`, `age`, `skill`, `level`) VALUES ('118', 'Anthony', 'Stark', 40, 'AWS', 'architect');
INSERT INTO TBL_EMPLOYEE (`employee_id`, `name`, `lastName`, `age`, `skill`, `level`) VALUES ('119', 'Clark', 'Kent', 35, 'python', 'senior');
ksql> SELECT * FROM TBL_EMPLOYEE_FINAL EMIT CHANGES;
+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
|ROWTIME |ROWKEY |employee_id |name |lastName |skill |level |
+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
|1611440363833 |119 |119 |Clark |Kent |python |senior |
|1611440361284 |117 |117 |John |Constantine |java |jr |
|1611440361408 |118 |118 |Anthony |Stark |AWS |architect |
The second step is to send an update
INSERT INTO TBL_EMPLOYEE_DESIRED_UPDATES (`employee_id`, `skill`, `level` ) VALUES ('118', 'mongo', 'senior');
The result
ksql> SELECT * from TBL_EMPLOYEE_FINAL EMIT CHANGES;
+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
|ROWTIME |ROWKEY |employee_id |name |lastName |skill |level |
+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
|1611440363833 |119 |119 |Clark |Kent |python |senior |
|1611440361284 |117 |117 |John |Constantine |java |jr |
|1611440361408 |118 |118 |Anthony |Stark |AWS |architect |
|1611440585726 |118 |118 |Anthony |Stark |mongo |senior |
You have to consider the latest element in the table as the new one with the two modifications. The other one is part of the changelog of the table. The records are immutables.

KSQL: stream select query is not populating the data

Select query against created stream on topic in KSQL , is not giving the result where as respective topics are holing the message .
Step1: producer command :
root#masbidw1.usa.corp.ad:/usr/hdp/3.0.1.0-187/confluentclient/confluent-5.2.2> bin/kafka-console-producer --broker-list masbidw1.usa.corp.ad:9092 --topic emptable3
>TEST1,TEST2,TEST3
Step-2:Print command in KSQL terminal :
ksql> print 'emptable3' from beginning;
Format:STRING
8/16/19 10:57:45 AM EDT , NULL , TEST1,TEST2,TEST3
8/16/19 11:00:09 AM EDT , NULL , TEST1,TEST2,TEST3
8/16/19 11:00:16 AM EDT , NULL , TEST1,TEST2,TEST3
8/16/19 2:17:36 PM EDT , NULL , TEST1,TEST2,TEST3
Consumer is not giving the result without partition number :
root#masbidw1.usa.corp.ad:/usr/hdp/3.0.1.0-187/confluentclient/confluent-5.2.2> bin/kafka-console-consumer --bootstrap-server masbidw1.usa.corp.ad:9092 --topic emptable3 --from-beginning
So added partition number in consumer command and it is giving the message out .
root#masbidw1.usa.corp.ad:/usr/hdp/3.0.1.0-187/confluentclient/confluent-5.2.2> bin/kafka-console-consumer --bootstrap-server masbidw1.usa.corp.ad:9092 --topic emptable3 --from-beginning --partition 0
TEST1,TEST2,TEST3
TEST1,TEST2,TEST3
TEST1,TEST2,TEST3
TEST1,TEST2,TEST3
Setting up the earliest property in KSQL terminal
'auto.offset.reset'='earliest'
creating the KSQL stream:
CREATE STREAM emptable2_stream (column1 STRING, column2 STRING, column3 STRING) WITH (KAFKA_TOPIC='emptable3', VALUE_FORMAT='DELIMITED');
selecting the value from KSQL stream:
ksql> select * from EMPTABLE3_STREAM;
Press CTRL-C to interrupt
But No response . Even new data itself is not getting queried from the stream
Describe extended
ksql> describe extended EMPTABLE3_STREAM;
Name : EMPTABLE3_STREAM
Type : STREAM
Key field :
Key format : STRING
Timestamp field : Not set - using <ROWTIME>
Value format : DELIMITED
Kafka topic : emptable3 (partitions: 1, replication: 1)
Field | Type
-------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
COLUMN1 | VARCHAR(STRING)
COLUMN2 | VARCHAR(STRING)
COLUMN3 | VARCHAR(STRING)
-------------------------------------
Local runtime statistics
------------------------
(Statistics of the local KSQL server interaction with the Kafka topic emptable3)

String field in KSQL TABLE or STREAM containing part of original JSON message

Is it possible to add string field to KSQL table/stream which will contain part of original message's JSON.
For example,
Original message:
{userId:12345,
service:"service-1",
"debug":{
"msg":"Debug message",
"timer": 11.12}
}
So, we need to map userId to userId BIGINT, service to service STRING and debug to debug STRING which will contain {"msg":"Debug message", "timer": 11.12} as string.
Yes, you can simply declare it as a VARCHAR. From there you can treat it as just a string that happens to be JSON, or you can manipulate it further with the EXTRACTJSONFIELD function.
Send sample message to topic:
echo '{"userId":12345, "service":"service-1", "debug":{ "msg":"Debug message", "timer": 11.12} }' | kafkacat -b localhost:9092 -t test_topic -P
Declare the stream:
ksql> CREATE STREAM demo (userid BIGINT, service VARCHAR, debug VARCHAR) WITH (KAFKA_TOPIC='test_topic', VALUE_FORMAT='JSON');
Message
----------------
Stream created
----------------
Query the columns:
ksql> SET 'auto.offset.reset' = 'earliest';
Successfully changed local property 'auto.offset.reset' to 'earliest'. Use the UNSET command to revert your change.
ksql> SELECT USERID, SERVICE, DEBUG FROM demo;
12345 | service-1 | {"msg":"Debug message","timer":11.12}
Access nested JSON fields:
ksql> SELECT USERID, SERVICE, EXTRACTJSONFIELD(DEBUG,'$.msg') FROM demo;
12345 | service-1 | Debug message
ksql> SELECT USERID, SERVICE, EXTRACTJSONFIELD(DEBUG,'$.timer') FROM demo;
12345 | service-1 | 11.12