How should I connect clickhouse to Kafka? - apache-kafka

CREATE TABLE readings_queue
(
`readid` Int32,
`time` DateTime,
`temperature` Decimal(5,2)
)
ENGINE = Kafka
SETTINGS kafka_broker_list = 'serverIP:9092',
kafka_topic_list = 'newtest',
kafka_format = 'CSV',
kafka_group_name = 'clickhouse_consumer_group',
kafka_num_consumers = 3
Above is the code that I build up connection of Kafka and Clickhouse. But after I execute the code, only the table is created, no data is retrieved.
~/kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic newtest --from-beginning
1,"2020-05-16 23:55:44",14.2
2,"2020-05-16 23:55:45",20.1
3,"2020-05-16 23:55:51",12.9
Is there something wrong with my query in clickhouse?
When I checked the log, I found below warning.
2021.05.10 10:19:50.534441 [ 1534 ] {} <Warning> (readings_queue): Can't get assignment. It can be caused by some issue with consumer group (not enough partitions?). Will keep trying.

Here is an example to work with:
-- Actual table to store the data fetched from an Apache Kafka topic
CREATE TABLE data(
a DateTime,
b Int64,
c Sting
) Engine=MergeTree ORDER BY (a,b);
-- Materialized View to insert any consumed data by Kafka Engine to 'data' table
CREATE MATERIALIZED VIEW data_mv TO data AS
SELECT
a,
b,
c
FROM data_kafka;
-- Kafka Engine which consumes the data from 'data_topic' of Apache Kafka
CREATE TABLE data_kafka(
a DateTime,
b Int64,
c Sting
) ENGINE = Kafka
SETTINGS kafka_broker_list = '10.1.2.3:9092,10.1.2.4:9092,
10.1.2.5:9092',
kafka_topic_list = 'data_topic'
kafka_num_consumers = 1,
kafka_group_name = 'ch-result5',
kafka_format = 'JSONEachRow';

Related

Flink SQL not rolling iceberg files to hdfs while flink sql streaming job running

I am working in project using flink and iceberg to write data from kafka to iceberg hive table or hdfs using hadoop catalog when i publish message to kafka i can see message in kafka table but there is no file added in hdfs or row added in hive table
files added only if i canceled job i can see files in hdfs. what is reason ? why i can not see one file per kafka message in hdfs ? also i can not see data in sink table(flink_iceberg_tbl3).
************************* code i used below ********************************
CREATE CATALOG hadoop_iceberg WITH (
'type'='iceberg',
'catalog-type'='hadoop',
'warehouse'='hdfs://localhost:9000/flink_iceberg_warehouse'
);
create table hadoop_iceberg.iceberg_db.flink_iceberg_tbl3
(id int
,name string
,age int
,loc string
) partitioned by(loc);
create table kafka_input_table(
id int,
name varchar,
age int,
loc varchar
) with (
'connector' = 'kafka',
'topic' = 'test_topic',
'properties.bootstrap.servers'='localhost:9092',
'scan.startup.mode'='latest-offset',
'properties.group.id' = 'my-group-id',
'format' = 'csv'
);
ALTER TABLE kafka_input_table SET ('scan.startup.mode'='earliest-offset');
set table.dynamic-table-options.enabled = true
insert into hadoop_iceberg.iceberg_db.flink_iceberg_tbl3 select id,name,age,loc from kafka_input_table
select * from hadoop_iceberg.iceberg_db.flink_iceberg_tbl3 /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s')*/
i tried with many blogs but same issue

NoOffsetForPartitionException when using Flink SQL with Kafka-backed table

This is my table:
CREATE TABLE orders (
`id` STRING,
`currency_code` STRING,
`total` DECIMAL(10,2),
`order_time` TIMESTAMP(3),
WATERMARK FOR `order_time` AS order_time - INTERVAL '30' SECONDS
) WITH (
'connector' = 'kafka',
'topic' = 'orders',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'groupgroup',
'value.format' = 'json'
);
I can insert into the table:
INSERT into orders
VALUES ('001', 'EURO', 9.10, TO_TIMESTAMP('2022-01-12 12:50:00', 'yyyy-MM-dd HH:mm:ss'));
And I can have verified that the data is there:
$ kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic orders --from-beginning
{"id":"001","currency_code":"EURO","total":9.1,"order_time":"2022-01-12 12:50:00"}
But when I try to query the table I get an error:
Flink SQL> select * from orders;
[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.NoOffsetForPartitionException: Undefined offset with no reset policy for partitions: [orders-0]
The default value of scan.startup.mode is group-offsets, and there are no committed group offsets. You can fix this by setting scan.startup.mode to another value, e.g.,
'scan.startup.mode' = 'earliest-offset',
in the table definition, or by arranging to commit the offsets before making a query.
Flink will commit the offsets as part of checkpointing, but if you are using something like sql-client.sh or Zeppelin to experiment with Flink SQL interactively, setting scan.startup.mode to earliest-offset is a good solution.

Apache Flink : Handle bad avro records in confluent-avro from Kafka

I've created a table using Flink's table APIs.
CREATE TABLE recommendations (
...
) WITH (
'connector' = 'kafka',
'topic' = 'my_kafka_topic',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'testGroup',
'properties.security.protocol' = 'SASL_PLAINTEXT',
'properties.sasl.kerberos.service.name' = 'kafka',
'scan.startup.mode' = 'latest-offset',
'value.format' = 'avro-confluent',
'value.avro-confluent.url' = 'http://schema-registry-address',
'value.fields-include' = 'EXCEPT_KEY'
);
When running the SQL to view the records, I'm getting:
Flink SQL> select * from default_catalog.default_database.recommendations ;
[ERROR] Could not execute SQL statement. Reason:
java.lang.ArrayIndexOutOfBoundsException: -25
Flink SQL> select * from default_catalog.default_database.recommendations ;
[ERROR] Could not execute SQL statement. Reason:
java.io.IOException: Failed to deserialize Avro record.
I'm aware there are some BAD avro records being pushed into the Kafka topic. In JSON format, there's an option to skip/filter these records by setting
'json.ignore-parse-errors' = 'true'. Is there any way we can skip these records when reading from confluent-avro format?
It's not ideal but unfortunately, I can't control what's being pushed to Kafka despite having a schema registry.
There's currently no such option for AVRO. There's an open ticket for it at https://issues.apache.org/jira/browse/FLINK-20091

why ADD COLUMN to kafka table is not supported in Clickhouse

I have a problem adding a column to the Kafka queue in ClickHouse.
I've created a table with the command
CREATE TABLE my_db.my_queue ON CLUSTER my_cluster
(
`ts` String,
.... some other columns
)
ENGINE = Kafka()
SETTINGS
kafka_broker_list = '172.21.0.3:9092',
kafka_topic_list = 'my_topic',
kafka_group_name = 'my_group',
kafka_format = 'JSONEachRow',
kafka_row_delimiter = '\n',
kafka_num_consumers = 1,
kafka_skip_broken_messages = 10;
And then trying to add a column
ALTER TABLE my_db.my_queue ON CLUSTER my_cluster ADD COLUMN new_column String;
But getting an error
SQL Error [48]: ClickHouse exception, code: 48, host: 172.21.0.4, port: 8123; Code: 48,
e.displayText() = DB::Exception: There was an error on [clickhouse-server:9000]: Code: 48,
e.displayText() = DB::Exception: Alter of type 'ADD COLUMN' is not supported by storage Kafka
(version 20.11.4.13 (official build)) (version 20.11.4.13 (official build))
I am not familiar with ClickHouse and any analytical database.
So I am wondering why it is not supported? Or I should add a column in another way?
A way of supporting messages with different schema from a Kafka queue consists on storing the raw JSON messages like this:
CREATE TABLE my_db.my_queue ON CLUSTER my_cluster
(
`message` String
)
ENGINE = Kafka()
SETTINGS
kafka_broker_list = '172.21.0.3:9092',
kafka_topic_list = 'my_topic',
kafka_group_name = 'my_group',
kafka_format = 'JSONAsString',
kafka_row_delimiter = '\n',
kafka_num_consumers = 1,
kafka_skip_broken_messages = 10;
The JSONAsString format will store the raw JSON in the message column. This way from the Kafka table you can post-process each new row through materialized views and JSON functions.
For instance:
CREATE TABLE my_db.post_processed_data (
`ts` String,
`another_column` String
)
-- use a proper engine
Engine=Log;
CREATE MATERIALIZED VIEW my_db.my_queue_mv TO my_db.post_processed_data
AS
SELECT
JSONExtractString(message, 'ts') AS ts,
JSONExtractString(message, 'another_column') AS another_column
FROM my_db.my_queue;
If there's any change in the JSON schema of the Kafka queue, you can react accordingly doing an ALTER TABLE .. ADD COLUMN .. in the post_processed_data table and updating the materialized view accordingly. That way the Kafka table would remain as it is.
kafka Engine does not support it.
Just drop the table and create with a new schema.
It does not support alter because an author of KafkaEngine does not need it.

Data type for null and string with \ in Clickhouse

{"DEVICE_GROUP":null,"DEVICE":"10.84.130.44","RULE":"check-in-traffic",
"TOPIC1":"interface.statistics","FIELDS_HIGH_THRESHOLD":800000000,
"FIELDS_LOW_THRESHOLD":500000000,"FIELDS_OUT_OCTETS_STATS_VALUE":null,
"FIELDS_TANDINGESTTIMESTAMP":1598127990844870700,
"FIELDS_TANDTIMEOFFSET":"719.508643ms","KEYS_INTERFACE_NAME":"em3",
"KEYS_PLAYBOOK_NAME":"interface-kpis-playbook",
"KEYS_INSTANCE_ID":"[\"i1\"]"
Above is the Json I am getting from Kafka. I am able to create table using most of the keys just want to know what data type should I provide for KEYS_INSTANCE_ID to create table in Clickhouse using MergerTree and Kafka engine using Materialized view. I tried string but didn't worked for me for creating the table.
#to create table using mergetree engine:
CREATE TABLE IF NOT EXISTS readings_hb_trial_11
(
KEYS_INSTANCE_ID String
)
ENGINE = MergeTree
ORDER BY KEYS_INSTANCE_ID
#to create table using kafka engine:
CREATE TABLE IF NOT EXISTS readings_queue_hb_trial_11
(
KEYS_INSTANCE_ID String
)
ENGINE = Kafka
SETTINGS kafka_broker_list = '10########2', kafka_topic_list = 'R########B', kafka_group_name = 'readings_consumer_group3', kafka_format = 'JSONEachRow', kafka_max_block_size = 1048576
#to materialize the table:
CREATE MATERIALIZED VIEW readings_queue_mv_hb_trial_11 TO readings_hb_trial_11 AS
SELECT
KEYS_INSTANCE_ID
FROM readings_queue_hb_trial_11
I suspect you made mistake in JSON when braced an array to double quotes so the correct one should look as "KEYS_INSTANCE_ID":["i1"].
In this case, the type Array(String) should help.
Let's test it:
/* Emulate the table with Kafka-engine */
CREATE TABLE readings_queue_hb_trial_11
(
`KEYS_INSTANCE_ID` Array(String)
)
ENGINE = Memory
/* MV takes just the first item from an array (as I understood it is your case). */
CREATE MATERIALIZED VIEW readings_queue_mv_hb_trial_11 TO readings_hb_trial_11 AS
SELECT
empty(KEYS_INSTANCE_ID) ? '' : KEYS_INSTANCE_ID[1] AS KEYS_INSTANCE_ID
FROM readings_queue_hb_trial_11
Emulate processing some messages:
INSERT INTO readings_queue_hb_trial_11
SELECT JSONExtractArrayRaw('{"KEYS_INSTANCE_ID":["i1"]}', 'KEYS_INSTANCE_ID')
UNION ALL
SELECT JSONExtractArrayRaw('{"KEYS_INSTANCE_ID":[]}', 'KEYS_INSTANCE_ID')
UNION ALL
SELECT JSONExtractArrayRaw('{"KEYS_INSTANCE_ID":["i1", "i2"]}', 'KEYS_INSTANCE_ID')
The result of processing:
SELECT *
FROM readings_hb_trial_11
┌─KEYS_INSTANCE_ID─┐
│ "i1" │
│ │
│ "i1" │
└──────────────────┘