Apache Flink : Handle bad avro records in confluent-avro from Kafka - apache-kafka

I've created a table using Flink's table APIs.
CREATE TABLE recommendations (
...
) WITH (
'connector' = 'kafka',
'topic' = 'my_kafka_topic',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'testGroup',
'properties.security.protocol' = 'SASL_PLAINTEXT',
'properties.sasl.kerberos.service.name' = 'kafka',
'scan.startup.mode' = 'latest-offset',
'value.format' = 'avro-confluent',
'value.avro-confluent.url' = 'http://schema-registry-address',
'value.fields-include' = 'EXCEPT_KEY'
);
When running the SQL to view the records, I'm getting:
Flink SQL> select * from default_catalog.default_database.recommendations ;
[ERROR] Could not execute SQL statement. Reason:
java.lang.ArrayIndexOutOfBoundsException: -25
Flink SQL> select * from default_catalog.default_database.recommendations ;
[ERROR] Could not execute SQL statement. Reason:
java.io.IOException: Failed to deserialize Avro record.
I'm aware there are some BAD avro records being pushed into the Kafka topic. In JSON format, there's an option to skip/filter these records by setting
'json.ignore-parse-errors' = 'true'. Is there any way we can skip these records when reading from confluent-avro format?
It's not ideal but unfortunately, I can't control what's being pushed to Kafka despite having a schema registry.

There's currently no such option for AVRO. There's an open ticket for it at https://issues.apache.org/jira/browse/FLINK-20091

Related

Flink SQL not rolling iceberg files to hdfs while flink sql streaming job running

I am working in project using flink and iceberg to write data from kafka to iceberg hive table or hdfs using hadoop catalog when i publish message to kafka i can see message in kafka table but there is no file added in hdfs or row added in hive table
files added only if i canceled job i can see files in hdfs. what is reason ? why i can not see one file per kafka message in hdfs ? also i can not see data in sink table(flink_iceberg_tbl3).
************************* code i used below ********************************
CREATE CATALOG hadoop_iceberg WITH (
'type'='iceberg',
'catalog-type'='hadoop',
'warehouse'='hdfs://localhost:9000/flink_iceberg_warehouse'
);
create table hadoop_iceberg.iceberg_db.flink_iceberg_tbl3
(id int
,name string
,age int
,loc string
) partitioned by(loc);
create table kafka_input_table(
id int,
name varchar,
age int,
loc varchar
) with (
'connector' = 'kafka',
'topic' = 'test_topic',
'properties.bootstrap.servers'='localhost:9092',
'scan.startup.mode'='latest-offset',
'properties.group.id' = 'my-group-id',
'format' = 'csv'
);
ALTER TABLE kafka_input_table SET ('scan.startup.mode'='earliest-offset');
set table.dynamic-table-options.enabled = true
insert into hadoop_iceberg.iceberg_db.flink_iceberg_tbl3 select id,name,age,loc from kafka_input_table
select * from hadoop_iceberg.iceberg_db.flink_iceberg_tbl3 /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s')*/
i tried with many blogs but same issue

NoOffsetForPartitionException when using Flink SQL with Kafka-backed table

This is my table:
CREATE TABLE orders (
`id` STRING,
`currency_code` STRING,
`total` DECIMAL(10,2),
`order_time` TIMESTAMP(3),
WATERMARK FOR `order_time` AS order_time - INTERVAL '30' SECONDS
) WITH (
'connector' = 'kafka',
'topic' = 'orders',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'groupgroup',
'value.format' = 'json'
);
I can insert into the table:
INSERT into orders
VALUES ('001', 'EURO', 9.10, TO_TIMESTAMP('2022-01-12 12:50:00', 'yyyy-MM-dd HH:mm:ss'));
And I can have verified that the data is there:
$ kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic orders --from-beginning
{"id":"001","currency_code":"EURO","total":9.1,"order_time":"2022-01-12 12:50:00"}
But when I try to query the table I get an error:
Flink SQL> select * from orders;
[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.NoOffsetForPartitionException: Undefined offset with no reset policy for partitions: [orders-0]
The default value of scan.startup.mode is group-offsets, and there are no committed group offsets. You can fix this by setting scan.startup.mode to another value, e.g.,
'scan.startup.mode' = 'earliest-offset',
in the table definition, or by arranging to commit the offsets before making a query.
Flink will commit the offsets as part of checkpointing, but if you are using something like sql-client.sh or Zeppelin to experiment with Flink SQL interactively, setting scan.startup.mode to earliest-offset is a good solution.

Delete in Apache Hudi - Glue Job

I have to build a Glue Job for updating and deleting old rows in Athena table.
When I run my job for deleting it returns an error:
AnalysisException: 'Unable to infer schema for Parquet. It must be specified manually.;'
My Glue Job:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table", transformation_ctx = "datasource0")
datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table_output", transformation_ctx = "datasource1")
datasource0.toDF().createOrReplaceTempView("view_dyf")
datasource1.toDF().createOrReplaceTempView("view_dyf_output")
ds = spark.sql("SELECT * FROM view_dyf_output where id in (select id from view_dyf where op like 'D')")
hudi_delete_options = {
'hoodie.table.name': 'test_table_output',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.table.name': 'test_table_output',
'hoodie.datasource.write.operation': 'delete',
'hoodie.datasource.write.precombine.field': 'name',
'hoodie.upsert.shuffle.parallelism': 1,
'hoodie.insert.shuffle.parallelism': 1
}
from pyspark.sql.functions import lit
deletes = list(map(lambda row: (row[0], row[1]), ds.collect()))
df = spark.sparkContext.parallelize(deletes).toDF(['id']).withColumn('name', lit(0.0))
df.write.format("hudi"). \
options(**hudi_delete_options). \
mode("append"). \
save('s3://data/test-output/')
roAfterDeleteViewDF = spark. \
read. \
format("hudi"). \
load("s3://data/test-output/")
roAfterDeleteViewDF.registerTempTable("test_table_output")
spark.sql("SELECT * FROM view_dyf_output where id in (select distinct id from view_dyf where op like 'D')").count()
I have 2 data sources; first old Athena table where data has to updated or deleted, and the second table in which are coming new updated or deleted data.
In ds I have selected all rows that have to be deleted in old table.
op is for operation; 'D' for delete, 'U' for update.
Does anyone know what am I missing here?
The value for hoodie.datasource.write.operation is invalid in your code, the supported write operations are: UPSERT/Insert/Bulk_insert. check Hudi Doc.
Also what is your intention for deleting records: hard delete or soft ?
For Hard delete, you have to provide
{'hoodie.datasource.write.payload.class': 'org.apache.hudi.common.model.EmptyHoodieRecordPayload}

How should I connect clickhouse to Kafka?

CREATE TABLE readings_queue
(
`readid` Int32,
`time` DateTime,
`temperature` Decimal(5,2)
)
ENGINE = Kafka
SETTINGS kafka_broker_list = 'serverIP:9092',
kafka_topic_list = 'newtest',
kafka_format = 'CSV',
kafka_group_name = 'clickhouse_consumer_group',
kafka_num_consumers = 3
Above is the code that I build up connection of Kafka and Clickhouse. But after I execute the code, only the table is created, no data is retrieved.
~/kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic newtest --from-beginning
1,"2020-05-16 23:55:44",14.2
2,"2020-05-16 23:55:45",20.1
3,"2020-05-16 23:55:51",12.9
Is there something wrong with my query in clickhouse?
When I checked the log, I found below warning.
2021.05.10 10:19:50.534441 [ 1534 ] {} <Warning> (readings_queue): Can't get assignment. It can be caused by some issue with consumer group (not enough partitions?). Will keep trying.
Here is an example to work with:
-- Actual table to store the data fetched from an Apache Kafka topic
CREATE TABLE data(
a DateTime,
b Int64,
c Sting
) Engine=MergeTree ORDER BY (a,b);
-- Materialized View to insert any consumed data by Kafka Engine to 'data' table
CREATE MATERIALIZED VIEW data_mv TO data AS
SELECT
a,
b,
c
FROM data_kafka;
-- Kafka Engine which consumes the data from 'data_topic' of Apache Kafka
CREATE TABLE data_kafka(
a DateTime,
b Int64,
c Sting
) ENGINE = Kafka
SETTINGS kafka_broker_list = '10.1.2.3:9092,10.1.2.4:9092,
10.1.2.5:9092',
kafka_topic_list = 'data_topic'
kafka_num_consumers = 1,
kafka_group_name = 'ch-result5',
kafka_format = 'JSONEachRow';

why ADD COLUMN to kafka table is not supported in Clickhouse

I have a problem adding a column to the Kafka queue in ClickHouse.
I've created a table with the command
CREATE TABLE my_db.my_queue ON CLUSTER my_cluster
(
`ts` String,
.... some other columns
)
ENGINE = Kafka()
SETTINGS
kafka_broker_list = '172.21.0.3:9092',
kafka_topic_list = 'my_topic',
kafka_group_name = 'my_group',
kafka_format = 'JSONEachRow',
kafka_row_delimiter = '\n',
kafka_num_consumers = 1,
kafka_skip_broken_messages = 10;
And then trying to add a column
ALTER TABLE my_db.my_queue ON CLUSTER my_cluster ADD COLUMN new_column String;
But getting an error
SQL Error [48]: ClickHouse exception, code: 48, host: 172.21.0.4, port: 8123; Code: 48,
e.displayText() = DB::Exception: There was an error on [clickhouse-server:9000]: Code: 48,
e.displayText() = DB::Exception: Alter of type 'ADD COLUMN' is not supported by storage Kafka
(version 20.11.4.13 (official build)) (version 20.11.4.13 (official build))
I am not familiar with ClickHouse and any analytical database.
So I am wondering why it is not supported? Or I should add a column in another way?
A way of supporting messages with different schema from a Kafka queue consists on storing the raw JSON messages like this:
CREATE TABLE my_db.my_queue ON CLUSTER my_cluster
(
`message` String
)
ENGINE = Kafka()
SETTINGS
kafka_broker_list = '172.21.0.3:9092',
kafka_topic_list = 'my_topic',
kafka_group_name = 'my_group',
kafka_format = 'JSONAsString',
kafka_row_delimiter = '\n',
kafka_num_consumers = 1,
kafka_skip_broken_messages = 10;
The JSONAsString format will store the raw JSON in the message column. This way from the Kafka table you can post-process each new row through materialized views and JSON functions.
For instance:
CREATE TABLE my_db.post_processed_data (
`ts` String,
`another_column` String
)
-- use a proper engine
Engine=Log;
CREATE MATERIALIZED VIEW my_db.my_queue_mv TO my_db.post_processed_data
AS
SELECT
JSONExtractString(message, 'ts') AS ts,
JSONExtractString(message, 'another_column') AS another_column
FROM my_db.my_queue;
If there's any change in the JSON schema of the Kafka queue, you can react accordingly doing an ALTER TABLE .. ADD COLUMN .. in the post_processed_data table and updating the materialized view accordingly. That way the Kafka table would remain as it is.
kafka Engine does not support it.
Just drop the table and create with a new schema.
It does not support alter because an author of KafkaEngine does not need it.