NoOffsetForPartitionException when using Flink SQL with Kafka-backed table

NoOffsetForPartitionException when using Flink SQL with Kafka-backed table - apache-kafka

This is my table:
CREATE TABLE orders (
`id` STRING,
`currency_code` STRING,
`total` DECIMAL(10,2),
`order_time` TIMESTAMP(3),
WATERMARK FOR `order_time` AS order_time - INTERVAL '30' SECONDS
) WITH (
'connector' = 'kafka',
'topic' = 'orders',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'groupgroup',
'value.format' = 'json'
);
I can insert into the table:
INSERT into orders
VALUES ('001', 'EURO', 9.10, TO_TIMESTAMP('2022-01-12 12:50:00', 'yyyy-MM-dd HH:mm:ss'));
And I can have verified that the data is there:
$ kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic orders --from-beginning
{"id":"001","currency_code":"EURO","total":9.1,"order_time":"2022-01-12 12:50:00"}
But when I try to query the table I get an error:
Flink SQL> select * from orders;
[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.NoOffsetForPartitionException: Undefined offset with no reset policy for partitions: [orders-0]

The default value of scan.startup.mode is group-offsets, and there are no committed group offsets. You can fix this by setting scan.startup.mode to another value, e.g.,
'scan.startup.mode' = 'earliest-offset',
in the table definition, or by arranging to commit the offsets before making a query.
Flink will commit the offsets as part of checkpointing, but if you are using something like sql-client.sh or Zeppelin to experiment with Flink SQL interactively, setting scan.startup.mode to earliest-offset is a good solution.

Related

Flink SQL not rolling iceberg files to hdfs while flink sql streaming job running

I am working in project using flink and iceberg to write data from kafka to iceberg hive table or hdfs using hadoop catalog when i publish message to kafka i can see message in kafka table but there is no file added in hdfs or row added in hive table
files added only if i canceled job i can see files in hdfs. what is reason ? why i can not see one file per kafka message in hdfs ? also i can not see data in sink table(flink_iceberg_tbl3).
************************* code i used below ********************************
CREATE CATALOG hadoop_iceberg WITH (
'type'='iceberg',
'catalog-type'='hadoop',
'warehouse'='hdfs://localhost:9000/flink_iceberg_warehouse'
);
create table hadoop_iceberg.iceberg_db.flink_iceberg_tbl3
(id int
,name string
,age int
,loc string
) partitioned by(loc);
create table kafka_input_table(
id int,
name varchar,
age int,
loc varchar
) with (
'connector' = 'kafka',
'topic' = 'test_topic',
'properties.bootstrap.servers'='localhost:9092',
'scan.startup.mode'='latest-offset',
'properties.group.id' = 'my-group-id',
'format' = 'csv'
);
ALTER TABLE kafka_input_table SET ('scan.startup.mode'='earliest-offset');
set table.dynamic-table-options.enabled = true
insert into hadoop_iceberg.iceberg_db.flink_iceberg_tbl3 select id,name,age,loc from kafka_input_table
select * from hadoop_iceberg.iceberg_db.flink_iceberg_tbl3 /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s')*/
i tried with many blogs but same issue

Apache Flink : Handle bad avro records in confluent-avro from Kafka

I've created a table using Flink's table APIs.
CREATE TABLE recommendations (
...
) WITH (
'connector' = 'kafka',
'topic' = 'my_kafka_topic',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'testGroup',
'properties.security.protocol' = 'SASL_PLAINTEXT',
'properties.sasl.kerberos.service.name' = 'kafka',
'scan.startup.mode' = 'latest-offset',
'value.format' = 'avro-confluent',
'value.avro-confluent.url' = 'http://schema-registry-address',
'value.fields-include' = 'EXCEPT_KEY'
);
When running the SQL to view the records, I'm getting:
Flink SQL> select * from default_catalog.default_database.recommendations ;
[ERROR] Could not execute SQL statement. Reason:
java.lang.ArrayIndexOutOfBoundsException: -25
Flink SQL> select * from default_catalog.default_database.recommendations ;
[ERROR] Could not execute SQL statement. Reason:
java.io.IOException: Failed to deserialize Avro record.
I'm aware there are some BAD avro records being pushed into the Kafka topic. In JSON format, there's an option to skip/filter these records by setting
'json.ignore-parse-errors' = 'true'. Is there any way we can skip these records when reading from confluent-avro format?
It's not ideal but unfortunately, I can't control what's being pushed to Kafka despite having a schema registry.

There's currently no such option for AVRO. There's an open ticket for it at https://issues.apache.org/jira/browse/FLINK-20091

Delete in Apache Hudi - Glue Job

I have to build a Glue Job for updating and deleting old rows in Athena table.
When I run my job for deleting it returns an error:
AnalysisException: 'Unable to infer schema for Parquet. It must be specified manually.;'
My Glue Job:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table", transformation_ctx = "datasource0")
datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table_output", transformation_ctx = "datasource1")
datasource0.toDF().createOrReplaceTempView("view_dyf")
datasource1.toDF().createOrReplaceTempView("view_dyf_output")
ds = spark.sql("SELECT * FROM view_dyf_output where id in (select id from view_dyf where op like 'D')")
hudi_delete_options = {
'hoodie.table.name': 'test_table_output',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.table.name': 'test_table_output',
'hoodie.datasource.write.operation': 'delete',
'hoodie.datasource.write.precombine.field': 'name',
'hoodie.upsert.shuffle.parallelism': 1,
'hoodie.insert.shuffle.parallelism': 1
}
from pyspark.sql.functions import lit
deletes = list(map(lambda row: (row[0], row[1]), ds.collect()))
df = spark.sparkContext.parallelize(deletes).toDF(['id']).withColumn('name', lit(0.0))
df.write.format("hudi"). \
options(**hudi_delete_options). \
mode("append"). \
save('s3://data/test-output/')
roAfterDeleteViewDF = spark. \
read. \
format("hudi"). \
load("s3://data/test-output/")
roAfterDeleteViewDF.registerTempTable("test_table_output")
spark.sql("SELECT * FROM view_dyf_output where id in (select distinct id from view_dyf where op like 'D')").count()
I have 2 data sources; first old Athena table where data has to updated or deleted, and the second table in which are coming new updated or deleted data.
In ds I have selected all rows that have to be deleted in old table.
op is for operation; 'D' for delete, 'U' for update.
Does anyone know what am I missing here?

The value for hoodie.datasource.write.operation is invalid in your code, the supported write operations are: UPSERT/Insert/Bulk_insert. check Hudi Doc.
Also what is your intention for deleting records: hard delete or soft ?
For Hard delete, you have to provide
{'hoodie.datasource.write.payload.class': 'org.apache.hudi.common.model.EmptyHoodieRecordPayload}

How should I connect clickhouse to Kafka?

CREATE TABLE readings_queue
(
`readid` Int32,
`time` DateTime,
`temperature` Decimal(5,2)
)
ENGINE = Kafka
SETTINGS kafka_broker_list = 'serverIP:9092',
kafka_topic_list = 'newtest',
kafka_format = 'CSV',
kafka_group_name = 'clickhouse_consumer_group',
kafka_num_consumers = 3
Above is the code that I build up connection of Kafka and Clickhouse. But after I execute the code, only the table is created, no data is retrieved.
~/kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic newtest --from-beginning
1,"2020-05-16 23:55:44",14.2
2,"2020-05-16 23:55:45",20.1
3,"2020-05-16 23:55:51",12.9
Is there something wrong with my query in clickhouse?
When I checked the log, I found below warning.
2021.05.10 10:19:50.534441 [ 1534 ] {} <Warning> (readings_queue): Can't get assignment. It can be caused by some issue with consumer group (not enough partitions?). Will keep trying.

Here is an example to work with:
-- Actual table to store the data fetched from an Apache Kafka topic
CREATE TABLE data(
a DateTime,
b Int64,
c Sting
) Engine=MergeTree ORDER BY (a,b);
-- Materialized View to insert any consumed data by Kafka Engine to 'data' table
CREATE MATERIALIZED VIEW data_mv TO data AS
SELECT
a,
b,
c
FROM data_kafka;
-- Kafka Engine which consumes the data from 'data_topic' of Apache Kafka
CREATE TABLE data_kafka(
a DateTime,
b Int64,
c Sting
) ENGINE = Kafka
SETTINGS kafka_broker_list = '10.1.2.3:9092,10.1.2.4:9092,
10.1.2.5:9092',
kafka_topic_list = 'data_topic'
kafka_num_consumers = 1,
kafka_group_name = 'ch-result5',
kafka_format = 'JSONEachRow';

KSQL: stream select query is not populating the data

Select query against created stream on topic in KSQL , is not giving the result where as respective topics are holing the message .
Step1: producer command :
root#masbidw1.usa.corp.ad:/usr/hdp/3.0.1.0-187/confluentclient/confluent-5.2.2> bin/kafka-console-producer --broker-list masbidw1.usa.corp.ad:9092 --topic emptable3
>TEST1,TEST2,TEST3
Step-2:Print command in KSQL terminal :
ksql> print 'emptable3' from beginning;
Format:STRING
8/16/19 10:57:45 AM EDT , NULL , TEST1,TEST2,TEST3
8/16/19 11:00:09 AM EDT , NULL , TEST1,TEST2,TEST3
8/16/19 11:00:16 AM EDT , NULL , TEST1,TEST2,TEST3
8/16/19 2:17:36 PM EDT , NULL , TEST1,TEST2,TEST3
Consumer is not giving the result without partition number :
root#masbidw1.usa.corp.ad:/usr/hdp/3.0.1.0-187/confluentclient/confluent-5.2.2> bin/kafka-console-consumer --bootstrap-server masbidw1.usa.corp.ad:9092 --topic emptable3 --from-beginning
So added partition number in consumer command and it is giving the message out .
root#masbidw1.usa.corp.ad:/usr/hdp/3.0.1.0-187/confluentclient/confluent-5.2.2> bin/kafka-console-consumer --bootstrap-server masbidw1.usa.corp.ad:9092 --topic emptable3 --from-beginning --partition 0
TEST1,TEST2,TEST3
TEST1,TEST2,TEST3
TEST1,TEST2,TEST3
TEST1,TEST2,TEST3
Setting up the earliest property in KSQL terminal
'auto.offset.reset'='earliest'
creating the KSQL stream:
CREATE STREAM emptable2_stream (column1 STRING, column2 STRING, column3 STRING) WITH (KAFKA_TOPIC='emptable3', VALUE_FORMAT='DELIMITED');
selecting the value from KSQL stream:
ksql> select * from EMPTABLE3_STREAM;
Press CTRL-C to interrupt
But No response . Even new data itself is not getting queried from the stream
Describe extended
ksql> describe extended EMPTABLE3_STREAM;
Name : EMPTABLE3_STREAM
Type : STREAM
Key field :
Key format : STRING
Timestamp field : Not set - using <ROWTIME>
Value format : DELIMITED
Kafka topic : emptable3 (partitions: 1, replication: 1)
Field | Type
-------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
COLUMN1 | VARCHAR(STRING)
COLUMN2 | VARCHAR(STRING)
COLUMN3 | VARCHAR(STRING)
-------------------------------------
Local runtime statistics
------------------------
(Statistics of the local KSQL server interaction with the Kafka topic emptable3)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

NoOffsetForPartitionException when using Flink SQL with Kafka-backed table - apache-kafka

Related

Flink SQL not rolling iceberg files to hdfs while flink sql streaming job running

Apache Flink : Handle bad avro records in confluent-avro from Kafka

Delete in Apache Hudi - Glue Job

How should I connect clickhouse to Kafka?

KSQL: stream select query is not populating the data

Categories

Resources