NoOffsetForPartitionException when using Flink SQL with Kafka-backed table - apache-kafka

This is my table:
CREATE TABLE orders (
`id` STRING,
`currency_code` STRING,
`total` DECIMAL(10,2),
`order_time` TIMESTAMP(3),
WATERMARK FOR `order_time` AS order_time - INTERVAL '30' SECONDS
) WITH (
'connector' = 'kafka',
'topic' = 'orders',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'groupgroup',
'value.format' = 'json'
);
I can insert into the table:
INSERT into orders
VALUES ('001', 'EURO', 9.10, TO_TIMESTAMP('2022-01-12 12:50:00', 'yyyy-MM-dd HH:mm:ss'));
And I can have verified that the data is there:
$ kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic orders --from-beginning
{"id":"001","currency_code":"EURO","total":9.1,"order_time":"2022-01-12 12:50:00"}
But when I try to query the table I get an error:
Flink SQL> select * from orders;
[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.NoOffsetForPartitionException: Undefined offset with no reset policy for partitions: [orders-0]

The default value of scan.startup.mode is group-offsets, and there are no committed group offsets. You can fix this by setting scan.startup.mode to another value, e.g.,
'scan.startup.mode' = 'earliest-offset',
in the table definition, or by arranging to commit the offsets before making a query.
Flink will commit the offsets as part of checkpointing, but if you are using something like sql-client.sh or Zeppelin to experiment with Flink SQL interactively, setting scan.startup.mode to earliest-offset is a good solution.

Related

Flink SQL not rolling iceberg files to hdfs while flink sql streaming job running

I am working in project using flink and iceberg to write data from kafka to iceberg hive table or hdfs using hadoop catalog when i publish message to kafka i can see message in kafka table but there is no file added in hdfs or row added in hive table
files added only if i canceled job i can see files in hdfs. what is reason ? why i can not see one file per kafka message in hdfs ? also i can not see data in sink table(flink_iceberg_tbl3).
************************* code i used below ********************************
CREATE CATALOG hadoop_iceberg WITH (
'type'='iceberg',
'catalog-type'='hadoop',
'warehouse'='hdfs://localhost:9000/flink_iceberg_warehouse'
);
create table hadoop_iceberg.iceberg_db.flink_iceberg_tbl3
(id int
,name string
,age int
,loc string
) partitioned by(loc);
create table kafka_input_table(
id int,
name varchar,
age int,
loc varchar
) with (
'connector' = 'kafka',
'topic' = 'test_topic',
'properties.bootstrap.servers'='localhost:9092',
'scan.startup.mode'='latest-offset',
'properties.group.id' = 'my-group-id',
'format' = 'csv'
);
ALTER TABLE kafka_input_table SET ('scan.startup.mode'='earliest-offset');
set table.dynamic-table-options.enabled = true
insert into hadoop_iceberg.iceberg_db.flink_iceberg_tbl3 select id,name,age,loc from kafka_input_table
select * from hadoop_iceberg.iceberg_db.flink_iceberg_tbl3 /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s')*/
i tried with many blogs but same issue

Apache Flink : Handle bad avro records in confluent-avro from Kafka

I've created a table using Flink's table APIs.
CREATE TABLE recommendations (
...
) WITH (
'connector' = 'kafka',
'topic' = 'my_kafka_topic',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'testGroup',
'properties.security.protocol' = 'SASL_PLAINTEXT',
'properties.sasl.kerberos.service.name' = 'kafka',
'scan.startup.mode' = 'latest-offset',
'value.format' = 'avro-confluent',
'value.avro-confluent.url' = 'http://schema-registry-address',
'value.fields-include' = 'EXCEPT_KEY'
);
When running the SQL to view the records, I'm getting:
Flink SQL> select * from default_catalog.default_database.recommendations ;
[ERROR] Could not execute SQL statement. Reason:
java.lang.ArrayIndexOutOfBoundsException: -25
Flink SQL> select * from default_catalog.default_database.recommendations ;
[ERROR] Could not execute SQL statement. Reason:
java.io.IOException: Failed to deserialize Avro record.
I'm aware there are some BAD avro records being pushed into the Kafka topic. In JSON format, there's an option to skip/filter these records by setting
'json.ignore-parse-errors' = 'true'. Is there any way we can skip these records when reading from confluent-avro format?
It's not ideal but unfortunately, I can't control what's being pushed to Kafka despite having a schema registry.
There's currently no such option for AVRO. There's an open ticket for it at https://issues.apache.org/jira/browse/FLINK-20091

Delete in Apache Hudi - Glue Job

I have to build a Glue Job for updating and deleting old rows in Athena table.
When I run my job for deleting it returns an error:
AnalysisException: 'Unable to infer schema for Parquet. It must be specified manually.;'
My Glue Job:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table", transformation_ctx = "datasource0")
datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table_output", transformation_ctx = "datasource1")
datasource0.toDF().createOrReplaceTempView("view_dyf")
datasource1.toDF().createOrReplaceTempView("view_dyf_output")
ds = spark.sql("SELECT * FROM view_dyf_output where id in (select id from view_dyf where op like 'D')")
hudi_delete_options = {
'hoodie.table.name': 'test_table_output',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.table.name': 'test_table_output',
'hoodie.datasource.write.operation': 'delete',
'hoodie.datasource.write.precombine.field': 'name',
'hoodie.upsert.shuffle.parallelism': 1,
'hoodie.insert.shuffle.parallelism': 1
}
from pyspark.sql.functions import lit
deletes = list(map(lambda row: (row[0], row[1]), ds.collect()))
df = spark.sparkContext.parallelize(deletes).toDF(['id']).withColumn('name', lit(0.0))
df.write.format("hudi"). \
options(**hudi_delete_options). \
mode("append"). \
save('s3://data/test-output/')
roAfterDeleteViewDF = spark. \
read. \
format("hudi"). \
load("s3://data/test-output/")
roAfterDeleteViewDF.registerTempTable("test_table_output")
spark.sql("SELECT * FROM view_dyf_output where id in (select distinct id from view_dyf where op like 'D')").count()
I have 2 data sources; first old Athena table where data has to updated or deleted, and the second table in which are coming new updated or deleted data.
In ds I have selected all rows that have to be deleted in old table.
op is for operation; 'D' for delete, 'U' for update.
Does anyone know what am I missing here?
The value for hoodie.datasource.write.operation is invalid in your code, the supported write operations are: UPSERT/Insert/Bulk_insert. check Hudi Doc.
Also what is your intention for deleting records: hard delete or soft ?
For Hard delete, you have to provide
{'hoodie.datasource.write.payload.class': 'org.apache.hudi.common.model.EmptyHoodieRecordPayload}

How should I connect clickhouse to Kafka?

CREATE TABLE readings_queue
(
`readid` Int32,
`time` DateTime,
`temperature` Decimal(5,2)
)
ENGINE = Kafka
SETTINGS kafka_broker_list = 'serverIP:9092',
kafka_topic_list = 'newtest',
kafka_format = 'CSV',
kafka_group_name = 'clickhouse_consumer_group',
kafka_num_consumers = 3
Above is the code that I build up connection of Kafka and Clickhouse. But after I execute the code, only the table is created, no data is retrieved.
~/kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic newtest --from-beginning
1,"2020-05-16 23:55:44",14.2
2,"2020-05-16 23:55:45",20.1
3,"2020-05-16 23:55:51",12.9
Is there something wrong with my query in clickhouse?
When I checked the log, I found below warning.
2021.05.10 10:19:50.534441 [ 1534 ] {} <Warning> (readings_queue): Can't get assignment. It can be caused by some issue with consumer group (not enough partitions?). Will keep trying.
Here is an example to work with:
-- Actual table to store the data fetched from an Apache Kafka topic
CREATE TABLE data(
a DateTime,
b Int64,
c Sting
) Engine=MergeTree ORDER BY (a,b);
-- Materialized View to insert any consumed data by Kafka Engine to 'data' table
CREATE MATERIALIZED VIEW data_mv TO data AS
SELECT
a,
b,
c
FROM data_kafka;
-- Kafka Engine which consumes the data from 'data_topic' of Apache Kafka
CREATE TABLE data_kafka(
a DateTime,
b Int64,
c Sting
) ENGINE = Kafka
SETTINGS kafka_broker_list = '10.1.2.3:9092,10.1.2.4:9092,
10.1.2.5:9092',
kafka_topic_list = 'data_topic'
kafka_num_consumers = 1,
kafka_group_name = 'ch-result5',
kafka_format = 'JSONEachRow';

KSQL: stream select query is not populating the data

Select query against created stream on topic in KSQL , is not giving the result where as respective topics are holing the message .
Step1: producer command :
root#masbidw1.usa.corp.ad:/usr/hdp/3.0.1.0-187/confluentclient/confluent-5.2.2> bin/kafka-console-producer --broker-list masbidw1.usa.corp.ad:9092 --topic emptable3
>TEST1,TEST2,TEST3
Step-2:Print command in KSQL terminal :
ksql> print 'emptable3' from beginning;
Format:STRING
8/16/19 10:57:45 AM EDT , NULL , TEST1,TEST2,TEST3
8/16/19 11:00:09 AM EDT , NULL , TEST1,TEST2,TEST3
8/16/19 11:00:16 AM EDT , NULL , TEST1,TEST2,TEST3
8/16/19 2:17:36 PM EDT , NULL , TEST1,TEST2,TEST3
Consumer is not giving the result without partition number :
root#masbidw1.usa.corp.ad:/usr/hdp/3.0.1.0-187/confluentclient/confluent-5.2.2> bin/kafka-console-consumer --bootstrap-server masbidw1.usa.corp.ad:9092 --topic emptable3 --from-beginning
So added partition number in consumer command and it is giving the message out .
root#masbidw1.usa.corp.ad:/usr/hdp/3.0.1.0-187/confluentclient/confluent-5.2.2> bin/kafka-console-consumer --bootstrap-server masbidw1.usa.corp.ad:9092 --topic emptable3 --from-beginning --partition 0
TEST1,TEST2,TEST3
TEST1,TEST2,TEST3
TEST1,TEST2,TEST3
TEST1,TEST2,TEST3
Setting up the earliest property in KSQL terminal
'auto.offset.reset'='earliest'
creating the KSQL stream:
CREATE STREAM emptable2_stream (column1 STRING, column2 STRING, column3 STRING) WITH (KAFKA_TOPIC='emptable3', VALUE_FORMAT='DELIMITED');
selecting the value from KSQL stream:
ksql> select * from EMPTABLE3_STREAM;
Press CTRL-C to interrupt
But No response . Even new data itself is not getting queried from the stream
Describe extended
ksql> describe extended EMPTABLE3_STREAM;
Name : EMPTABLE3_STREAM
Type : STREAM
Key field :
Key format : STRING
Timestamp field : Not set - using <ROWTIME>
Value format : DELIMITED
Kafka topic : emptable3 (partitions: 1, replication: 1)
Field | Type
-------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
COLUMN1 | VARCHAR(STRING)
COLUMN2 | VARCHAR(STRING)
COLUMN3 | VARCHAR(STRING)
-------------------------------------
Local runtime statistics
------------------------
(Statistics of the local KSQL server interaction with the Kafka topic emptable3)