Cannot consume the Confluent Kafka data using Flink-sql.
Fink version:1.12-csadh1.3.0.0
Cluster:Cloudera(CDP)
Kafka:Confluent Kafka
SQL:
CREATE TABLE consumer_session_created (
***
) WITH (
'connector' = 'kafka',
'topic' = '***',
'scan.startup.mode' = 'earliest-offset',
'properties.bootstrap.servers' = '***:9092',
'properties.group.id' = '***',
'properties.security.protocol' = 'SASL_SSL',
'properties.sasl.mechanism' = 'PLAIN',
'properties.sasl.jaas.config' = 'org.apache.kafka.common.security.plain.PlainLoginModule required username="***" password="***"',
'properties.avro-confluent.basic-auth.credentials-source' = '***',
'properties.avro-confluent.basic-auth.user-info' = '***',
'value.format' = 'avro-confluent',
'value.fields-include' = 'EXCEPT_KEY',
'value.avro-confluent.schema-registry.url' = 'https://***',
'value.avro-confluent.schema-registry.subject' = '***'
)
Error Msg:
java.io.IOException: Failed to deserialize Avro record.
at org.apache.flink.formats.avro.AvroRowDataDeserializationSchema.deserialize(AvroRowDataDeserializationSchema.java:101)
at org.apache.flink.formats.avro.AvroRowDataDeserializationSchema.deserialize(AvroRowDataDeserializationSchema.java:44)
at org.apache.flink.api.common.serialization.DeserializationSchema.deserialize(DeserializationSchema.java:82)
at org.apache.flink.streaming.connectors.kafka.table.DynamicKafkaDeserializationSchema.deserialize(DynamicKafkaDeserializationSchema.java:113)
at org.apache.flink.streaming.connectors.kafka.internals.KafkaFetcher.partitionConsumerRecordsHandler(KafkaFetcher.java:179)
at org.apache.flink.streaming.connectors.kafka.internals.KafkaFetcher.runFetchLoop(KafkaFetcher.java:142)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:826)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:66)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:241)
Caused by: java.io.IOException: Could not find schema with id 79 in registry
at org.apache.flink.formats.avro.registry.confluent.ConfluentSchemaRegistryCoder.readSchema(ConfluentSchemaRegistryCoder.java:77)
at org.apache.flink.formats.avro.RegistryAvroDeserializationSchema.deserialize(RegistryAvroDeserializationSchema.java:70)
at org.apache.flink.formats.avro.AvroRowDataDeserializationSchema.deserialize(AvroRowDataDeserializationSchema.java:98)
... 9 more
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Unauthorized; error code: 401
at io.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:292)
at io.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:352)
at io.confluent.kafka.schemaregistry.client.rest.RestService.getId(RestService.java:660)
at io.confluent.kafka.schemaregistry.client.rest.RestService.getId(RestService.java:642)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getSchemaByIdFromRegistry(CachedSchemaRegistryClient.java:217)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getSchemaBySubjectAndId(CachedSchemaRegistryClient.java:291)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getSchemaById(CachedSchemaRegistryClient.java:276)
at io.confluent.kafka.schemaregistry.client.SchemaRegistryClient.getById(SchemaRegistryClient.java:64)
at org.apache.flink.formats.avro.registry.confluent.ConfluentSchemaRegistryCoder.readSchema(ConfluentSchemaRegistryCoder.java:74)
... 11 more
I thought I used the parameters avro-confluent.basic-auth.* in the wrong way, according to the flink Doc here. So I removed the prefix properties.:
WITH (
'connector',
***
'avro-confluent.basic-auth.credentials-source' = '***',
'avro-confluent.basic-auth.user-info' = '***',
***
)
However, another exception raised:
org.apache.flink.table.api.ValidationException: Unsupported options found for connector 'kafka'.
Unsupported options:
avro-confluent.basic-auth.credentials-source
avro-confluent.basic-auth.user-info
Tips:
we can consume/de-ser the kafka data correctly using DataStream API with same parameters, and this topic has been used for a long time by others'.
I'm not sure about Flink packed into Cloudera distributive, but in simple (original) Flink I connect to Kafka using SQL:
CREATE TABLE my_flink_table (
event_date AS TO_TIMESTAMP(eventtime_string_field, 'yyyyMMddHHmmssX')
,field1
,field2
...
,WATERMARK FOR event_date AS event_date - INTERVAL '10' MINUTE
) WITH (
'connector' = 'kafka',
'topic' = 'my_events_topic',
'scan.startup.mode' = 'earliest-offset',
'format' = 'avro-confluent',
'avro-confluent.schema-registry.url' = 'http://my_confluent_schma_reg_host:8081/',
'properties.group.id' = 'flink-test-001',
'properties.bootstrap.servers' = 'kafka_host01:9092'
);
I'm using Flink SQL to read debezium avro data from Kafka and store as parquet files in S3. Here is my code,
import os
from pyflink.datastream import StreamExecutionEnvironment, FsStateBackend
from pyflink.table import TableConfig, DataTypes, BatchTableEnvironment, StreamTableEnvironment, \
ScalarFunction
exec_env = StreamExecutionEnvironment.get_execution_environment()
exec_env.set_parallelism(1)
# start a checkpoint every 12 s
exec_env.enable_checkpointing(12000)
t_config = TableConfig()
t_env = StreamTableEnvironment.create(exec_env, t_config)
INPUT_TABLE = 'source'
KAFKA_TOPIC = os.environ['KAFKA_TOPIC']
KAFKA_BOOTSTRAP_SERVER = os.environ['KAFKA_BOOTSTRAP_SERVER']
OUTPUT_TABLE = 'sink'
S3_BUCKET = os.environ['S3_BUCKET']
OUTPUT_S3_LOCATION = os.environ['OUTPUT_S3_LOCATION']
ddl_source = f"""
CREATE TABLE {INPUT_TABLE} (
`event_time` TIMESTAMP(3) METADATA FROM 'timestamp' VIRTUAL,
`id` BIGINT,
`price` DOUBLE,
`type` INT,
`is_reinvite` INT
) WITH (
'connector' = 'kafka',
'topic' = '{KAFKA_TOPIC}',
'properties.bootstrap.servers' = '{KAFKA_BOOTSTRAP_SERVER}',
'scan.startup.mode' = 'earliest-offset',
'format' = 'debezium-avro-confluent',
'debezium-avro-confluent.schema-registry.url' = 'http://kafka-production-schema-registry:8081'
)
"""
ddl_sink = f"""
CREATE TABLE {OUTPUT_TABLE} (
`event_time` TIMESTAMP,
`id` BIGINT,
`price` DOUBLE,
`type` INT,
`is_reinvite` INT
) WITH (
'connector' = 'filesystem',
'path' = 's3://{S3_BUCKET}/{OUTPUT_S3_LOCATION}',
'format' = 'parquet'
)
"""
t_env.sql_update(ddl_source)
t_env.sql_update(ddl_sink)
t_env.execute_sql(f"""
INSERT INTO {OUTPUT_TABLE}
SELECT *
FROM {INPUT_TABLE}
""")
When I submit the job, I get the following error message,
pyflink.util.exceptions.TableException: Table sink 'default_catalog.default_database.sink' doesn't support consuming update and delete changes which is produced by node TableSourceScan(table=[[default_catalog, default_database, source]], fields=[id, price, type, is_reinvite, timestamp])
I'm using Flink 1.12.1. The source is working properly and I have tested it using a 'print' connector in the sink. Here is a sample data set which was extracted from the task manager logs when using 'print' connector in the table sink,
-D(2021-02-20T17:07:27.298,14091764,26.0,9,0)
-D(2021-02-20T17:07:27.298,14099765,26.0,9,0)
-D(2021-02-20T17:07:27.299,14189806,16.0,9,0)
-D(2021-02-20T17:07:27.299,14189838,37.0,9,0)
-D(2021-02-20T17:07:27.299,14089840,26.0,9,0)
-D(2021-02-20T17:07:27.299,14089847,26.0,9,0)
-D(2021-02-20T17:07:27.300,14189859,26.0,9,0)
-D(2021-02-20T17:07:27.301,14091808,37.0,9,0)
-D(2021-02-20T17:07:27.301,14089911,37.0,9,0)
-D(2021-02-20T17:07:27.301,14099937,26.0,9,0)
-D(2021-02-20T17:07:27.302,14091851,37.0,9,0)
How can I make my table sink work with the filesystem connector ?
What happens is that:
when receiving the Debezium records, Flink updates a logical table by adding, removing and suppressing Flink rows based on their primary key.
the only sinks that can handle that kind of information are those that have a concept of update by key. Jdbc would be a typical example, in which case it's straightforward for Flink to translate the concept of "a Flink row with key foo has been updated to bar" into "JDBC row with key foo should be updated with value bar", or something. filesystem sink do not support that kind of operation since files are append-only.
See also Flink documentation on append and update queries
In practice, in order to do the conversion, we first have to decide what is it exactly we want to have in this append-only file.
If what we want is to have in the file the latest version of each item any time an id is updated, then to my knowledge the way to go would be to convert it to a stream first, and then output that with a FileSink. Note that in that case, the result contains a boolean saying if the row is updated or deleted, and we have to decide how we want this information to be visible in the resulting file.
Note: I used this other CDC example from the Flink SQL cookbook to reproduce a similar setup:
// assuming a Flink retract table of claims build from a CDC stream:
tableEnv.executeSql("" +
" CREATE TABLE accident_claims (\n" +
" claim_id INT,\n" +
" claim_total FLOAT,\n" +
" claim_total_receipt VARCHAR(50),\n" +
" claim_currency VARCHAR(3),\n" +
" member_id INT,\n" +
" accident_date VARCHAR(20),\n" +
" accident_type VARCHAR(20),\n" +
" accident_detail VARCHAR(20),\n" +
" claim_date VARCHAR(20),\n" +
" claim_status VARCHAR(10),\n" +
" ts_created VARCHAR(20),\n" +
" ts_updated VARCHAR(20)" +
") WITH (\n" +
" 'connector' = 'postgres-cdc',\n" +
" 'hostname' = 'localhost',\n" +
" 'port' = '5432',\n" +
" 'username' = 'postgres',\n" +
" 'password' = 'postgres',\n" +
" 'database-name' = 'postgres',\n" +
" 'schema-name' = 'claims',\n" +
" 'table-name' = 'accident_claims'\n" +
" )"
);
// convert it to a stream
Table accidentClaims = tableEnv.from("accident_claims");
DataStream<Tuple2<Boolean, Row>> accidentClaimsStream = tableEnv
.toRetractStream(accidentClaims, Row.class);
// and write to file
final FileSink<Tuple2<Boolean, Row>> sink = FileSink
// TODO: adapt the output format here:
.forRowFormat(new Path("/tmp/flink-demo"),
(Encoder<Tuple2<Boolean, Row>>) (element, stream) -> stream.write((element.toString() + "\n").getBytes(StandardCharsets.UTF_8)))
.build();
ordersStreams.sinkTo(sink);
streamEnv.execute();
Note that during the conversion, you obtain a boolean telling you whether that row is a new value for that accident claim, or a deletion of such claim. My basic FileSink config there is just including that boolean in the output, although how to handle deletions is to be decided case by case.
The result in the file then looks like this:
head /tmp/flink-demo/2021-03-09--09/.part-c7cdb74e-893c-4b0e-8f69-1e8f02505199-0.inprogress.f0f7263e-ec24-4474-b953-4d8ef4641998
(true,1,4153.92,null,AUD,412,2020-06-18 18:49:19,Permanent Injury,Saltwater Crocodile,2020-06-06 03:42:25,IN REVIEW,2021-03-09 06:39:28,2021-03-09 06:39:28)
(true,2,8940.53,IpsumPrimis.tiff,AUD,323,2019-03-18 15:48:16,Collision,Blue Ringed Octopus,2020-05-26 14:59:19,IN REVIEW,2021-03-09 06:39:28,2021-03-09 06:39:28)
(true,3,9406.86,null,USD,39,2019-04-28 21:15:09,Death,Great White Shark,2020-03-06 11:20:54,INITIAL,2021-03-09 06:39:28,2021-03-09 06:39:28)
(true,4,3997.9,null,AUD,315,2019-10-26 21:24:04,Permanent Injury,Saltwater Crocodile,2020-06-25 20:43:32,IN REVIEW,2021-03-09 06:39:28,2021-03-09 06:39:28)
(true,5,2647.35,null,AUD,74,2019-12-07 04:21:37,Light Injury,Cassowary,2020-07-30 10:28:53,REIMBURSED,2021-03-09 06:39:28,2021-03-09 06:39:28)
I am trying to create a table in Apache Flink SQL client. I want to filter my JSON data in Flink, which arrives continously from a Kafka cluster.
The JSON looks like this:
{"lat":25.77,"lon":-80.19,"timezone":"America\/New_York",
"timezone_offset":-14400,
"current.dt":1592151550,
"current.sunrise":1592130546,
"current.sunset":1592179999,
"current.temp":302.77,
"current.feels_like":306.9,
"current.pressure":1017,
"current.humidity":78,
"current.dew_point":298.52,
"current.uvi":11.97,
"current.clouds":75,
"current.visibility":16093,
"current.wind_speed":3.6,
"current.wind_deg":60,
"current.weather.0.id":803,
"current.weather.0.main":"Clouds",
"current.weather.0.description":"broken clouds",
"current.weather.0.icon":"04d"}
The part I am interested in :
"current.weather.0.description":"broken clouds"
I want to filter my data whenever the current.weather description is "moderate rain". I tried to create two tables in Flink:
the Rain table, where the whole JSON arrives, and
where my filtered data will be stored and sent back to another Kafka cluster.
CREATE TABLE Rain (current.weather.0.description varchar) WITH ('connector.type' = 'kafka',
'connector.version' = 'universal',
'connector.topic' = 'WeatherRawData',
'format.type' = 'json',
'connector.properties.0.key' = 'bootstrap.servers',
'connector.properties.0.value' = 'kafka:9092',
'connector.properties.1.key' = 'group.id',
'connector.properties.1.value' = 'flink-input-group',
'connector.startup-mode' = 'earliest-offset'
);
CREATE TABLE ProcessedRain(
current.weather.0.description varchar
) WITH (
'connector.type' = 'kafka',
'connector.version' = 'universal',
'connector.topic' = 'WeatherProcessedData',
'format.type' = 'json',
'connector.properties.0.key' = 'bootstrap.servers',
'connector.properties.0.value' = 'kafka:9092',
'connector.properties.1.key' = 'group.id',
'connector.properties.1.value' = 'flink-output-group'
);
The error message I get :
[ERROR] Could not execute SQL statement. Reason: org.apache.flink.table.api.SqlParserException: SQL parse failed. Encountered "current" at line 1, column 20. Was expecting one of:
"PRIMARY" ...
"UNIQUE" ...
"WATERMARK" ...
<BRACKET_QUOTED_IDENTIFIER> ...
<QUOTED_IDENTIFIER> ...
<BACK_QUOTED_IDENTIFIER> ...
<IDENTIFIER> ...
<UNICODE_QUOTED_IDENTIFIER> ...
How should my CREATE TABLE be created correctly?
I think it should be
CREATE TABLE ProcessedRain (
`current.weather.0.description` VARCHAR
) WITH (
'connector.type' = 'kafka',
'connector.version' = 'universal',
'connector.topic' = 'WeatherProcessedData',
'format.type' = 'json',
'connector.properties.bootstrap.servers' = 'kafka:9092',
'connector.properties.group.id' = 'flink-output-group'
);
Im using Scrapy (aka Twisted) and also Postgres as a database.
After I while my connections seem to fill up and then my script is been stuck. I checked this with this query SELECT * FROM pg_stat_activity; and read that its caused because Postgres has no connection pool.
I read about txpostgres and PGBouncer, Bouncer regrettably isn't an option, what else can I do to avoid this problem?
So far I use the following pipeline:
import psycopg2
from twisted.enterprise import adbapi
import logging
from datetime import datetime
import scrapy
from scrapy.exceptions import DropItem
class PostgreSQLPipeline(object):
""" PostgreSQL pipeline class """
def __init__(self, dbpool):
self.logger = logging.getLogger(__name__)
self.dbpool = dbpool
#classmethod
def from_settings(cls, settings):
dbargs = dict(
host=settings['POSTGRESQL_HOST'],
database=settings['POSTGRESQL_DATABASE'],
user=settings['POSTGRESQL_USER'],
password=settings['POSTGRESQL_PASSWORD'],
)
dbpool = adbapi.ConnectionPool('psycopg2', **dbargs)
return cls(dbpool)
def process_item(self, item, spider):
d = self.dbpool.runInteraction(self._insert_item, item, spider)
d.addErrback(self._handle_error, item, spider)
d.addBoth(lambda _: item)
return d
def _insert_item(self, txn, item, spider):
"""Perform an insert or update."""
now = datetime.utcnow().replace(microsecond=0).isoformat(' ')
txn.execute(
"""
SELECT EXISTS(
SELECT 1
FROM expose
WHERE expose_id = %s
)
""", (
item['expose_id'],
)
)
ret = txn.fetchone()[0]
if ret:
self.logger.info("Item already in db: %r" % (item))
txn.execute(
"""
UPDATE expose
SET last_seen=%s, offline=0
WHERE expose_id=%s
""", (
now,
item['expose_id']
)
)
else:
self.logger.info("Item stored in db: %r" % (item))
txn.execute("""
INSERT INTO expose (
expose_id,
title
) VALUES (%s, %s)
""", (
item['expose_id'],
item['title']
)
)
# Write image info (path, original url, ...) to db, CONSTRAIN to expose.expose_id
for image in item['images']:
txn.execute(
"""
INSERT INTO image (
expose_id,
name
) VALUES (%s, %s)
""", (
item['expose_id'],
image['path'].replace('full/', '')
)
)
def _handle_error(self, failure, item, spider):
"""Handle occurred on db interaction."""
# do nothing, just log
self.logger.error(failure, failure.printTraceback())
Could it be possible to convert a postgresql database (including data) with SQLAlchemy to a sqlite database?
I tried the code below. It looks like that it works.
What do you think about it? Could this be an answer?
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import sqlalchemy as sa
import sqlalchemy.ext.declarative as sad
import sqlalchemy.orm as sao
import sqlalchemy.orm.session as sas
from sqlalchemy_utils import create_database
_Base = sad.declarative_base()
class Child(_Base):
__tablename__ = 'Child'
_oid = sa.Column('oid', sa.Integer, primary_key=True)
_name = sa.Column('name', sa.String)
def __init__(self, name):
self._name = name
class Parent(_Base):
__tablename__ = 'Parent'
_oid = sa.Column('oid', sa.Integer, primary_key=True)
_name = sa.Column('name', sa.String)
_child_fk = sa.Column('child', sa.Integer, sa.ForeignKey('Child.oid'))
_child = sao.relationship('Child')
def __init__(self, name):
super(Parent, self).__init__()
self._name = name
pstr = 'postgres://postgres#localhost/Family'
sstr = 'sqlite:///family.db'
pengine = sa.create_engine(pstr, echo = True)
sengine = sa.create_engine(sstr, echo = True)
def createPostgreSQL_Family():
"""Create for PostgreSQL the scheme and the data for testing."""
# create schema
create_database(pengine.url)
_Base.metadata.create_all(pengine)
psession = sao.sessionmaker(bind = pengine)()
# child
c = Child('Jim Bob')
psession.add(c)
psession.commit()
# parent
p = Parent('Mr. Doe')
p._child = c
psession.add(p)
psession.commit()
psession.close()
def convert():
# get one object from the PostgreSQL database
psession = sao.sessionmaker(bind = pengine)()
p = psession.query(Parent).first()
sas.make_transient(p)
#p._oid = None
c = psession.query(Child).first()
sas.make_transient(c)
#c._oid = None
psession.close()
# create and open the SQLite database
create_database(sengine.url)
_Base.metadata.create_all(sengine)
# add/convert the one object to the new database
ssession = sao.sessionmaker(bind = sengine)()
ssession.add(c)
ssession.add(p)
ssession.commit()
if __name__ == '__main__':
#createPostgreSQL_Family()
convert()