Failed to deserialize Confluent Avro record using Flink-SQL - apache-kafka

Cannot consume the Confluent Kafka data using Flink-sql.
Fink version:1.12-csadh1.3.0.0
Cluster:Cloudera(CDP)
Kafka:Confluent Kafka
SQL:
CREATE TABLE consumer_session_created (
***
) WITH (
'connector' = 'kafka',
'topic' = '***',
'scan.startup.mode' = 'earliest-offset',
'properties.bootstrap.servers' = '***:9092',
'properties.group.id' = '***',
'properties.security.protocol' = 'SASL_SSL',
'properties.sasl.mechanism' = 'PLAIN',
'properties.sasl.jaas.config' = 'org.apache.kafka.common.security.plain.PlainLoginModule required username="***" password="***"',
'properties.avro-confluent.basic-auth.credentials-source' = '***',
'properties.avro-confluent.basic-auth.user-info' = '***',
'value.format' = 'avro-confluent',
'value.fields-include' = 'EXCEPT_KEY',
'value.avro-confluent.schema-registry.url' = 'https://***',
'value.avro-confluent.schema-registry.subject' = '***'
)
Error Msg:
java.io.IOException: Failed to deserialize Avro record.
at org.apache.flink.formats.avro.AvroRowDataDeserializationSchema.deserialize(AvroRowDataDeserializationSchema.java:101)
at org.apache.flink.formats.avro.AvroRowDataDeserializationSchema.deserialize(AvroRowDataDeserializationSchema.java:44)
at org.apache.flink.api.common.serialization.DeserializationSchema.deserialize(DeserializationSchema.java:82)
at org.apache.flink.streaming.connectors.kafka.table.DynamicKafkaDeserializationSchema.deserialize(DynamicKafkaDeserializationSchema.java:113)
at org.apache.flink.streaming.connectors.kafka.internals.KafkaFetcher.partitionConsumerRecordsHandler(KafkaFetcher.java:179)
at org.apache.flink.streaming.connectors.kafka.internals.KafkaFetcher.runFetchLoop(KafkaFetcher.java:142)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:826)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:66)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:241)
Caused by: java.io.IOException: Could not find schema with id 79 in registry
at org.apache.flink.formats.avro.registry.confluent.ConfluentSchemaRegistryCoder.readSchema(ConfluentSchemaRegistryCoder.java:77)
at org.apache.flink.formats.avro.RegistryAvroDeserializationSchema.deserialize(RegistryAvroDeserializationSchema.java:70)
at org.apache.flink.formats.avro.AvroRowDataDeserializationSchema.deserialize(AvroRowDataDeserializationSchema.java:98)
... 9 more
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Unauthorized; error code: 401
at io.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:292)
at io.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:352)
at io.confluent.kafka.schemaregistry.client.rest.RestService.getId(RestService.java:660)
at io.confluent.kafka.schemaregistry.client.rest.RestService.getId(RestService.java:642)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getSchemaByIdFromRegistry(CachedSchemaRegistryClient.java:217)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getSchemaBySubjectAndId(CachedSchemaRegistryClient.java:291)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getSchemaById(CachedSchemaRegistryClient.java:276)
at io.confluent.kafka.schemaregistry.client.SchemaRegistryClient.getById(SchemaRegistryClient.java:64)
at org.apache.flink.formats.avro.registry.confluent.ConfluentSchemaRegistryCoder.readSchema(ConfluentSchemaRegistryCoder.java:74)
... 11 more
I thought I used the parameters avro-confluent.basic-auth.* in the wrong way, according to the flink Doc here. So I removed the prefix properties.:
WITH (
'connector',
***
'avro-confluent.basic-auth.credentials-source' = '***',
'avro-confluent.basic-auth.user-info' = '***',
***
)
However, another exception raised:
org.apache.flink.table.api.ValidationException: Unsupported options found for connector 'kafka'.
Unsupported options:
avro-confluent.basic-auth.credentials-source
avro-confluent.basic-auth.user-info
Tips:
we can consume/de-ser the kafka data correctly using DataStream API with same parameters, and this topic has been used for a long time by others'.

I'm not sure about Flink packed into Cloudera distributive, but in simple (original) Flink I connect to Kafka using SQL:
CREATE TABLE my_flink_table (
event_date AS TO_TIMESTAMP(eventtime_string_field, 'yyyyMMddHHmmssX')
,field1
,field2
...
,WATERMARK FOR event_date AS event_date - INTERVAL '10' MINUTE
) WITH (
'connector' = 'kafka',
'topic' = 'my_events_topic',
'scan.startup.mode' = 'earliest-offset',
'format' = 'avro-confluent',
'avro-confluent.schema-registry.url' = 'http://my_confluent_schma_reg_host:8081/',
'properties.group.id' = 'flink-test-001',
'properties.bootstrap.servers' = 'kafka_host01:9092'
);

Related

Apache flink connect to postgresql

I'm trying to connect to a postgresql with pyflink on windows and I'm using the following code:
from pyflink.table import EnvironmentSettings, TableEnvironment
env_settings = EnvironmentSettings.in_streaming_mode()
table_env = TableEnvironment.create(env_settings)
table_env.execute_sql("""
CREATE TABLE test_nifi (
codecountry VARCHAR(50),
name VARCHAR(50),
PRIMARY KEY (codecountry) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:postgresql://localhost:5432/TestDS',
'table-name' = 'public.test_nifi',
'username' = 'postgres',
'password' = 'postgres'
)
""")
result = table_env.from_path("test_nifi").select("codecountry, name")
print(result.to_pandas())
and I'm getting the following error:
Caused by: org.apache.flink.table.api.ValidationException: Could not find any factory for identifier 'jdbc' that implements 'org.apache.flink.table.factories.DynamicTableFactory' in the classpath.
Any idea why is this happening?
add following line:
table_env.get_config().get_configuration().set_string("pipeline.jars", "file:///C:/Users/Admin/Desktop/Flink/flink-connector-jdbc_2.12-1.14.3.jar;file:///C:/Users/Admin/Desktop/Flink/postgresql-42.3.1.jar")
Since Flink is a Java/Scala-based project, for both connectors and formats, implementations are available as jars
postgresql in pyflink relies on Java's flink-connector-jdbc implementation and you need to add this jar in stream_execution_environment
stream_execution_environment.add_jars("file:///my/jar/path/connector1.jar", "file:///my/jar/path/connector2.jar")

Pyflink 1.14 table connectors - Kafka authentication

I've only seen Pyflink table API examples of kafka connections which does not contain authentication details in the connection establishment (doc ref), namely source table connection:
source_ddl = """
CREATE TABLE source_table(
a VARCHAR,
b INT
) WITH (
'connector' = 'kafka',
'topic' = 'source_topic',
'properties.bootstrap.servers' = 'kafka:9092',
'properties.group.id' = 'test_3',
'scan.startup.mode' = 'latest-offset',
'format' = 'json'
)
"""
I however need to connect to kafka sources with authentication enabled. By 'interpreting' that all property.XXX are dedicated as kafka config, I altered the examples as follows and tested:
import os
from pyflink.datastream.stream_execution_environment import StreamExecutionEnvironment
from pyflink.table import TableEnvironment, EnvironmentSettings, environment_settings
from pyflink.table.table_environment import StreamTableEnvironment
KAFKA_SERVERS = 'localhost:9092'
KAFKA_USERNAME = "user"
KAFKA_PASSWORD = "XXX"
KAFKA_SOURCE_TOPIC = 'source'
KAFKA_SINK_TOPIC = 'dest'
def log_processing():
env = StreamExecutionEnvironment.get_execution_environment()
env.add_jars("file:///opt/flink/lib_py/kafka-clients-2.4.1.jar")
env.add_jars("file:///opt/flink/lib_py/flink-connector-kafka_2.11-1.14.0.jar")
env.add_jars("file:///opt/flink/lib_py/flink-sql-connector-kafka_2.12-1.14.0.jar")
settings = EnvironmentSettings.new_instance()\
.in_streaming_mode()\
.use_blink_planner()\
.build()
t_env = StreamTableEnvironment.create(stream_execution_environment= env, environment_settings=settings)
source_ddl = f"""
CREATE TABLE source_table(
Cylinders INT,
Displacement INT,
Horsepower INT,
Weight INT,
Acceleration INT,
Model_Year INT,
USA INT,
Europe INT,
Japan INT
) WITH (
'connector' = 'kafka',
'topic' = '{KAFKA_SOURCE_TOPIC}',
'properties.bootstrap.servers' = '{KAFKA_SERVERS}',
'properties.group.id' = 'testgroup12',
'properties.sasl.mechanism' = 'PLAIN',
'properties.security.protocol' = 'SASL_PLAINTEXT',
'properties.sasl.jaas.config' = 'org.apache.kafka.common.security.plain.PlainLoginModule required username=\"{KAFKA_USERNAME}\" password=\"{KAFKA_PASSWORD}\";',
'scan.startup.mode' = 'latest-offset',
'format' = 'json'
)
"""
sink_ddl = f"""
CREATE TABLE sink_table(
Cylinders INT,
Displacement INT,
Horsepower INT,
Weight INT,
Acceleration INT,
Model_Year INT,
USA INT,
Europe INT,
Japan INT
) WITH (
'connector' = 'kafka',
'topic' = '{KAFKA_SINK_TOPIC}',
'properties.bootstrap.servers' = '{KAFKA_SERVERS}',
'properties.group.id' = 'testgroup12',
'properties.sasl.mechanism' = 'PLAIN',
'properties.security.protocol' = 'SASL_PLAINTEXT',
'properties.sasl.jaas.config' = 'org.apache.kafka.common.security.plain.PlainLoginModule required username=\"{KAFKA_USERNAME}\" password=\"{KAFKA_PASSWORD}\";',
'scan.startup.mode' = 'latest-offset',
'format' = 'json'
)
"""
t_env.execute_sql(source_ddl)
t_env.execute_sql(sink_ddl)
t_env.sql_query("SELECT * FROM source_table").execute_insert("sink_table").wait()
t_env.execute("kafka-table")
if __name__ == '__main__':
log_processing()
By adding this job from the cli, there is no response or indication that a job in instantiated with a respective job id:
Respectively no job created when viewing flink UI
If I'm incorrectly configuring the connection, can someone please correct me, or point me to a relative source of documentation (I've googled quite a bit...)
Found the problem, as suggested by #DavidAnderson. The code from my question works as is... just required to update the dependency jars respectively. If using Scala 2.12 and flink version 1.14, the following dependencies are applicable (with the jar dependencies downloaded and available on you jobManager in the respective directory):
env.add_jars("file:///opt/flink/lib_py/kafka-clients-2.4.1.jar")
env.add_jars("file:///opt/flink/lib_py/flink-connector-kafka_2.12-1.14.0.jar")
env.add_jars("file:///opt/flink/lib_py/flink-sql-connector-kafka_2.12-1.14.0.jar")
A useful site to reference, which I found later on is https://mvnrepository.com/artifact/org.apache.flink/flink-connector-kafka_2.12/1.14.0

Flink SQL Client connect to secured kafka cluster

I want to execute a query on Flink SQL Table backed by kafka topic of secured kafka cluster. I'm able to execute the query programmatically but unable to do the same through Flink SQL client. I'm not sure on how to pass JAAS config (java.security.auth.login.config) and other system properties through Flink SQL client.
Flink SQL query programmatically
private static void simpleExec_auth() {
// Create the execution environment.
final EnvironmentSettings settings = EnvironmentSettings.newInstance()
.inStreamingMode()
.withBuiltInCatalogName(
"default_catalog")
.withBuiltInDatabaseName(
"default_database")
.build();
System.setProperty("java.security.auth.login.config","client_jaas.conf");
System.setProperty("sun.security.jgss.native", "true");
System.setProperty("sun.security.jgss.lib", "/usr/libexec/libgsswrap.so");
System.setProperty("javax.security.auth.useSubjectCredsOnly","false");
TableEnvironment tableEnvironment = TableEnvironment.create(settings);
String createQuery = "CREATE TABLE test_flink11 ( " + "`keyid` STRING, " + "`id` STRING, "
+ "`name` STRING, " + "`age` INT, " + "`color` STRING, " + "`rowtime` TIMESTAMP(3) METADATA FROM 'timestamp', " + "`proctime` AS PROCTIME(), " + "`address` STRING) " + "WITH ( "
+ "'connector' = 'kafka', "
+ "'topic' = 'test_flink10', "
+ "'scan.startup.mode' = 'latest-offset', "
+ "'properties.bootstrap.servers' = 'kafka01.nyc.com:9092', "
+ "'value.format' = 'avro-confluent', "
+ "'key.format' = 'avro-confluent', "
+ "'key.fields' = 'keyid', "
+ "'value.fields-include' = 'EXCEPT_KEY', "
+ "'properties.security.protocol' = 'SASL_PLAINTEXT', 'properties.sasl.kerberos.service.name' = 'kafka', 'properties.sasl.kerberos.kinit.cmd' = '/usr/local/bin/skinit --quiet', 'properties.sasl.mechanism' = 'GSSAPI', "
+ "'key.avro-confluent.schema-registry.url' = 'http://kafka-schema-registry:5037', "
+ "'key.avro-confluent.schema-registry.subject' = 'test_flink6', "
+ "'value.avro-confluent.schema-registry.url' = 'http://kafka-schema-registry:5037', "
+ "'value.avro-confluent.schema-registry.subject' = 'test_flink4')";
System.out.println(createQuery);
tableEnvironment.executeSql(createQuery);
TableResult result = tableEnvironment
.executeSql("SELECT name,rowtime FROM test_flink11");
result.print();
}
This is working fine.
Flink SQL query through SQL client
Running this giving the following error.
Flink SQL> CREATE TABLE test_flink11 (`keyid` STRING,`id` STRING,`name` STRING,`address` STRING,`age` INT,`color` STRING) WITH('connector' = 'kafka', 'topic' = 'test_flink10','scan.startup.mode' = 'earliest-offset','properties.bootstrap.servers' = 'kafka01.nyc.com:9092','value.format' = 'avro-confluent','key.format' = 'avro-confluent','key.fields' = 'keyid', 'value.avro-confluent.schema-registry.url' = 'http://kafka-schema-registry:5037', 'value.avro-confluent.schema-registry.subject' = 'test_flink4', 'value.fields-include' = 'EXCEPT_KEY', 'key.avro-confluent.schema-registry.url' = 'http://kafka-schema-registry:5037', 'key.avro-confluent.schema-registry.subject' = 'test_flink6', 'properties.security.protocol' = 'SASL_PLAINTEXT', 'properties.sasl.kerberos.service.name' = 'kafka', 'properties.sasl.kerberos.kinit.cmd' = '/usr/local/bin/skinit --quiet', 'properties.sasl.mechanism' = 'GSSAPI');
Flink SQL> select * from test_flink11;
[ERROR] Could not execute SQL statement. Reason:
java.lang.IllegalArgumentException: Could not find a 'KafkaClient' entry in the JAAS configuration. System property 'java.security.auth.login.config' is /tmp/jaas-6309821891889949793.conf
There is nothing in /tmp/jaas-6309821891889949793.conf except the following comment
# We are using this file as an workaround for the Kafka and ZK SASL implementation
# since they explicitly look for java.security.auth.login.config property
# Please do not edit/delete this file - See FLINK-3929
SQL client run command
bin/sql-client.sh embedded --jar flink-sql-connector-kafka_2.11-1.12.0.jar --jar flink-sql-avro-confluent-registry-1.12.0.jar
Flink cluster command
bin/start-cluster.sh
How to pass this java.security.auth.login.config and other system properties (that I'm setting in the above java code snippet), for SQL client?
flink-conf.yaml
security.kerberos.login.use-ticket-cache: true
security.kerberos.login.principal: XXXXX#HADOOP.COM
security.kerberos.login.use-ticket-cache: false
security.kerberos.login.keytab: /path/to/kafka.keytab
security.kerberos.login.principal: XXXX#HADOOP.COM
security.kerberos.login.contexts: Client,KafkaClient
I haven't really tested whether this solution is feasible, you can try it out, hope it will help you.

Flink SQL CLI client CREATE TABLE from Kafka

I am trying to create a table in Apache Flink SQL client. I want to filter my JSON data in Flink, which arrives continously from a Kafka cluster.
The JSON looks like this:
{"lat":25.77,"lon":-80.19,"timezone":"America\/New_York",
"timezone_offset":-14400,
"current.dt":1592151550,
"current.sunrise":1592130546,
"current.sunset":1592179999,
"current.temp":302.77,
"current.feels_like":306.9,
"current.pressure":1017,
"current.humidity":78,
"current.dew_point":298.52,
"current.uvi":11.97,
"current.clouds":75,
"current.visibility":16093,
"current.wind_speed":3.6,
"current.wind_deg":60,
"current.weather.0.id":803,
"current.weather.0.main":"Clouds",
"current.weather.0.description":"broken clouds",
"current.weather.0.icon":"04d"}
The part I am interested in :
"current.weather.0.description":"broken clouds"
I want to filter my data whenever the current.weather description is "moderate rain". I tried to create two tables in Flink:
the Rain table, where the whole JSON arrives, and
where my filtered data will be stored and sent back to another Kafka cluster.
CREATE TABLE Rain (current.weather.0.description varchar) WITH ('connector.type' = 'kafka',
'connector.version' = 'universal',
'connector.topic' = 'WeatherRawData',
'format.type' = 'json',
'connector.properties.0.key' = 'bootstrap.servers',
'connector.properties.0.value' = 'kafka:9092',
'connector.properties.1.key' = 'group.id',
'connector.properties.1.value' = 'flink-input-group',
'connector.startup-mode' = 'earliest-offset'
);
CREATE TABLE ProcessedRain(
current.weather.0.description varchar
) WITH (
'connector.type' = 'kafka',
'connector.version' = 'universal',
'connector.topic' = 'WeatherProcessedData',
'format.type' = 'json',
'connector.properties.0.key' = 'bootstrap.servers',
'connector.properties.0.value' = 'kafka:9092',
'connector.properties.1.key' = 'group.id',
'connector.properties.1.value' = 'flink-output-group'
);
The error message I get :
[ERROR] Could not execute SQL statement. Reason: org.apache.flink.table.api.SqlParserException: SQL parse failed. Encountered "current" at line 1, column 20. Was expecting one of:
"PRIMARY" ...
"UNIQUE" ...
"WATERMARK" ...
<BRACKET_QUOTED_IDENTIFIER> ...
<QUOTED_IDENTIFIER> ...
<BACK_QUOTED_IDENTIFIER> ...
<IDENTIFIER> ...
<UNICODE_QUOTED_IDENTIFIER> ...
How should my CREATE TABLE be created correctly?
I think it should be
CREATE TABLE ProcessedRain (
`current.weather.0.description` VARCHAR
) WITH (
'connector.type' = 'kafka',
'connector.version' = 'universal',
'connector.topic' = 'WeatherProcessedData',
'format.type' = 'json',
'connector.properties.bootstrap.servers' = 'kafka:9092',
'connector.properties.group.id' = 'flink-output-group'
);

WSO2 SP - Kafka source with JSON attributes

I'm trying to read JSON data from Kafka, using following code:
#source(type = 'kafka', bootstrap.servers = 'localhost:9092', topic.list = 'TestTopic',
group.id = 'test', threading.option = 'single.thread', #map(type = 'json'))
define stream myDataStream (json object);
But failed with following error:
[2019-03-27_11-39-32_103] ERROR
{org.wso2.extension.siddhi.map.json.sourcemapper.JsonSourceMapper} -
Stream "myDataStream" does not have an attribute named "ABC",
but the received event {"event":{"ABC":"1"}} does. Hence dropping the message.
Check whether the json string is in a correct format for default mapping.
I've tried adding the attributes
#source(type = 'kafka', bootstrap.servers = 'localhost:9092',
topic.list = 'TestTopic', group.id = 'test',
threading.option = 'single.thread',
#map(type = 'json', #attributes(ABC = '$.ABC')))
Syntax error:
Error at 'json' defined at stream 'myDataStream', attribute 'json' is
not mapped
Any help would be greatly appreciated.
There is an error in the syntax of the stream,
define stream myDataStream (ABC string);
Here the attribute name is the key of the JSON messages, in this case, ABC