Flink SQL-CLi: bring header records - apache-kafka

I'm new with flink sql cli and I want to create a sink from my kafka cluster.
I've read the documentation and as I understand de headers are a map<STRING, BYTE> types and through them are all the important information.
When I'm using de sql-cli I try to create a sink table following this command:
CREATE TABLE KafkaSink (
`headers` MAP<STRING, BYTES> METADATA
) WITH (
'connector' = 'kafka',
'topic' = 'MyTopic',
'properties.bootstrap.servers' ='LocalHost',
'properties.group.id' = 'MyGroypID',
'scan.startup.mode' = 'earliest-offset',
'value.format' = 'json'
);
But when I try to read the data with select * from KafkaSink limit 10; It returns me null records
I've tried to run queries like
select headers.col1 from a limit 10;
And also, I've tried to create the sink table with different structures at selecting columns part:
...
`headers` STRING
...
...
`headers` MAP<STRING, STRING>
...
...
`headers` ROW(COL1 VARCHAR, COL2 VARCHAR...)
...
But it returns me nothing, however when I bring the offset columns from kafka cluster it brings me the offset but no the headers.
Can someone explain me my error?
I want to create a kafka sink with flink sql cli

Ok, as I could see it, when I tried to change to
'format' = 'debezium-json'
I could see in a better way the json.
I follow the json schema, in my case was
{
"data": {...},
"metadata":{...}
}
So instead of bringing the header i'm bringing the data with all the columns that i need, the data as a string and the columns as for example
data.col1, data.col2
In order to see the records, just with a
select
json_value(data, '$.Col1') as Col1
from Table;
it works!

Related

KSQLDB Create Stream Select AS with Headers

Is there a way in KsqlDB to add headers while creating a stream from AS SELECT?
For example, I have a stream DomainOrgs(DomainId INT,OrgId INT,someId INT), now I need to create a stream with all the values in DomainOrgs also DomainId should go to Header
I tried to create like
CREATE STREAM DomainOrgs_with_header AS
SELECT DomainId,
OrgId,
someId,
DomainId HEADER('DomainId')
FROM DomainOrgs
EMIT CHANGES;
Also tried
CREATE STREAM DomainOrgs_with_header
(
DomainId INT,
OrgId INT,
someId INT,
DomainId_Header Header('DomainId')
)
INSERT INTO DomainOrgs_with_header
SELECT DomainId,OrgId,someId,DomainId FROM DomainOrgs
Here, stream will create but INSERT INTO will fail.
Is there any way to select data into the stream with headers?

Insert the evaluation of a function in cassandra with JSON

I am using kafka connect to sink some data from a kafka topic to a cassandra table. I want to add a column with a timestamp when the insert/update happens into cassandra. That is easy with postgres with functions and triggers. To use the same in cassandra that is not going to happen. I can not add java code to the cassandra cluster.
So I am thinking on how to add this value injecting a now() at some point of the kafka connect and inserted on the cassandra table as the result of the execution of the function.
I read kafka connect to cassandra uses de JSON cassandra insert api.
https://docs.datastax.com/en/cql-oss/3.3/cql/cql_using/useInsertJSON.html
I tried to insert with different formats but nothing worked.
INSERT INTO keyspace1.table1 JSON '{ "lhlt" : "key 1", "last_updated_on_cassandra" : now() }';
InvalidRequest: Error from server: code=2200 [Invalid query] message="Could not decode JSON string as a map: com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'now': was expecting 'null', 'true', 'false' or NaN
at [Source: { "lhlt" : "key 1", "last_updated_on_cassandra" : now() }; line: 1, column: 74]. (String was: { "lhlt" : "key 1", "last_updated_on_cassandra" : now() })"
The CQL grammar does not support the use of functions when inserting data in JSON format.
The INSERT INTO ... JSON command only accepts valid JSON. For example:
INSERT INTO tstamp_tbl JSON '{
"id":1,
"tstamp":"2022-10-29"
}';
You will need to use the more generic INSERT syntax if you want to call CQL functions. For example:
INSERT INTO tstamp_tbl (id, tstamp)
VALUES ( 1, toTimestamp(now()) );
Cheers!

Get s3 folder partitions with AWS Glue create_dynamic_frame.from_options

I'm creating a dynamic frame with create_dynamic_frame.from_options that pulls data directly from s3.
However, I have multiple partitions on my raw data that don't show up in the schema this way, because they aren't actually part of the data, they're part of the s3 folder structure.
If the partitions were part of the data I could use partitionKeys=["date"] , but date is not a column, it is a folder.
How can I detect these partitions using from_options or some other mechanism?
I can't use a glue crawler to detect the full schema including partitions, because my data is too nested for glue crawlers to handle.
Another way to deal with partitions is to register them in the Glue Data catalog either manually or programmatically.
You can use the glue client in a lambda function.
The example below is from a lambda function triggered upon file landing in a datalake, that parses the file name to create partitions in a target table in the glue catalog. (You can find a use case with Adobe Analytics in aws archives from where I got the code amazon-archives/athena-adobe-datafeed-splitter.)
import boto3
def create_glue_client():
"""Create a return a Glue client for the region this AWS lambda job is running in"""
current_region = os.environ['AWS_REGION'] # With Glue, we only support writing to the region where this code runs
return boto3.client('glue', region_name=current_region)
def does_partition_exist(glue_client, database, table, part_values):
"""Test if a specific partition exists in a database.table"""
try:
glue_client.get_partition(DatabaseName=database, TableName=table, PartitionValues=part_values)
return True
except glue_client.exceptions.EntityNotFoundException:
return False
def add_partition(glue_client, database_name, s3_report_base, key):
"""Add a partition to the target table for a specific date"""
# key_example = 01-xxxxxxxx_2019-11-07.tsv.gz
partition_date = key.split("_")[1].split(".")[0]
# partition_date = "2019-11-07"
year, month, day = partition_date.split('-')
# year, month, day = [2019,11,07]
part = key[:2]
# part = "01"
if does_partition_exist(glue_client, database_name, "target_table", [year, month, day, part]):
return
# get headers python list from csv file in S3
headers = get_headers(s3_headers_path)
# create new partition
glue_client.create_partition(
DatabaseName=database_name,
TableName="target_table",
PartitionInput={
"Values": [year, month, day, part],
"StorageDescriptor": storage_descriptor(
[{"Type": "string", "Name": name} for name in headers], #columns
'%s/year=%s/month=%s/day=%s/part=%s' % (s3_report_base, year, month, day, part) #location
),
"Parameters": {}
}
)
def storage_descriptor(columns, location):
"""Data Catalog storage descriptor with the desired columns and S3 location"""
return {
"Columns": columns,
"Location": location,
"InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
"OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.serde2.OpenCSVSerde",
"Parameters": {
"separatorChar": "\t"
}
},
"BucketColumns": [], # Required or SHOW CREATE TABLE fails
"Parameters": {} # Required or create_dynamic_frame.from_catalog fails
}
In your case with an existing file structure you can use the function below to iterate on the file keys in your raw bucket and parse them to create partitions :
s3 = boto3.resource('s3')
def get_s3_keys(bucket):
"""Get a list of keys in an S3 bucket."""
keys = []
resp = s3.list_objects_v2(Bucket=bucket)
for obj in resp['Contents']:
keys.append(obj['Key'])
return keys
Then you can build your dynamic frame as follows :
datasource = glueContext.create_dynamic_frame.from_catalog(
database = "database_name",
table_name = "target_table",
transformation_ctx = "datasource1",
push_down_predicate = "(year='2020' and month='5' and day='4')"
)
Another workaround to this is to show the file paths of the data using attachFilename option (similar to input_file_name in spark) as a column and parse the partitions manually from the string paths.
df = glueContext.create_dynamic_frame.from_options(
connection_type='s3',
connection_options = {"paths": paths, "groupFiles": "none"},
format="csv",
format_options = {"withHeader": True,
"multiLine": True,
"quoteChar": '"',
"attachFilename": "source_file"
},
transformation_ctx="table_source_s3"
)

ksqlDB get message's value as string column from Json topic

There is a topic, containing plain JSON messages, trying to create a new stream by extracting few columns from JSON and another column as varchar with the message's value.
Here is a sample message in a topic_json
{
"db": "mydb",
"collection": "collection",
"op": "update"
}
Creating a stream like -
CREATE STREAM test1 (
db VARCHAR,
collection VARCHAR
VAL STRING
) WITH (
KAFKA_TOPIC = 'topic_json',
VALUE_FORMAT = 'JSON'
);
Output of this stream will contain only db and collection columns, How can I add another column as message's value - "{\"db\":\"mydb\",\"collection\":\"collection\",\"op\":\"update\"}"

ClickHouse JSON parse exception: Cannot parse input: expected ',' before

I'm trying to add JSON data to ClickHouse from Kafka. Here's simplified JSON:
{
...
"sendAddress":{
"sendCommChannelTypeId":4,
"sendCommChannelTypeCode":"SMS",
"sendAddress":"789345345945"},
...
}
Here's the steps for creating table in ClickHouse, create another table using Kafka Engine and creating MATERIALIZED VIEW to connect these two tables, and also connect CH with Kafka.
Creating the first table
CREATE TABLE tab
(
...
sendAddress Tuple (sendCommChannelTypeId Int32, sendCommChannelTypeCode String, sendAddress String),
...
)Engine = MergeTree()
PARTITION BY applicationId
ORDER BY (applicationId);
Creating a second table with Kafka Engine SETTINGS:
CREATE TABLE tab_kfk
(
...
sendAddress Tuple (sendCommChannelTypeId Int32, sendCommChannelTypeCode String, sendAddress String),
...
)ENGINE = Kafka
SETTINGS kafka_broker_list = 'localhost:9092',
kafka_topic_list = 'topk2',
kafka_group_name = 'group1',
kafka_format = 'JSONEachRow',
kafka_row_delimiter = '\n';
Create MATERIALIZED VIEW
CREATE MATERIALIZED VIEW tab_mv TO tab AS
SELECT ... sendAddress, ...
FROM tab_kfk;
Then I try to SELECT all or specific items from the first table - tab and get nothing. Logs is following
OK. Just add '[]' before curly braces in the sendAddress like this:
"authkey":"some_value",
"sendAddress":[{
"sendCommChannelTypeId":4,
"sendCommChannelTypeCode":"SMS",
"sendAddress":"789345345945"
}]
And I still get a mistake, but slightly different:
What should I do to fix this problem, thanks!
There are 3 ways to fix it:
Not use nested objects and flatten messages before inserting to Kafka topic. For example such way:
{
..
"authkey":"key",
"sendAddress_CommChannelTypeId":4,
"sendAddress_CommChannelTypeCode":"SMS",
"sendAddress":"789345345945",
..
}
Use Nested data structure that required to change the JSON-message schema and table schema:
{
..
"authkey":"key",
"sendAddress.sendCommChannelTypeId":[4],
"sendAddress.sendCommChannelTypeCode":["SMS"],
"sendAddress.sendAddress":["789345345945"],
..
}
CREATE TABLE tab_kfk
(
applicationId Int32,
..
sendAddress Nested(
sendCommChannelTypeId Int32,
sendCommChannelTypeCode String,
sendAddress String),
..
)
ENGINE = Kafka
SETTINGS kafka_broker_list = 'localhost:9092',
kafka_topic_list = 'topk2',
kafka_group_name = 'group1',
kafka_format = 'JSONEachRow',
kafka_row_delimiter = '\n',
input_format_import_nested_json = 1 /* <--- */
Take into account the setting input_format_import_nested_json.
Interpret input JSON-message as string & parse it manually (see github issue #16969):
CREATE TABLE tab_kfk
(
message String
)
ENGINE = Kafka
SETTINGS
..
kafka_format = 'JSONAsString', /* <--- */
..
CREATE MATERIALIZED VIEW tab_mv TO tab
AS
SELECT
..
JSONExtractString(message, 'authkey') AS authkey,
JSONExtract(message, 'sendAddress', 'Tuple(Int32,String,String)') AS sendAddress,
..
FROM tab_kfk;
By reading at this commit, I believe for release 23.1, that a new setting input_format_json_read_objects_as_strings will allow converting nested json to String.
Example:
SET input_format_json_read_objects_as_strings = 1;
CREATE TABLE test (id UInt64, obj String, date Date) ENGINE=Memory();
INSERT INTO test FORMAT JSONEachRow {"id" : 1, "obj" : {"a" : 1, "b" : "Hello"}, "date" : "2020-01-01"};
SELECT * FROM test;
Result:
id
obj
date
1
{"a" : 1, "b" : "Hello"}
2020-01-01
docs
Of course, you can still use the materialized view to convert the String object to the correct column types using the same technics as for parsing JSONAsString format