ksqlDB get message's value as string column from Json topic - apache-kafka

There is a topic, containing plain JSON messages, trying to create a new stream by extracting few columns from JSON and another column as varchar with the message's value.
Here is a sample message in a topic_json
{
"db": "mydb",
"collection": "collection",
"op": "update"
}
Creating a stream like -
CREATE STREAM test1 (
db VARCHAR,
collection VARCHAR
VAL STRING
) WITH (
KAFKA_TOPIC = 'topic_json',
VALUE_FORMAT = 'JSON'
);
Output of this stream will contain only db and collection columns, How can I add another column as message's value - "{\"db\":\"mydb\",\"collection\":\"collection\",\"op\":\"update\"}"

Related

Flink SQL-CLi: bring header records

I'm new with flink sql cli and I want to create a sink from my kafka cluster.
I've read the documentation and as I understand de headers are a map<STRING, BYTE> types and through them are all the important information.
When I'm using de sql-cli I try to create a sink table following this command:
CREATE TABLE KafkaSink (
`headers` MAP<STRING, BYTES> METADATA
) WITH (
'connector' = 'kafka',
'topic' = 'MyTopic',
'properties.bootstrap.servers' ='LocalHost',
'properties.group.id' = 'MyGroypID',
'scan.startup.mode' = 'earliest-offset',
'value.format' = 'json'
);
But when I try to read the data with select * from KafkaSink limit 10; It returns me null records
I've tried to run queries like
select headers.col1 from a limit 10;
And also, I've tried to create the sink table with different structures at selecting columns part:
...
`headers` STRING
...
...
`headers` MAP<STRING, STRING>
...
...
`headers` ROW(COL1 VARCHAR, COL2 VARCHAR...)
...
But it returns me nothing, however when I bring the offset columns from kafka cluster it brings me the offset but no the headers.
Can someone explain me my error?
I want to create a kafka sink with flink sql cli
Ok, as I could see it, when I tried to change to
'format' = 'debezium-json'
I could see in a better way the json.
I follow the json schema, in my case was
{
"data": {...},
"metadata":{...}
}
So instead of bringing the header i'm bringing the data with all the columns that i need, the data as a string and the columns as for example
data.col1, data.col2
In order to see the records, just with a
select
json_value(data, '$.Col1') as Col1
from Table;
it works!

Insert the evaluation of a function in cassandra with JSON

I am using kafka connect to sink some data from a kafka topic to a cassandra table. I want to add a column with a timestamp when the insert/update happens into cassandra. That is easy with postgres with functions and triggers. To use the same in cassandra that is not going to happen. I can not add java code to the cassandra cluster.
So I am thinking on how to add this value injecting a now() at some point of the kafka connect and inserted on the cassandra table as the result of the execution of the function.
I read kafka connect to cassandra uses de JSON cassandra insert api.
https://docs.datastax.com/en/cql-oss/3.3/cql/cql_using/useInsertJSON.html
I tried to insert with different formats but nothing worked.
INSERT INTO keyspace1.table1 JSON '{ "lhlt" : "key 1", "last_updated_on_cassandra" : now() }';
InvalidRequest: Error from server: code=2200 [Invalid query] message="Could not decode JSON string as a map: com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'now': was expecting 'null', 'true', 'false' or NaN
at [Source: { "lhlt" : "key 1", "last_updated_on_cassandra" : now() }; line: 1, column: 74]. (String was: { "lhlt" : "key 1", "last_updated_on_cassandra" : now() })"
The CQL grammar does not support the use of functions when inserting data in JSON format.
The INSERT INTO ... JSON command only accepts valid JSON. For example:
INSERT INTO tstamp_tbl JSON '{
"id":1,
"tstamp":"2022-10-29"
}';
You will need to use the more generic INSERT syntax if you want to call CQL functions. For example:
INSERT INTO tstamp_tbl (id, tstamp)
VALUES ( 1, toTimestamp(now()) );
Cheers!

ksqldb stream created with EXTRACTJSONFIELD fields contains null values for these fields

everyone. I am trying to create a stream from another stream (using CREATE FROM SELECT) that contains JSON data stored as VARCHAR. My first intention was to use EXTRACTJSONFIELD to extract specific JSON fields to separate columns. But on an attempt to query a new stream I am getting the null values for fields created using EXTRACTJSONFIELD.
Example:
My payload looks like this:
{
"data": {
"some_value": 3702,
},
"metadata": {
"timestamp": "2022-05-23T23:54:38.477934Z",
"table-name": "some_table"
}
}
I create stream using
CREATE STREAM SOME_STREAM (data VARCHAR, metadata VARCHAR) WITH (KAFKA_TOPIC='SOME_TOPIC', VALUE_FORMAT='JSON');
Data inside looks like this:
Then I am trying to create new stream using:
CREATE STREAM SOME_STREAM_QUERYABLE
AS SELECT
DATA,
METADATA,
EXTRACTJSONFIELD(METADATA, '$.timestamp') as timestamp
FROM SOME_STREAM
EMIT CHANGES
The attempt to query this new stream gives me null for timestamp field. Data and metadata fields have valid json.

How to parse array of struct when creating stream in KSQL

I'm trying to access a key called session ID, here is what the JSON of my message looks like:
{
"events":[
{
"data":{
"session_id":"-8479590123628083695"}}]}
This is my KSQL code to create the stream,
CREATE STREAM stream_test(
events ARRAY< data STRUCT< session_id varchar> >
) WITH (KAFKA_TOPIC='my_topic',
VALUE_FORMAT='JSON')
But I get this error,
KSQLError: ("line 4:22: mismatched input 'STRUCT' expecting {'(', 'ARRAY', '>'}", 40001, None)
Does anyone know how to unpack this kind of structure? I'm struggling to find examples
My solution:
events ARRAY< STRUCT<data STRUCT< session_id varchar >> >

How Create KSQLdb Stream fields from nested JSON Object

I have a topic over which i am sending json in the following format:
{
"schema": {
"type": "string",
"optional": true
},
"payload": “CustomerData{version='1', customerId=‘76813432’, phone=‘76813432’}”
}
and I would like to create a stream with customerId, and phone but I am not sure how to define the stream in terms of a nested json object. (edited)
CREATE STREAM customer (
payload.version VARCHAR,
payload.customerId VARCHAR,
payload.phone VARCHAR
) WITH (
KAFKA_TOPIC='customers',
VALUE_FORMAT='JSON'
);
Would it be something like that? How do I de-reference the nested object in defining the streams field?
Actually the above does not work for field definitions it says:
Caused by: line 2:12:
extraneous input '.' expecting {'EMIT', 'CHANGES',
'INTEGER', 'DATE', 'TIME', 'TIMESTAMP', 'INTERVAL', 'YEAR', 'MONTH', 'DAY',
Applying function extractjsonfield
There is an ksqlDB function called extractjsonfield that you can use.
First, you need to extract the schema and payload fields:
CREATE STREAM customer (
schema VARCHAR,
payload VARCHAR
) WITH (
KAFKA_TOPIC='customers',
VALUE_FORMAT='JSON'
);
Then you can select the nested fields within the json:
SELECT EXTRACTJSONFIELD(payload, '$.version') AS version FROM customer;
However, it looks like your payload data does not have a valid JSON format.
Applying a STRUCT schema
If your entire payload is encoded as JSON string, which means your data looks like:
{
"schema": {
"type": "string",
"optional": true
},
"payload": {
"version"="1",
"customerId"="76813432",
"phone"="76813432"
}
}
you can define STRUCT as below:
CREATE STREAM customer (
schema STRUCT<
type VARCHAR,
optional BOOLEAN>,
payload STRUCT<
version VARCHAR,
customerId VARCHAR,
phone VARCHAR>
)
WITH (
KAFKA_TOPIC='customers',
VALUE_FORMAT='JSON'
);
and finally referencing individual fields can be done like this:
CREATE STREAM customer_analysis AS
SELECT
payload->version as VERSION,
payload->customerId as CUSTOMER_ID,
payload->phone as PHONE
FROM customer
EMIT CHANGES;