I have a topic over which i am sending json in the following format:
{
"schema": {
"type": "string",
"optional": true
},
"payload": “CustomerData{version='1', customerId=‘76813432’, phone=‘76813432’}”
}
and I would like to create a stream with customerId, and phone but I am not sure how to define the stream in terms of a nested json object. (edited)
CREATE STREAM customer (
payload.version VARCHAR,
payload.customerId VARCHAR,
payload.phone VARCHAR
) WITH (
KAFKA_TOPIC='customers',
VALUE_FORMAT='JSON'
);
Would it be something like that? How do I de-reference the nested object in defining the streams field?
Actually the above does not work for field definitions it says:
Caused by: line 2:12:
extraneous input '.' expecting {'EMIT', 'CHANGES',
'INTEGER', 'DATE', 'TIME', 'TIMESTAMP', 'INTERVAL', 'YEAR', 'MONTH', 'DAY',
Applying function extractjsonfield
There is an ksqlDB function called extractjsonfield that you can use.
First, you need to extract the schema and payload fields:
CREATE STREAM customer (
schema VARCHAR,
payload VARCHAR
) WITH (
KAFKA_TOPIC='customers',
VALUE_FORMAT='JSON'
);
Then you can select the nested fields within the json:
SELECT EXTRACTJSONFIELD(payload, '$.version') AS version FROM customer;
However, it looks like your payload data does not have a valid JSON format.
Applying a STRUCT schema
If your entire payload is encoded as JSON string, which means your data looks like:
{
"schema": {
"type": "string",
"optional": true
},
"payload": {
"version"="1",
"customerId"="76813432",
"phone"="76813432"
}
}
you can define STRUCT as below:
CREATE STREAM customer (
schema STRUCT<
type VARCHAR,
optional BOOLEAN>,
payload STRUCT<
version VARCHAR,
customerId VARCHAR,
phone VARCHAR>
)
WITH (
KAFKA_TOPIC='customers',
VALUE_FORMAT='JSON'
);
and finally referencing individual fields can be done like this:
CREATE STREAM customer_analysis AS
SELECT
payload->version as VERSION,
payload->customerId as CUSTOMER_ID,
payload->phone as PHONE
FROM customer
EMIT CHANGES;
Related
everyone. I am trying to create a stream from another stream (using CREATE FROM SELECT) that contains JSON data stored as VARCHAR. My first intention was to use EXTRACTJSONFIELD to extract specific JSON fields to separate columns. But on an attempt to query a new stream I am getting the null values for fields created using EXTRACTJSONFIELD.
Example:
My payload looks like this:
{
"data": {
"some_value": 3702,
},
"metadata": {
"timestamp": "2022-05-23T23:54:38.477934Z",
"table-name": "some_table"
}
}
I create stream using
CREATE STREAM SOME_STREAM (data VARCHAR, metadata VARCHAR) WITH (KAFKA_TOPIC='SOME_TOPIC', VALUE_FORMAT='JSON');
Data inside looks like this:
Then I am trying to create new stream using:
CREATE STREAM SOME_STREAM_QUERYABLE
AS SELECT
DATA,
METADATA,
EXTRACTJSONFIELD(METADATA, '$.timestamp') as timestamp
FROM SOME_STREAM
EMIT CHANGES
The attempt to query this new stream gives me null for timestamp field. Data and metadata fields have valid json.
I'm trying to access a key called session ID, here is what the JSON of my message looks like:
{
"events":[
{
"data":{
"session_id":"-8479590123628083695"}}]}
This is my KSQL code to create the stream,
CREATE STREAM stream_test(
events ARRAY< data STRUCT< session_id varchar> >
) WITH (KAFKA_TOPIC='my_topic',
VALUE_FORMAT='JSON')
But I get this error,
KSQLError: ("line 4:22: mismatched input 'STRUCT' expecting {'(', 'ARRAY', '>'}", 40001, None)
Does anyone know how to unpack this kind of structure? I'm struggling to find examples
My solution:
events ARRAY< STRUCT<data STRUCT< session_id varchar >> >
There is a topic, containing plain JSON messages, trying to create a new stream by extracting few columns from JSON and another column as varchar with the message's value.
Here is a sample message in a topic_json
{
"db": "mydb",
"collection": "collection",
"op": "update"
}
Creating a stream like -
CREATE STREAM test1 (
db VARCHAR,
collection VARCHAR
VAL STRING
) WITH (
KAFKA_TOPIC = 'topic_json',
VALUE_FORMAT = 'JSON'
);
Output of this stream will contain only db and collection columns, How can I add another column as message's value - "{\"db\":\"mydb\",\"collection\":\"collection\",\"op\":\"update\"}"
I want to store a table for variables with table definition like below
variable
--------
id int
var_type int 0: number, 1: string, 2: json
var_name int
var_value ??? varchar or jsonb?
If I use varchar, how to store the json type variable and if, I am using jsonb how to store the int and string?
The example json value that is stored,
[{"name": "Andy", "email" : "andy#mail.id"},{"name": "Cindy", "email" : "cindy#mail.id"}]
TIA
Beny
When you have data and you don't know the structure, use a single jsonb column. JSON can handle strings, numbers, and more JSON.
{
"string": "basset hounds got long ears",
"number": 23.42,
"json": [1,2,3,4,5]
}
Don't try to cram them all into a single array. Put them in separate rows.
One row: {"name": "Andy", "email" : "andy#mail.id"}
Another row: {"name": "Cindy", "email" : "cindy#mail.id"}
However, your example feels like its avoiding designing a schema. JSONB is useful, but overusing it defeats the point of a relational database.
create table people (
id bigserial primary key,
// Columns for known keys which can have constraints.
name text not null,
email text not null,
// JSONB for extra keys you can't predict.
data jsonb
)
Use the JSON operators to query individual pairs.
select
name, email, data->>'favorite dog breed'
from some_table
I have the objective of breaking out the results of a query on a table with a json column that contains an array into individual rows. However I'm not sure about the syntax to write this query. I'm using this:
For the following query
SELECT
jobs.id,
templates.Id,
templates.Version,
templates.StepGroupId,
templates.PublicVersion,
templates.PlannedDataSheetIds,
templates.SnapshottedDataSheetValues
FROM jobs,
jsonb_to_recordset(jobs.source_templates) AS templates(Id, Version, StepGroupId, PublicVersion,
PlannedDataSheetIds, SnapshottedDataSheetValues)
On the following table:
create table jobs
(
id uuid default uuid_generate_v4() not null
constraint jobs_pkey
primary key,
source_templates jsonb,
);
with the jsonb column containing data in this format:
[
{
"Id":"94729e08-7d5c-459d-9244-f66e17059fc4",
"Version":1,
"StepGroupId":"0274590b-c08d-4963-b37e-8fc8f25151d2",
"PublicVersion":1,
"PlannedDataSheetIds":null,
"SnapshottedDataSheetValues":null
},
{
"Id":"66791bfd-8cdb-43f7-92e6-bfb45b0f780f",
"Version":4,
"StepGroupId":"126404c5-ed1e-4796-80b1-ca68ad486682",
"PublicVersion":1,
"PlannedDataSheetIds":null,
"SnapshottedDataSheetValues":null
},
{
"Id":"e3b31b98-8052-40dd-9405-c316b9c62942",
"Version":4,
"StepGroupId":"bc6a9dd3-d527-449e-bb36-39f03eaf87b9",
"PublicVersion":1,
"PlannedDataSheetIds":null,
"SnapshottedDataSheetValues":null
}
]
I get an error:
[42601] ERROR: a column definition list is required for functions returning "record"
What is the right way to do this without generating the error?
You need to define datatypes:
SELECT
jobs.id,
templates.Id,
templates.Version,
templates.StepGroupId,
templates.PublicVersion,
templates.PlannedDataSheetIds,
templates.SnapshottedDataSheetValues
FROM jobs,
jsonb_to_recordset(jobs.source_templates)
AS templates(Id UUID, Version INT, StepGroupId UUID, PublicVersion INT,
PlannedDataSheetIds INT, SnapshottedDataSheetValues INT)