The INPUT_STREAM in Kafka was created with ksql statement below:
CREATE STREAM INPUT_STREAM (year STRUCT<month STRUCT<day STRUCT<hour INTEGER, minute INTEGER>>>) WITH (KAFKA_TOPIC = 'INPUT_TOPIC', VALUE_FORMAT = 'JSON');
It defines four levels nested json schema with fields year, month and day and hour and minute, like the one below:
{
"year": {
"month": {
"day": {
"hour": string,
"minute": string
}
}
}
}
I want to create a second OUTPUT_STREAM that will read the messages from INPUT_STREAM and re-map its field names to some custom ones. I want to grab the hour and minute values and place them in a nested json below the fields one and two, like this one below:
{
"one": {
"two": {
"hour": string,
"minute": string
}
}
}
I go ahead and put together ksql statement to create OUTPUT_STREAM
CREATE STREAM OUTPUT_STREAM WITH (KAFKA_TOPIC='OUTPUT_TOPIC', REPLICAS=3) AS SELECT YEAR->MONTH->DAY->HOUR ONE->TWO->HOUR FROM INPUT_STREAM EMIT CHANGES;
The statement fails with an error. Is there a syntax error in this statement? Is it be possible to specify the destination field name like I do here with
...AS SELECT YEAR->MONTH->DAY->HOUR ONE->TWO->HOUR FROM... ?
I've tried to use STRUCT instead of ONE->TWO->HOUR:
CREATE STREAM OUTPUT_STREAM WITH (KAFKA_TOPIC='OUTPUT_TOPIC', REPLICAS=3) AS SELECT YEAR->MONTH->DAY->HOUR ONE STRUCT<TWO STRUCT<HOUR VARCHAR>> FROM INPUT_STREAM EMIT CHANGES;
It errors out too and doesn't work
To create structure in ksqlDB
SELECT
STRUCT(
COLUMN_NEW_NAME_1 := OLD_COLUMN_NAME_1,
COLUMN_NEW_NAME_2 := OLD_COLUMN_NAME_2
) as STRUCT_COLUMN_NAME
FROM OLD_TABLE_OR_STREAM
So answer to your question would be, yes you are having syntax error.
SELECT
STRUCT(
TWO := STRUCT(
HOUR := YEAR->MONTH->DAY->HOUR,
MINUTE := YEAR->MONTH->DAY->MINUTE
)
) as ONE
FROM INPUT_STREAM;
Related
First of all, if someone has a better sentence for my question, feel free to comment.
I want to translate this query into Golang
SELECT
mou."id",
mou."name",
mou.description,
mou.img_url,
um.favorite
FROM
majors_of_universities mou
JOIN field_of_studies fos ON mou.field_of_studies_id = fos."id"
JOIN univ_major um ON mou."id" = um.majors_of_universities_id
WHERE
mou."name" ILIKE '%%'
AND fos."name" IN ( 'IT & Software', 'Analisis Data & Statistik' )
ORDER BY
mou."name"
LIMIT 99 OFFSET 0;
This query works well, btw. I'm using sqlc as a generator and by it rules (CMIIW), I changed...
'%%' to $1
'IT & Software', 'Analisis Data & Statistik' to $2
99 to $3
0 to $4
so it become a variable.
little did I know, the $2 generated into a string data type. what I want is it generated into an array of string data type, because I found out that Golang can translate an array of string from ["1", "2", "3"] into '1', '2', '3' , just like what I want to input inside postgres IN parenthesis.
in Golang side, I made a custom struct like this
type SearchMajorReq struct {
Name string `json:"name"`
FieldOfStudies []string `json:"field_of_studies"`
Limit int32 `json:"limit"`
Page int32 `json:"page"`
}
in hope that this is the correct data type to send a JSON req body like this
{
"name":"",
"field_of_studies": ["1", "2", "3"],
"limit": 10,
"page": 1
}
but it doesn't works. I have an error in FieldOfStudies part.
How can I solve this?
everyone. I am trying to create a stream from another stream (using CREATE FROM SELECT) that contains JSON data stored as VARCHAR. My first intention was to use EXTRACTJSONFIELD to extract specific JSON fields to separate columns. But on an attempt to query a new stream I am getting the null values for fields created using EXTRACTJSONFIELD.
Example:
My payload looks like this:
{
"data": {
"some_value": 3702,
},
"metadata": {
"timestamp": "2022-05-23T23:54:38.477934Z",
"table-name": "some_table"
}
}
I create stream using
CREATE STREAM SOME_STREAM (data VARCHAR, metadata VARCHAR) WITH (KAFKA_TOPIC='SOME_TOPIC', VALUE_FORMAT='JSON');
Data inside looks like this:
Then I am trying to create new stream using:
CREATE STREAM SOME_STREAM_QUERYABLE
AS SELECT
DATA,
METADATA,
EXTRACTJSONFIELD(METADATA, '$.timestamp') as timestamp
FROM SOME_STREAM
EMIT CHANGES
The attempt to query this new stream gives me null for timestamp field. Data and metadata fields have valid json.
I'm creating a dynamic frame with create_dynamic_frame.from_options that pulls data directly from s3.
However, I have multiple partitions on my raw data that don't show up in the schema this way, because they aren't actually part of the data, they're part of the s3 folder structure.
If the partitions were part of the data I could use partitionKeys=["date"] , but date is not a column, it is a folder.
How can I detect these partitions using from_options or some other mechanism?
I can't use a glue crawler to detect the full schema including partitions, because my data is too nested for glue crawlers to handle.
Another way to deal with partitions is to register them in the Glue Data catalog either manually or programmatically.
You can use the glue client in a lambda function.
The example below is from a lambda function triggered upon file landing in a datalake, that parses the file name to create partitions in a target table in the glue catalog. (You can find a use case with Adobe Analytics in aws archives from where I got the code amazon-archives/athena-adobe-datafeed-splitter.)
import boto3
def create_glue_client():
"""Create a return a Glue client for the region this AWS lambda job is running in"""
current_region = os.environ['AWS_REGION'] # With Glue, we only support writing to the region where this code runs
return boto3.client('glue', region_name=current_region)
def does_partition_exist(glue_client, database, table, part_values):
"""Test if a specific partition exists in a database.table"""
try:
glue_client.get_partition(DatabaseName=database, TableName=table, PartitionValues=part_values)
return True
except glue_client.exceptions.EntityNotFoundException:
return False
def add_partition(glue_client, database_name, s3_report_base, key):
"""Add a partition to the target table for a specific date"""
# key_example = 01-xxxxxxxx_2019-11-07.tsv.gz
partition_date = key.split("_")[1].split(".")[0]
# partition_date = "2019-11-07"
year, month, day = partition_date.split('-')
# year, month, day = [2019,11,07]
part = key[:2]
# part = "01"
if does_partition_exist(glue_client, database_name, "target_table", [year, month, day, part]):
return
# get headers python list from csv file in S3
headers = get_headers(s3_headers_path)
# create new partition
glue_client.create_partition(
DatabaseName=database_name,
TableName="target_table",
PartitionInput={
"Values": [year, month, day, part],
"StorageDescriptor": storage_descriptor(
[{"Type": "string", "Name": name} for name in headers], #columns
'%s/year=%s/month=%s/day=%s/part=%s' % (s3_report_base, year, month, day, part) #location
),
"Parameters": {}
}
)
def storage_descriptor(columns, location):
"""Data Catalog storage descriptor with the desired columns and S3 location"""
return {
"Columns": columns,
"Location": location,
"InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
"OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.serde2.OpenCSVSerde",
"Parameters": {
"separatorChar": "\t"
}
},
"BucketColumns": [], # Required or SHOW CREATE TABLE fails
"Parameters": {} # Required or create_dynamic_frame.from_catalog fails
}
In your case with an existing file structure you can use the function below to iterate on the file keys in your raw bucket and parse them to create partitions :
s3 = boto3.resource('s3')
def get_s3_keys(bucket):
"""Get a list of keys in an S3 bucket."""
keys = []
resp = s3.list_objects_v2(Bucket=bucket)
for obj in resp['Contents']:
keys.append(obj['Key'])
return keys
Then you can build your dynamic frame as follows :
datasource = glueContext.create_dynamic_frame.from_catalog(
database = "database_name",
table_name = "target_table",
transformation_ctx = "datasource1",
push_down_predicate = "(year='2020' and month='5' and day='4')"
)
Another workaround to this is to show the file paths of the data using attachFilename option (similar to input_file_name in spark) as a column and parse the partitions manually from the string paths.
df = glueContext.create_dynamic_frame.from_options(
connection_type='s3',
connection_options = {"paths": paths, "groupFiles": "none"},
format="csv",
format_options = {"withHeader": True,
"multiLine": True,
"quoteChar": '"',
"attachFilename": "source_file"
},
transformation_ctx="table_source_s3"
)
I'm trying to add JSON data to ClickHouse from Kafka. Here's simplified JSON:
{
...
"sendAddress":{
"sendCommChannelTypeId":4,
"sendCommChannelTypeCode":"SMS",
"sendAddress":"789345345945"},
...
}
Here's the steps for creating table in ClickHouse, create another table using Kafka Engine and creating MATERIALIZED VIEW to connect these two tables, and also connect CH with Kafka.
Creating the first table
CREATE TABLE tab
(
...
sendAddress Tuple (sendCommChannelTypeId Int32, sendCommChannelTypeCode String, sendAddress String),
...
)Engine = MergeTree()
PARTITION BY applicationId
ORDER BY (applicationId);
Creating a second table with Kafka Engine SETTINGS:
CREATE TABLE tab_kfk
(
...
sendAddress Tuple (sendCommChannelTypeId Int32, sendCommChannelTypeCode String, sendAddress String),
...
)ENGINE = Kafka
SETTINGS kafka_broker_list = 'localhost:9092',
kafka_topic_list = 'topk2',
kafka_group_name = 'group1',
kafka_format = 'JSONEachRow',
kafka_row_delimiter = '\n';
Create MATERIALIZED VIEW
CREATE MATERIALIZED VIEW tab_mv TO tab AS
SELECT ... sendAddress, ...
FROM tab_kfk;
Then I try to SELECT all or specific items from the first table - tab and get nothing. Logs is following
OK. Just add '[]' before curly braces in the sendAddress like this:
"authkey":"some_value",
"sendAddress":[{
"sendCommChannelTypeId":4,
"sendCommChannelTypeCode":"SMS",
"sendAddress":"789345345945"
}]
And I still get a mistake, but slightly different:
What should I do to fix this problem, thanks!
There are 3 ways to fix it:
Not use nested objects and flatten messages before inserting to Kafka topic. For example such way:
{
..
"authkey":"key",
"sendAddress_CommChannelTypeId":4,
"sendAddress_CommChannelTypeCode":"SMS",
"sendAddress":"789345345945",
..
}
Use Nested data structure that required to change the JSON-message schema and table schema:
{
..
"authkey":"key",
"sendAddress.sendCommChannelTypeId":[4],
"sendAddress.sendCommChannelTypeCode":["SMS"],
"sendAddress.sendAddress":["789345345945"],
..
}
CREATE TABLE tab_kfk
(
applicationId Int32,
..
sendAddress Nested(
sendCommChannelTypeId Int32,
sendCommChannelTypeCode String,
sendAddress String),
..
)
ENGINE = Kafka
SETTINGS kafka_broker_list = 'localhost:9092',
kafka_topic_list = 'topk2',
kafka_group_name = 'group1',
kafka_format = 'JSONEachRow',
kafka_row_delimiter = '\n',
input_format_import_nested_json = 1 /* <--- */
Take into account the setting input_format_import_nested_json.
Interpret input JSON-message as string & parse it manually (see github issue #16969):
CREATE TABLE tab_kfk
(
message String
)
ENGINE = Kafka
SETTINGS
..
kafka_format = 'JSONAsString', /* <--- */
..
CREATE MATERIALIZED VIEW tab_mv TO tab
AS
SELECT
..
JSONExtractString(message, 'authkey') AS authkey,
JSONExtract(message, 'sendAddress', 'Tuple(Int32,String,String)') AS sendAddress,
..
FROM tab_kfk;
By reading at this commit, I believe for release 23.1, that a new setting input_format_json_read_objects_as_strings will allow converting nested json to String.
Example:
SET input_format_json_read_objects_as_strings = 1;
CREATE TABLE test (id UInt64, obj String, date Date) ENGINE=Memory();
INSERT INTO test FORMAT JSONEachRow {"id" : 1, "obj" : {"a" : 1, "b" : "Hello"}, "date" : "2020-01-01"};
SELECT * FROM test;
Result:
id
obj
date
1
{"a" : 1, "b" : "Hello"}
2020-01-01
docs
Of course, you can still use the materialized view to convert the String object to the correct column types using the same technics as for parsing JSONAsString format
I have a topic over which i am sending json in the following format:
{
"schema": {
"type": "string",
"optional": true
},
"payload": “CustomerData{version='1', customerId=‘76813432’, phone=‘76813432’}”
}
and I would like to create a stream with customerId, and phone but I am not sure how to define the stream in terms of a nested json object. (edited)
CREATE STREAM customer (
payload.version VARCHAR,
payload.customerId VARCHAR,
payload.phone VARCHAR
) WITH (
KAFKA_TOPIC='customers',
VALUE_FORMAT='JSON'
);
Would it be something like that? How do I de-reference the nested object in defining the streams field?
Actually the above does not work for field definitions it says:
Caused by: line 2:12:
extraneous input '.' expecting {'EMIT', 'CHANGES',
'INTEGER', 'DATE', 'TIME', 'TIMESTAMP', 'INTERVAL', 'YEAR', 'MONTH', 'DAY',
Applying function extractjsonfield
There is an ksqlDB function called extractjsonfield that you can use.
First, you need to extract the schema and payload fields:
CREATE STREAM customer (
schema VARCHAR,
payload VARCHAR
) WITH (
KAFKA_TOPIC='customers',
VALUE_FORMAT='JSON'
);
Then you can select the nested fields within the json:
SELECT EXTRACTJSONFIELD(payload, '$.version') AS version FROM customer;
However, it looks like your payload data does not have a valid JSON format.
Applying a STRUCT schema
If your entire payload is encoded as JSON string, which means your data looks like:
{
"schema": {
"type": "string",
"optional": true
},
"payload": {
"version"="1",
"customerId"="76813432",
"phone"="76813432"
}
}
you can define STRUCT as below:
CREATE STREAM customer (
schema STRUCT<
type VARCHAR,
optional BOOLEAN>,
payload STRUCT<
version VARCHAR,
customerId VARCHAR,
phone VARCHAR>
)
WITH (
KAFKA_TOPIC='customers',
VALUE_FORMAT='JSON'
);
and finally referencing individual fields can be done like this:
CREATE STREAM customer_analysis AS
SELECT
payload->version as VERSION,
payload->customerId as CUSTOMER_ID,
payload->phone as PHONE
FROM customer
EMIT CHANGES;