I need to copy messages from one Kafka topic to another based on a specific JSON property. That is, if property value is "A" - copy the message, otherwise do not copy. I'm trying to figure out the simplest way to do it with KSQL. My source messages all have my test property, but otherwise have very different and complex schema. Is there a way to have "schemaless" setup for this?
Source message (example):
{
"data": {
"propertyToCheck": "value",
... complex structure ...
}
}
If I define my "data" as VARCHAR in the stream I can examine the property further on with EXTRACTJSONFIELD.
CREATE OR REPLACE STREAM Test1 (
`data` VARCHAR
)
WITH (
kafka_topic = 'Source_Topic',
value_format = 'JSON'
);
In this case however, my "select" stream will produce data as JSON string instead of raw JSON (which is what I want).
CREATE OR REPLACE STREAM Test2 WITH (
kafka_topic = 'Target_Topic',
value_format = 'JSON'
)AS
SELECT
`data` AS `data`
FROM Test1
EMIT CHANGES;
Any ideas how to make this work?
This is a bit of a workaround, but you can achieve your desired behavior as follows: instead of defining your message schema as VARCHAR, use the BYTES type instead. Then use FROM_BYTES in combination with EXTRACTJSONFIELD to read the property you'd like to filter on from the bytes representation.
Here's an example:
Here's a source stream, with nested JSON data, and one example row of data:
CREATE STREAM test (data STRUCT<FOO VARCHAR, BAR VARCHAR>) with (kafka_topic='test', value_format='json', partitions=1);
INSERT INTO test (data) VALUES (STRUCT(FOO := 'foo', BAR := 'bar'));
Now, represent the data as bytes (using the KAFKA format), instead of as JSON:
CREATE STREAM test_bytes (data BYTES) WITH (kafka_topic='test', value_format='kafka');
Next, perform the filter based on the nested JSON data:
CREATE STREAM test_filtered_bytes WITH (kafka_topic='test_filtered') AS SELECT * FROM test_bytes WHERE extractjsonfield(from_bytes(data, 'utf8'), '$.DATA.FOO') = 'foo';
The newly created topic "test_filtered" now has data in proper JSON format, analogous to the source stream "test". We can verify by representing the stream in the original format and reading it back to check:
CREATE STREAM test_filtered (data STRUCT<FOO VARCHAR, BAR VARCHAR>) WITH (kafka_topic='test_filtered', value_format='json');
SELECT * FROM test_filtered EMIT CHANGES;
I verified that these example statements work for me as of the latest ksqlDB version (0.27.2). They should work the same on all ksqlDB versions ever since the BYTES type and relevant built-in functions were introduced.
Using ksqlDB scalar functions such as EXTRACTJSONFIELD or JSON_RECORDS might help you.
Related
Format my records in topic - key: 1-2MDWE7JT, I want to convert it into key: {"some_id" : "1-2MDWE7JT"}.
How can I include a field into json with ksql functions? I didn't find a reverse function for "EXTRACTJSONFIELD" like "INCLUDEJSONFIELD".
If the data is written as key: 1-2MDWE7JT in the input topic, you can only read it this way, ie, the schema would a plain VARCHAR type. You could wrap the key into a STRUCT dynamically at query time though
SELECT STRUCT(f1 := v1, f2 := v2) FROM s1;
Cf https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/select-pull-query/#struct-output
Thus, you would create an output STREAM that contains the key wrapped as like you want if (if you use JSON key format).
I just started using ksql, when I do print topic from beginning I get data in below format.
rowtime: 4/12/20, 9:00:05 AM MDT, key: {"messageId":null}, value: {"WHS":[{"Character Set":"UTF-8","action":"finished","Update-Date-Time":"2020-04-11 09:00:02:25","Number":0,"Abbr":"","Name":"","Name2":"","Country-Code":"","Addr-1":"","Addr-2":"","Addr-3":"","Addr-4":"","City":"","State":""}]}
But all the examples in KSQL have the data in below format
{"ROWTIME":1537436551210,"ROWKEY":"3375","rating_id":3375,"user_id":2,"stars":3,"route_id":6972,"rating_time":1537436551210,"channel":"web","message":"airport refurb looks great, will fly outta here more!"}
so I'm not able to perform any operations, the format is showing as
Key format: JSON or SESSION(KAFKA_STRING) or HOPPING(KAFKA_STRING) or TUMBLING(KAFKA_STRING) or KAFKA_STRING
Value format: JSON or KAFKA_STRING
on my topic. How can I modify the data into the specific format?
Thanks
ksqlDB does not yet support JSON message keys, (See the tracking Github issue).
However, you can still access the data, both in the key and the value. The JSON key is just a string after all!
The value, when reformatted, looks like this:
{
"WHS":[
{
"Character Set":"UTF-8",
"action":"finished",
"Update-Date-Time":"2020-04-11 09:00:02:25",
"Number":0,
"Abbr":"",
"Name":"",
"Name2":"",
"Country-Code":"",
"Addr-1":"",
"Addr-2":"",
"Addr-3":"",
"Addr-4":"",
"City":"",
"State":""
}
]
}
Which, assuming all rows share a common format, ksqlDB can easily handle.
To import your stream you should be able to run something like this:
-- assuming v0.9 of Kafka
create stream stuff
(
ROWKEY STRING KEY,
WHS ARRAY<
STRUCT<
`Character Set` STRING,
action STRING,
`Update-Date-Time` STRING,
Number STRING,
... etc
>
>
)
WITH (kafka_topic='?', value_format='JSON');
The value column WHS is an array of structs, (where the will be only one element), and the struct defines all the fields you need to access. Note, some field names needed quoting as they contained invalid characters, e.g. spaces and dashes.
Desired functionality: For a given key, key123, numerous services are running in parallel and reporting their results to a single location, once all results are gathered for key123 they are passed to a new downstream consumer.
Original idea: Using AWS DynamoDB to hold all results for a given entry. Every time a result is ready a micro-service does a PATCH operation to the database on key123. An output stream checks each UPDATE to see if the entry is complete, if so, it is forwarded downstream.
New Idea: Use Kafka Streams and KSQL to reach the same goal. All services write their output to the results topic, the topic forms a change log Kstream that we KSQL query for completed entries. Something like:
CREATE STREAM competed_results FROM results_stream SELECT * WHERE (all results != NULL).
The part I'm not sure how to do is the PATCH operation on the stream. To have the output stream show the accumulation of all messages for key123 instead of just the most recent one?
KSQL users, does this even make sense? Am I close to a solution that someone has done before?
If you can produce all your events to the same topic, with the key set, then you can collect all of the events for a specific key using an aggregation in ksqlDB such as:
CREATE STREAM source (
KEY INT KEY, -- example key to group by
EVENT STRING -- example event to collect
) WITH (
kafka_topic='source', -- or whatever your source topic is called.
value_format='json' -- or whatever value format you need.
);
CREATE TABLE agg AS
SELECT
key,
COLLECT_LIST(event) as events
FROM source
GROUP BY key;
This will create a changelog topic called AGG by default. As new events are received for a specific key on the source topic, ksqlDB will produce messages to the AGG topic, with the key set to key and the value containing the list of all the events seen for that key.
You can then import this changelog as a stream:
CREATE STREAM agg_stream (
KEY INT KEY,
EVENTS ARRAY<STRING>
) WITH (
kafka_topic='AGG',
value_format='json'
);
And you can then apply some criteria to filter the stream to only include your final results:
STREAM competed_results AS
SELECT
*
FROM agg_stream
WHERE ARRAY_LEN(EVENTS) = 5; -- example 'complete' criteria.
You may even want to use a user-defined function to define your complete criteria:
STREAM competed_results AS
SELECT
*
FROM agg_stream
WHERE IS_COMPLETE(EVENTS);
I have a custom UDF called getCityStats(string city, double distance) which takes 2 arguments and returns an array of JSON strings ( Objects) as follows
{"zipCode":"90921","mode":3.54}
{"zipCode":"91029","mode":7.23}
{"zipCode":"96928","mode":4.56}
{"zipCode":"90921","mode":6.54}
{"zipCode":"91029","mode":4.43}
{"zipCode":"96928","mode":3.96}
I would like to process them in a KSQL table creation query as
create table city_stats
as
select
zipCode,
avg(mode) as mode
from
(select
getCityStats(city,distance) as (zipCode,mode)
from
city_data_stream
) t
group by zipCode;
In other words can KSQL handle tuple type where an array of Json strings can be processed to return as indicated above in a table creation query?
No, KSQL doesn't currently support the syntax you're suggesting. Whilst KSQL can work with arrays, it doesn't get support any kind of explode function, so you can reference specific index points in the array only.
Feel free to view and upvote if appropriate these issues: #527, #1830, or indeed raise your own if they don't cover what you want to do.
My kafka topic is pushing data in this format (coming from collectd):
[{"values":[100.000080140372],"dstypes":["derive"],"dsnames":["value"],"time":1529970061.145,"interval":10.000,"host":"k5.orch","plugin":"cpu","plugin_instance":"23","type":"cpu","type_instance":"idle","meta":{"network:received":true}}]
It's a combination of arrays, ints and floats... and the whole thing is inside a json array. As a result Im having a heck of a time using ksql to do anything with this data.
When I create a 'default' stream as
create stream cd_temp with (kafka_topic='ctd_test', value_format='json');
I get this result:
ksql> describe cd_temp;
Field | Type
-------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
-------------------------------------
Any select will return the ROWTIME and an 8 digit hex value for ROWKEY.
I've spent some time trying to extract the json fields to no avail. What concerns me is this:
ksql> print 'ctd_test' from beginning;
Format:JSON
com.fasterxml.jackson.databind.node.ArrayNode cannot be cast to com.fasterxml.jackson.databind.node.ObjectNode
Is it possible that this topic can't be used in ksql? Is there a technique for unpacking the outer array to get to the interesting bits inside?
At the time of writing, (June 2018), KSQL can't handle a JSON message where the whole thing is embedded inside a top level array. There is a github issue to track this. I'd suggest adding a +1 vote on this issue to up the priority of it.
Also, I notice that your create stream statement is not defining the schema of the json message. While this won't help in this situation, it is something that you'll need for other Json input formats, i.e. you create statement should be something like:
create stream cd_temp (values ARRAY<DOUBLE>, dstypes ARRAY<VARCHAR>, etc) with (kafka_topic='ctd_test', value_format='json');