ksql - creating a stream from a json array - apache-kafka

My kafka topic is pushing data in this format (coming from collectd):
[{"values":[100.000080140372],"dstypes":["derive"],"dsnames":["value"],"time":1529970061.145,"interval":10.000,"host":"k5.orch","plugin":"cpu","plugin_instance":"23","type":"cpu","type_instance":"idle","meta":{"network:received":true}}]
It's a combination of arrays, ints and floats... and the whole thing is inside a json array. As a result Im having a heck of a time using ksql to do anything with this data.
When I create a 'default' stream as
create stream cd_temp with (kafka_topic='ctd_test', value_format='json');
I get this result:
ksql> describe cd_temp;
Field | Type
-------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
-------------------------------------
Any select will return the ROWTIME and an 8 digit hex value for ROWKEY.
I've spent some time trying to extract the json fields to no avail. What concerns me is this:
ksql> print 'ctd_test' from beginning;
Format:JSON
com.fasterxml.jackson.databind.node.ArrayNode cannot be cast to com.fasterxml.jackson.databind.node.ObjectNode
Is it possible that this topic can't be used in ksql? Is there a technique for unpacking the outer array to get to the interesting bits inside?

At the time of writing, (June 2018), KSQL can't handle a JSON message where the whole thing is embedded inside a top level array. There is a github issue to track this. I'd suggest adding a +1 vote on this issue to up the priority of it.
Also, I notice that your create stream statement is not defining the schema of the json message. While this won't help in this situation, it is something that you'll need for other Json input formats, i.e. you create statement should be something like:
create stream cd_temp (values ARRAY<DOUBLE>, dstypes ARRAY<VARCHAR>, etc) with (kafka_topic='ctd_test', value_format='json');

Related

KSQL: How to cast JSON string to raw JSON

I need to copy messages from one Kafka topic to another based on a specific JSON property. That is, if property value is "A" - copy the message, otherwise do not copy. I'm trying to figure out the simplest way to do it with KSQL. My source messages all have my test property, but otherwise have very different and complex schema. Is there a way to have "schemaless" setup for this?
Source message (example):
{
"data": {
"propertyToCheck": "value",
... complex structure ...
}
}
If I define my "data" as VARCHAR in the stream I can examine the property further on with EXTRACTJSONFIELD.
CREATE OR REPLACE STREAM Test1 (
`data` VARCHAR
)
WITH (
kafka_topic = 'Source_Topic',
value_format = 'JSON'
);
In this case however, my "select" stream will produce data as JSON string instead of raw JSON (which is what I want).
CREATE OR REPLACE STREAM Test2 WITH (
kafka_topic = 'Target_Topic',
value_format = 'JSON'
)AS
SELECT
`data` AS `data`
FROM Test1
EMIT CHANGES;
Any ideas how to make this work?
This is a bit of a workaround, but you can achieve your desired behavior as follows: instead of defining your message schema as VARCHAR, use the BYTES type instead. Then use FROM_BYTES in combination with EXTRACTJSONFIELD to read the property you'd like to filter on from the bytes representation.
Here's an example:
Here's a source stream, with nested JSON data, and one example row of data:
CREATE STREAM test (data STRUCT<FOO VARCHAR, BAR VARCHAR>) with (kafka_topic='test', value_format='json', partitions=1);
INSERT INTO test (data) VALUES (STRUCT(FOO := 'foo', BAR := 'bar'));
Now, represent the data as bytes (using the KAFKA format), instead of as JSON:
CREATE STREAM test_bytes (data BYTES) WITH (kafka_topic='test', value_format='kafka');
Next, perform the filter based on the nested JSON data:
CREATE STREAM test_filtered_bytes WITH (kafka_topic='test_filtered') AS SELECT * FROM test_bytes WHERE extractjsonfield(from_bytes(data, 'utf8'), '$.DATA.FOO') = 'foo';
The newly created topic "test_filtered" now has data in proper JSON format, analogous to the source stream "test". We can verify by representing the stream in the original format and reading it back to check:
CREATE STREAM test_filtered (data STRUCT<FOO VARCHAR, BAR VARCHAR>) WITH (kafka_topic='test_filtered', value_format='json');
SELECT * FROM test_filtered EMIT CHANGES;
I verified that these example statements work for me as of the latest ksqlDB version (0.27.2). They should work the same on all ksqlDB versions ever since the BYTES type and relevant built-in functions were introduced.
Using ksqlDB scalar functions such as EXTRACTJSONFIELD or JSON_RECORDS might help you.

With KSQL, why does my table keep data with older ROWTIME and discard updates with newer ROWTIME?

I have a process that feeds relatively simple vehicle data into a kafka topic. The records are keyd by registration and the values contain things like latitude/longitude etc + a value called DateTime which is a timestamp based on the sensor that took the readings (not the producer or the cluster).
My data arrives out of order in general and also especially if I keep on pumping the same test data set into the vehicle-update-log topic over and over. My data set contains two records for the vehicle I'm testing with.
My expectation is that when I do a select on the table, that it will return one row with the most recent data based on the ROWTIME of the records. (I've verified that the ROWTIME is getting set correctly.)
What happens instead is that the result has both rows (for the same primary KEY) and the last value is the oldest ROWTIME.
I'm confused; I thought ksql will keep the most recent update only. Must I now write additional logic on the client side to pick the latest of the data I get?
I created the table like this:
CREATE TABLE vehicle_updates
(
Latitude DOUBLE,
Longitude DOUBLE,
DateTime BIGINT,
Registration STRING PRIMARY KEY
)
WITH
(
KAFKA_TOPIC = 'vehicle-update-log',
VALUE_FORMAT = 'JSON_SR',
TIMESTAMP = 'DateTime'
);
Here is my query:
SELECT
registration,
ROWTIME,
TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd HH:mm:ss.SSS', 'Africa/Johannesburg') AS rowtime_formatted
FROM vehicle_updates
WHERE registration = 'BT66MVE'
EMIT CHANGES;
Results while no data is flowing:
+------------------------------+------------------------------+------------------------------+
|REGISTRATION |ROWTIME |ROWTIME_FORMATTED |
+------------------------------+------------------------------+------------------------------+
|BT66MVE |1631532052000 |2021-09-13 13:20:52.000 |
|BT66MVE |1631527147000 |2021-09-13 11:59:07.000 |
Here's the same query, but I'm pumping the data set into the topic again while the query is running. I'm surprised to be getting the older record as updates.
Results while feeding data:
+------------------------------+------------------------------+------------------------------+
|REGISTRATION |ROWTIME |ROWTIME_FORMATTED |
+------------------------------+------------------------------+------------------------------+
|BT66MVE |1631532052000 |2021-09-13 13:20:52.000 |
|BT66MVE |1631527147000 |2021-09-13 11:59:07.000 |
|BT66MVE |1631527147000 |2021-09-13 11:59:07.000 |
What gives?
In the end, it's a issue in Kafka Streams, that is not easy to resolve: https://issues.apache.org/jira/browse/KAFKA-10493 (we are working on some long term solution for it already though).
While event-time based processing is a central design pillar, there are some gaps that still needs to get closed.
The underlying issue is, that Kafka itself was originally designed based on log-append order only. Timestamps got added later (in 0.10 release). There are still some gaps today (eg, https://issues.apache.org/jira/browse/KAFKA-7061) in which "offset order" is dominant. You are hitting one of those cases.

Topic data format in Kafka for KSQL operations

I just started using ksql, when I do print topic from beginning I get data in below format.
rowtime: 4/12/20, 9:00:05 AM MDT, key: {"messageId":null}, value: {"WHS":[{"Character Set":"UTF-8","action":"finished","Update-Date-Time":"2020-04-11 09:00:02:25","Number":0,"Abbr":"","Name":"","Name2":"","Country-Code":"","Addr-1":"","Addr-2":"","Addr-3":"","Addr-4":"","City":"","State":""}]}
But all the examples in KSQL have the data in below format
{"ROWTIME":1537436551210,"ROWKEY":"3375","rating_id":3375,"user_id":2,"stars":3,"route_id":6972,"rating_time":1537436551210,"channel":"web","message":"airport refurb looks great, will fly outta here more!"}
so I'm not able to perform any operations, the format is showing as
Key format: JSON or SESSION(KAFKA_STRING) or HOPPING(KAFKA_STRING) or TUMBLING(KAFKA_STRING) or KAFKA_STRING
Value format: JSON or KAFKA_STRING
on my topic. How can I modify the data into the specific format?
Thanks
ksqlDB does not yet support JSON message keys, (See the tracking Github issue).
However, you can still access the data, both in the key and the value. The JSON key is just a string after all!
The value, when reformatted, looks like this:
{
"WHS":[
{
"Character Set":"UTF-8",
"action":"finished",
"Update-Date-Time":"2020-04-11 09:00:02:25",
"Number":0,
"Abbr":"",
"Name":"",
"Name2":"",
"Country-Code":"",
"Addr-1":"",
"Addr-2":"",
"Addr-3":"",
"Addr-4":"",
"City":"",
"State":""
}
]
}
Which, assuming all rows share a common format, ksqlDB can easily handle.
To import your stream you should be able to run something like this:
-- assuming v0.9 of Kafka
create stream stuff
(
ROWKEY STRING KEY,
WHS ARRAY<
STRUCT<
`Character Set` STRING,
action STRING,
`Update-Date-Time` STRING,
Number STRING,
... etc
>
>
)
WITH (kafka_topic='?', value_format='JSON');
The value column WHS is an array of structs, (where the will be only one element), and the struct defines all the fields you need to access. Note, some field names needed quoting as they contained invalid characters, e.g. spaces and dashes.

Hbase reading freezes for a few records when reading with partial rowkey

I am reading data from HBase through spark. The code runs fine when reading the data using a prefix filter with a complete rowkey or using GET, but it freezes if I use a partial prefixed rowkey. The rowkey structure is md5OfAkey_Akey_txDate_someKey. I want to read all data matching “Akeys” from a data frame. The table has a single column family , 50 column qualifiers and has around 200 million records. So when I read using md5OfAkey_Akey_txDate the code gets stuck while if I construct the whole key it runs fine. But I do not want to pass the whole rowkey as I want to read all data for a particular account(Akey) and transaction date (txDate). Any help would be appreciated.

How to convert a response from KSQL - UDF returning JSON array to columns

I have a custom UDF called getCityStats(string city, double distance) which takes 2 arguments and returns an array of JSON strings ( Objects) as follows
{"zipCode":"90921","mode":3.54}
{"zipCode":"91029","mode":7.23}
{"zipCode":"96928","mode":4.56}
{"zipCode":"90921","mode":6.54}
{"zipCode":"91029","mode":4.43}
{"zipCode":"96928","mode":3.96}
I would like to process them in a KSQL table creation query as
create table city_stats
as
select
zipCode,
avg(mode) as mode
from
(select
getCityStats(city,distance) as (zipCode,mode)
from
city_data_stream
) t
group by zipCode;
In other words can KSQL handle tuple type where an array of Json strings can be processed to return as indicated above in a table creation query?
No, KSQL doesn't currently support the syntax you're suggesting. Whilst KSQL can work with arrays, it doesn't get support any kind of explode function, so you can reference specific index points in the array only.
Feel free to view and upvote if appropriate these issues: #527, #1830, or indeed raise your own if they don't cover what you want to do.