Topic data format in Kafka for KSQL operations

Topic data format in Kafka for KSQL operations - apache-kafka

I just started using ksql, when I do print topic from beginning I get data in below format.
rowtime: 4/12/20, 9:00:05 AM MDT, key: {"messageId":null}, value: {"WHS":[{"Character Set":"UTF-8","action":"finished","Update-Date-Time":"2020-04-11 09:00:02:25","Number":0,"Abbr":"","Name":"","Name2":"","Country-Code":"","Addr-1":"","Addr-2":"","Addr-3":"","Addr-4":"","City":"","State":""}]}
But all the examples in KSQL have the data in below format
{"ROWTIME":1537436551210,"ROWKEY":"3375","rating_id":3375,"user_id":2,"stars":3,"route_id":6972,"rating_time":1537436551210,"channel":"web","message":"airport refurb looks great, will fly outta here more!"}
so I'm not able to perform any operations, the format is showing as
Key format: JSON or SESSION(KAFKA_STRING) or HOPPING(KAFKA_STRING) or TUMBLING(KAFKA_STRING) or KAFKA_STRING
Value format: JSON or KAFKA_STRING
on my topic. How can I modify the data into the specific format?
Thanks

ksqlDB does not yet support JSON message keys, (See the tracking Github issue).
However, you can still access the data, both in the key and the value. The JSON key is just a string after all!
The value, when reformatted, looks like this:
{
"WHS":[
{
"Character Set":"UTF-8",
"action":"finished",
"Update-Date-Time":"2020-04-11 09:00:02:25",
"Number":0,
"Abbr":"",
"Name":"",
"Name2":"",
"Country-Code":"",
"Addr-1":"",
"Addr-2":"",
"Addr-3":"",
"Addr-4":"",
"City":"",
"State":""
}
]
}
Which, assuming all rows share a common format, ksqlDB can easily handle.
To import your stream you should be able to run something like this:
-- assuming v0.9 of Kafka
create stream stuff
(
ROWKEY STRING KEY,
WHS ARRAY<
STRUCT<
`Character Set` STRING,
action STRING,
`Update-Date-Time` STRING,
Number STRING,
... etc
>
>
)
WITH (kafka_topic='?', value_format='JSON');
The value column WHS is an array of structs, (where the will be only one element), and the struct defines all the fields you need to access. Note, some field names needed quoting as they contained invalid characters, e.g. spaces and dashes.

Related

KSQL: How to cast JSON string to raw JSON

I need to copy messages from one Kafka topic to another based on a specific JSON property. That is, if property value is "A" - copy the message, otherwise do not copy. I'm trying to figure out the simplest way to do it with KSQL. My source messages all have my test property, but otherwise have very different and complex schema. Is there a way to have "schemaless" setup for this?
Source message (example):
{
"data": {
"propertyToCheck": "value",
... complex structure ...
}
}
If I define my "data" as VARCHAR in the stream I can examine the property further on with EXTRACTJSONFIELD.
CREATE OR REPLACE STREAM Test1 (
`data` VARCHAR
)
WITH (
kafka_topic = 'Source_Topic',
value_format = 'JSON'
);
In this case however, my "select" stream will produce data as JSON string instead of raw JSON (which is what I want).
CREATE OR REPLACE STREAM Test2 WITH (
kafka_topic = 'Target_Topic',
value_format = 'JSON'
)AS
SELECT
`data` AS `data`
FROM Test1
EMIT CHANGES;
Any ideas how to make this work?

This is a bit of a workaround, but you can achieve your desired behavior as follows: instead of defining your message schema as VARCHAR, use the BYTES type instead. Then use FROM_BYTES in combination with EXTRACTJSONFIELD to read the property you'd like to filter on from the bytes representation.
Here's an example:
Here's a source stream, with nested JSON data, and one example row of data:
CREATE STREAM test (data STRUCT<FOO VARCHAR, BAR VARCHAR>) with (kafka_topic='test', value_format='json', partitions=1);
INSERT INTO test (data) VALUES (STRUCT(FOO := 'foo', BAR := 'bar'));
Now, represent the data as bytes (using the KAFKA format), instead of as JSON:
CREATE STREAM test_bytes (data BYTES) WITH (kafka_topic='test', value_format='kafka');
Next, perform the filter based on the nested JSON data:
CREATE STREAM test_filtered_bytes WITH (kafka_topic='test_filtered') AS SELECT * FROM test_bytes WHERE extractjsonfield(from_bytes(data, 'utf8'), '$.DATA.FOO') = 'foo';
The newly created topic "test_filtered" now has data in proper JSON format, analogous to the source stream "test". We can verify by representing the stream in the original format and reading it back to check:
CREATE STREAM test_filtered (data STRUCT<FOO VARCHAR, BAR VARCHAR>) WITH (kafka_topic='test_filtered', value_format='json');
SELECT * FROM test_filtered EMIT CHANGES;
I verified that these example statements work for me as of the latest ksqlDB version (0.27.2). They should work the same on all ksqlDB versions ever since the BYTES type and relevant built-in functions were introduced.

Using ksqlDB scalar functions such as EXTRACTJSONFIELD or JSON_RECORDS might help you.

How to do reverse EXTRACTJSONFIELD function?

Format my records in topic - key: 1-2MDWE7JT, I want to convert it into key: {"some_id" : "1-2MDWE7JT"}.
How can I include a field into json with ksql functions? I didn't find a reverse function for "EXTRACTJSONFIELD" like "INCLUDEJSONFIELD".

If the data is written as key: 1-2MDWE7JT in the input topic, you can only read it this way, ie, the schema would a plain VARCHAR type. You could wrap the key into a STRUCT dynamically at query time though
SELECT STRUCT(f1 := v1, f2 := v2) FROM s1;
Cf https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/select-pull-query/#struct-output
Thus, you would create an output STREAM that contains the key wrapped as like you want if (if you use JSON key format).

Spark/Scala - Validate JSON document in a row of a streaming DataFrame

I have a streaming application which is processing a streaming DataFrame with column "body" that contains a JSON string.
So in the body is something like (these are four input rows):
{"id":1, "ts":1557994974, "details":[{"id":1,"attr2":3,"attr3":"something"}, {"id":2,"attr2":3,"attr3":"something"}]}
{"id":2, "ts":1557994975, "details":[{"id":1,"attr2":"3","attr3":"something"}, {"id":2,"attr2":"3","attr3":"something"},{"id":3,"attr2":"3","attr3":"something"}]}
{"id":3, "ts":1557994976, "details":[{"id":1,"attr2":3,"attr3":"something"}, {"id":2,"attr2":3}]}
{"id":4, "ts":1557994977, "details":[]}
I would like to check that each row has the correct schema (data types and contains all attributes). I would like to filter out and log the invalid records somewhere (like a Parquet file). I am especially interested in the "details" array - each of the nested documents must have specified fields and correct data types.
So in the example above only row id = 1 is valid.
I was thinking about a case class such as:
case class Detail(
id: Int,
attr2: Int,
attr3: String
)
case class Input(
id: Int,
ts: Long,
details: Seq[Detail]
)
and Try but not sure how to go about it.
Could someone help, please?
Thanks

One approach is to use JSON Schema that can help you with schema validations on the data. The getting started page is a good place to start off with if you're new.
The other approach would roughly work as follows
Build models (case classes) for each of the objects like you've attempted in your question.
Use a JSON library like Spray JSON / Play-JSON to parse the input json.
For all input that fail to be parsed into valid records mostly likely invalid and you can partition those output into a different sink in your spark code. It would also make this robust if you've an isValid method on the objects which can validate if a parsed record is correct or not too.

The easiest way for me is to create a Dataframe with a Schema and then filter with id == 1. This is not the most efficient way.
Heare you can find a example to create a dataframe with Schema: https://blog.antlypls.com/blog/2016/01/30/processing-json-data-with-sparksql/
Edit
I can't find a Pre-filtering to speed up JSON search in scala, but you can use this three options:
spark.read.schema(mySchema).format(json).load("myPath").filter($"id" === 1)
or
spark.read.schema(mySchema).json("myPath").filter($"id" === 1)
or
spark.read.json("myPath").filter($"id" === 1)

ksql - creating a stream from a json array

My kafka topic is pushing data in this format (coming from collectd):
[{"values":[100.000080140372],"dstypes":["derive"],"dsnames":["value"],"time":1529970061.145,"interval":10.000,"host":"k5.orch","plugin":"cpu","plugin_instance":"23","type":"cpu","type_instance":"idle","meta":{"network:received":true}}]
It's a combination of arrays, ints and floats... and the whole thing is inside a json array. As a result Im having a heck of a time using ksql to do anything with this data.
When I create a 'default' stream as
create stream cd_temp with (kafka_topic='ctd_test', value_format='json');
I get this result:
ksql> describe cd_temp;
Field | Type
-------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
-------------------------------------
Any select will return the ROWTIME and an 8 digit hex value for ROWKEY.
I've spent some time trying to extract the json fields to no avail. What concerns me is this:
ksql> print 'ctd_test' from beginning;
Format:JSON
com.fasterxml.jackson.databind.node.ArrayNode cannot be cast to com.fasterxml.jackson.databind.node.ObjectNode
Is it possible that this topic can't be used in ksql? Is there a technique for unpacking the outer array to get to the interesting bits inside?

At the time of writing, (June 2018), KSQL can't handle a JSON message where the whole thing is embedded inside a top level array. There is a github issue to track this. I'd suggest adding a +1 vote on this issue to up the priority of it.
Also, I notice that your create stream statement is not defining the schema of the json message. While this won't help in this situation, it is something that you'll need for other Json input formats, i.e. you create statement should be something like:
create stream cd_temp (values ARRAY<DOUBLE>, dstypes ARRAY<VARCHAR>, etc) with (kafka_topic='ctd_test', value_format='json');

deserialize cassandra row key

I'm trying to use the sstablekeys utility for Cassandra to retrieve all the keys currently in the sstables for a Cassandra cluster. The format they come back in what appears to be serialized format when I run sstablekeys for a single sstable. Does anyone know how to deserialize the keys or get them back into their original format? They were inserted into Cassandra using astyanax, where the serialized type is a tuple in Scala. The key_validation_class for the column family is a CompositeType.

Thanks to the comment by Thomas Stets, I was able to figure out that the keys are actually just converted to hex and printed out. See here for a way of getting them back to their original format.
For the specific problem of figuring out the format of a CompositeType row key and unhexifying it, see the Cassandra source which describes the format of a CompositeType key that gets output by sstablekeys. With CompositeType(UTF8Type, UTF8Type, Int32Type), the UTF8Type treats bytes as ASCII characters (so the function in the link above works in this case), but with Int32Type, you must interpret the bytes as one number.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Topic data format in Kafka for KSQL operations - apache-kafka

Related

KSQL: How to cast JSON string to raw JSON

How to do reverse EXTRACTJSONFIELD function?

Spark/Scala - Validate JSON document in a row of a streaming DataFrame

ksql - creating a stream from a json array

deserialize cassandra row key

Categories

Resources