I am starting to get into Apache Kafka (Confluent) and have some questions regarding the use of schemas.
First, is my general understanding correct that a schema is used for validating the data? My understanding of schemas is that when the data is "produced", it checks if the Keys and Values fit the predefined concept and splits them accordingly.
My current technical setup is as follows:
Python:
from confluent_kafka import Producer
from config import conf
import json
# create producer
producer = Producer(conf)
producer.produce("datagen-topic", json.dumps({"product":"table","brand":"abc"}))
producer.flush()
in Confluent, i set up a json key schema for my topic:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"properties": {
"brand": {
"type": "string"
},
"product": {
"type": "string"
}
},
"required": [
"product",
"brand"
],
"type": "object"
}
Now, when I produce the data, the message in Confluent contains only content in "Value". Key and Header are null:
{
"product": "table",
"brand": "abc"
}
Basically it doesn't make a difference if I have this schema set up or not, so I guess it's just not working as I set it up. Can you help me where my way of thinking is wrong or where my code is lacking input?
The Confluent Python library Producer class doesn't interact with the Registry in any way, so your message wouldn't be validated.
You'll want to use SerializingProducer like in the example - https://github.com/confluentinc/confluent-kafka-python/blob/master/examples/json_producer.py
If you want non-null keys and headers, you'll need to pass those on to the send method
Related
I'm building recommender system using AWS Personalize. User-personalization recipe has 3 dataset inputs: interactions, user_metadata and item_metadata. I am having trouble importing user metadata which contains boolean field.
I created the following schema:
user_schema = {
"type": "record",
"name": "Users",
"namespace": "com.amazonaws.personalize.schema",
"fields": [
{
"name": "USER_ID",
"type": "string"
},
{
"name": "type",
"type": [
"null",
"string"
],
"categorical": True
},
{
"name": "lang",
"type": [
"null",
"string"
],
"categorical": True
},
{
"name": "is_active",
"type": "boolean"
}
],
"version": "1.0"
}
dataset csv file content looks like:
USER_ID,type,lang,is_active
1234#gmail.com ,,geo,True
01027061015#mail.ru ,facebook,eng,True
03dadahda#gmail.com ,facebook,geo,True
040168fadw#gmail.com ,facebook,geo,False
I uploaded given csv file on s3 bucket.
When I am trying create dataset import job it gives me the following exception:
InvalidInputException: An error occurred (InvalidInputException) when calling the CreateDatasetImportJob operation: Input csv has rows that do not conform to the dataset schema. Please ensure all required data fields are present and that they are of the type specified in the schema.
I tested and it works without boolean field is_active. There are no NaN values in given column!
It'd be nice to have an ability to directly test if your pandas dataframe or csv file conforms given schema and possibly get more detailed error message.
Does anybody know how to format boolean field to fix that issue?
I found a solution through many trials. Checked the AWS Personalization documentation (https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html#dataset-requirements) which says that: boolean (values true and false must be lower case in your data).
Then I tried several things to find a solution, and one of them really worked. But still the hard way to find a solution and spent hours.
Solution:
Convert column in pandas DataFrame into string (Object) format.
lowercase True and False string values to get true and false.
store pandas DataFrame as csv file
it results in lowercase values of true and false.
USER_ID,type,lang,is_active
1234#gmail.com ,,geo,true
01027061015#mail.ru ,facebook,eng,true
03dadahda#gmail.com ,facebook,geo,true
040168fadw#gmail.com ,facebook,geo,false
That's all! There is no need to change "boolean" type in schema to "string"!
Hopefully they'll solve that issue soon since I contacted with AWS technical support with the same issue.
I have a certain json where i need to set the document id as combination of two fields.
{
"Event_start_time": "2021-05-16T08:27:21.164Z",
"allbeat": {
"heartbeat": {
"pkt_loss_pct": 0,
"type": "ping",
"bu_id": 1,
"minimum_rtt": 32.248,
"jitter": 0.09999999999999788,
"target_state": "Up",
"average_rtt": 32.35,
"maximum_rtt": 32.436,
"tenant_id": 1,
"target": "google.com",
"port": 0
}
}
}
From the above document can we set a key with the combination of Event_start_time and allbeat.heartbeat.target using the available SMT's?
There is not an available Single Message Transform that I'm aware of that will do this. You could write your own, or you could use stream processing (e.g. Kafka Streams, ksqlDB) to do it
Using Azure Data Factory and a data transformation flow. I have a csv that contains a column with a json object string, below an example including the header:
"Id","Name","Timestamp","Value","Metadata"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-18 05:53:00.0000000","0","{""unit"":""%""}"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-19 05:53:00.0000000","4","{""jobName"":""RecipeB""}"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-16 02:12:30.0000000","state","{""jobEndState"":""negative""}"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-19 06:33:00.0000000","23","{""unit"":""kg""}"
Want to store the data in a json like this:
{
"id": "99c9347ab7c34733a4fe0623e1496ffd",
"name": "data1",
"values": [
{
"timestamp": "2021-03-18 05:53:00.0000000",
"value": "0",
"metadata": {
"unit": "%"
}
},
{
"timestamp": "2021-03-19 05:53:00.0000000",
"value": "4",
"metadata": {
"jobName": "RecipeB"
}
}
....
]
}
The challenge is that metadata has dynamic content, meaning, that it will be always a json object but the content can vary. Therefore I cannot define a schema. Currently the column "metadata" on the sink schema is defined as object, but whenever I run the transformation I run into an exception:
Conversion from ArrayType(StructType(StructField(timestamp,StringType,false),
StructField(value,StringType,false), StructField(metadata,StringType,false)),true) to ArrayType(StructType(StructField(timestamp,StringType,true),
StructField(value,StringType,true), StructField(metadata,StructType(StructField(,StringType,true)),true)),false) not defined
We can get the output you expected, we need the expression to get the object Metadata.value.
Please ref my steps, here's my source:
Derived column expressions, create a JSON schema to convert the data:
#(id=Id,
name=Name,
values=#(timestamp=Timestamp,
value=Value,
metadata=#(unit=substring(split(Metadata,':')[2], 3, length(split(Metadata,':')[2])-6))))
Sink mapping and output data preview:
The key is that your matadata value is an object and may have different schema and content, may be 'value' or other key. We only can manually build the schema, it doesn't support expression. That's the limit.
We can't achieve that within Data Factory.
HTH.
I've been using Kafka Connect for the last couple of months
and recently I included the ActiveMQ source plugin in order to read some JMS topic messages that include a json file inside, put them in a kafka topic and then create a stream/table in Ksqldb that uses as columns some of the keys the json file has.
The thing is though that the plugin inserts the JMS message as text with double quotes so it's not recognized properly in Ksqldb.
I tried various things in configuration in order to fix it but nothing worked so far.
I also want to use json formatting and not Avro in kafka connect (no schema registry running too).
For testing purposes I also tried to send JMS messages by specifying the header content as "application/json" and still no luck.
Here's how my ActiveMQ plugin looks like
"config": {"connector.class":"ActiveMQSourceConnector", "tasks.max":"1", "kafka.topic":"activemq", "activemq.url":"tcp://localhost:61616","activemq.username":"admin","activemq.password":"admin","jms.destination.name":"topic.2","jms.destination.type":"topic","jms.message.format":"json","jms.message.converter":"org.apache.kafka.connect.json.JsonConverter","confluent.license":"","confluent.topic.bootstrap.servers":"localhost:9092"}}
and here's how my Kafka connect configuration looks like
bootstrap.servers=localhost:9092
group.id=connect-cluster
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
offset.storage.topic=connect-offsets
offset.storage.replication.factor=1
config.storage.topic=connect-configs
config.storage.replication.factor=1
status.storage.topic=connect-status
status.storage.replication.factor=1
offset.flush.interval.ms=10000
plugin.path=/opt/kafka_2.13-2.5.0/plugins
Also here's a example of how Kafka consumes the messages
{
"messageID": "ID:plato-46377-1596636746117-4:4:1:1:1",
"text": "{\"widget\": { \"debug\": \"on\", \"window\": { \"title\": \"Sample Konfabulator Widget\", \"name\": \"main_window\", \"width\": 500, \"height\": 500 }, \"image\": { \"src\": \"Images/Sun.png\", \"name\": \"sun1\", \"hOffset\": 250, \"vOffset\": 250, \"alignment\": \"center\" }, \"text\": { \"data\": \"Click Here\", \"size\": 36, \"style\": \"bold\", \"name\": \"text1\", \"hOffset\": 250, \"vOffset\": 100, \"alignment\": \"center\", \"onMouseUp\": \"sun1.opacity = 39\"} }}\n"
}
If any other info is needed please let me know
Any help would be much appreciated.
UPDATE: Ultimately the best solution would be to somehow be able to configure the connector to not escape the quotes in the payload.
Also unfortunately the escaped quotes are generated from activeMQ itself and are not part of the initial message
So the message would look like this
{
"messageID": "ID:plato-46377-1596636746117-4:4:1:1:1",
"text": {
"widget": {
"debug": "on",
"window": {
"title": "Sample Konfabulator Widget",
"name": "main_window",
"width": 500,
"height": 500
},
"image": {
"src": "Images/Sun.png",
"name": "sun1",
"hOffset": 250,
"vOffset": 250,
"alignment":
"center"
}
}
Welcome Elen1no1Yami!
Looks to me like the issue is that the text field of the message is a string containing the JSON payload you're interested in, but that payload has its double-quotes escaped with a \ char.
I'm assuming the data in ActiveMQ itself does not have the \ char, but it would be good if you could clarify this.
The approaches I see to solving this are to either:
be able to configure the connector to NOT escape the quotes in the payload. So that the message looks more like:
{
"messageID": "ID:plato-46377-1596636746117-4:4:1:1:1",
"text": {
"widget": {
"debug": "on",
"window": {
"title": "Sample Konfabulator Widget",
"name": "main_window",
"width": 500,
"height": 500
},
"image": {
"src": "Images/Sun.png",
"name": "sun1",
"hOffset": 250,
"vOffset": 250,
"alignment":
"center"
},
... etc
}
or somehow have ksqlDB handle the message as it is an still get access to the JSON within the text field.
Does that summarise what you're looking for? If so, please update your question to reflect this. (It's good to include such details in your question so that its clear what you're asking.
As for an answer...
I'm no Connect expert, so can't really comment and can't really see anything in the details of the connector's config that may allow you to change the contents of text. Others that know more about Connect may be able to help more.
To be able to access the embedded/escaped JSON in ksqlDB you would first need to remove the escaping. See below for ways to do this using ksqlDB
Using ksqlDB to access escaped JSON
Before we can access the JSON document in text we must remove the escaping.
I can think of two ways of the top of my head:
Write a custom UDF
The best way would be to write a custom UDF 'unescape_json` that could remove the escaping.
-- Import raw stream with value as simple STRING containing all the payload
CREATE STREAM RAW (
message STRING
) WITH (
KAFKA_TOPIC=<something>,
VALUE_FORMAT='KAFKA'
);
-- Use custom UDF to process this and write it back as a properly formatted JSON document:
CREATE STREAM JSONIFIED AS
SELECT MY_CUSTOM_UDF(message) FROM RAW;
If written correctly, the custom UDF approach would not suffer from the potential data corruption issues the REPLACE based solution suffers from.
Using REPLACE to remove escaping
NOTE: this solution is brittle: the character replacement can match and replace things it shouldn't, depending on the content of your message!
Let's work with more simple test data to explain what's needed, e.g we want to convert:
{
"messageID": "ID:plato-46377-1596636746117-4:4:1:1:1",
"text": "{\"widget\": 10}"
}
To:
{
"messageID": "ID:plato-46377-1596636746117-4:4:1:1:1",
"text": {"widget": 10}
}
This requires three things:
Replace opening "text": "{ with "text": {
Replace all \" with ".
Replace closing }" with }
We can use the REPLACE function to do this, or the REGEXP_REPLACE function:
-- Import raw stream with value as simple STRING containing all the payload
CREATE STREAM RAW (
message STRING
) WITH (
KAFKA_TOPIC=<something>,
VALUE_FORMAT='KAFKA'
);
-- Use REPLACE to remove reformat:
CREATE STREAM JSONIFIED AS
SELECT
REPLACE(
REPLACE(
REPLACE(message,
'"text": "{', '"text": {'),
'\"', '"'),
'"}', '}')
FROM RAW;
Of course this solution suffers from potentially corrupting your data if your data contains any of the search terms: "text": "{, \" or "} anywhere else in your data, e.g.
{
"messageID": "ID:plato-46377-1596636746117-4:4:1:1:1",
"text": "{\"widget\": \"hello \\\"} world\"}"
}
Would incorrectly be converted to
{
"messageID": "ID:plato-46377-1596636746117-4:4:1:1:1",
"text": {"widget": "hello \\}world"}
}
This is why a custom UDF would be preferable.
Once you've corrected the contents of your input (and written it to a new topic), then you can import your data as normal:
CREATE STREAM DATA (
messageId STRING,
text STRUCT<Widget INT>
) WITH (
kafka_topic='JSONIFIED',
value_format='JSON'
);
sample json payload:
'{
"Stub1": "XXXXX",
"Stub2": "XXXXX-3047-4ed3-b73b-83fbcc0c2aa9",
"Code": "CodeX",
"people": [
{
"ID": "XXXXX-6425-EA11-A94A-A08CFDCA6C02"
"customer": {
"Id": 173,
"Account": 275,
"AFile": "tel"
},
"products": [
{
"product": 1,
"type": "A",
"stub1": "XXXXX-42E1-4A13-8190-20C2DE39C0A5",
"Stub2": "XXXXX-FC4F-41AB-92E7-A408E7F4C632",
"stub3": "XXXXX-A2B4-4ADF-96C5-8F3CDCF5821D",
"Stub4": "XXXXX-1948-4B3C-987F-B5EC4D6C2824"
},
{
"product": 2,
"type": "B",
"stub1": "XXXXX-42E1-4A13-8190-20C2DE39C0A5",
"Stub2": "XXXXX-FC4F-41AB-92E7-A408E7F4C632",
"stub3": "XXXXX-A2B4-4ADF-96C5-8F3CDCF5821D",
"Stub4": "XXXXX-1948-4B3C-987F-B5EC4D6C2824"
}
]
}
]
}'
I am working on a POST call. Is there any way to feed multiple json files as a payload in Gatling. I am using body(RawFileBody("file.json")) as json here.
This works fine for a single json file. I want to check response for multiple json files. Is there any way we can parametrize this and get response against multiple json files.
As far as I can see, there's a couple of ways you could do this.
Use a JSON feeder (https://gatling.io/docs/current/session/feeder#json-feeders). This would need your multiple JSON files to be in a single file, with the root element being a JSON array. Essentially you'd put the JSON objects you have inside an array inside a single JSON file
Create a Scala Iterator and have the names of the JSON files you're going to use in it. e.g:
val fileNames = Iterator("file1.json", "file2.json)
// and later, in your scenario
body(RawFileBody(fileNames.next())
Note that this method cannot be used across users, as the iterator will initialize separately for each user. You'd have to use repeat or something similar to send multiple files as a single user.
You could do something similar by maintaining the file names as a list inside Gatling's session variable, but this session would still not be shared between different users you inject into your scenario.