Avro uint64 data type - encoding

I need to serialize a uint64 into a Avro field.
However in the docs I only see signed integers:
The set of primitive type names is:
null: no value
boolean: a binary value
int: 32-bit signed integer
long: 64-bit signed integer
float: single precision (32-bit) IEEE 754 floating-point number
double: double precision (64-bit) IEEE 754 floating-point number
bytes: sequence of 8-bit unsigned bytes
string: unicode character sequence
What is the "canonical" way to serialize a uint64 in Avro? As bytes?
{
"name": "payload",
"type": "record",
"fields": [
{
"name": "my_uint64",
"type": "bytes"
}
]
}
Edit:
Or should the data be encoded as a long and then be casted on the consumer side?
{
"name": "payload",
"type": "record",
"fields": [
{
"name": "my_uint64",
"type": "long"
}
]
}
My problem with both approaches is that the receiver will have to know that some bytes/longs are in reality unit64 - however where do I store this information so that the consumer can rely on the schema?
My tendency is toward using bytes with a magic byte in front that indicates a uint64 within.
Has anyone had similar issues and came to a conclusion?

Mailing list from Avro recommends to use fixed type (https://avro.apache.org/docs/1.10.2/spec.html#Fixed) to support unsigned integers.
{
"name": "payload",
"type": "record",
"fields": [
{
"name": "my_uint64",
"type": {
"name": "myFixed",
"type": "fixed",
"size": 8
}
}
]
}
See https://www.mail-archive.com/user#avro.apache.org/msg01731.html.

Related

JDBC sink topic with multiple structs to postgres

I am trying to sink a few topics top a postgres database. However the topic schema defines a array at the top level and within it multiple structs. Automapping does not work and I cannot find any reference how to handle this. I need all structs because they are dependent types, the second struct references the first struct as a field.
Currently it breaks when hitting the 2nd struct stating statusChangeEvent (struct) has no mapping to sql column type. This because it is using auto.create to make a table (probably called ProcessStatus) then when hitting the second entry there is no column of course.
[
{
"type": "record",
"name": "processStatus",
"namespace": "company.some.process",
"fields": [
{
"name": "code",
"doc": "The code of the processStatus",
"type": "string"
},
{
"name": "name",
"doc": "The name of the processStatus",
"type": "string"
},
{
"name": "description",
"type": "string"
},
{
"name": "isCompleted",
"type": "boolean"
},
{
"name": "isSuccessfullyCompleted",
"type": "boolean"
}
]
},
{
"type": "record",
"name": "StatusChangeEvent",
"namespace": "company.some.process",
"fields": [
{
"name": "contNumber",
"type": "string"
},
{
"name": "processId",
"type": "string"
},
{
"name": "processVersion",
"type": "int"
},
{
"name": "extProcessId",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "fromStatus",
"type": "process.status"
},
{
"name": "toStatus",
"doc": "The new status of the process",
"type": "company.some.process.processStatus"
},
{
"name": "changeDateTime",
"type": "long",
"logicalType": "timestamp-millis"
},
{
"name": "isPublic",
"type": "boolean"
}
]
}
]
I am not using ksql atm. Which connector settings are suited for this task? If there is a ksql alternative it would be nice to know but the current requirement is to use the JDBC connector.
I tried using flatten but it does not support struct fields that have a schema. Which seems kind of weird. Aren't schema's the whole selling point of connect with kafka? Or is it more of a constraint you have to work around?
Aren't schema's the whole selling point of connect with kafka?
Yes, but Postgres (or the JDBC Sink, in general) doesn't really support nested objects within columns. For that, you're better off with a document database, such as using Mongo Sink Connector.
Which connector settings are suited for this task?
None, really, other than transforms. You could write your own if flatten doesn't work.
You could try pre-defining your table to use JSONB for the two status columns, however, that's more of a workaround.

ksqldb keeps saying - VALUE_FORMAT should support schema inference when VALUE_SCHEMA_ID is provided. Current format is JSON

I'm trying to create a stream in ksqldb to a topic in Kafka using an avro schema.
The command looks like this:
CREATE STREAM customer_stream WITH (KAFKA_TOPIC='customers', VALUE_FORMAT='JSON', VALUE_SCHEMA_ID=1);
Topic customers looks like this:
Using the command - print 'customers';
Key format: ¯_(ツ)_/¯ - no data processed
Value format: JSON or KAFKA_STRING
rowtime: 2022/09/29 12:34:53.440 Z, key: , value: {"Name":"John Smith","PhoneNumbers":["212 555-1111","212 555-2222"],"Remote":false,"Height":"62.4","FicoScore":" > 640"}, partition: 0
rowtime: 2022/09/29 12:34:53.440 Z, key: , value: {"Name":"Jane Smith","PhoneNumbers":["269 xxx-1111","269 xxx-2222"],"Remote":false,"Height":"69.9","FicoScore":" > 690"}, partition: 0
To this topic an avro schema has been added.
{
"type": "record",
"name": "Customer",
"namespace": "com.acme.avro",
"fields": [{
"name": "ficoScore",
"type": ["null", "string"],
"default": null
}, {
"name": "height",
"type": ["null", "double"],
"default": null
}, {
"name": "name",
"type": ["null", "string"],
"default": null
}, {
"name": "phoneNumbers",
"type": ["null", {
"type": "array",
"items": ["null", "string"]
}
],
"default": null
}, {
"name": "remote",
"type": ["null", "boolean"],
"default": null
}
]
}
When I run the command below I got this reply:
CREATE STREAM customer_stream WITH (KAFKA_TOPIC='customers', VALUE_FORMAT='JSON', VALUE_SCHEMA_ID=1);
VALUE_FORMAT should support schema inference when VALUE_SCHEMA_ID is provided. Current format is JSON.
Any suggestion?
JSON doesn't use schema IDs. JSON_SR format does, but if you want Avro, then you need to use AVRO as the format.
You dont "add schemas" to topics. You can only register them in the registry.
Example of converting JSON to Avro with kSQL:
CREATE STREAM sensor_events_json (sensor_id VARCHAR, temperature INTEGER, ...)
WITH (KAFKA_TOPIC='events-topic', VALUE_FORMAT='JSON');
CREATE STREAM sensor_events_avro WITH (VALUE_FORMAT='AVRO') AS SELECT * FROM sensor_events_json;
Notice that you dont need to refer to any ID as the serializer will auto-register the necessary schema.

Default value for a record in AVRO?

I wanted to add a new field into an AVRO schema of type "record" that cannot be null and therefore has a default value. The topic is set to compatibility type "Full_Transitive".
The schema did not change from the last version, only the last field produktType was added:
{
"type": "record",
"name": "Finished",
"namespace": "com.domain.finishing",
"doc": "Schema to indicate the end of the ongoing saga...",
"fields": [
{
"name": "numberOfAThing",
"type": [
"null",
{
"type": "string",
"avro.java.string": "String"
}
],
"default": null
},
{
"name": "previousNumbersOfThings",
"type": {
"type": "array",
"items": {
"type": "string",
"avro.java.string": "String"
}
},
"default": []
},
{
"name": "produktType",
"type": {
"type": "record",
"name": "ProduktType",
"fields": [
{
"name": "art",
"type": "int",
"default": 1
},
{
"name": "code",
"type": "int",
"default": 10003
}
]
},
"default": { "art": 1, "code": 10003 }
}
]
}
I've checked with the schema-registry that the new version of the schema is compatible.
But when we try to read old messages that do not contain that new field with the new schema (where the defaults are) there is a EOF Exception and it does not seem to work.
The part that causes headaches is the new added field "produktType". It cannot be null so we tried adding defaults. Which is possible for primitive type fields ("int" and so on). The line "default": { "art": 1, "code": 10003 } seems to be ok with the schema-registry but does not seem to have an effect when we read messages from the topic that do not contain this field.
The schema registry also marks it as not compatible when the last "default": { "art": 1, "code": 10003 } line is missing (but also "default": true works regarding schema compatibility...).
The AVRO specification for complex types contains an example for type "record" and default {"a": 1} so that is where we got that idea from. But since its not working something is still wrong.
There are similar questions like this one claiming records can only have null as a default or this un-answered one.
Is this supposed to work? And if so how can defaults for these "type": "record" fields be defined? Or is it still true that records can only have null as default?
Thanks!
Update on the compatibility cases:
Schema V1 (old one without the new field): can read v1 and v2 records.
Schema V2 (new field added): cannot read v1 records, can read v2 records
The case where a consumer using schema v2 encountering records using v1 is the surprising one - as I thought the defaults are for that purpose.
Even weirder: when I don't set the new field values at all. The v2 record does contain some values:
I have no idea where the value for code is from. The schema uses other numbers for its defaults:
So one of them seems to work, the other does not.

Convert CSV to AVRO in NiFi 1.13.0

I would like to convert my CSV dataflow to AVRO in NiFi 1.13.0 and send it to a Kafka topic with a Key and his Schema.
So I have multiple problems:
Convert my file in AVRO (compatible for Kafka Topic and Kakfa Stream use)
Send my AVRO message to my Kafka topic with his Schema
Attach a custom Key with my message
I saw many things about processors that does not exist anymore so I would like to have a clear answer for the 1.13.0 NiFi version.
Here is my dataFlow :
Project,Price,Charges,hours spent,Days spent,price/day
API,75000,2500,1500,187.5,1000
Here is the AVRO Schema I'd like to have at the end :
{
"name": "projectClass",
"type": "record",
"fields": [
{
"name": "Project",
"type": "string"
},
{
"name": "Price",
"type": "int"
},
{
"name": "Charges",
"type": "int"
},
{
"name": "hours spent",
"type": "int"
},
{
"name": "Days spent",
"type": "double"
},
{
"name": "price/day",
"type": "int"
}
]
}
The associated key must be a unique ID (int or double).
Thanks for your answers.

'int' is a primitive and doesn't support nested properties: Azure Data Factory v2

I am trying to find a substring of a string as a part of an activity in ADF. Say for instance I want to extract out the subsctring 'de' out of a string 'abcde'. I have tried:
#substring(variables('target_folder_name'), 3, (int(variables('target_folder_name_length'))-3))
where int(variables('target_folder_name_length')) has a value of 5 and variables('target_folder_name') has a value of 'abcde'
But it gives me: Unrecognized expression: (int(variables('target_folder_name_length'))-3)
On the other hand, if I try this: #substring(variables('target_folder_name'), 2, int(variables('target_folder_name_length'))-3)
This gives me: 'int' is a primitive and doesn't support nested properties
Where am I going wrong?
Use indexof. See my example below:
#substring(variables('testString'), indexof(variables('testString'), variables('de')), length(variables('de')))
Output of 'result' variable:
Since your preceding values are static, you can use below dynamic expression to achieve the substring value as per your requirement.
#substring(variables('varInputFolderName'), 3, sub(length(variables('varInputFolderName')), 3))
Where varInputFolderName = String = abcde as per this sample.
Here is the pipeline JSON payload for this sample. You can play around with it for further testing.
{
"name": "pipeline_FindSubstring",
"properties": {
"activities": [
{
"name": "setSubstringValue",
"type": "SetVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "varSubstringOutput",
"value": {
"value": "#substring(variables('varInputFolderName'), 3, sub(length(variables('varInputFolderName')), 3))",
"type": "Expression"
}
}
}
],
"variables": {
"varInputFolderName": {
"type": "String",
"defaultValue": "abcde"
},
"varSubstringOutput": {
"type": "String"
}
},
"annotations": []
}
}