How to produce tombstone for a Kafka Avro topic - scala

I am trying to produce tombstone messages to a compacted Kafka topic with Avro schema using Scala (v2.13.10) and FS2 Kafka library(v3.0.0-M8) with Vulcan module.
The app consumes from a topic A and produces a tombstone to the same topic A for the values that matches some condition.
A sample snippet
val producerSettings =
ProducerSettings(
keySerializer = keySerializer,
valueSerializer = Serializer.unit[IO]
).withBootstrapServers("localhost:9092")
def processRecord(committableRecord: CommittableConsumerRecord[IO, KeySchema, ValueSchema]
, producer: KafkaProducer.Metrics[IO, KeySchema, Unit]
): IO[CommittableOffset[IO]] = {
val key = committableRecord.record.key
val value = committableRecord.record.value
if(value.filterColumn.field1 == "<removable>") {
val tombStone = ProducerRecord(committableRecord.record.topic, key, ())
val producerRecord: ProducerRecords[CommittableOffset[IO], KeySchema, Unit] = ProducerRecords.one(tombStone, committableRecord.offset)
producer.produce(producerRecord).flatten.map(_.flatMap(a => {
IO(a.passthrough)
}))
}
else
IO(committableRecord.offset)
}
The above snippet works fine if I produce a valid key value message.
However, I am getting the below error when I try to generate an null/empty messages:
java.lang.IllegalArgumentException: Invalid Avro record: bytes is null or empty
at fs2.kafka.vulcan.AvroDeserializer$.$anonfun$using$4(AvroDeserializer.scala:32)
at defer # fs2.kafka.vulcan.AvroDeserializer$.$anonfun$using$3(AvroDeserializer.scala:29)
at defer # fs2.kafka.vulcan.AvroDeserializer$.$anonfun$using$3(AvroDeserializer.scala:29)
at mapN # fs2.kafka.KafkaProducerConnection$$anon$1.withSerializersFrom(KafkaProducerConnection.scala:141)
at map # fs2.kafka.ConsumerRecord$.fromJava(ConsumerRecord.scala:184)
at map # fs2.kafka.internal.KafkaConsumerActor.$anonfun$records$2(KafkaConsumerActor.scala:265)
at traverse # fs2.kafka.KafkaConsumer$$anon$1.$anonfun$partitionsMapStream$26(KafkaConsumer.scala:267)
at defer # fs2.kafka.vulcan.AvroDeserializer$.$anonfun$using$3(AvroDeserializer.scala:29)
at defer # fs2.kafka.vulcan.AvroDeserializer$.$anonfun$using$3(AvroDeserializer.scala:29)
at mapN # fs2.kafka.KafkaProducerConnection$$anon$1.withSerializersFrom(KafkaProducerConnection.scala:141)
A sample Avro Schema:
{
"type": "record",
"name": "SampleOrder",
"namespace": "com.myschema.global",
"fields": [
{
"name": "cust_id",
"type": "int"
},
{
"name": "month",
"type": "int"
},
{
"name": "expenses",
"type": "double"
},
{
"name": "filterColumn",
"type": {
"type": "record",
"name": "filterColumn",
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "field1",
"type": "string"
}
]
}
}
]
}
Thanks in advance.
I've tried different serializers for producer but all result in same above exception.

First, a producer would use a Serializer, yet your stacktrace says deserializer. Unless your keys are Avro, you don't need an Avro schema to send null values into a topic. Use ByteArraySerializer, and simply send null value...
But this seems like a bug. If the incoming record key/value is null, it should return null, not explicitly throw an error
https://github.com/fd4s/fs2-kafka/blob/series/2.x/modules/vulcan/src/main/scala/fs2/kafka/vulcan/AvroDeserializer.scala#L29
Compare to Confluent implementation

Related

ksqldb keeps saying - VALUE_FORMAT should support schema inference when VALUE_SCHEMA_ID is provided. Current format is JSON

I'm trying to create a stream in ksqldb to a topic in Kafka using an avro schema.
The command looks like this:
CREATE STREAM customer_stream WITH (KAFKA_TOPIC='customers', VALUE_FORMAT='JSON', VALUE_SCHEMA_ID=1);
Topic customers looks like this:
Using the command - print 'customers';
Key format: ¯_(ツ)_/¯ - no data processed
Value format: JSON or KAFKA_STRING
rowtime: 2022/09/29 12:34:53.440 Z, key: , value: {"Name":"John Smith","PhoneNumbers":["212 555-1111","212 555-2222"],"Remote":false,"Height":"62.4","FicoScore":" > 640"}, partition: 0
rowtime: 2022/09/29 12:34:53.440 Z, key: , value: {"Name":"Jane Smith","PhoneNumbers":["269 xxx-1111","269 xxx-2222"],"Remote":false,"Height":"69.9","FicoScore":" > 690"}, partition: 0
To this topic an avro schema has been added.
{
"type": "record",
"name": "Customer",
"namespace": "com.acme.avro",
"fields": [{
"name": "ficoScore",
"type": ["null", "string"],
"default": null
}, {
"name": "height",
"type": ["null", "double"],
"default": null
}, {
"name": "name",
"type": ["null", "string"],
"default": null
}, {
"name": "phoneNumbers",
"type": ["null", {
"type": "array",
"items": ["null", "string"]
}
],
"default": null
}, {
"name": "remote",
"type": ["null", "boolean"],
"default": null
}
]
}
When I run the command below I got this reply:
CREATE STREAM customer_stream WITH (KAFKA_TOPIC='customers', VALUE_FORMAT='JSON', VALUE_SCHEMA_ID=1);
VALUE_FORMAT should support schema inference when VALUE_SCHEMA_ID is provided. Current format is JSON.
Any suggestion?
JSON doesn't use schema IDs. JSON_SR format does, but if you want Avro, then you need to use AVRO as the format.
You dont "add schemas" to topics. You can only register them in the registry.
Example of converting JSON to Avro with kSQL:
CREATE STREAM sensor_events_json (sensor_id VARCHAR, temperature INTEGER, ...)
WITH (KAFKA_TOPIC='events-topic', VALUE_FORMAT='JSON');
CREATE STREAM sensor_events_avro WITH (VALUE_FORMAT='AVRO') AS SELECT * FROM sensor_events_json;
Notice that you dont need to refer to any ID as the serializer will auto-register the necessary schema.

Convert CSV to AVRO in NiFi 1.13.0

I would like to convert my CSV dataflow to AVRO in NiFi 1.13.0 and send it to a Kafka topic with a Key and his Schema.
So I have multiple problems:
Convert my file in AVRO (compatible for Kafka Topic and Kakfa Stream use)
Send my AVRO message to my Kafka topic with his Schema
Attach a custom Key with my message
I saw many things about processors that does not exist anymore so I would like to have a clear answer for the 1.13.0 NiFi version.
Here is my dataFlow :
Project,Price,Charges,hours spent,Days spent,price/day
API,75000,2500,1500,187.5,1000
Here is the AVRO Schema I'd like to have at the end :
{
"name": "projectClass",
"type": "record",
"fields": [
{
"name": "Project",
"type": "string"
},
{
"name": "Price",
"type": "int"
},
{
"name": "Charges",
"type": "int"
},
{
"name": "hours spent",
"type": "int"
},
{
"name": "Days spent",
"type": "double"
},
{
"name": "price/day",
"type": "int"
}
]
}
The associated key must be a unique ID (int or double).
Thanks for your answers.

Avro Schema: Build Avro Schema from Schema Fields

I am trying to write a function to calculate a diff between two avro schemas and generate another schema.
schema_one = {
"type": "record",
"name": "schema_one",
"namespace": "test",
"fields": [
{
"name": "type",
"type": "string"
},
{
"name": "id",
"type": "string"
}
]
}
schema_two = {
"type": "record",
"name": "schema_two",
"namespace": "test",
"fields": [
{
"name": "type",
"type": "string"
}
]
}
To get elements field in schema_one not in schema_two
import org.apache.avro.Schema._
import org.apache.avro.{Schema, SchemaBuilder}
val diff: Set[Schema.Field] = schema_one.getFields.asScala.toSet.filterNot(schema_two.getFields.asScala.toSet)
So far, so good.
I want to build a new schema from diff and I expect it to be:
schema_three = {
"type": "record",
"name": "schema_three",
"namespace": "test",
"fields": [
{
"name": "id",
"type": "string"
}
]
}
I cant seem to find any method within Avro SchemaBuilder to achieve this without having to explicitly provide named fields. i.e build Schema given Schema.Fields
For example:
SchemaBuilder.record("schema_three").namespace("test").fromFields(diff)
Is there a way to achieve this? Appreciate comments.
I was able to achieve this using the kite sdk "org.kitesdk" % "kite-data-core" % "1.1.0"
val schema_namespace = schema_one.getNamespace
val schema_name = schema_one.getName
val schemas = diff.map( f => {
SchemaBuilder
.record(schema_name)
.namespace(schema_namespace)
.fields()
.name(f.name())
.`type`(f.schema())
.noDefault()
.endRecord()
}
)
val schema_three = SchemaUtil.merge(schemas.asJava)

How to set array of records Using GenericRecordBuilder

I'm trying to turn a Scala object (i.e case class) into byte array.
In order to do so, I'm inserting the object content into a GenericRecordBuilder using its specific schema, and eventually using GenericDatumWriter i turn it into a byte array.
I have no problem to set primitive types, and array of primitive types into the GenericRecordBuilder.
But, I need help with Inserting array of records into the GenericRecordBuilder, and create a byte array from it.
What is the right way to insert array of records into the GenericRecordBuilder?
Here is part of what i'm trying to do:
This is the Schema:
{
"type": "record",
"name": "test1",
"namespace": "ns",
"fields": [
{
"name": "t_name",
"type": "string",
"default": "a"
},
{
"name": "t_num",
"type": "int",
"default": 0
},
{"name" : "t_arr", "type":
["null",
{"type": "array", "items": {
"name": "t_arr_a",
"type": "record",
"fields": [
{
"name": "t_arr_f1",
"type": "int",
"default": 0
},
{
"name": "t_arr_f2",
"type": "int",
"default": 0
}
]
}
}
]
}
]
}
This is the Scala class that populate the GenericRecordBuilder and transform it into byte Array:
package utils
import java.io.ByteArrayOutputStream
import org.apache.avro.{Schema, generic}
import org.apache.avro.generic.{GenericData, GenericDatumWriter}
import org.apache.avro.io.EncoderFactory
import org.apache.avro.generic.GenericRecordBuilder
object CheckRecBuilder extends App {
val avroSchema: Schema = new Schema.Parser().parse(this.getClass.getResourceAsStream("/data/myschema.avsc"))
val recordBuilder = new GenericRecordBuilder(avroSchema)
recordBuilder.set("t_name", "X")
recordBuilder.set("t_num", 100)
recordBuilder.set("t_arr", ???)
val record = recordBuilder.build()
val w = new GenericDatumWriter[GenericData.Record](avroSchema)
val outputStream = new ByteArrayOutputStream()
val e = EncoderFactory.get.binaryEncoder(outputStream, null)
w.write(record, e)
val barr = outputStream.toByteArray
println("End")
}
I manged to set the array of objects.
I wonder if there is a better or righter way for doing it.
Here is what I did:
Created a case class:
case class t_arr_a(t_arr_f1:Int, t_arr_f2:Int)
Created a method that transform case class into a GenericData.Record:
def caseClassToGenericDataRecord(cc:Product, schema:Schema): GenericData.Record = {
val childRecord = new GenericData.Record(schema.getElementType)
val values = cc.productIterator
cc.getClass.getDeclaredFields.map(f => childRecord.put(f.getName, values.next ))
childRecord
}
Updated the class CheckRecBuilder above:
replaced:
recordBuilder.set("t_arr", ???)
With:
val childSchema = new GenericData.Record(avroSchema2).getSchema.getField("t_arr").schema().getTypes().get(1)
val tArray = Array(t_arr_a(2,4), t_arr_a(25,14))
val tArrayGRecords: util.List[GenericData.Record]
= Some(yy.map(x => caseClassToGenericDataRecord(x,childSchema))).map(arr => java.util.Arrays.asList(arr: _*)).orNull
recordBuilder.set("t_arr", tArrayGRecords)

Avro genericdata.Record ignores data types

I have the following avro schema
{ "namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
I use the following snippet to set up a Record
val schema = new Schema.Parser().parse(new File("data/user.avsc"))
val user1 = new GenericData.Record(schema) //strangely this schema only checks for valid fields NOT types.
user1.put("name", "Fred")
user1.put("favorite_number", "Jones")
I would have thought that this would fail to validate against the schema
When I add the line
user1.put("last_name", 100)
It generates a run time error, which is what I would expect in the first case as well.
Exception in thread "main" org.apache.avro.AvroRuntimeException: Not a valid schema field: last_name
at org.apache.avro.generic.GenericData$Record.put(GenericData.java:125)
at csv2avro$.main(csv2avro.scala:40)
at csv2avro.main(csv2avro.scala)
What's going on here?
It won't fail when adding it into the Record, it will fail when it tries to serialize because it is at that point when it is trying to match the type. As far as I'm aware that is the only place it does type checking.