Differentiating an AVRO union type - scala

I'm consuming Avro serialized messages from Kafka using the "automatic" deserializer like:
props.put(
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,
"io.confluent.kafka.serializers.KafkaAvroDeserializer"
);
props.put("schema.registry.url", "https://example.com");
This works brilliantly, and is right out of the docs at https://docs.confluent.io/current/schema-registry/serializer-formatter.html#serializer.
The problem I'm facing is that I actually just want to forward these messages, but to do the routing I need some metadata from inside. Some technical constraints mean that I can't feasibly compile-in generated class files to use the KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG => true, so I am using a regular decoder without being tied into Kafka, specifically just reading the bytes as a Array[Byte] and passing them to a manually constructed deserializer:
var maxSchemasToCache = 1000;
var schemaRegistryURL = "https://example.com/"
var specificDeserializerProps = Map(
"schema.registry.url"
-> schemaRegistryURL,
KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG
-> "false"
);
var client = new CachedSchemaRegistryClient(
schemaRegistryURL,
maxSchemasToCache
);
var deserializer = new KafkaAvroDeserializer(
client,
specificDeserializerProps.asJava
);
The messages are a "container" type, with the really interesting part one of about ~25 types in a union { A, B, C } msg record field:
record Event {
timestamp_ms created_at;
union {
Online,
Offline,
Available,
Unavailable,
...
...Failed,
...Updated
} msg;
}
So I'm successfully reading a Array[Byte] into record and feeding it into the deserializer like this:
var genericRecord = deserializer.deserialize(topic, consumerRecord.value())
.asInstanceOf[GenericRecord];
var schema = genericRecord.getSchema();
var msgSchema = schema.getField("msg").schema();
The problem however is that I can find no to discern, discriminate or "resolve" the "type" of the msg field through the union:
System.out.printf(
"msg.schema = %s msg.schema.getType = %s\n",
msgSchema.getFullName(),
msgSchema.getType().name());
=> msg.schema = union msg.schema.getType = union
How to discriminate types in this scenario? The confluent registry knows, these things have names, they have "types", even if I'm treating them as GenericRecords,
My goal here is to know that record.msg is of "type" Online | Offline | Available rather than just knowing it's a union.

After having looked into the implementation of the AVRO Java library, it think it's safe to say that this is impossible given the current API. I've found the following way of extracting the types while parsing, using a custom GenericDatumReader subclass, but it needs a lot of polishing before I'd use something like this in production code :D
So here's the subclass:
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.io.ResolvingDecoder;
import java.io.IOException;
import java.util.List;
public class CustomReader<D> extends GenericDatumReader<D> {
private final GenericData data;
private Schema actual;
private Schema expected;
private ResolvingDecoder creatorResolver = null;
private final Thread creator;
private List<Schema> unionTypes;
// vvv This is the constructor I've modified, added a list of types
public CustomReader(Schema schema, List<Schema> unionTypes) {
this(schema, schema, GenericData.get());
this.unionTypes = unionTypes;
}
public CustomReader(Schema writer, Schema reader, GenericData data) {
this(data);
this.actual = writer;
this.expected = reader;
}
protected CustomReader(GenericData data) {
this.data = data;
this.creator = Thread.currentThread();
}
protected Object readWithoutConversion(Object old, Schema expected, ResolvingDecoder in) throws IOException {
switch (expected.getType()) {
case RECORD:
return super.readRecord(old, expected, in);
case ENUM:
return super.readEnum(expected, in);
case ARRAY:
return super.readArray(old, expected, in);
case MAP:
return super.readMap(old, expected, in);
case UNION:
// vvv The magic happens here
Schema type = expected.getTypes().get(in.readIndex());
unionTypes.add(type);
return super.read(old, type, in);
case FIXED:
return super.readFixed(old, expected, in);
case STRING:
return super.readString(old, expected, in);
case BYTES:
return super.readBytes(old, expected, in);
case INT:
return super.readInt(old, expected, in);
case LONG:
return in.readLong();
case FLOAT:
return in.readFloat();
case DOUBLE:
return in.readDouble();
case BOOLEAN:
return in.readBoolean();
case NULL:
in.readNull();
return null;
default:
return super.readWithoutConversion(old, expected, in);
}
}
}
I've added comments to the code for the interesting parts, as it's mostly boilerplate.
Then you can use this custom reader like this:
List<Schema> unionTypes = new ArrayList<>();
DatumReader<GenericRecord> datumReader = new CustomReader<GenericRecord>(schema, unionTypes);
DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(eventFile, datumReader);
GenericRecord event = null;
while (dataFileReader.hasNext()) {
event = dataFileReader.next(event);
}
System.out.println(unionTypes);
This will print, for each union parsed, the type of that union. Note that you'll have to figure out which element of that list is interesting to you depending on how many unions you have in a record, etc.
Not pretty tbh :D

I was able to come up with a single-use solution after a lot of digging:
val records: ConsumerRecords[String, Array[Byte]] = consumer.poll(100);
for (consumerRecord <- asScalaIterator(records.iterator)) {
var genericRecord = deserializer.deserialize(topic, consumerRecord.value()).asInstanceOf[GenericRecord];
var msgSchema = genericRecord.get("msg").asInstanceOf[GenericRecord].getSchema();
System.out.printf("%s \n", msgSchema.getFullName());
Prints com.myorg.SomeSchemaFromTheEnum and works perfectly in my use-case.
The confusing thing, is that because of the use of GenericRecord, .get("msg") returns Object, which, in a general way I have no way to safely typecast. In this limited case, I know the cast is safe.
In my limited use-case the solution in the 5 lines above is suitable, but for a more general solution the answer https://stackoverflow.com/a/59844401/119669 posted by https://stackoverflow.com/users/124257/fresskoma seems more appropriate.
Whether using DatumReader or GenericRecord is probably a matter of preference and whether the Kafka ecosystem is in mind, alone with Avro I'd probably prefer a DatumReader solution, but in this instance I can live with having Kafak-esque nomenclature in my code.

To retrieve the schema of the value of a field, you can use
new GenericData().induce(genericRecord.get("msg"))

Related

Use mongodb BsonSerializer to serialize and deserialize data

I have complex classes like this:
abstract class Animal { ... }
class Dog: Animal{ ... }
class Cat: Animal{ ... }
class Farm{
public List<Animal> Animals {get;set;}
...
}
My goal is to send objects from computer A to computer B
I was able to achieve my goal by using BinaryFormatter serialization. It enabled me to serialize complex classes like Animal in order to transfer objects from computer A to computer B. Serialization was very fast and I only had to worry about placing a serializable attribute on top of my classes. But now BinaryFormatter is obsolete and if you read on the internet future versions of dotnet may remove that.
As a result I have these options:
Use System.Text.Json
This approach does not work well with polymorphism. In other words I cannot deserialize an array of cats and dogs. So I will try to avoid it.
Use protobuf
I do not want to create protobuf map files for every class. I have over 40 classes this is a lot of work. Or maybe there is a converter that I am not aware of? But still how will the converter be smart enough to know that my array of animals can have cats and dogs?
Use Newtonsoft (json.net)
I could use this solution and build something like this: https://stackoverflow.com/a/19308474/637142 . Or even better serialize the objects with a type like this: https://stackoverflow.com/a/71398251/637142. So this will probably be my to go option.
Use MongoDB.Bson.Serialization.BsonSerializer Because I am dealing with a lot of complex objects we are using MongoDB. MongoDB is able to store a Farm object easily. My goal is to retrieve objects from the database in binary format and send that binary data to another computer and use BsonSerializer to deserialize them back to objects.
Have computer B connect to the database remotely. I cannot use this option because one of our requirements is to do everything through an API. For security reasons we are not allowed to connect remotely to the database.
I am hopping I can use step 4. It will be the most efficient because we are already using MongoDB. If we use step 3 which will work we are doing extra steps. We do not need the data in json format. Why not just sent it in binary and deserialize it once it is received by computer B? MongoDB.Driver is already doing this. I wish I knew how it does it.
This is what I have worked so far:
MongoClient m = new MongoClient("mongodb://localhost:27017");
var db = m.GetDatabase("TestDatabase");
var collection = db.GetCollection<BsonDocument>("Farms");
// I have 1s and 0s in here.
var binaryData = collection.Find("{}").ToBson();
// this is not readable
var t = System.Text.Encoding.UTF8.GetString(binaryData);
Console.WriteLine(t);
// how can I convert those 0s and 1s to a Farm object?
var collection = db.GetCollection<RawBsonDocument>(nameof(this.Calls));
var sw = new Stopwatch();
var sb = new StringBuilder();
sw.Start();
// get items
IEnumerable<RawBsonDocument>? objects = collection.Find("{}").ToList();
sb.Append("TimeToObtainFromDb: ");
sb.AppendLine(sw.Elapsed.TotalMilliseconds.ToString());
sw.Restart();
var ms = new MemoryStream();
var largestSixe = 0;
// write data to memory stream for demo purposes. on real example I will write this to a tcpSocket
foreach (var item in objects)
{
var bsonType = item.BsonType;
// write object
var bytes = item.ToBson();
ushort sizeOfBytes = (ushort)bytes.Length;
if (bytes.Length > largestSixe)
largestSixe = bytes.Length;
var size = BitConverter.GetBytes(sizeOfBytes);
ms.Write(size);
ms.Write(bytes);
}
sb.Append("time to serialze into bson to memory: ");
sb.AppendLine(sw.Elapsed.TotalMilliseconds.ToString());
sw.Restart();
// now on the client side on computer B lets pretend we are deserializing the stream
ms.Position = 0;
var clones = new List<Call>();
byte[] sizeOfArray = new byte[2];
byte[] buffer = new byte[102400]; // make this large because if an document is larger than 102400 bytes it will fail!
while (true)
{
var i = ms.Read(sizeOfArray, 0, 2);
if (i < 1)
break;
var sizeOfBuffer = BitConverter.ToUInt16(sizeOfArray);
int position = 0;
while (position < sizeOfBuffer)
position = ms.Read(buffer, position, sizeOfBuffer - position);
//using var test = new RawBsonDocument(buffer);
using var test = new RawBsonDocumentWrapper(buffer , sizeOfBuffer);
var identityBson = test.ToBsonDocument();
var cc = BsonSerializer.Deserialize<Call>(identityBson);
clones.Add(cc);
}
sb.Append("time to deserialize from memory into clones: ");
sb.AppendLine(sw.Elapsed.TotalMilliseconds.ToString());
sw.Restart();
var serializedjs = new List<string>();
foreach(var item in clones)
{
var foo = item.SerializeToJsStandards();
if (foo.Contains("jaja"))
throw new Exception();
serializedjs.Add(foo);
}
sb.Append("time to serialze into js: ");
sb.AppendLine(sw.Elapsed.TotalMilliseconds.ToString());
sw.Restart();
foreach(var item in serializedjs)
{
try
{
var obj = item.DeserializeUsingJsStandards<Call>();
if (obj is null)
throw new Exception();
if (obj.IdAccount.Contains("jsfjklsdfl"))
throw new Exception();
}
catch(Exception ex)
{
Console.WriteLine(ex);
throw;
}
}
sb.Append("time to deserialize js: ");
sb.AppendLine(sw.Elapsed.TotalMilliseconds.ToString());
sw.Restart();

How do I update a MongoDB document with new value using reactors Mono? (Kotlin)

So the context is that I require to update a value in a single document, I have a Mono, the parameter Object contains values such as username (to find the correct user by unique username) and an amount value.
The problem is that this value (due to other components of my application) is the value by which I need to increase/decrease the users balance, as opposed to passing a new balance. I intend to do this using two Monos where one finds the user, then this is combined to the other Mono with the inbound request, where I can then perform a simple sum (i.e balance + changeRequest.amount) then return this to the document database.
override fun increaseBalance(changeRequest: Mono<ChangeBalanceRequestResource>): Mono<ChangeBalanceResponse> {
val changeAmount: Mono<Decimal128> = changeRequest.map { it.transactionAmount }
val user: Mono<User> = changeRequest.flatMap { rxUserRepository.findByUsername(it.username)
val newBalace = user.map {
val r = changeAmount.block()
it.balance = sumBalance(it.balance!!, r!!)
rxUserRepository.save(it)
}
.flatMap { it }
.map { it.balance!! }
return Mono.just(ChangeBalanceResponse("success", newBalace.block()!!))
}
Obviously I'm trying to achieve this in a non-blocking fashion. I'm also open to using only a single Mono if that's possible/optimal. I also appreciate I've truly butchered the example and used .block as a placeholder to illustrate what I'm trying to achieve.
P.S this is my first post, so any tips on how to express my problem clearer would be useful.
Here's how I would do this in Java (Using Double instead of Decimal128):
public Mono<ChangeBalanceResponse> increaseBalance(Mono<ChangeBalanceRequestResource> changeRequest) {
Mono<Double> changeAmount = changeRequest.map(a -> a.transactionAmount());
Mono<User> user = changeRequest.map(a -> a.username()).flatMap(RxUserRepository::findByUsername);
return Mono.zip(changeAmount,user).flatMap(t2 -> {
Double changeAmount = t2.getT1();
User user = t2.getT2();
//assumes User is chained
return rxUserRepository.save(user.balance(sumBalance(changeAmount,user.balance())));
}).map(res -> new ChangeBalanceResponse("success",res.newBalance()))
}

Using a Beakerx Custom Magic

I've created a custom Magic command with the intention of generating a spark query programatically. Here's the relevant part of my class that implements the MagicCommandFunctionality:
MagicCommandOutcomeItem execute(MagicCommandExecutionParam magicCommandExecutionParam) {
// get the string that was entered:
String input = magicCommandExecutionParam.command.substring(MAGIC.length())
// use the input to generate a query
String generatedQuery = Interpreter.interpret(input)
MIMEContainer result = Text(generatedQuery);
return new MagicCommandOutput(MagicCommandOutcomeItem.Status.OK, result.getData().toString());
}
This works splendidly. It returns the command that I generated. (As text)
My question is -- how do I coerce the notebook into evaluating that value in the cell? My guess is that a SimpleEvaluationObject and TryResult are involved, but I can't find any examples of their use
Rather than creating the MagicCommandOutput I probably want the Kernel to create one for me. I see that the KernelMagicCommand has an execute method that would do that. Anyone have any ideas?
Okay, I found one way to do it. Here's my solution:
You can ask the current kernelManager for the kernel you're interested in,
then call PythonEntryPoint.evaluate. It seems to do the job!
#Override
MagicCommandOutcomeItem execute(MagicCommandExecutionParam magicCommandExecutionParam) {
String input = magicCommandExecutionParam.command.substring(MAGIC.length() + 1)
// this is the Scala code I want to evaluate:
String codeToExecute = <your code here>
KernelFunctionality kernel = KernelManager.get()
PythonEntryPoint pep = kernel.getPythonEntryPoint(SCALA_KERNEL)
pep.evaluate(codeToExecute)
pep.getShellMsg()
List<Message> messages = new ArrayList<>()
//until there are messages on iopub channel available collect them into response
while (true) {
String iopubMsg = pep.getIopubMsg()
if (iopubMsg == "null") break
try {
Message msg = parseMessage(iopubMsg) //(I didn't show this part)
messages.add(msg)
String commId = (String) msg.getContent().get("comm_id")
if (commId != null) {
kernel.addCommIdManagerMapping(commId, SCALA_KERNEL)
}
} catch (IOException e) {
log.error("There was an error: ${e.getMessage()}")
return new MagicKernelResponse(MagicCommandOutcomeItem.Status.ERROR, messages)
}
}
return new MagicKernelResponse(MagicCommandOutcomeItem.Status.OK, messages)
}

Kafka streams: adding dynamic fields at runtime to avro record

I want to implement a configurable Kafka stream which reads a row of data and applies a list of transforms. Like applying functions to the fields of the record, renaming fields etc. The stream should be completely configurable so I can specify which transforms should be applied to which field. I'm using Avro to encode the Data as GenericRecords. My problem is that I also need transforms which create new columns. Instead of overwriting the previous value of the field they should append a new field to the record. This means the schema of the record changes. The solution I came up with so far is iterating over the list of transforms first to figure out which fields I need to add to the schema. I then create a new schema with the old fields and new fields combined
The list of transforms(There is always a source field which gets passed to the transform method and the result is then written back to the targetField):
val transforms: List[Transform] = List(
FieldTransform(field = "referrer", targetField = "referrer", method = "mask"),
FieldTransform(field = "name", targetField = "name_clean", method = "replaceUmlauts")
)
case class FieldTransform(field: String, targetField: String, method: String)
method to create the new schema, based on the old schema and the list of transforms
def getExtendedSchema(schema: Schema, transforms: List[Transform]): Schema = {
var newSchema = SchemaBuilder
.builder(schema.getNamespace)
.record(schema.getName)
.fields()
// create new schema with existing fields from schemas and new fields which are created through transforms
val fields = schema.getFields ++ getNewFields(schema, transforms)
fields
.foldLeft(newSchema)((newSchema, field: Schema.Field) => {
newSchema
.name(field.name)
.`type`(field.schema())
.noDefault()
// TODO: find way to differentiate between explicitly set null defaults and fields which have no default
//.withDefault(field.defaultValue())
})
newSchema.endRecord()
}
def getNewFields(schema: Schema, transforms: List[Transform]): List[Schema.Field] = {
transforms
.filter { // only select targetFields which are not in schema
case FieldTransform(field, targetField, method) => schema.getField(targetField) == null
case _ => false
}
.distinct
.map { // create new Field object for each targetField
case FieldTransform(field, targetField, method) =>
val sourceField = schema.getField(field)
new Schema.Field(targetField, sourceField.schema(), sourceField.doc(), sourceField.defaultValue())
}
}
Instantiating a new GenericRecord based on an old record
val extendedSchema = getExtendedSchema(row.getSchema, transforms)
val extendedRow = new GenericData.Record(extendedSchema)
for (field <- row.getSchema.getFields) {
extendedRow.put(field.name, row.get(field.name))
}
I tried to look for other solutions but couldn't find any example which had changing data types. It feels to me like there must be a simpler cleaner solution to handle changing Avro schemas at runtime. Any ideas are appreciated.
Thanks,
Paul
I have implemented Passing Dynamic values to your avro schema and validating union to in schema
Example :-
RestTemplate template = new RestTemplate();
HttpHeaders headers = new HttpHeaders();
headers.setContentType(MediaType.APPLICATION_JSON);
HttpEntity<String> entity = new HttpEntity<String>(headers);
ResponseEntity<String> response = template.exchange(""+registryUrl+"/subjects/"+topic+"/versions/"+version+"", HttpMethod.GET, entity, String.class);
String responseData = response.getBody();
JSONObject jsonObject = new JSONObject(responseData); // add your json string which you will pass from postman
JSONObject jsonObjectResult = new JSONObject(jsonResult);
String getData = jsonObject.get("schema").toString();
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(getData);
GenericRecord genericRecord = new GenericData.Record(schema);
schema.getFields().stream().forEach(field->{
genericRecord.put(field.name(),jsonObjectResult.get(field.name()));
});
GenericDatumReader<GenericRecord>reader = new GenericDatumReader<GenericRecord>(schema);
boolean data = reader.getData().validate(schema,genericRecord );

Scala/Akka How do you reference the message being received?

I have a Java program that I must implement in Scala, but I am extremely new to Scala. After reading a number of SO question & answers as well as reading through a number of Google-retrieved resources on case classes, I am still having trouble grasping how to acquire a reference to the message I received? Example code is below:
case class SpecialMessage(key: Int) {
val id: Int = Main.idNum.getAndIncrement().intValue()
def getId(): Int = {
return id
}
}
Then in another class's receive I am trying to reference that number with:
def receive() = {
case SpecialMessage(key) {
val empID = ?? getId() // Get the id stored in the Special Message
// Do stuff with empID
}
}
I cannot figure out what to put on the right sight of empID = in order to get that id. Is this really simple, or something that isn't normally done?
These are 2 ways to do what you want, pick the one that suits best
case msg: SpecialMessage => {
val empID = msg.getId() // Get the id stored in the Special Message
// Do stuff with empID
}
case msg # SpecialMessage(key) => {
val empID = msg.getId() // Get the id stored in the Special Message
// Do stuff with empID
}
Pim's answer is good.
But maybe you can modify the structure of SpecialMessage like
case class SpecialMessage(key: Int,val id: Int = Main.idNum.getAndIncrement().intValue())
so you can get id directly from pattern matching.
def receive() = {
case SpecialMessage(key, empID) {
// Do stuff with empID
}
}