Kafka streams: adding dynamic fields at runtime to avro record - scala

I want to implement a configurable Kafka stream which reads a row of data and applies a list of transforms. Like applying functions to the fields of the record, renaming fields etc. The stream should be completely configurable so I can specify which transforms should be applied to which field. I'm using Avro to encode the Data as GenericRecords. My problem is that I also need transforms which create new columns. Instead of overwriting the previous value of the field they should append a new field to the record. This means the schema of the record changes. The solution I came up with so far is iterating over the list of transforms first to figure out which fields I need to add to the schema. I then create a new schema with the old fields and new fields combined
The list of transforms(There is always a source field which gets passed to the transform method and the result is then written back to the targetField):
val transforms: List[Transform] = List(
FieldTransform(field = "referrer", targetField = "referrer", method = "mask"),
FieldTransform(field = "name", targetField = "name_clean", method = "replaceUmlauts")
)
case class FieldTransform(field: String, targetField: String, method: String)
method to create the new schema, based on the old schema and the list of transforms
def getExtendedSchema(schema: Schema, transforms: List[Transform]): Schema = {
var newSchema = SchemaBuilder
.builder(schema.getNamespace)
.record(schema.getName)
.fields()
// create new schema with existing fields from schemas and new fields which are created through transforms
val fields = schema.getFields ++ getNewFields(schema, transforms)
fields
.foldLeft(newSchema)((newSchema, field: Schema.Field) => {
newSchema
.name(field.name)
.`type`(field.schema())
.noDefault()
// TODO: find way to differentiate between explicitly set null defaults and fields which have no default
//.withDefault(field.defaultValue())
})
newSchema.endRecord()
}
def getNewFields(schema: Schema, transforms: List[Transform]): List[Schema.Field] = {
transforms
.filter { // only select targetFields which are not in schema
case FieldTransform(field, targetField, method) => schema.getField(targetField) == null
case _ => false
}
.distinct
.map { // create new Field object for each targetField
case FieldTransform(field, targetField, method) =>
val sourceField = schema.getField(field)
new Schema.Field(targetField, sourceField.schema(), sourceField.doc(), sourceField.defaultValue())
}
}
Instantiating a new GenericRecord based on an old record
val extendedSchema = getExtendedSchema(row.getSchema, transforms)
val extendedRow = new GenericData.Record(extendedSchema)
for (field <- row.getSchema.getFields) {
extendedRow.put(field.name, row.get(field.name))
}
I tried to look for other solutions but couldn't find any example which had changing data types. It feels to me like there must be a simpler cleaner solution to handle changing Avro schemas at runtime. Any ideas are appreciated.
Thanks,
Paul

I have implemented Passing Dynamic values to your avro schema and validating union to in schema
Example :-
RestTemplate template = new RestTemplate();
HttpHeaders headers = new HttpHeaders();
headers.setContentType(MediaType.APPLICATION_JSON);
HttpEntity<String> entity = new HttpEntity<String>(headers);
ResponseEntity<String> response = template.exchange(""+registryUrl+"/subjects/"+topic+"/versions/"+version+"", HttpMethod.GET, entity, String.class);
String responseData = response.getBody();
JSONObject jsonObject = new JSONObject(responseData); // add your json string which you will pass from postman
JSONObject jsonObjectResult = new JSONObject(jsonResult);
String getData = jsonObject.get("schema").toString();
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(getData);
GenericRecord genericRecord = new GenericData.Record(schema);
schema.getFields().stream().forEach(field->{
genericRecord.put(field.name(),jsonObjectResult.get(field.name()));
});
GenericDatumReader<GenericRecord>reader = new GenericDatumReader<GenericRecord>(schema);
boolean data = reader.getData().validate(schema,genericRecord );

Related

MinBy does not return any result Apache Storm

I have built Storm topology with data retrieval from Kafka. And I would like to build an aggregation with counting minimum for each of the batches on one of the fields. I tried to use maxBy function on the stream, however, it does not display any results, although the data is flowing through the system and output function worked with other aggregations. How can it be implemented differently or what can be fixed in the current implementation?
Here is my current implementation:
val tridentTopology = new TridentTopology()
val stream = tridentTopology.newStream("kafka_spout",
new KafkaTridentSpoutOpaque(spoutConfig))
.map(new ParserMapFunction, new Fields("created_at", "id", "text", "source", "timestamp_ms",
"user.id", "user.name", "user.location", "user.url", "user.description", "user.followers_count",
"user.friends_count", "user.favorite_count", "user.lang", "entities.hashtags"))
.maxBy("user.followers_count")
.map(new OutputFunction)
My custom output function:
class OutputFunction extends MapFunction{
override def execute(input: TridentTuple): Values = {
val values = input.getValues.asScala.toList.toString
println(s"TWEET: $values")
new Values(values)
}
}

Differentiating an AVRO union type

I'm consuming Avro serialized messages from Kafka using the "automatic" deserializer like:
props.put(
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,
"io.confluent.kafka.serializers.KafkaAvroDeserializer"
);
props.put("schema.registry.url", "https://example.com");
This works brilliantly, and is right out of the docs at https://docs.confluent.io/current/schema-registry/serializer-formatter.html#serializer.
The problem I'm facing is that I actually just want to forward these messages, but to do the routing I need some metadata from inside. Some technical constraints mean that I can't feasibly compile-in generated class files to use the KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG => true, so I am using a regular decoder without being tied into Kafka, specifically just reading the bytes as a Array[Byte] and passing them to a manually constructed deserializer:
var maxSchemasToCache = 1000;
var schemaRegistryURL = "https://example.com/"
var specificDeserializerProps = Map(
"schema.registry.url"
-> schemaRegistryURL,
KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG
-> "false"
);
var client = new CachedSchemaRegistryClient(
schemaRegistryURL,
maxSchemasToCache
);
var deserializer = new KafkaAvroDeserializer(
client,
specificDeserializerProps.asJava
);
The messages are a "container" type, with the really interesting part one of about ~25 types in a union { A, B, C } msg record field:
record Event {
timestamp_ms created_at;
union {
Online,
Offline,
Available,
Unavailable,
...
...Failed,
...Updated
} msg;
}
So I'm successfully reading a Array[Byte] into record and feeding it into the deserializer like this:
var genericRecord = deserializer.deserialize(topic, consumerRecord.value())
.asInstanceOf[GenericRecord];
var schema = genericRecord.getSchema();
var msgSchema = schema.getField("msg").schema();
The problem however is that I can find no to discern, discriminate or "resolve" the "type" of the msg field through the union:
System.out.printf(
"msg.schema = %s msg.schema.getType = %s\n",
msgSchema.getFullName(),
msgSchema.getType().name());
=> msg.schema = union msg.schema.getType = union
How to discriminate types in this scenario? The confluent registry knows, these things have names, they have "types", even if I'm treating them as GenericRecords,
My goal here is to know that record.msg is of "type" Online | Offline | Available rather than just knowing it's a union.
After having looked into the implementation of the AVRO Java library, it think it's safe to say that this is impossible given the current API. I've found the following way of extracting the types while parsing, using a custom GenericDatumReader subclass, but it needs a lot of polishing before I'd use something like this in production code :D
So here's the subclass:
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.io.ResolvingDecoder;
import java.io.IOException;
import java.util.List;
public class CustomReader<D> extends GenericDatumReader<D> {
private final GenericData data;
private Schema actual;
private Schema expected;
private ResolvingDecoder creatorResolver = null;
private final Thread creator;
private List<Schema> unionTypes;
// vvv This is the constructor I've modified, added a list of types
public CustomReader(Schema schema, List<Schema> unionTypes) {
this(schema, schema, GenericData.get());
this.unionTypes = unionTypes;
}
public CustomReader(Schema writer, Schema reader, GenericData data) {
this(data);
this.actual = writer;
this.expected = reader;
}
protected CustomReader(GenericData data) {
this.data = data;
this.creator = Thread.currentThread();
}
protected Object readWithoutConversion(Object old, Schema expected, ResolvingDecoder in) throws IOException {
switch (expected.getType()) {
case RECORD:
return super.readRecord(old, expected, in);
case ENUM:
return super.readEnum(expected, in);
case ARRAY:
return super.readArray(old, expected, in);
case MAP:
return super.readMap(old, expected, in);
case UNION:
// vvv The magic happens here
Schema type = expected.getTypes().get(in.readIndex());
unionTypes.add(type);
return super.read(old, type, in);
case FIXED:
return super.readFixed(old, expected, in);
case STRING:
return super.readString(old, expected, in);
case BYTES:
return super.readBytes(old, expected, in);
case INT:
return super.readInt(old, expected, in);
case LONG:
return in.readLong();
case FLOAT:
return in.readFloat();
case DOUBLE:
return in.readDouble();
case BOOLEAN:
return in.readBoolean();
case NULL:
in.readNull();
return null;
default:
return super.readWithoutConversion(old, expected, in);
}
}
}
I've added comments to the code for the interesting parts, as it's mostly boilerplate.
Then you can use this custom reader like this:
List<Schema> unionTypes = new ArrayList<>();
DatumReader<GenericRecord> datumReader = new CustomReader<GenericRecord>(schema, unionTypes);
DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(eventFile, datumReader);
GenericRecord event = null;
while (dataFileReader.hasNext()) {
event = dataFileReader.next(event);
}
System.out.println(unionTypes);
This will print, for each union parsed, the type of that union. Note that you'll have to figure out which element of that list is interesting to you depending on how many unions you have in a record, etc.
Not pretty tbh :D
I was able to come up with a single-use solution after a lot of digging:
val records: ConsumerRecords[String, Array[Byte]] = consumer.poll(100);
for (consumerRecord <- asScalaIterator(records.iterator)) {
var genericRecord = deserializer.deserialize(topic, consumerRecord.value()).asInstanceOf[GenericRecord];
var msgSchema = genericRecord.get("msg").asInstanceOf[GenericRecord].getSchema();
System.out.printf("%s \n", msgSchema.getFullName());
Prints com.myorg.SomeSchemaFromTheEnum and works perfectly in my use-case.
The confusing thing, is that because of the use of GenericRecord, .get("msg") returns Object, which, in a general way I have no way to safely typecast. In this limited case, I know the cast is safe.
In my limited use-case the solution in the 5 lines above is suitable, but for a more general solution the answer https://stackoverflow.com/a/59844401/119669 posted by https://stackoverflow.com/users/124257/fresskoma seems more appropriate.
Whether using DatumReader or GenericRecord is probably a matter of preference and whether the Kafka ecosystem is in mind, alone with Avro I'd probably prefer a DatumReader solution, but in this instance I can live with having Kafak-esque nomenclature in my code.
To retrieve the schema of the value of a field, you can use
new GenericData().induce(genericRecord.get("msg"))

Build dynamic LINQ queries from a string - Use Reflection?

I have some word templates(maybe thousands). Each template has merge fields which will be filled from database. I don`t like writing separate code for every template and then build the application and deploy it whenever a template is changed or a field on the template is added!
Instead, I'm trying to define all merge fields in a separate xml file and for each field I want to write the "query" which will be called when needed. EX:
mergefield1 will call query "Case.Parties.FirstOrDefault.NameEn"
mergefield2 will call query "Case.CaseNumber"
mergefield3 will call query "Case.Documents.FirstOrDefault.DocumentContent.DocumentType"
Etc,
So, for a particular template I scan its merge fields, and for each merge field I take it`s "query definition" and make that request to database using EntityFramework and LINQ. Ex. it works for these queries: "TimeSlots.FirstOrDefault.StartDateTime" or
"Case.CaseNumber"
This will be an engine which will generate word documents and fill it with merge fields from xml. In addition, it will work for any new template or new merge field.
Now, I have worked a version using reflection.
public string GetColumnValueByObjectByName(Expression<Func<TEntity, bool>> filter = null, string objectName = "", string dllName = "", string objectID = "", string propertyName = "")
{
string objectDllName = objectName + ", " + dllName;
Type type = Type.GetType(objectDllName);
Guid oID = new Guid(objectID);
dynamic Entity = context.Set(type).Find(oID); // get Object by Type and ObjectID
string value = ""; //the value which will be filled with data from database
IEnumerable<string> linqMethods = typeof(System.Linq.Enumerable).GetMethods(BindingFlags.Static | BindingFlags.Public).Select(s => s.Name).ToList(); //get all linq methods and save them as list of strings
if (propertyName.Contains('.'))
{
string[] properies = propertyName.Split('.');
dynamic object1 = Entity;
IEnumerable<dynamic> Child = new List<dynamic>();
for (int i = 0; i < properies.Length; i++)
{
if (i < properies.Length - 1 && linqMethods.Contains(properies[i + 1]))
{
Child = type.GetProperty(properies[i]).GetValue(object1, null);
}
else if (linqMethods.Contains(properies[i]))
{
object1 = Child.Cast<object>().FirstOrDefault(); //for now works only with FirstOrDefault - Later it will be changed to work with ToList or other linq methods
type = object1.GetType();
}
else
{
if (linqMethods.Contains(properies[i]))
{
object1 = type.GetProperty(properies[i + 1]).GetValue(object1, null);
}
else
{
object1 = type.GetProperty(properies[i]).GetValue(object1, null);
}
type = object1.GetType();
}
}
value = object1.ToString(); //.StartDateTime.ToString();
}
return value;
}
I`m not sure if this is the best approach. Does anyone have a better suggestion, or maybe someone has already done something like this?
To shorten it: The idea is to make generic linq queries to database from a string like: "Case.Parties.FirstOrDefault.NameEn".
Your approach is very good. I have no doubt that it already works.
Another approach is using Expression Tree like #Egorikas have suggested.
Disclaimer: I'm the owner of the project Eval-Expression.NET
In short, this library allows you to evaluate almost any C# code at runtime (What you exactly want to do).
I would suggest you use my library instead. To keep the code:
More readable
Easier to support
Add some flexibility
Example
public string GetColumnValueByObjectByName(Expression<Func<TEntity, bool>> filter = null, string objectName = "", string dllName = "", string objectID = "", string propertyName = "")
{
string objectDllName = objectName + ", " + dllName;
Type type = Type.GetType(objectDllName);
Guid oID = new Guid(objectID);
object Entity = context.Set(type).Find(oID); // get Object by Type and ObjectID
var value = Eval.Execute("x." + propertyName, new { x = entity });
return value.ToString();
}
The library also allow you to use dynamic string with IQueryable
Wiki: LINQ-Dynamic

Mapping postgreSQL inetger[] data type to Grails

In my postgreSQL DB i have a field with the data type int[] trying to mapp this to a Grails domain class column Integer[] the application fails to start:
org.hibernate.type.SerializationException: could not deserialize
Is there any other way to achieve this?
I also tried this: //insurance column: 'rs_insurance', sqlType: "integer[]"
byte[] type is working out-of-box and is mapped to a respective BLOB-type.
If it's not fine with you, you can serialize your array upon saving and deserialize it upon loading:
void setValue( v ) {
ByteArrayOutputStream baos = new ByteArrayOutputStream()
baos.withObjectOutputStream{ it.writeObject v }
blob = baos.toByteArray()
}
def getValue() {
def out = null
if( blob ) new ByteArrayInputStream( blob ).withObjectInputStream{ out = it.readObject() }
out
}

How to count new element from stream by using spark-streaming

I have done implementation of daily compute. Here is some pseudo-code.
"newUser" may called first activated user.
// Get today log from hbase or somewhere else
val log = getRddFromHbase(todayDate)
// Compute active user
val activeUser = log.map(line => ((line.uid, line.appId), line).reduceByKey(distinctStrategyMethod)
// Get history user from hdfs
val historyUser = loadFromHdfs(path + yesterdayDate)
// Compute new user from active user and historyUser
val newUser = activeUser.subtractByKey(historyUser)
// Get new history user
val newHistoryUser = historyUser.union(newUser)
// Save today history user
saveToHdfs(path + todayDate)
Computation of "activeUser" can be converted to spark-streaming easily. Here is some code:
val transformedLog = sdkLogDs.map(sdkLog => {
val time = System.currentTimeMillis()
val timeToday = ((time - (time + 3600000 * 8) % 86400000) / 1000).toInt
((sdkLog.appid, sdkLog.bcode, sdkLog.uid), (sdkLog.channel_no, sdkLog.ctime.toInt, timeToday))
})
val activeUser = transformedLog.groupByKeyAndWindow(Seconds(86400), Seconds(60)).mapValues(x => {
var firstLine = x.head
x.foreach(line => {
if (line._2 < firstLine._2) firstLine = line
})
firstLine
})
But the approach of "newUser" and "historyUser" is confusing me.
I think my question can be summarized as "how to count new element from stream". As my pseudo-code above, "newUser" is part of "activeUser". And I must maintain a set of "historyUser" to know which part is "newUser".
I consider an approach, but I think it may not work right way:
Load the history user as a RDD. Foreach DStream of "activeUser" and find the elements doesn't exist in the "historyUser". A problem here is when should I update this RDD of "historyUser" to make sure I can get the right "newUser" of a window.
Update the "historyUser" RDD means add "newUser" to it. Just like what I did in the pseudo-code above. The "historyUser" is updated once a day in that code. Another problem is how to do this update RDD operation from a DStream. I think update "historyUser" when window slides is proper. But I haven't find a proper API to do this.
So which is the best practice to solve this problem.
updateStateByKey would help here as it allows you to set initial state (your historical users) and then update it on each interval of your main stream. I put some code together to explain the concept
val historyUsers = loadFromHdfs(path + yesterdayDate).map(UserData(...))
case class UserStatusState(isNew: Boolean, values: UserData)
// this will prepare the RDD of already known historical users
// to pass into updateStateByKey as initial state
val initialStateRDD = historyUsers.map(user => UserStatusState(false, user))
// stateful stream
val trackUsers = sdkLogDs.updateStateByKey(updateState, new HashPartitioner(sdkLogDs.ssc.sparkContext.defaultParallelism), true, initialStateRDD)
// only new users
val newUsersStream = trackUsers.filter(_._2.isNew)
def updateState(newValues: Seq[UserData], prevState: Option[UserStatusState]): Option[UserStatusState] = {
// Group all values for specific user as needed
val groupedUserData: UserData = newValues.reduce(...)
// prevState is defined only for users previously seen in the stream
// or loaded as initial state from historyUsers RDD
// For new users it is None
val isNewUser = !prevState.isDefined
// as you return state here for the user - prevState won't be None on next iterations
Some(UserStatusState(isNewUser, groupedUserData))
}