how to send avro schema ONLY once in kafka - apache-kafka

I'm using the following code (not really, but let's assume it) to create a schema and send it to kafka by a producer.
public static final String USER_SCHEMA = "{"
+ "\"type\":\"record\","
+ "\"name\":\"myrecord\","
+ "\"fields\":["
+ " { \"name\":\"str1\", \"type\":\"string\" },"
+ " { \"name\":\"str2\", \"type\":\"string\" },"
+ " { \"name\":\"int1\", \"type\":\"int\" }"
+ "]}";
public static void main(String[] args) throws InterruptedException {
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer");
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(USER_SCHEMA);
Injection<GenericRecord, byte[]> recordInjection = GenericAvroCodecs.toBinary(schema);
KafkaProducer<String, byte[]> producer = new KafkaProducer<>(props);
for (int i = 0; i < 1000; i++) {
GenericData.Record avroRecord = new GenericData.Record(schema);
avroRecord.put("str1", "Str 1-" + i);
avroRecord.put("str2", "Str 2-" + i);
avroRecord.put("int1", i);
byte[] bytes = recordInjection.apply(avroRecord);
ProducerRecord<String, byte[]> record = new ProducerRecord<>("mytopic", bytes);
producer.send(record);
Thread.sleep(250);
}
producer.close();
}
The thing is the code allows me to send only 1 message with this schema. Then I need to change the schema name in order to send the next message... so the name string is a randomly generated one right now so I can send more message. This is a hack so I'd like to know the proper way to do this.
I've also looked at how to send messages without a schema (ie already sent 1 message with schema to kafka now all other messages don't need schema anymore) - but new GenericData.Record(..) expects a schema parameter. If it's null it will throw an error.
So what's the correct way to send avro schema messages to kafka?
Here is another code sample - pretty identical to mine:
https://github.com/confluentinc/examples/blob/kafka-0.10.0.1-cp-3.0.1/kafka-clients/producer/src/main/java/io/confluent/examples/producer/ProducerExample.java
It also doesn't show how to send without setting a schema.

I didn't understand the line:
The thing is the code allows me to send only 1 message with this
schema. Then I need to change the schema name in order to send the
next message.
In both of the examples, your and the confluent example you supplied, the schema is not sent to Kafka.
In the example you supplied, the schema used to create a GenericRecord object. You supply the schema, because you want to validate the record against some schema (for example validate that you would only be able to put an integer int1 field inside GenericRecord object).
In your code the only difference is that you decided to serialize the data to byte[], which is probably not needed since you can delegate this responsibility to KafkaAvroSerializer, as you can see in the confluent example.
GenericRecord is an Avro object, it is not an enforcement by Kafka. If you want to send any kind of object to Kafka (with schema or without), you just need to create (or use exiting) serializer that will convert your object to byte[] and set this serializer in the properties you are creating for the producer.
Usually it is a good practice to send a pointer to the schema with the Avro message itself. You can find the reasoning for it at the following links:
http://www.confluent.io/blog/schema-registry-kafka-stream-processing-yes-virginia-you-really-need-one/

Related

How to make Serdes work with multi-step kafka streams

I am new to Kafka and I'm building a starter project using the Twitter API as a data source. I have create a Producer which can query the Twitter API and sends the data to my kafka topic with string serializer for both key and value. My Kafka Stream Application reads this data and does a word count, but also grouping by the date of the tweet. This part is done through a KTable called wordCounts to make use of its upsert functionality. The structure of this KTable is:
Key: {word: exampleWord, date: exampleDate}, Value: numberOfOccurences
I then attempt to restructure the data in the KTable stream to a flat structure so I can later send it to a database. You can see this in the wordCountsStructured KStream object. This restructures the data to look like the structure below. The value is initially a JsonObject but i convert it to a string to match the Serdes which i set.
Key: null, Value: {word: exampleWord, date: exampleDate, Counts: numberOfOccurences}
However, when I try to send this to my second kafka topic, I get the error below.
A serializer (key:
org.apache.kafka.common.serialization.StringSerializer / value:
org.apache.kafka.common.serialization.StringSerializer) is not
compatible to the actual key or value type (key type:
com.google.gson.JsonObject / value type: com.google.gson.JsonObject).
Change the default Serdes in StreamConfig or provide correct Serdes
via method parameters.
I'm confused by this since the KStream I am sending to the topic is of type <String, String>. Does anyone know how I might fix this?
public class TwitterWordCounter {
private final JsonParser jsonParser = new JsonParser();
public Topology createTopology(){
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> textLines = builder.stream("test-topic2");
KTable<JsonObject, Long> wordCounts = textLines
//parse each tweet as a tweet object
.mapValues(tweetString -> new Gson().fromJson(jsonParser.parse(tweetString).getAsJsonObject(), Tweet.class))
//map each tweet object to a list of json objects, each of which containing a word from the tweet and the date of the tweet
.flatMapValues(TwitterWordCounter::tweetWordDateMapper)
//update the key so it matches the word-date combination so we can do a groupBy and count instances
.selectKey((key, wordDate) -> wordDate)
.groupByKey()
.count(Materialized.as("Counts"));
/*
In order to structure the data so that it can be ingested into SQL, the value of each item in the stream must be straightforward: property, value
so we have to:
1. take the columns which include the dimensional data and put this into the value of the stream.
2. lable the count with 'count' as the column name
*/
KStream<String, String> wordCountsStructured = wordCounts.toStream()
.map((key, value) -> new KeyValue<>(null, MapValuesToIncludeColumnData(key, value).toString()));
KStream<String, String> wordCountsPeek = wordCountsStructured.peek(
(key, value) -> System.out.println("key: " + key + "value:" + value)
);
wordCountsStructured.to("test-output2", Produced.with(Serdes.String(), Serdes.String()));
return builder.build();
}
public static void main(String[] args) {
Properties config = new Properties();
config.put(StreamsConfig.APPLICATION_ID_CONFIG, "wordcount-application1111");
config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "myIPAddress");
config.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
config.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
config.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
TwitterWordCounter wordCountApp = new TwitterWordCounter();
KafkaStreams streams = new KafkaStreams(wordCountApp.createTopology(), config);
streams.start();
// shutdown hook to correctly close the streams application
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
}
//this method is used for taking a tweet and transforming it to a representation of the words in it plus the date
public static List<JsonObject> tweetWordDateMapper(Tweet tweet) {
try{
List<String> words = Arrays.asList(tweet.tweetText.split("\\W+"));
List<JsonObject> tweetsJson = new ArrayList<JsonObject>();
for(String word: words) {
JsonObject tweetJson = new JsonObject();
tweetJson.add("date", new JsonPrimitive(tweet.formattedDate().toString()));
tweetJson.add("word", new JsonPrimitive(word));
tweetsJson.add(tweetJson);
}
return tweetsJson;
}
catch (Exception e) {
System.out.println(e);
System.out.println(tweet.serialize().toString());
return new ArrayList<JsonObject>();
}
}
public JsonObject MapValuesToIncludeColumnData(JsonObject key, Long countOfWord) {
key.addProperty("count", countOfWord); //new JsonPrimitive(count));
return key;
}
Because you are performing a key changing operation before the groupBy(), it will create a repartition topic and for that topic, it will rely on the default key, value serdes, which you have set to String Serde.
You can modify the groupBy() call to groupBy(Grouped.with(StringSerde,JsonSerde) and this should help.

Kafka stream : class cast exception during left join

I am new to kafka. I am trying to leftJoin a kafka stream (named as inputStream) to kafka-table(named as detailTable) where the kafka-stream is built as:
//The consumer to consume the input topic
Consumed<String, NotifyRecipient> inputNotificationEventConsumed = Consumed
.with(Constants.CONSUMER_KEY_SERDE, Constants.CONSUMER_VALUE_SERDE);
//Now create the stream that is directly reading from the topic
KStream<NotifyKey, NotifyVal> initialInputStream =
streamsBuilder.stream(properties.getInputTopic(), inputNotificationEventConsumed);
//Now re-key the above stream for the purpose of left join
KStream<String, NotifyVal> inputStream = initialInputStream
.map((notifyKey,notifyVal) ->
KeyValue.pair(notifyVal.getId(),notifyVal)
);
And the kafka-table is created this way:
//The consumer for the table
Consumed<String, Detail> notifyDetailConsumed =
Consumed.with(Serdes.String(), Constants.DET_CONSUMER_VALUE_SERDE);
//Now consume from the topic into ktable
KTable<String, Detail> detailTable = streamsBuilder
.table(properties.getDetailTopic(), notifyDetailConsumed);
Now I am trying to join the inputStream to the detailTable as:
//Now join
KStream<String,Pair<Long, SendCmd>> joinedStream = inputStream
.leftJoin(detailTable, valJoiner)
.filter((key,value)->value!=null);
I am getting an error from which it seems that during the join, the key and value of the inputStream were tried to cast into the default key-serde and default value-serde and getting a class cast exception.
Not sure how to fix this and need help there.
Let me know if I should provide more info.
Because you use a map(), key and value type might have changes and thus you need to specify the correct Serdes via Joined.with(...) as third parameter to .leftJoin().

How to write the ValueJoiner when joining two Kafka Streams defined using Avro Schemas?

I am building an ecommerce application, where I am currently dealing with two data feeds: order executions, and broken sales. A broken sale would be an invalid execution, for a variety of reasons. A broken sale would have the same order ref number as the order, so the join is on order ref # and line item #.
Currently, I have two topics - orders, and broken. Both have been defined using Avro Schemas, and built using SpecificRecord. The key is OrderReferenceNumber.
Fields for orders: OrderReferenceNumber, Timestamp, OrderLine, ItemNumber, Quantity
Fields for broken: OrderReferenceNumber, OrderLine, Timestamp
Corresponding Java classes were generated by running
mvn clean package
I need to left-join orders with broken and include the following fields in the output: OrderReferenceNumber, Timestamp, BrokenSaleTimestamp, OrderLine, ItemNumber, Quantity
Here is my code:
public static void main(String[] args) {
// Declare variables
final Map<String, String> avroSerdeConfig = Collections.singletonMap(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, "http://localhost:8081");
// Add Kafka Streams Properties
Properties streamsProperties = new Properties();
streamsProperties.put(StreamsConfig.APPLICATION_ID_CONFIG, "orderProcessor");
streamsProperties.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
streamsProperties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
streamsProperties.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
streamsProperties.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, SpecificAvroSerde.class);
streamsProperties.put(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, "localhost:8081");
// Specify Kafka Topic Names
String orderTopic = "com.ecomapp.input.OrderExecuted";
String brokenTopic = "com.ecomapp.input.BrokenSale";
// Specify Serializer-Deserializer or Serdes for each Message Type
Serdes.StringSerde stringSerde = new Serdes.StringSerde();
Serdes.LongSerde longSerde = new Serdes.LongSerde();
// For the Order Executed Message
SpecificAvroSerde<OrderExecuted> ordersSpecificAvroSerde = new SpecificAvroSerde<OrderExecuted>();
ordersSpecificAvroSerde.configure(avroSerdeConfig, false);
// For the Broken Sale Message
SpecificAvroSerde<BrokenSale> brokenSpecificAvroSerde = new SpecificAvroSerde<BrokenSale>();
brokenSpecificAvroSerde.configure(avroSerdeConfig, false);
StreamsBuilder streamBuilder = new StreamsBuilder();
KStream<String, OrderExecuted> orders = streamBuilder
.stream(orderTopic, Consumed.with(stringSerde, ordersSpecificAvroSerde))
.selectKey((key, orderExec) -> orderExec.getMatchNumber().toString());
KStream<String, BrokenSale> broken = streamBuilder
.stream(brokenTopic, Consumed.with(stringSerde, brokenSpecificAvroSerde))
.selectKey((key, brokenS) -> brokenS.getMatchNumber().toString());
KStream<String, JoinOrdersExecutedNonBroken> joinOrdersNonBroken = orders
.leftJoin(broken,
(orderExec, brokenS) -> JoinOrdersExecutedNonBroken.newBuilder()
.setOrderReferenceNumber((Long) orderExec.get("OrderReferenceNumber"))
.setTimestamp((Long) orderExec.get("Timestamp"))
.setBrokenSaleTimestamp((Long) brokenS.get("Timestamp"))
.setOrderLine((Long) orderExec.get("OrderLine"))
.setItemNumber((String) orderExec.get("ItemNumber"))
.setQuantity((Long) orderExec.get("Quantity"))
.build(),
JoinWindows.of(TimeUnit.MILLISECONDS.toMillis(1))
Joined.with(stringSerde, ordersSpecificAvroSerde, brokenSpecificAvroSerde))
.peek((key, value) -> System.out.println("key = " + key + ", value = " + value));
KafkaStreams orderStreams = new KafkaStreams(streamBuilder.build(), streamsProperties);
orderStreams.start();
// print the topology
System.out.println(orderStreams.localThreadsMetadata());
// shutdown hook to correctly close the streams application
Runtime.getRuntime().addShutdownHook(new Thread(orderStreams::close));
}
When I run this, I get the following maven compile error:
[ERROR] /Tech/Projects/jCom/src/main/java/com/ecomapp/kafka/orderProcessor.java:[96,26] incompatible types: cannot infer type-variable(s) VO,VR,K,V,VO
(argument mismatch; org.apache.kafka.streams.kstream.Joined<K,V,com.ecomapp.input.BrokenSale> cannot be converted to org.apache.kafka.streams.kstream.Joined<java.lang.String,com.ecomapp.OrderExecuted,com.ecomapp.input.BrokenSale>)
The issue really is in defining my ValueJoiner. The Confluent documentation is not very clear on how to do this when Avro schemas are involved (I can't find examples either). What is the right way to define this?
Not sure why Java cannot resolve the type.
Try:
Joined.<String,OrderExecuted,BrokenSale>with(stringSerde, ordersSpecificAvroSerde, brokenSpecificAvroSerde))
To specify the types explicitly.

Kafka KStream to GlobalKTable join does not work with same key used

I have a very frustrating problem trying to join a KStream, that is populated by a java driver program using KafkaProducer, to a GlobalKTable that is populated from a Topic that, in turn, is populated using the JDBCConnector pulling data from a MySQL Table. No matter what I try to do the join between the KStream and the GlobalKTable, which both are keyed on the same value, will not work. What I mean is that the ValueJoiner is never called. I'll try and explain by showing the relevant config and code below. I appreciate any help.
I am using the latest version of the confluent platform.
The topic that the GlobalKTable is populated from is pulled from a single MySQL table:
Column Name/Type:
pk/bigint(20)
org_name/varchar(255)
orgId/varchar(10)
The JDBCConnector configuration for this is:
name=my-demo
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
connection.url=jdbc:mysql://localhost:3306/reporting?user=root&password=XXX
table.whitelist=organisation
mode=incrementing
incrementing.column.name=pk
topic.prefix=my-
transforms=keyaddition
transforms.keyaddition.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.keyaddition.fields=orgId
I am running the JDBC connector using the command line:
connect-standalone /home/jim/platform/confluent/etc/schema-registry/connect-avro-standalone.properties /home/jim/prg/kafka/config/my.mysql.properties
This gives me a topic called my-organisation, that is keyed on orgId ..... so far so good!
(note, the namespace does not seem to be set by JDBCConnector but I don't think this is an issue but I don't know for sure)
Now, the code. Here is how I initialise and create the GlobalKTable (relevant code shown):
final Map<String, String> serdeConfig =
Collections.singletonMap(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG,
schemaRegistryUrl);
final StreamsBuilder builder = new StreamsBuilder();
final SpecificAvroSerde<Organisation> orgSerde = new SpecificAvroSerde<>();
orgSerde.configure(serdeConfig, false);
// Create the GlobalKTable from the topic that was populated using the connect-standalone command line
final GlobalKTable<String, Organisation>
orgs =
builder.globalTable(ORG_TOPIC, Materialized.<String, Organisation, KeyValueStore<Bytes, byte[]>>as(ORG_STORE)
.withKeySerde(Serdes.String())
.withValueSerde(orgSerde));
The avro schema, from where the Organisaton class is generated is defined as:
{"namespace": "io.confluent.examples.streams.avro",
"type":"record",
"name":"Organisation",
"fields":[
{"name": "pk", "type":"long"},
{"name": "org_name", "type":"string"},
{"name": "orgId", "type":"string"}
]
}
Note: as described above the orgId is set as the key on the topic using the single message transform (SMT) operation.
So, that is the GlobalKTable setup.
Now for the KStream setup (the right hand side of the join). This has the same key (orgId) as the globalKTable. I use a simple driver program for this:
(The use case is that this topic will contain events associated with each organisation)
public class UploadGenerator {
public static void main(String[] args){
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put("schema.registry.url", "http://localhost:8081");
KafkaProducer producer = new KafkaProducer(props);
// This schema is also used in the consumer application or more specifically a class generated from it.
String mySchema = "{\"namespace\": \"io.confluent.examples.streams.avro\"," +
"\"type\":\"record\"," +
"\"name\":\"DocumentUpload\"," +
"\"fields\":[{\"name\":\"orgId\",\"type\":\"string\"}," +
"{\"name\":\"date\",\"type\":\"long\",\"logicalType\":\"timestamp-millis\"}]}";
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(mySchema);
// Just using three fictional organisations with the following orgIds/keys
String[] ORG_ARRAY = {"002", "003", "004"};
long count = 0;
String key = ""; // key is the realm
while(true) {
count++;
try {
TimeUnit.SECONDS.sleep(5);
} catch (InterruptedException e) {
}
GenericRecord avroRecord = new GenericData.Record(schema);
int orgId = ThreadLocalRandom.current().nextInt(0, 2 + 1);
avroRecord.put("orgId",ORG_ARRAY[orgId]);
avroRecord.put("date",new Date().getTime());
key = ORG_ARRAY[orgId];
ProducerRecord<Object, Object> record = new ProducerRecord<>("topic_uploads", key, avroRecord);
try {
producer.send(record);
producer.flush();
} catch(SerializationException e) {
System.out.println("Exccccception was generated! + " + e.getMessage());
} catch(Exception el) {
System.out.println("Exception: " + el.getMessage());
}
}
}
}
So, this generates a new event representing an upload for an organisation represented by the orgId but also specifically set in the key variable used in the ProducerRecord.
Here is the code that sets up the KStream for these events:
final SpecificAvroSerde<DocumentUpload> uploadSerde = new SpecificAvroSerde<>();
uploadSerde.configure(serdeConfig, false);
// Get the stream of uploads
final KStream<String, DocumentUpload> uploadStream = builder.stream(UPLOADS_TOPIC, Consumed.with(Serdes.String(), uploadSerde));
// Debug output to see the contents of the stream
uploadStream.foreach((k, v) -> System.out.println("uploadStream: Key: " + k + ", Value: " + v));
// Note, I tried to re-key the stream with the orgId field (even though it was set as the key in the driver but same problem)
final KStream<String, DocumentUpload> keyedUploadStream = uploadStream.selectKey((key, value) -> value.getOrgId());
keyedUploadStream.foreach((k, v) -> System.out.println("keyedUploadStream: Key: " + k + ", Value: " + v));
// Java 7 form used as it was easier to put in debug statements
// OrgPK is just a helper class defined in the same class
KStream<String, OrgPk> joined = keyedUploadStream.leftJoin(orgs,
new KeyValueMapper<String, DocumentUpload, String>() { /* derive a (potentially) new key by which to lookup against the table */
#Override
public String apply(String key, DocumentUpload value) {
System.out.println("1. The key passed in is: " + key);
System.out.println("2. The upload realm passed in is: " + value.getOrgId());
return value.getOrgId();
}
},
// THIS IS NEVER CALLED WITH A join() AND WHEN CALLED WITH A leftJoin() HAS A NULL ORGANISATION
new ValueJoiner<DocumentUpload, Organisation, OrgPk>() {
#Override
public OrgPk apply(DocumentUpload leftValue, Organisation rightValue) {
System.out.println("3. Value joiner has been called...");
if( null == rightValue ) {
// THIS IS ALWAYS CALLED, SO THERE IS NEVER A "MATCH"
System.out.println(" 3.1. Orgnisation is NULL");
return new OrgPk(leftValue.getRealm(), 1L);
}
System.out.println(" 3.1. Org is OK");
// Never reaches here - this is the issue i.e. there is never a match
return new OrgPk(leftValue.getOrgId(), rightValue.getPk());
}
});
So, the above join (or leftJoin) never matches, even though the two keys are the same! This is the main issue.
Finally, the avro schema for the DocumentUpload is:
{"namespace": "io.confluent.examples.streams.avro",
"type":"record",
"name":"DocumentUpload",
"fields":[
{"name": "orgId", "type":"string"},
{"name":"date", "type":"long", "logicalType":"timestamp-millis"}
]
}
So, in summary:
I have a KStream on a topic with a String key of OrgId
I have a GlobalKTable on a topic with a String key of OrgId also.
The join never works, even though the keys are in the GlobalKTable (at least they are in the topic underlying the GlobalKTable)
Can someone help me? I am pulling my hair out trying to figure this out.
I was able to solve this issue on Windows/Intellij by providing a state dir config
StreamsConfig.STATE_DIR_CONFIG

Avro with Kafka - Deserializing with changing schema

Based on Avro schema I generated a class (Data) to work with the class appropriate to the schema
After it I encode the data and send in to other application "A" using kafka
Data data; // <- The object was initialized before . Here it is only the declaration "for example"
EncoderFactory encoderFactory = EncoderFactory.get();
ByteArrayOutputStream out = new ByteArrayOutputStream();
BinaryEncoder encoder = encoderFactory. directBinaryEncoder(out, null);
DatumWriter<Tloog> writer;
writer = new SpecificDatumWriter<Data>( Data.class);
writer.write(data, encoder);
byte[] avroByteMessage = out.toByteArray();
On the other side (in the application "A") I deserilize the the data by implementing Deserializer
class DataDeserializer implements Deserializer<Data> {
private String encoding = "UTF8";
#Override
public void configure(Map<String, ?> configs, boolean isKey) {
// nothing to do
}
#Override
public Tloog deserialize(String topic, byte[] data) {
try {
if (data == null)
{
return null;
}
else
{
DatumReader<Tloog> reader = new SpecificDatumReader<Data>( Data.class);
DecoderFactory decoderFactory = DecoderFactory.get();
BinaryDecoder decoder = decoderFactory.binaryDecoder( data, null);
Data decoded = reader.read(null, decoder);
return decoded;
}
} catch (Exception e) {
throw new SerializationException("Error when deserializing byte[] to string due to unsupported encoding " + encoding);
}
}
The problem is that this approach requires the use of SpecificDatumReader, I.e.the Data class should be integrated with the application code...This could be problematic - schema could change and therefore Data class should be re-generated and integrated once more
2 questions:
Should I use GenericDatumReader in the application? How to do that
correctly. (I can save the schema simply in the application)
Isthere a simple way to work with SpecificDatumReader if Data changes? How could it be integrated with out much trouble?
Thanks
I use GenericDatumReader -- well, actually I derive my reader class from it, but you get the point. To use it, I keep my schemas in a special Kafka topic -- Schema surprisingly enough. Consumers and producers both, on startup, read from this topic and configure their respective parsers.
If you do it like this, you can even have your consumers and producers update their schemas on the fly, without having to restart them. This was a design goal for me -- I didn't want to have to restart my applications in order to add or change schemas. Which is why SpecificDatumReader doesn't work for me, and honestly why I use Avro in the first place instead of something like Thrift.
Update
The normal way to do Avro is to store the schema in the file with the records. I don't do it that way, primarily because I can't. I use Kafka, so I can't store the schemas directly with the data -- I have to store the schemas in a separate topic.
The way I do it, first I load all of my schemas. You can read them from a text file; but like I said, I read them from a Kafka topic. After I read them from Kafka, I have an array like this:
val schemaArray: Array[String] = Array(
"""{"name":"MyObj","type":"record","fields":[...]}""",
"""{"name":"MyOtherObj","type":"record","fields":[...]}"""
)
Apologize for the Scala BTW, but it's what I got.
At any rate, then you need to create a parser, and foreach schema, parse it and create readers and writers, and save them off to Maps:
val parser = new Schema.Parser()
val schemas = Map(schemaArray.map{s => parser.parse(s)}.map(s => (s.getName, s)):_*)
val readers = schemas.map(s => (s._1, new GenericDatumReader[GenericRecord](s._2)))
val writers = schemas.map(s => (s._1, new GenericDatumWriter[GenericRecord](s._2)))
var decoder: BinaryDecoder = null
I do all of that before I parse an actual record -- that's just to configure the parser. Then, to decode an individual record I would do:
val byteArray: Array[Byte] = ... // <-- Avro encoded record
val schemaName: String = ... // <-- name of the Avro schema
val reader = readers.get(schemaName).get
decoder = DecoderFactory.get.binaryDecoder(byteArray, decoder)
val record = reader.read(null, decoder)