Can the serialVersionUID be defined in the class that extends the Kafka interface Serializer/Deserializer (org.apache.kafka.common.serialization), similar to the one we implement with Serializable? Would we face exception - NotSerializableException, during Java/Kafka upgrade?
For example, in KTable the key/value is serialized with a custom class object. So, if we persisted some data and if any upgrade takes place for either Kafka OR Java OR both, will we be able to to read (deserialize) the key/value, that was persisted prior to the upgrade, without any backward compatibility constraints?
Reference:
What is a serialVersionUID and why should I use it?
If you write a custom Kafka Serializer/Deserializer there is no need to specify Java's serialVersionUID. You can still do it, but it would be unused.
The purpose of Java's serialVersionUID is to use Java's built in serialization mechanism. But this mechanism is not used when you implement a custom Serializer/Deserializer for Kafka.
Related
So i'm trying to realize akka persistance, but I have an error: Persistence failure when replaying events for persistenceId [1]. Last known sequence number [0]
java.io.InvalidClassException: kz.dar.arena.domain.Bullet$Evt; local class incompatible: stream classdesc serialVersionUID = -5880771744357396366, local class serialVersionUID = 1502307324601793787
what is the problem?
UPD: I've solved this problem. I just cleared journal
I see that you solved it but let me explain what is wrong:
You are using Java serialization to serialize the events to bytes to store in the journal. Java serialization by default decides how to serialize the data based on the structure of the class, this means that if you change the structure of the class (add a field, remove a field, rename a field etc.) it may not know how to deserialize data. The serialVersionUID field is like a protocol version number that makes it easy for the serialization library to know that the version of the data is incompatible with the class (see more here: What is a serialVersionUID and why should I use it? )
Most importantly you should almost never use Java serialization for Akka persistence or you will have problems in the future. Read more in this section of the Akka docs: https://doc.akka.io/docs/akka/current/persistence-schema-evolution.html#picking-the-right-serialization-format
I am trying to build a kakfa pipeline which will read JSON input data into Kafka topic.
I am using AVRO serialization with schema registry, as my schema gets changed on regular basis.
As of now GenericRecord is used to parse the schema.
But I recently came to know that avro-tools are available to read schema and generate Java classes which can be used to create Producer Code.
I am confused choose between these two options.
Can you please suggest me which one is better, as my schema gets frequently changed?
avro-tools are available to read schema and generate java classes which can be used to create Producer Code
They create specific Avro classes, not Producer code, but regarding the question. Both will work.
The way I see it
GenericRecord - Think of it as a HashMap<String, Object>. As a consumer need to know the fields to get. If, as a producer or schema creator, you are not able to send your classes as a library to your consumers, this is the essentially the best you can get. I believe you'll always be able to get the latest data, though (all possible fields can be accessed by a get("fieldname") call. See example here
SpecificRecord (what avro-tools generates) - It is just a generated class with getter methods and builder objects / setter methods. Any consumer will be able to import your producer classes as dependencies, deserialize the message, then immediately know what fields are available. You are not guaranteed to get the latest schema here - you will be "downgraded" and limited to whatever schema was used to generate those classes.
I use avro-maven-plugin to generally create the classes. Just as this example
You could also use AvroReflect to build an Avro schema from a Java class rather than the other way around. Annotations can be used on fields to set #Union or #AvroDefault settings.
Further Reading about using the Confluent Schema Registry
We are considering a serialization approach for our scala-based Akka Persistence app. We consider it likely that our persisted events will "evolve" over time, so we want to support schema evolution, and are considering Avro first.
We'd like to avoid including the full schema with every message. However, for the foreseeable future, this Akka Persistence app is the only app that will be serializing and deserializing these messages, so we don't see a need for a separate schema registry.
Checking the docs for avro and the various scala libs, I see ways to include the schema with messages, and also how to use it "schema-less" by using a schema registry, but what about the in-between case? What's the correct approach for going schema-less, but somehow including an identifier to be able to look up the correct schema (available in the local deployed codebase) for the deserialized object? Would I literally just create a schema that represents my case class, but with an additional "identifier" field for schema version, and then have some sort of in-memory map of identifier->schema at runtime?
Also, is the correct approach to have one serializer/deserialize class for each version of the schema, so it knows how to translate every version to/from the most recent version?
Finally, are there recommendations on how to unit-test schema evolutions? For instance, store a message in akka-persistence, then actually change the definition of the case class, and then kill the actor and make sure it properly evolves. (I don't see how to change the definition of the case class at runtime.)
After spending more time on this, here are the answers I came up with.
Using avro4s, you can use the default data output stream to include the schema with every serialized message. Or, you can use the binary output stream, which simply omits the schema when serializing each message. ('binary' is a bit of a misnomer here since all it does is omit the schema. In either case it is still an Array[Byte].)
Akka itself supplies a Serializer trait or a SerializerWithStringManifest trait, which will automatically include a field for a "schema identifier" in the object of whatever you serialize.
So when you create your custom serializer, you can extend the appropriate trait, define your schema identifier, and use the binary output stream. When those techniques are combined, you'll successfully be using schema-less serialization while including a schema identifier.
One common technique is to "fingerprint" your schema - treat it as a string and then calculate its digest (MD5, SHA-256, whatever). If you construct an in-memory map of fingerprint to schema, that can serve as your application's in-memory schema registry.
So then when deserializing, your incoming object will have the schema identifier of the schema that was used to serialize it (the "writer"). While deserializing, you should know the identifier of the schema to use to deserialize it (the "reader"). Avro4s supports a way for you to specify both using a builder pattern, so avro can translate the object from the old format to the new. That's how you support "schema evolution". Because of how that works, you don't need a separate serializer for each schema version. Your custom serializer will know how to evolve your objects, because that's the part that Avro gives you for free.
As for unit testing, your best bet is exploratory testing. Actually define multiple versions of a case class in your test, and multiple accompanying versions of its schema, and then explore how Avro works by writing tests that will evolve an object between different versions of that schema.
Unfortunately that won't be directly relevant to the code you are writing, because it's hard to simulate actually changing the code you are testing as you test it.
I developed a prototype that demonstrates several of these answers, and it's available on github. It uses avro, avro4s, and akka persistence. For this one, I demonstrated a changing codebase by actually changing it across commits - you'd check out commit #1, run the code, then move to commit #2, etc. It runs against cassandra so it will demonstrate replaying events that need to be evolved using new schema, all without using an external schema registry.
I'm using Akka Persistence, with LevelDB as storage plugin, in an application written in Scala. On the query-side, the current implementation uses PersistentView, which polls messages from a PersistentActor's journal by just knowing the identifier of the actor.
Now I've learned that PersistentView is deprecated, and one is encouraged to use Persistent Query instead. However, I haven't found any thorough description on how to adapt the code from using PersistentView to support the preferred Persistence Query implementation.
Any help would be appreciated!
From the 2.4.x-to-2.5.x migration guide:
Removal of PersistentView
After being deprecated for a long time, and replaced by Persistence Query PersistentView has now been removed.
The corresponding query type is EventsByPersistenceId. There are several alternatives for connecting the Source to an actor corresponding to a previous PersistentView actor which are documented in Integration.
The consuming actor may be a plain Actor or an PersistentActor if it needs to store its own state (e.g. fromSequenceNr offset).
Please note that Persistence Query is not experimental/may-change anymore in Akka 2.5.0, so you can safely upgrade to it.
I am replacing an in-house caching system with memcached but memcached client cannot cache the JsonNode objects since they don't implement Serializable.
Is there any way you can achieve serializing a JsonNode object? Does Jackson provide Serializable equivalent of this class?
JSON is best serialized by writing it out as bytes. In Jackson, it is done using ObjectMapper, for example by:
byte[] raw = objectMapper.writeValueAsBytes(root);
MemCache does not really need Serializable since it's all raw bytes; although Java clients may try to be helpful and handle serialization.