I'm using Scala 2.12, implementing some producers and consumer with this library:
"org.apache.kafka" % "kafka-clients" % "2.4.1"
For both key and value I'm using classOf[StringDeserializer].
Let's say every message is a json string of a case class like this:
case class Person(name: String, age: Int, id: UUID)
So value in every message would be something like this:
{"name":"Joe", "age": ...}
How can I write customer serializer for this?
How can I write customer serializer
You'd implement the interface...
class MySerializer extends Serializer[Person] {
override def serialize(String topic, Person p): Array[Byte] { ... }
}
Let's say every message is a json string
Kafka already has a JSONSerializer class you can use
Or you can use the ones provided by Confluent or Spring-Kafka.
Otherwise, StringSerializer will work fine if you pre-serialize the data before producing
Related
I, go to case class in Apache Flink with Scala. I want to turn it into Json to send it on a Kafka topic.
it's comes from a DataStream that went for a RichMapFunction
What would be the best strategy?
Example:
case class MySchema(
someObj1: Some[String],
someObj2: Some[String],
someObj3: Some[String],
obj1: String,
obj2: String,
obj3: String,
intObj1: Int,
intObj2: Int
)
I have a persistent actor which can receive one type of command Persist(event) where event is a type of trait Event (there are numerous implementations of it). And on success, this reponds with Persisted(event) to the sender.
The event itself is serializable since this is the data we store in persistence storage and the serialization is implemented with a custom serializer which internally uses classes generated from google protobuf .proto files. And this custom serializer is configured in application.conf and bound to the base trait Event. This is already working fine.
Note: The implementations of Event are not classes generated by protobuf. They are normal scala classes and they have a protobuf equivalent of them too, but that's mapped through the custom serializer that's bound to the base Event type. This was done by my predecessors for versioning (which probably isn't required because this can be handled with plain protobuf classes + custom serialization combi too, but that's a different matter) and I don't wish to change that atm.
We're now trying to implement cluster sharding for this actor which also means that my commands (viz. Persist and Persisted) also need to be serializable since they may be forwarded to other nodes.
This is the domain model :
sealed trait PersistenceCommand {
def event: Event
}
final case class Persisted(event: Event) extends PersistenceCommand
final case class Persist(event: Event) extends PersistenceCommand
Problem is, I do not see an elegent way to make it serializable. Following are the options I have considered
Approach 1. Define a new proto file for Persist and Persisted, but what do I use as the datatype for event? I didn't find a way to define something like this :
message Persist {
"com.example.Event" event = 1 // this doesn't work
}
Such that I can use existing Scala trait Event as a data type. If this works, I guess (it's far fetched though) I could bind the generated code (After compiling this proto file) to akka's inbuilt serializer for google protobuf and it may work. The note above explains why I cannot use oneof construct in my proto file.
Approach 2. This is what I've implemented and it works (but I don't like it)
Basically, I wrote a new serializer for the commands and delegated seraizalition and de-serialization of event part of the command to the existing serializer.
class PersistenceCommandSerializer extends SerializerWithStringManifest {
val eventSerializer: ManifestAwareEventSerializer = new ManifestAwareEventSerializer()
val PersistManifest = Persist.getClass.getName
val PersistedManifest = Persisted.getClass.getName
val Separator = "~"
override def identifier: Int = 808653986
override def manifest(o: AnyRef): String = o match {
case Persist(event) => s"$PersistManifest$Separator${eventSerializer.manifest(event)}"
case Persisted(event) => s"$PersistedManifest$Separator${eventSerializer.manifest(event)}"
}
override def toBinary(o: AnyRef): Array[Byte] = o match {
case command: PersistenceCommand => eventSerializer.toBinary(command.event)
}
override def fromBinary(bytes: Array[Byte], manifest: String): AnyRef = {
val (commandManifest, dataManifest) = splitIntoCommandAndDataManifests(manifest)
val event = eventSerializer.fromBinary(bytes, dataManifest).asInstanceOf[Event]
commandManifest match {
case PersistManifest =>
Persist(event)
case PersistedManifest =>
Persisted(event)
}
}
private def splitIntoCommandAndDataManifests(manifest: String) = {
val commandAndDataManifests = manifest.split(Separator)
(commandAndDataManifests(0), commandAndDataManifests(1))
}
}
Problem with this approach is the thing I'm doing in def manifest and in def fromBinary. I had to make sure that I have the command's manifest as well as the event's manifest while serializing and de-serializing. Hence, I had to use ~ as a separator - sort of, my custom serialization technique for the manifest information.
Is there a better or perhaps, a right way, to implement this?
For context: I'm using ScalaPB for generating scala classes from .proto files and classic akka actors.
Any kind of guidance is hugely appreciated!
If you delegate serialization of the nested object to whichever serializer you have configured the nested field should have bytes for the serialized data but also an int32 with the id of the used serializer and bytes for the message manifest. This ensures that you will be able to version/replace the nested serializers which is important for data that will be stored for a longer time period.
You can see how this is done internally in Akka for our own wire formats here: https://github.com/akka/akka/blob/6bf20f4117a8c27f8bd412228424caafe76a89eb/akka-remote/src/main/protobuf/WireFormats.proto#L48 and here https://github.com/akka/akka/blob/6bf20f4117a8c27f8bd412228424caafe76a89eb/akka-remote/src/main/scala/akka/remote/MessageSerializer.scala#L45
I am using case class which has nested case classes and Seq[Nested Case Classes]
The problem is when I try to serialize it using KafkaAvroSerializer it throws:
Caused by: java.lang.IllegalArgumentException: Unsupported Avro type. Supported types are null, Boolean, Integer, Long, Float, Double, String, byte[] and IndexedRecord
at io.confluent.kafka.serializers.AbstractKafkaAvroSerDe.getSchema(AbstractKafkaAvroSerDe.java:115)
at io.confluent.kafka.serializers.AbstractKafkaAvroSerializer.serializeImpl(AbstractKafkaAvroSerializer.java:71)
at io.confluent.kafka.serializers.KafkaAvroSerializer.serialize(KafkaAvroSerializer.java:54)
at org.apache.kafka.common.serialization.Serializer.serialize(Serializer.java:60)
at org.apache.kafka.clients.producer.KafkaProducer.doSend(KafkaProducer.java:879)
at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:841)
at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:728)```
If you want to use Avro with Scala constructs like case classes I recommend you use Avro4s. This has native support for all scala features and can even create the schema from your model if that is what you want.
There are some gotcha's with automatic type class derivation. This is what I learned.
Use at least avro4s version 2.0.4
Some of the macros generate code with compiler warnings and also break wart remover. We had to add the following annotations to get our code to compile (sometimes the error is cannot find implicit, but its caused by error in macro generated code):
#com.github.ghik.silencer.silent
#SuppressWarnings(Array("org.wartremover.warts.Null", "org.wartremover.warts.AsInstanceOf", "org.wartremover.warts.StringPlusAny"))
Next automatic type class derivation only works one level at a time. I created an object to hold all my SchemaFor, Decoder and Encoder instances for my schema. Then I built up the type classes instances explicitly starting from the inner most types. I also used implicitly to verify each ADT would resolve before moving to the next one. For example:
sealed trait Notification
object Notification {
final case class Outstanding(attempts: Int) extends Notification
final case class Complete(attemts: Int, completedAt: Instant) extends Notification
}
sealed trait Job
final case class EnqueuedJob(id: String, enqueuedAt: Instant) extends Job
final case class RunningJob(id: String, enqueuedAt: Instant, startedAt: Instant) extends Job
final case class FinishedJob(id: String, enqueuedAt: Instant, startedAt: Instant, completedAt: Instant) extends Job
object Schema {
// Explicitly define schema for ADT instances
implicit val schemaForNotificationComplete: SchemaFor[Notification.Complete] = SchemaFor.applyMacro
implicit val schemaForNotificationOutstanding: SchemaFor[Notification.Outstanding] = SchemaFor.applyMacro
// Verify Notification ADT is defined
implicitly[SchemaFor[Notification]]
implicitly[Decoder[Notification]]
implicitly[Encoder[Notification]]
// Explicitly define schema, decoder and encoder for ADT instances
implicit val schemaForEnqueuedJob: SchemaFor[EnqueuedJob] = SchemaFor.applyMacro
implicit val decodeEnqueuedJob: Decoder[EnqueuedJob] = Decoder.applyMacro
implicit val encodeEnqueuedJob: Encoder[EnqueuedJob] = Encoder.applyMacro
implicit val schemaForRunningJob: SchemaFor[RunningJob] = SchemaFor.applyMacro
implicit val decodeRunningJob: Decoder[RunningJob] = Decoder.applyMacro
implicit val encodeRunningJob: Encoder[RunningJob] = Encoder.applyMacro
implicit val schemaForFinishedJob: SchemaFor[FinishedJob] = SchemaFor.applyMacro
implicit val decodeFinishedJob: Decoder[FinishedJob] = Decoder.applyMacro
implicit val encodeFinishedJob: Encoder[FinishedJob] = Encoder.applyMacro
// Verify Notification ADT is defined
implicitly[Encoder[Job]]
implicitly[Decoder[Job]]
implicitly[SchemaFor[Job]]
// And so on until complete nested ADT is defined
}
I am trying to get familiar with the semantics of Flink after having started with Spark. I would like to write a DataSet[IndexNode] to persistent storage in HDFS so that it can be read later by another process. Spark has a simple ObjectFile API that provides such a functionality, but I cannot find a similar option in Flink.
case class IndexNode(vec: Vector[IndexNode],
id: Int) extends Serializable {
// Getters and setters etc. here
}
The build-in sinks tend to serialize the instance based on the toString method, which is not suitable here due to the nested structure of the class. I imagine the solution is to use a FileOutputFormat and provide a translation of the instances to a byte stream. However, I am not sure how to serialize the vector, which is of an arbitrary length and can be many levels deep.
You can achieve this by using SerializedOutputFormat and SerializedInputFormat.
Try following steps:
Make IndexNode extend IOReadableWritable interface from FLINK. Make unserialisable fields #transient. Implement write(DataOutputView out) and read(DataInputView in) method. The write method will write out all data from IndexNode and read method will read them back and build all internal data fields. For example, instead of serialising all data from arr field in Result class, I write out all value out and then read them back and rebuild the array in read method.
class Result(var name: String, var count: Int) extends IOReadableWritable {
#transient
var arr = Array(count, count)
def this() {
this("", 1)
}
override def write(out: DataOutputView): Unit = {
out.writeInt(count)
out.writeUTF(name)
}
override def read(in: DataInputView): Unit = {
count = in.readInt()
name = in.readUTF()
arr = Array(count, count)
}
override def toString: String = s"$name, $count, ${getArr}"
}
Write out data with
myDataSet.write(new SerializedOutputFormat[Result], "/tmp/test")
and read it back with
env.readFile(new SerializedInputFormat[Result], "/tmp/test")
I am trying to create this api for objects like user, location, and visits. And I want to have post methods for adding and updating values of these objects. However, I do not know how to pass the object to the route of an api. A part of my route for the location update:
trait ApiRoute extends MyDatabases with FailFastCirceSupport {
val routes =
pathPrefix("locations") {
pathSingleSlash {
pathPrefix(LongNumber) { id =>
post { entity(as[Location]){
location => {
onSuccess(locationRepository.update(location)) {
result => complete(result.asJson)
}
}
}
}
However, when I try to build update in such a way I get an error:
could not find implicit value for parameter um: akka.http.scaladsl.unmarshalling.FromRequestUnmarshaller[models.Location]
[error] post { entity(as[Location]){
Json encoder for the location:
package object application {
implicit val locationEncoder = new Encoder[Location] {
final def apply(location: Location): Json = Json.obj(
("id", Json.fromLong(location.id)),
("place", Json.fromString(location.place)),
("country", Json.fromString(location.country)),
("city", Json.fromString(location.city)),
("distance", Json.fromLong(location.distance))
)
}
I am using Slick to model and get all the data from the database:
case class Location (id: Long, place: String, country: String, city: String, distance: Long)
class LocationTable(tag: Tag) extends Table[Location](tag, "locations") {
val id = column[Long]("id", O.PrimaryKey)
val place = column[String]("place")
val country = column[String]("country")
val city = column[String]("city")
val distance = column[Long]("distance")
def * = (id, place, country, city, distance) <> (Location.apply _ tupled, Location.unapply)
}
object LocationTable {
val table = TableQuery[LocationTable]
}
class LocationRepository(db: Database) {
val locationTableQuery = TableQuery[LocationTable]
def create(location: Location): Future[Location] =
db.run(locationTableQuery returning locationTableQuery += location)
def update(location: Location): Future[Int] =
db.run(locationTableQuery.filter(_.id === location.id).update(location))
}
So what should I add or change in my code to get rid of the exception and make it work?
If you are adding or updating a Location, that Location needs a Decoder as well to read the serialized data from the client that comes across the wire as the HTTP entity. Akka HTTP also needs a FromRequestUnmarshaller in conjunction with the Decoder to decode the request entity which in this example is the Location you want to add or update, and the one that you extract the id from.
Using Scala's Circe library for JSON handling, then the Akka HTTP JSON project has part of what you need. As that project indicates, add the following to your build.sbt
// All releases including intermediate ones are published here,
// final ones are also published to Maven Central.
resolvers += Resolver.bintrayRepo("hseeberger", "maven")
libraryDependencies ++= List(
"de.heikoseeberger" %% "akka-http-circe" % "1.18.0"
)
and then you can mix in the support you need using FailFastCirceSupport or ErrorAccumulatingCirceSupport. Add that to the class that defines your routes with
class SomeClassWithRoutes extends ErrorAccumulatingCirceSupport
or
class SomeClassWithRoutes extends FailFastCirceSupport
depending on whether you want to fail on the first error, if any, or accumulate them.
You still need to have a Decoder[Location] in scope as well. For that you can see Circe's documentation, but one quick way to define that if you want the default field names is to use the following imports in your route definition class or file so that Circe creates the necessary Decoder for you.
import io.circe.generic.auto._
import io.circe.Decoder