Deserialize Protobuf kafka messages with Flink - scala

I am trying to read and print Protobuf message from Kafka using Apache Flink.
I followed the official docs with no success: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/serialization/third_party_serializers/
The Flink consumer code is:
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI()
env.enableCheckpointing(5000)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
env.getCheckpointConfig.setCheckpointStorage(s"$targetPath/checkpoints")
env.getConfig.registerTypeWithKryoSerializer(classOf[User], classOf[ProtobufSerializer])
val source = KafkaSource.builder[User]
.setBootstrapServers(brokers)
.setTopics(topic)
.setGroupId(consumerGroupId)
.setValueOnlyDeserializer(new ProtoDeserializer())
.setStartingOffsets(OffsetsInitializer.earliest)
.build
val stream = env.fromSource(source, WatermarkStrategy.noWatermarks[User], kafkaTableName)
stream.print()
env.execute()
}
The deserializer code is:
class ProtoDeserializer extends DeserializationSchema[User] {
override def getProducedType: TypeInformation[User] = null
override def deserialize(message: Array[Byte]): User = User.parseFrom(message)
override def isEndOfStream(nextElement: User): Boolean = false
}
I get the following error when the streamer is executed:
Protocol message contained an invalid tag (zero).
It's important to mention that I manage to read and deserialize the messages successfully using the confluent protobuf consumer so it seems that the messages are not corrupted.

The confluent protobuf serializer doesn't produce content that can be directly deserialized by other deserializers. The format is described in confluent's documentation: it starts with a magic byte (that is always zero), followed by a four byte schema ID. The protobuf payload follows, starting with byte 5.
The getProducedType method should return appropriate TypeInformation, in this case TypeInformation.of(User.class). Without this you may run into problems at runtime.
Deserializers used with KafkaSource don't need to implement isEndOfStream, but it won't hurt anything.

Related

Serializing message with protobuf for akka actor which contains serializable data

I have a persistent actor which can receive one type of command Persist(event) where event is a type of trait Event (there are numerous implementations of it). And on success, this reponds with Persisted(event) to the sender.
The event itself is serializable since this is the data we store in persistence storage and the serialization is implemented with a custom serializer which internally uses classes generated from google protobuf .proto files. And this custom serializer is configured in application.conf and bound to the base trait Event. This is already working fine.
Note: The implementations of Event are not classes generated by protobuf. They are normal scala classes and they have a protobuf equivalent of them too, but that's mapped through the custom serializer that's bound to the base Event type. This was done by my predecessors for versioning (which probably isn't required because this can be handled with plain protobuf classes + custom serialization combi too, but that's a different matter) and I don't wish to change that atm.
We're now trying to implement cluster sharding for this actor which also means that my commands (viz. Persist and Persisted) also need to be serializable since they may be forwarded to other nodes.
This is the domain model :
sealed trait PersistenceCommand {
def event: Event
}
final case class Persisted(event: Event) extends PersistenceCommand
final case class Persist(event: Event) extends PersistenceCommand
Problem is, I do not see an elegent way to make it serializable. Following are the options I have considered
Approach 1. Define a new proto file for Persist and Persisted, but what do I use as the datatype for event? I didn't find a way to define something like this :
message Persist {
"com.example.Event" event = 1 // this doesn't work
}
Such that I can use existing Scala trait Event as a data type. If this works, I guess (it's far fetched though) I could bind the generated code (After compiling this proto file) to akka's inbuilt serializer for google protobuf and it may work. The note above explains why I cannot use oneof construct in my proto file.
Approach 2. This is what I've implemented and it works (but I don't like it)
Basically, I wrote a new serializer for the commands and delegated seraizalition and de-serialization of event part of the command to the existing serializer.
class PersistenceCommandSerializer extends SerializerWithStringManifest {
val eventSerializer: ManifestAwareEventSerializer = new ManifestAwareEventSerializer()
val PersistManifest = Persist.getClass.getName
val PersistedManifest = Persisted.getClass.getName
val Separator = "~"
override def identifier: Int = 808653986
override def manifest(o: AnyRef): String = o match {
case Persist(event) => s"$PersistManifest$Separator${eventSerializer.manifest(event)}"
case Persisted(event) => s"$PersistedManifest$Separator${eventSerializer.manifest(event)}"
}
override def toBinary(o: AnyRef): Array[Byte] = o match {
case command: PersistenceCommand => eventSerializer.toBinary(command.event)
}
override def fromBinary(bytes: Array[Byte], manifest: String): AnyRef = {
val (commandManifest, dataManifest) = splitIntoCommandAndDataManifests(manifest)
val event = eventSerializer.fromBinary(bytes, dataManifest).asInstanceOf[Event]
commandManifest match {
case PersistManifest =>
Persist(event)
case PersistedManifest =>
Persisted(event)
}
}
private def splitIntoCommandAndDataManifests(manifest: String) = {
val commandAndDataManifests = manifest.split(Separator)
(commandAndDataManifests(0), commandAndDataManifests(1))
}
}
Problem with this approach is the thing I'm doing in def manifest and in def fromBinary. I had to make sure that I have the command's manifest as well as the event's manifest while serializing and de-serializing. Hence, I had to use ~ as a separator - sort of, my custom serialization technique for the manifest information.
Is there a better or perhaps, a right way, to implement this?
For context: I'm using ScalaPB for generating scala classes from .proto files and classic akka actors.
Any kind of guidance is hugely appreciated!
If you delegate serialization of the nested object to whichever serializer you have configured the nested field should have bytes for the serialized data but also an int32 with the id of the used serializer and bytes for the message manifest. This ensures that you will be able to version/replace the nested serializers which is important for data that will be stored for a longer time period.
You can see how this is done internally in Akka for our own wire formats here: https://github.com/akka/akka/blob/6bf20f4117a8c27f8bd412228424caafe76a89eb/akka-remote/src/main/protobuf/WireFormats.proto#L48 and here https://github.com/akka/akka/blob/6bf20f4117a8c27f8bd412228424caafe76a89eb/akka-remote/src/main/scala/akka/remote/MessageSerializer.scala#L45

Read data from several Kafka topics (generic list class design)

I try to change Flink runner code to let it read data from several Kafka topics and write it to different HDFS folders accordingly and without joining. I have a lot Java and Scala generic methods and generic object initiations inside the main process method and reflection.
It work correctly with one Avro schema, but when I try to add unknown amount of Avro schema I have a problem with generics and reflection constructions.
How to resolve it? Whats design pattern can help me?
The model (Avro schema) is in Java classes.
public enum Types implements MessageType {
RECORD_1("record1", "01", Record1.getClassSchema(), Record1.class),
RECORD_2("record2", "02", Record2.getClassSchema(), Record2.class);
private String topicName;
private String dataType;
private Schema schema;
private Class<? extends SpecificRecordBase> clazz;}
public class Record1 extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord
{
public static final org.apache.avro.Schema SCHEMA$ = new org.apache.avro.Schema.Parser().parse("???");
public static org.apache.avro.Schema getClassSchema() { return SCHEMA$; }
... }
public class Record1 ...
The process trait with main process methods.
import org.apache.avro.specific.SpecificRecordBase
import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.fs.Writer
import tests.{Record1, Record2, Types}
import scala.reflect.ClassTag
trait Converter[T] extends Serializable {
def convertToModel(message: KafkaSourceType): T
}
trait FlinkRunner extends Serializable {
val kafkaTopicToModelMapping: Map[String, Class[_ <: SpecificRecordBase]] =
Map(
"record_1" -> Types.RECORD_1.getClassType,
"record_2" -> Types.RECORD_2.getClassType
)
def buildAvroSink1(path: String, writer1: Writer[Record1]): BucketingSink[Record1] = ???
def buildAvroSink2(path: String, writer2: Writer[Record2]): BucketingSink[Record2] = ???
def process(topicList: List[String], env: StreamExecutionEnvironment): Unit = {
// producer kafka source building
val clazz1: Class[Record1] = ClassTag(kafkaTopicToModelMapping(topicList.head)).runtimeClass.asInstanceOf[Class[Record1]]
val clazz2: Class[Record2] = ClassTag(kafkaTopicToModelMapping(topicList.tail.head)).runtimeClass.asInstanceOf[Class[Record2]]
// How to makes clazzes list from that val clazzes: List[Class[???]] = ???
val avroTypeInfo1: TypeInformation[Record1] = TypeInformation.of(clazz1)
val avroTypeInfo2: TypeInformation[Record2] = TypeInformation.of(clazz2)
// How to makes clazzes list from that val avroTypeInfos = ???
val stream: DataStream[KafkaSourceType] = ???
// consumer avro paths building, it
val converter1: Converter[Record1] = new Converter[Record1] {
override def convertToModel(message: KafkaSourceType): Record1 = deserializeAvro[Record1](message.value)
}
val converter2: Converter[Record2] = new Converter[Record2] {
override def convertToModel(message: KafkaSourceType): Record2 = deserializeAvro[Record2](message.value)
}
// How to makes converters list from that
val outputResultStream1 = stream
.filter(_.topic == topicList.head)
.map(record => converter1.convertToModel(record))(avroTypeInfo1)
val outputResultStream2 = stream
.filter(_.topic == topicList.tail.head)
.map(record => converter2.convertToModel(record))(avroTypeInfo2)
val writer1 = new AvroSinkWriter[Record1](???)
val writer2 = new AvroSinkWriter[Record2](???)
// add sink and start process
}
}
AS IS
There are several different topics in Kafka. The Kafka version is 10.2 without Confluent. Every Kafka topic works with only one Avro schema class, written in Java.
The only one Flink job (written in Scala) reads the only one topic, convert with one Avro schema and write the data to only one folder in HDFS. The name, path and output folder name are in config.
For example there are 3 job flows with parameters:
First Job Flow
--brokersAdress …
--topic record1
--folderName folder1
-- avroClassName Record1
--output C:/….
--jobName SingleTopic1
--number_of_parallel 2
--number_of_task 1
--mainClass Runner
….
Second Job Flow
--brokersAdress …
--topic record1
--folderName folder1
-- avroClassName Record1
--output C:/….
--jobName SingleTopic2
--number_of_parallel 2
--number_of_task 1
--mainClass Runner
….
Third Job Flow
…
TO BE
The one Flink job can read more than one Kafka topics, convert it with different Avro schema and write the data to different folders without joining.
For example I can to start only one job flow which will do the same work
--brokersAdress …
--topic record1, record2, record3
--folderName folder1, folder2,
-- avroClassName Record1, Record2
--output C:/….
--jobName MultipleTopics
--number_of_parallel 3
--number_of_task 3
--mainClass Runner
...
Ok, thank you. There are several questions about code organization:
1) How to generalize variables in the method and methods (called procees) parameters to let initiate the List of several clazzes with inherited from SpecificRecordBase classes? If it possible, sure.
val clazz1: Class[Record1] = ClassTag(kafkaTopicToModelMapping(topicList.head)).runtimeClass.asInstanceOf[Class[Record1]]
val clazz2: Class[Record2] = ClassTag(kafkaTopicToModelMapping(topicList.tail.head)).runtimeClass.asInstanceOf[Class[Record2]]
2) The same question is for avroTypeInfo1, avroTypeInfo2 ..., converter1, converter2, ..., buildAvroSink1, buildAvroSink2, ... .
Also I have a questions about architecture. I tried to execute this code and Flink worked correctly with different topics with Avro schema classes.
Which Flink code tools can help me to put different avro schema classes to several outputStrems and add sink with them?
Do you have code examples with it?
And also what could I use instead the Flink to resolve issue with generating several Avro files from different Kafka topics? Perhaps confluent.
I'm a bit lost on your motivation. The general idea is that if you want to use a generic approach, go with GenericRecord. If you have specific code for the different types go SpecificRecord but then don't use the generic code around it.
Further, if you don't need, try to do the best to not mix different events in the same topic/topology. Rather spawn different topologies in your same main for each subtype.
def createTopology[T](topic: String) {
val stream: DataStream[KafkaSourceType] =
env.addSource(new FlinkKafkaConsumer[T](topic, AvroDeserializationSchema.forSpecific(T), properties))
stream.addSink(StreamingFileSink.forBulkFormat(
Path.fromLocalFile(folder),
ParquetAvroWriters.forSpecificRecord(T)))
}

Create Source from Lagom/Akka Kafka Topic Subscriber for Websocket

I want my Lagom subscriber-only service to subscribe to a Kafka Topic and stream the messages to a websocket. I have a service defined as follows using this documentation (https://www.lagomframework.com/documentation/1.4.x/scala/MessageBrokerApi.html#Subscribe-to-a-topic) as a guideline:
// service call
def stream(): ServiceCall[Source[String, NotUsed], Source[String, NotUsed]]
// service implementation
override def stream() = ServiceCall { req =>
req.runForeach(str => log.info(s"client: %str"))
kafkaTopic().subscribe.atLeastOnce(Flow.fromFunction(
// add message to a Source and return Done
))
Future.successful(//some Source[String, NotUsed])
However, I can't quite figure out how to handle my kafka message. The Flow.fromFunction returns [String, Done, _] and implies that I need to add those messages (strings) to a Source that has been created outside of the subscriber.
So my question is twofold:
1) How do I create an akka stream source to receive messages from a kafka topic subscriber at runtime?
2) How do I append kafka messages to said source while in the Flow?
You seem to be misunderstanding the service API of Lagom. If you're trying to materialize a stream from the body of your service call, there's no input to your call; i.e.,
def stream(): ServiceCall[Source[String, NotUsed], Source[String, NotUsed]]
implies that when the client provides a Source[String, NotUsed], the service will respond in kind. Your client is not directly providing this; therefore, your signature should likely be
def stream(): ServiceCall[NotUsed, Source[String, NotUsed]]
Now to your question...
This actually doesn't exist in the scala giter8 template, but the java version contains what they call an autonomous stream which does approximately what you want to do.
In Scala, this code would look something like...
override def autonomousStream(): ServiceCall[
Source[String, NotUsed],
Source[String, NotUsed]
] = ServiceCall { hellos => Future {
hellos.mapAsync(8, ...)
}
}
Since your call isn't mapping over the input stream, but rather a kafka topic, you'll want to do something like this:
override def stream(): ServiceCall[NotUsed, Source[String, NotUsed]] = ServiceCall {
_ =>
Future {
kafkaTopic()
.subscribe
.atMostOnce
.mapAsync(...)
}
}

custom serialization in kafka using scala [duplicate]

This question already has answers here:
Writing Custom Kafka Serializer
(3 answers)
Closed 2 years ago.
I am trying to build a POC with Kafka 0.10. I am using my own scala domain class as a Kafka message which has a bunch of String data types. I cannot use the default serializer class or the String serializer class that comes with Kafka library. I guess I need to write my own serializer and feed it to the producer properties. If you are aware of writing an example custom serializer in Kafka (using scala), please do share. Appreciate a lot, thanks much.
You can write custom serializer easily in scala.You just need to extend class
kafka.serializer.Encoder and override method toBytes and provide serialization logic in this method. Example code-
import com.google.gson.Gson
import kafka.utils.VerifiableProperties
class MessageSerializer(props : VerifiableProperties) extends
kafka.serializer.Encoder[Message]{
private val gson: Gson = new Gson()
override def toBytes(t: Message): Array[Byte] = {
val jsonStr = gson.toJson(t)
jsonStr.getBytes
}
}
In this code we are using google gson to serialize Message in json format, however you can use any other serialization framework.
Now you just need to provide this serializer in properties while instantiating producer i.e-
props.put("serializer.class","codec.MessageSerializer");
Edit:-
There is another way also to do it by extending Serializer directly.
code-
import com.google.gson.Gson
import org.apache.kafka.common.serialization.Serializer
class MessageSerializer extends Serializer[Message]{
private val gson: Gson = new Gson()
override def configure(configs: util.Map[String, _], isKey: Boolean):
Unit = {
// nothing to do
}
override def serialize(topic: String, data: Message): Array[Byte] = {
if(data == null)
null
else
gson.toJson(data).getBytes
}
override def close(): Unit = {
//nothing to do
}
}

Spark-Streaming from an Actor

I would like to have a consumer actor subscribe to a Kafka topic and stream data for further processing with Spark Streaming outside the consumer. Why an actor? Because I read that its supervisor strategy would be a great way to handle Kafka failures (e.g., restart on a failure).
I found two options:
The Java KafkaConsumer class: its poll() method returns a Map[String, Object]. I would like a DStream to be returned just like KafkaUtils.createDirectStream would, and I don't know how to fetch the stream from outside the actor.
Extend the ActorHelper trait and use actorStream() like shown in this example. This latter option doesn't display a connection to a topic but to a socket.
Could anyone point me in the right direction?
For handling Kafka failures, I used the Apache Curator framework and the following workaround:
val client: CuratorFramework = ... // see docs
val zk: CuratorZookeeperClient = client.getZookeeperClient
/**
* This method returns false if kafka or zookeeper is down.
*/
def isKafkaAvailable:Boolean =
Try {
if (zk.isConnected) {
val xs = client.getChildren.forPath("/brokers/ids")
xs.size() > 0
}
else false
}.getOrElse(false)
For consuming Kafka topics, I used the com.softwaremill.reactivekafka library. For example:
class KafkaConsumerActor extends Actor {
val kafka = new ReactiveKafka()
val config: ConsumerProperties[Array[Byte], Any] = ... // see docs
override def preStart(): Unit = {
super.preStart()
val publisher = kafka.consume(config)
Source.fromPublisher(publisher)
.map(handleKafkaRecord)
.to(Sink.ignore).run()
}
/**
* This method will be invoked when any kafka records will happen.
*/
def handleKafkaRecord(r: ConsumerRecord[Array[Byte], Any]) = {
// handle record
}
}