custom serialization in kafka using scala [duplicate] - scala

This question already has answers here:
Writing Custom Kafka Serializer
(3 answers)
Closed 2 years ago.
I am trying to build a POC with Kafka 0.10. I am using my own scala domain class as a Kafka message which has a bunch of String data types. I cannot use the default serializer class or the String serializer class that comes with Kafka library. I guess I need to write my own serializer and feed it to the producer properties. If you are aware of writing an example custom serializer in Kafka (using scala), please do share. Appreciate a lot, thanks much.

You can write custom serializer easily in scala.You just need to extend class
kafka.serializer.Encoder and override method toBytes and provide serialization logic in this method. Example code-
import com.google.gson.Gson
import kafka.utils.VerifiableProperties
class MessageSerializer(props : VerifiableProperties) extends
kafka.serializer.Encoder[Message]{
private val gson: Gson = new Gson()
override def toBytes(t: Message): Array[Byte] = {
val jsonStr = gson.toJson(t)
jsonStr.getBytes
}
}
In this code we are using google gson to serialize Message in json format, however you can use any other serialization framework.
Now you just need to provide this serializer in properties while instantiating producer i.e-
props.put("serializer.class","codec.MessageSerializer");
Edit:-
There is another way also to do it by extending Serializer directly.
code-
import com.google.gson.Gson
import org.apache.kafka.common.serialization.Serializer
class MessageSerializer extends Serializer[Message]{
private val gson: Gson = new Gson()
override def configure(configs: util.Map[String, _], isKey: Boolean):
Unit = {
// nothing to do
}
override def serialize(topic: String, data: Message): Array[Byte] = {
if(data == null)
null
else
gson.toJson(data).getBytes
}
override def close(): Unit = {
//nothing to do
}
}

Related

In twitter Kryo/Chill serializer, how to modify a class such that initialization code can run after every deserialization?

If I use Java serializer, I can write the following scala code:
trait BeforeAndAfterSerialize extends Serializable {
def beforeWrite(): Unit = {}
def afterRead(): Unit = {}
private def writeObject(out: ObjectOutputStream): Unit = {
beforeWrite()
out.defaultWriteObject()
}
private def readObject(in: java.io.ObjectInputStream): Unit = {
in.defaultReadObject()
afterRead()
}
}
Such that after a new instance is created through deserialization, the afterRead part of the code can be run to initialize the new instance.
How to do the same in Kryo?
P.S. In this situation I'm not allowed to get or modify the Kryo instance being used, because it is created and managed by a third party software (e.g. Apache Spark)
UPDATE 1, this is a more general case of a more specific problem:
For any Kryo instance, the behaviour of BeforeAndAfterSerialize can always be customised by overriding the class KryoSerializable, which has 2 abstract functions:
trait BeforeAndAfterSerialize extends Serializable with KryoSerializable {
....
override def write(kryo: Kryo, output: Output): Unit = ???
override def read(kryo: Kryo, input: Input): Unit = ???
}
If I only need to customise "write", I need to implement "read" with a default implementation. Now the problem is: what is this default implementation?

How to use Avro serialization for scala case classes with Flink 1.7?

We've got a Flink job written in Scala using case classes (generated from avsc files by avrohugger) to represent our state. We would like to use Avro for serialising our state so state migration will work when we update our models. We understood since Flink 1.7 Avro serialization is supported OOTB. We added the flink-avro module to the classpath, but when restoring from a saved snapshot we notice that it's still trying to use Kryo serialization. Relevant Code snippet
case class Foo(id: String, timestamp: java.time.Instant)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val conf = env.getConfig
conf.disableForceKryo()
conf.enableForceAvro()
val rawDataStream: DataStream[String] = env.addSource(MyFlinkKafkaConsumer)
val parsedDataSteam: DataStream[Foo] = rawDataStream.flatMap(new JsonParser[Foo])
// do something useful with it
env.execute("my-job")
When performing a state migration on Foo (e.g. by adding a field and deploying the job) I see that it tries to deserialize using Kryo, which obviously fails. How can I make sure Avro serialization is being used?
UPDATE
Found out about https://issues.apache.org/jira/browse/FLINK-10897, so POJO state serialization with Avro is only supported from 1.8 afaik. I tried it using the latest RC of 1.8 with a simple WordCount POJO that extends from SpecificRecord:
/** MACHINE-GENERATED FROM AVRO SCHEMA. DO NOT EDIT DIRECTLY */
import scala.annotation.switch
case class WordWithCount(var word: String, var count: Long) extends
org.apache.avro.specific.SpecificRecordBase {
def this() = this("", 0L)
def get(field$: Int): AnyRef = {
(field$: #switch) match {
case 0 => {
word
}.asInstanceOf[AnyRef]
case 1 => {
count
}.asInstanceOf[AnyRef]
case _ => new org.apache.avro.AvroRuntimeException("Bad index")
}
}
def put(field$: Int, value: Any): Unit = {
(field$: #switch) match {
case 0 => this.word = {
value.toString
}.asInstanceOf[String]
case 1 => this.count = {
value
}.asInstanceOf[Long]
case _ => new org.apache.avro.AvroRuntimeException("Bad index")
}
()
}
def getSchema: org.apache.avro.Schema = WordWithCount.SCHEMA$
}
object WordWithCount {
val SCHEMA$ = new org.apache.avro.Schema.Parser().parse(" .
{\"type\":\"record\",\"name\":\"WordWithCount\",\"fields\":
[{\"name\":\"word\",\"type\":\"string\"},
{\"name\":\"count\",\"type\":\"long\"}]}")
}
This, however, also didn’t work out of the box. We then tried to define our own type information using flink-avro’s AvroTypeInfo but this fails because Avro looks for a SCHEMA$ property (SpecificData:285) in the class and is unable to use Java reflection to identify the SCHEMA$ in the Scala companion object.
I could never get reflection to work due to Scala's fields being private under the hood. AFAIK the only solution is to update Flink to use avro's non-reflection-based constructors in AvroInputFormat (compare).
In a pinch, other than Java, one could fall back to avro's GenericRecord, maybe use avro4s to generate them from avrohugger's Standard format (note that Avro4s will generate it's own schema from the generated Scala types)

How To Mock Out KafkaProducer Used Inside a Scala Class

I want to write a unit test for a Scala class. The purpose of the class is to collect metrics and post them on a Kafka topic. I am trying to mock the producer in the unit test to ensure sanity of the rest of the code. Below is a simplified version of my class:
class MyEmitter(sparkConf: SparkConf) {
<snip> -- member variables
private val kafkaProducer = createProducer()
def createProducer(): Producer[String, MyMetricClass] = {
val props = new Properties()
...
Code to initialize properties
...
new KafkaProducer[String, MyMetricClass](props)
}
def initEmitter(metricName: String): SomeClass = {
// Some implementation
}
def collect(key: String, value: String): Unit = {
// Some implementation
}
def emit(): Unit = {
val record = new ProducerRecord("<topic name>", "<key>", "<value>")
kafkaProducer.send(record)
}
What I would like to do in my unit test is to mock out the producer and check whether the send() command has been called and, if so, whether the producer record matches the expectation. I have been unsuccessful to find a solution on my own. Googling the solution has also been unfruitful. If anyone knows how the problem could be solved, I will be most grateful.
'new' is generally an enemy of testing, so you should extract the creation of that object so you can either pass a real KafkaProducer or a mock
One way to do it without changing the interface could be
def createProducer(
producer: Properties => KafkaProducer = props => new KafkaProducer[String, MyMetricClass](props)
): Producer[String, MyMetricClass] = {
val props = new Properties()
producer(props)
}
So then in real code you keep calling
myEmmiter.createProducer()
But in test you'd do
val producerMock = mock[KafkaProducer]
myEmmiter.createProducer(_ => producerMock)
Another good thing about this is that you could also stub the function itself so you can verify that the props your method creates are the expected ones
Hope it helps

Kafka Streams - Convert Expression to Single Abstract Method

I am building an application using Kafka Streams with Scala. In it, I have a use case where I have to apply map() on a KStream.
Since Kafka Streams does not provide a Scala API, so, I have to write map function like this:
val builder = new KStreamBuilder()
val originalStream = builder.stream("SourceTopic")
val mappedStream =
originalStream.map[String, Integer] {
new KeyValueMapper[String, String, KeyValue[String, Integer]] {
override def apply(key: String, value: String): KeyValue[String, Integer] = {
new KeyValue(key, new Integer(value.length))
}
}
}
Above code compiles/runs fine. But, gives a warning - Convert Expression to Single Abstract Method.
So, my question is to how to convert above map expression into a SAM?
Any help is appreciated!

How to attach a HashMap to a Configuration object in Flink?

I want to share a HashMap across every node in Flink and allow the nodes to update that HashMap. I have this code so far:
object ParallelStreams {
val env = StreamExecutionEnvironment.getExecutionEnvironment
//Is there a way to attach a HashMap to this config variable?
val config = new Configuration()
config.setClass("HashMap", Class[CustomGlobal])
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
class CustomGlobal extends ExecutionConfig.GlobalJobParameters {
override def toMap: util.Map[String, String] = {
new HashMap[String, String]()
}
}
class MyCoMap extends RichCoMapFunction[String, String, String] {
var users: HashMap[String, String] = null
//How do I get access the HashMap I attach to the global config here?
override def open(parameters: Configuration): Unit = {
super.open(parameters)
val globalParams = getRuntimeContext.getExecutionConfig.getGlobalJobParameters
val globalConf = globalParams[Configuration]
val hashMap = globalConf.getClass
}
//Other functions to override here
}
}
I was wondering if you can attach a custom object to the config variable created here val config = new Configuration()? (Please see comments in the code above).
I noticed you can only attach primitive values. I created a custom class that extends ExecutionConfig.GlobalJobParameters and attached that class by doing config.setClass("HashMap", Class[CustomGlobal]) but I am not sure if that is how you are supposed to do it?
The common way to distribute parameters to operators is to have them as regular member variables in the function class. The function object that is created and assigned during plan construction is serialized and shipped to all workers. So you don't have to pass parameters via a configuration.
This would look as follows
class MyMapper(map: HashMap) extends MapFunction[String, String] {
// class definition
}
val inStream: DataStream[String] = ???
val myHashMap: HashMap = ???
val myMapper: MyMapper = new MyMapper(myHashMap)
val mappedStream: DataStream[String] = inStream.map(myMapper)
The myMapper object is serialized (using Java serialization) and shipped for execution. So the type of map must implement the Java Serializable interface.
EDIT: I missed the part that you want the map to be updatable from all parallel tasks. That is not possible with Flink. You would have to either fully replicate the map and all updated (by broadcasting) or use an external system (key-value store) for that.