Apply generic to deserialize in Kafka using Flink - apache-kafka

I am trying to deserialize the object to process the logic in Flink.
Since the records are consumed via Kafka so I used the KafkaDeserialization class.
When the new type of the object is added, I need to add the deserialize class as below.
//Existing deserialize class
class ADeserialize {
extends KafkaDeserializationSchema[TypeAClass] {
val mapper: ObjectMapper = new ObjectMapper
override def isEndOfStream(nextElement: TypeAClass): Boolean = false
override def deserialize(record: ConsumerRecord[Array[Byte], Array[Byte]]): TypeAClass = {
val jsonNode = mapper.readValue(value, classOf[JsonNode])
}
}
override def getProducedType: TypeInformation[TypeAClass] = Types.CASE_CLASS[TypeAClass]
}
//Newly added deserialize class
class BDeserialize {
extends KafkaDeserializationSchema[TypeBClass] {
val mapper: ObjectMapper = new ObjectMapper
override def isEndOfStream(nextElement: TypeBClass): Boolean = false
override def deserialize(record: ConsumerRecord[Array[Byte], Array[Byte]]): TypeBClass = {
val jsonNode = mapper.readValue(value, classOf[JsonNode])
}
}
override def getProducedType: TypeInformation[TypeBClass] = Types.CASE_CLASS[TypeBClass]
}
As you can see, if there are different sources added, then I need to create many times which generates the duplicated sources. To prevent this issue, I think that convering the sources into generic one is idea. But I simply failed to convert it using KafkaDeserializationSchema. My flink version is 1.11 since it is legacy.
Any help will be appreciated Thanks.

What You want is something like :
class MyJsonDeserializationSchema[T](implicit typeInfo: TypeInformation[T) extends KafkaDeserializationSchema[T] {
val mapper: ObjectMapper = new ObjectMapper
override def isEndOfStream(nextElement: T): Boolean = false
override def deserialize(record: ConsumerRecord[Array[Byte], Array[Byte]]): T = {
val jsonNode = mapper.readValue(value, classOf[T])
}
}
override def getProducedType: TypeInformation[T] = typeInfo
}

Related

Flink KafkaSource & KafkaSink with GenericRecords and Confluent Schema Registry

I was faced with the problem of reading / writing from and to Kafka using KafkaSource and KafkaSink with Flink v1.16 (Scala 2.12) and the Confluent Schema Registry. The events should be read and written as GenericRecords. Below is an overview of the approach that I came up with. It is far from perfect, but I hope it helps someone to get a general idea of how things can be tied together.
KafkaSource
First I created a POJO class to deserialize the event into (Why POJO). Be careful to adhere to the POJO conventions.
class InputEvent
(
var timestamp: Long
, var someValue: String
) extends Serializable {
def this() = this(Long, String)
def canEqual(other: Any): Boolean = ...
override def equals(other: Any): Boolean = ...
override def hashCode(): Int = ...
override def toString = ...
}
I then created a companion object that extends the SchemaProjectable[T] trait. It is important that the Schema is a string and only later parsed with the getSchema method (serialization issues).
object InputEvent extends SchemaProjectable[InputEvent] with Serializable {
val SCHEMA = "{\"type\":\"record\",\"name\":\"InputEvent\",\"namespace\":\"com.blog.post\",\"doc\":\"Input Events\",\"fields\":[{\"name\":\"timestamp\",\"type\":\"long\",\"doc\":\"Event timestamp in seconds\"},{\"name\":\"someValue\",\"type\":\"string\",\"doc\":\"Some Value Field\"}]}"
override def getSchema: Schema = new Schema.Parser().parse(SCHEMA)
override def projectFromGeneric(in: GenericRecord): InputEvent = {
new InputEvent(
in.getLong("timestamp"),
in.getString("someValue")
)
}
override def projectToGeneric(in: InputEvent): GenericRecord = new GenericData.Record(getSchema)
}
The SchemaProjectable[T] trait is used later for the de/serializer to have a common interface and to allow an event to be converted from and to a GenericRecord easily.
abstract class SchemaProjectable[T] extends Serializable {
def getSchema : Schema
def projectFromGeneric(in: GenericRecord) : T
def projectToGeneric(in: T) : GenericRecord
}
Subsequently, the builder of the KafkaSource is put into a separate object. The getKafkaSource[T] method requires the configuration properties as well as an implementation of trait SchemaProjectable[T].
object GenericKafkaSource {
def getKafkaSource[T]
(properties: Properties, schemaProjectable: SchemaProjectable[T])
(implicit tInfo: TypeInformation[T]): KafkaSource[T] = {
KafkaSource.builder[T]
.setProperties(properties)
.setTopics(properties.getProperty("topicName"))
.setStartingOffsets(configToOffset(properties.getProperty("offset")))
.setDeserializer(
new GenericDeserializationSchema(schemaProjectable, properties.getProperty("schema.registry.url"))
)
.build
}
}
As a last step, the generic deserializer is implemented using the ConfluentRegistryAvroDeserializationSchema
class GenericDeserializationSchema[T]
(schemaProjectable: SchemaProjectable[T], url: String)
(implicit tInfo: TypeInformation[T]) extends KafkaRecordDeserializationSchema[T]{
private val deserializationSchema = ConfluentRegistryAvroDeserializationSchema.forGeneric(schemaProjectable.getSchema,url)
override def deserialize(record: ConsumerRecord[Array[Byte], Array[Byte]], out: Collector[T]): Unit = {
out.collect(
schemaProjectable.projectFromGeneric(deserializationSchema.deserialize(record.value()))
)
}
override def getProducedType: TypeInformation[T] = Types.of[T]
}
Finally, the KafkaSource can be used as follows:
val source = GenericKafkaSource.getKafkaSource(consumerProperties, InputEvent)
KafkaSink
For the KafkaSink, we take a similar approach than with the KafkaSource. Again, we create a class and companion object for the output event.
class OutputEvent
(
var timestamp: Long
, var resultValue: String
) extends Serializable {
def this() = this(Long, String)
def canEqual(other: Any): Boolean = ...
override def equals(other: Any): Boolean = ...
override def hashCode(): Int = ...
override def toString = ...
}
object OutputEvent extends SchemaProjectable[OutputEvent] with Serializable{
val SCHEMA = "{\"type\":\"record\",\"name\":\"OutputEvent\",\"namespace\":\"com.blog.post\",\"fields\":[{\"name\":\"timestamp\",\"type\":[\"null\",\"long\"],\"doc\":\"Timestamp of the event\"},{\"name\":\"resultValue\",\"type\":[\"null\",\"string\"],\"doc\":\"Result value\"}]}"
override def getSchema: Schema = new Schema.Parser().parse(SCHEMA)
override def projectFromGeneric(in: GenericRecord): OutputEvent = new OutputEvent()
override def projectToGeneric(in: OutputEvent): GenericRecord = {
val record = new GenericData.Record(getSchema)
record.put("timestamp", in.timestamp)
record.put("resultValue", in.resultValue)
record
}
}
The KafkaSink builder is factored out to a separate object.
object GenericKafkaSink {
def getGenericKafkaSink[T]
(properties: Properties, schemaProjectable: SchemaProjectable[T])
(implicit tInfo: TypeInformation[T]): KafkaSink[T] = {
val topicName = properties.getProperty("topicName")
val url = properties.getProperty("schema.registry.url")
KafkaSink.builder[T]
.setKafkaProducerConfig(properties)
.setBootstrapServers(properties.getProperty("bootstrap.servers"))
.setRecordSerializer(
new GenericSerializationSchema[T](topicName, schemaProjectable, url)
)
.setTransactionalIdPrefix(properties.getProperty("transactionId"))
.build
}
}
Subsequently, the SerializationSchema is implemented as follows using the ConfluentRegistryAvroSerializationSchema.
class GenericSerializationSchema[T]
(topicName: String, schemaProjectable: SchemaProjectable[T], url: String)
(implicit tInfo: TypeInformation[T]) extends KafkaRecordSerializationSchema[T] with Serializable {
lazy private val serializationSchema: ConfluentRegistryAvroSerializationSchema[GenericRecord] = ConfluentRegistryAvroSerializationSchema.forGeneric(topicName+"-value",schemaProjectable.getSchema,url)
override def serialize(element: T, context: KafkaRecordSerializationSchema.KafkaSinkContext, timestamp: lang.Long): ProducerRecord[Array[Byte], Array[Byte]] = {
new ProducerRecord[Array[Byte], Array[Byte]](
topicName
, serializationSchema.serialize(schemaProjectable.projectToGeneric(element))
)
}
}
Finally, the KafkaSink can be used as follows:
val sink = GenericKafkaSink.getGenericKafkaSink(producerProperties, OutputEvent)
Any opinions on the approach? I'm glad for any feedback!
Kind Regards
Dominik

How to create a custom collection that extends on Sets In Scala?

I want to create a new custom Scala collection from existing Set Collection which I can later extend with some additional functions.
For the moment, I am trying to make it behave as the standard collection Set. I tried the following link : https://docs.scala-lang.org/overviews/core/custom-collections.html where they have a custom collection that extends on Map. However extending on Set seems a bit different and I am having a ton of type incompatibility that I am not able to resolve.
Here's my starting code : the parts I am not able to define are addOne, substractOne, contains and iterator.
import scala.collection._
class TestSet[A]
extends mutable.Set[A]
with mutable.SetOps[A, mutable.Set, TestSet[A]] {
override def empty: TestSet[A] = new TestSet
// Members declared in scala.collection.mutable.Clearable
override def clear(): Unit = immutable.Set.empty
// Members declared in scala.collection.IterableOps
override protected def fromSpecific(coll: IterableOnce[A]): TestSet[A] = TestSet.from(coll)
override protected def newSpecificBuilder: mutable.Builder[A, TestSet[A]] = TestSet.newBuilder
override def className = "TestSet"
//override def subtractOne(elem: A): TestSet.this.type = ???
//override def addOne(elem: A): TestSet.this.type = ???
//override def contains(elem: A): Boolean = ???
//override def iterator: Iterator[A] = {
//}
}
object TestSet {
def empty[A] = new TestSet[A]
def from[A](source: IterableOnce[A]): TestSet[A] =
source match {
case pm: TestSet[A] => pm
case _ => (newBuilder ++= source).result()
}
def apply[A](elem: A*): TestSet[A] = from(elem)
def newBuilder[A]: mutable.Builder[A, TestSet[A]] =
new mutable.GrowableBuilder[A, TestSet[A]](empty)
import scala.language.implicitConversions
implicit def toFactory[A](self: this.type): Factory[A, TestSet[A]] =
new Factory[A, TestSet[A]] {
def fromSpecific(it: IterableOnce[A]): TestSet[A] = self.from(it)
def newBuilder: mutable.Builder[A, TestSet[A]] = self.newBuilder
}
}
I will interpret your question as "I want to do something similar as the Map-example from https://docs.scala-lang.org/overviews/core/custom-collections.html for a mutable Set but now am stuck with this code" and will try to answer that (ignoring any other aspects of your question).
What you need to understand is that mutable.Set and mutable.SetOps are just traits that provide some reusable parts of implementation, but they do not contain any actual data structures.
So if you want to implement your own implementation, you will have to provide the actual underlying data structure yourself (similar to how the PrefixMap from that link has private var suffixes and private var value).
For example, you could use an underlying immutable Set like this:
class TestSet[A]
extends mutable.Set[A]
with mutable.SetOps[A, mutable.Set, TestSet[A]] {
// the underlying data structure
private var data = mutable.Set.empty[A]
// ATTENTION: your implementation was different and buggy
override def clear(): Unit = {
data = mutable.Set.empty
}
override def subtractOne(elem: A): TestSet.this.type = {
data = data - elem
this
}
override def addOne(elem: A): TestSet.this.type = {
data = data + elem
this
}
override def contains(elem: A): Boolean = data.contains(elem)
override def iterator: Iterator[A] = {
data.iterator
}
// ...
}
Note that the above is just an example of what you could do in order to get your code to work - I'm not saying that it's a good idea.

Configuring implicits in Scala

I have typeclass:
trait ProcessorTo[T]{
def process(s: String): T
}
and its implementation
class DefaultProcessor extends ProcessorTo[String]{
def process(s: String): String = s
}
trait DefaultProcessorSupport{
implicit val p: Processor[String] = new DefaultProcessor
}
To make it available for using I created
object ApplicationContext
extends DefaultProcessorSupport
with //Some other typeclasses
But now I have to add a processor which performs some DataBase - read. The DB URL etc are placed in condifguration file that is available only a runtime. For now I did the following.
class DbProcessor extends ProcessorTo[Int]{
private var config: Config = _
def start(config: Config) = //set the configuration, open connections etc
//Other implementation
}
object ApplicationContext{
implicit val p: ProcessorTo[Int] = new DbProcessor
def configure(config: Config) = p.asInstanceOf[DbProcessor].start(config)
}
It works for me, but I'm not sure about this technique. Looks strange for me a little bit. Is it a bad practice? If so, what would be a good solution?
I am a bit confused by the requirements as DbProcessor is missing the process implementation(???) and trait ProcessorTo[T] is missing start method which is defined in DbProcessor. So, I will assume the following while answering: the type class has both process and start methods
Define a type class:
trait ProcessorTo[T]{
def start(config: Config): Unit
def process(s: String): T
}
Provide implementations for the type class in the companion objects:
object ProcessorTo {
implicit object DbProcessor extends ProcessorTo[Int] {
override def start(config: Config): Unit = ???
override def process(s: String): Int = ???
}
implicit object DefaultProcessor extends ProcessorTo[String] {
override def start(config: Config): Unit = ???
override def process(s: String): String = s
}
}
and use it in your ApplicationContext as follows:
object ApplicationContext {
def configure[T](config: Config)(implicit ev: ProcessorTo[T]) = ev.start(config)
}
This is a nice blog post about Type Classes: http://danielwestheide.com/blog/2013/02/06/the-neophytes-guide-to-scala-part-12-type-classes.html
I don't really see why you need start. If your implicit DbProcessor has a dependency, why not make it an explicit dependency via constructor? I mean something like this:
class DbConfig(val settings: Map[String, Object]) {}
class DbProcessor(config: DbConfig) extends ProcessorTo[Int] {
// here goes actual configuration of the processor using config
private val mappings: Map[String, Int] = config.settings("DbProcessor").asInstanceOf[Map[String, Int]]
override def process(s: String): Int = mappings.getOrElse(s, -1)
}
object ApplicationContext {
// first create config then pass it explicitly
val config = new DbConfig(Map[String, Object]("DbProcessor" -> Map("1" -> 123)))
implicit val p: ProcessorTo[Int] = new DbProcessor(config)
}
Or if you like Cake pattern, you can do something like this:
trait DbConfig {
def getMappings(): Map[String, Int]
}
class DbProcessor(config: DbConfig) extends ProcessorTo[Int] {
// here goes actual configuration of the processor using config
private val mappings: Map[String, Int] = config.getMappings()
override def process(s: String): Int = mappings.getOrElse(s, -1)
}
trait DbProcessorSupport {
self: DbConfig =>
implicit val dbProcessor: ProcessorTo[Int] = new DbProcessor(self)
}
object ApplicationContext extends DbConfig with DbProcessorSupport {
override def getMappings(): Map[String, Int] = Map("1" -> 123)
}
So the only thing you do in your ApplicationContext is providing actual implementation of the DbConfig trait.

How can I predict which implementation will be chosen when mixing in multiple traits with conflicting abstract overrides?

Consider this example:
abstract class Writer {
def write(message: String): Unit
}
trait UpperCaseFilter extends Writer {
abstract override def write(message: String) =
super.write(message.toUpperCase)
}
trait LowerCaseFilter extends Writer {
abstract override def write(message: String) =
super.write(message.toLowerCase)
}
class StringWriter extends Writer {
val sb = new StringBuilder
override def write(message: String) =
sb.append(message)
override def toString = sb.toString
}
object Main extends App {
val writer = new StringWriter with UpperCaseFilter with LowerCaseFilter
writer.write("Hello, world!")
println(writer)
}
I was surprised by the output “HELLO, WORLD!” Why is the output not “hello, world!” or a compilation error?
The logic that decides it is called linearization. You can find more information about it here:
http://www.artima.com/pins1ed/traits.html#12.6
In your case the whole class hierarchy would linearized like this:
LowerCaseFilter > UpperCaseFilter > Writer > StringWriter > AnyRef > Any
So, as you can see, UpperCaseFilter was the last transformation that went to the StringWriter.

Problems using Nothing bottom type while trying to create generic zeros for parametrized monoids

Here's my code. It permits to create typesafe MongoDB queries using Casbah
trait TypesafeQuery[ObjectType, BuildType] {
def build: BuildType
}
trait TypesafeMongoQuery[ObjectType] extends TypesafeQuery[ObjectType, DBObject]
case class TypesafeMongoQueryConjunction[ObjectType](queries: Seq[TypesafeMongoQuery[ObjectType]]) extends TypesafeMongoQuery[ObjectType] {
override def build(): DBObject = $and(queries.map(_.build))
}
case class TypesafeMongoQueryDisjunction[ObjectType](queries: Seq[TypesafeMongoQuery[ObjectType]]) extends TypesafeMongoQuery[ObjectType] {
override def build(): DBObject = $or(queries.map(_.build))
}
object TypesafeMongoQuery {
// TODO could probably be reworked? see http://stackoverflow.com/questions/23917459/best-way-to-create-a-mongo-expression-that-never-matches
val AlwaysMatchingQuery: DBObject = $()
val NeverMatchingQuery: DBObject = $and($("_id" -> 1), $("_id" -> -1))
def AlwaysMatchingTypesafeQuery[ObjectType] = new TypesafeMongoQuery[ObjectType] { override def build(): DBObject = AlwaysMatchingQuery }
def NeverMatchingTypesafeQuery[ObjectType] = new TypesafeMongoQuery[ObjectType] { override def build(): DBObject = NeverMatchingQuery }
def and[ObjectType](queries: TypesafeMongoQuery[ObjectType]*) = TypesafeMongoQueryConjunction(queries)
def or[ObjectType](queries: TypesafeMongoQuery[ObjectType]*) = TypesafeMongoQueryDisjunction(queries)
// TODO maybe define Scalaz Monoids
def foldAnd[ObjectType](queries: Seq[TypesafeMongoQuery[ObjectType]]) = {
queries.foldLeft(AlwaysMatchingTypesafeQuery[ObjectType]) { (currentQuery, queryInList) =>
TypesafeMongoQuery.and(currentQuery, queryInList)
}
}
def foldOr[ObjectType](base: TypesafeMongoQuery[ObjectType], queries: Seq[TypesafeMongoQuery[ObjectType]]) = {
queries.foldLeft(NeverMatchingTypesafeQuery[ObjectType]) { (currentQuery, queryInList) =>
TypesafeMongoQuery.or(currentQuery, queryInList)
}
}
}
It works fine, except I'm not satisfied with these lines:
def AlwaysMatchingTypesafeQuery[ObjectType] = new TypesafeMongoQuery[ObjectType] { override def build(): DBObject = AlwaysMatchingQuery }
def NeverMatchingTypesafeQuery[ObjectType] = new TypesafeMongoQuery[ObjectType] { override def build(): DBObject = NeverMatchingQuery }
I think It would be possible to not create a new instance of these 2 objects for each folding operation, but rather using a val / singleton of type TypesafeMongoQuery[Nothing] since the underlying DBObject being built would always be the same.
I've tried some things, like replacing my signatures everywhere by [ObjectType,T <% ObjectType] but with no great success.
Any idea on how to solve my problem?
Can you make ObjectType covariant?
trait TypesafeQuery[+ObjectType, BuildType] {
def build: BuildType
}
trait TypesafeMongoQuery[+ObjectType] extends TypesafeQuery[+ObjectType, DBObject]
object AlwaysMatchingTypesafeQuery extends TypesafeMongoQuery[Nothing] { override def build(): DBObject = AlwaysMatchingQuery }
object NeverMatchingTypesafeQuery extends TypesafeMongoQuery[Nothing] { override def build(): DBObject = NeverMatchingQuery }