Apache Flink - Case class to Json for Kafka producer - scala

I, go to case class in Apache Flink with Scala. I want to turn it into Json to send it on a Kafka topic.
it's comes from a DataStream that went for a RichMapFunction
What would be the best strategy?
Example:
case class MySchema(
someObj1: Some[String],
someObj2: Some[String],
someObj3: Some[String],
obj1: String,
obj2: String,
obj3: String,
intObj1: Int,
intObj2: Int
)

Related

Deserialize Protobuf kafka messages with Flink

I am trying to read and print Protobuf message from Kafka using Apache Flink.
I followed the official docs with no success: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/serialization/third_party_serializers/
The Flink consumer code is:
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI()
env.enableCheckpointing(5000)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
env.getCheckpointConfig.setCheckpointStorage(s"$targetPath/checkpoints")
env.getConfig.registerTypeWithKryoSerializer(classOf[User], classOf[ProtobufSerializer])
val source = KafkaSource.builder[User]
.setBootstrapServers(brokers)
.setTopics(topic)
.setGroupId(consumerGroupId)
.setValueOnlyDeserializer(new ProtoDeserializer())
.setStartingOffsets(OffsetsInitializer.earliest)
.build
val stream = env.fromSource(source, WatermarkStrategy.noWatermarks[User], kafkaTableName)
stream.print()
env.execute()
}
The deserializer code is:
class ProtoDeserializer extends DeserializationSchema[User] {
override def getProducedType: TypeInformation[User] = null
override def deserialize(message: Array[Byte]): User = User.parseFrom(message)
override def isEndOfStream(nextElement: User): Boolean = false
}
I get the following error when the streamer is executed:
Protocol message contained an invalid tag (zero).
It's important to mention that I manage to read and deserialize the messages successfully using the confluent protobuf consumer so it seems that the messages are not corrupted.
The confluent protobuf serializer doesn't produce content that can be directly deserialized by other deserializers. The format is described in confluent's documentation: it starts with a magic byte (that is always zero), followed by a four byte schema ID. The protobuf payload follows, starting with byte 5.
The getProducedType method should return appropriate TypeInformation, in this case TypeInformation.of(User.class). Without this you may run into problems at runtime.
Deserializers used with KafkaSource don't need to implement isEndOfStream, but it won't hurt anything.

How to implement custom serializer

I'm using Scala 2.12, implementing some producers and consumer with this library:
"org.apache.kafka" % "kafka-clients" % "2.4.1"
For both key and value I'm using classOf[StringDeserializer].
Let's say every message is a json string of a case class like this:
case class Person(name: String, age: Int, id: UUID)
So value in every message would be something like this:
{"name":"Joe", "age": ...}
How can I write customer serializer for this?
How can I write customer serializer
You'd implement the interface...
class MySerializer extends Serializer[Person] {
override def serialize(String topic, Person p): Array[Byte] { ... }
}
Let's say every message is a json string
Kafka already has a JSONSerializer class you can use
Or you can use the ones provided by Confluent or Spring-Kafka.
Otherwise, StringSerializer will work fine if you pre-serialize the data before producing

How to use Avro serialization for scala case classes with Flink 1.7?

We've got a Flink job written in Scala using case classes (generated from avsc files by avrohugger) to represent our state. We would like to use Avro for serialising our state so state migration will work when we update our models. We understood since Flink 1.7 Avro serialization is supported OOTB. We added the flink-avro module to the classpath, but when restoring from a saved snapshot we notice that it's still trying to use Kryo serialization. Relevant Code snippet
case class Foo(id: String, timestamp: java.time.Instant)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val conf = env.getConfig
conf.disableForceKryo()
conf.enableForceAvro()
val rawDataStream: DataStream[String] = env.addSource(MyFlinkKafkaConsumer)
val parsedDataSteam: DataStream[Foo] = rawDataStream.flatMap(new JsonParser[Foo])
// do something useful with it
env.execute("my-job")
When performing a state migration on Foo (e.g. by adding a field and deploying the job) I see that it tries to deserialize using Kryo, which obviously fails. How can I make sure Avro serialization is being used?
UPDATE
Found out about https://issues.apache.org/jira/browse/FLINK-10897, so POJO state serialization with Avro is only supported from 1.8 afaik. I tried it using the latest RC of 1.8 with a simple WordCount POJO that extends from SpecificRecord:
/** MACHINE-GENERATED FROM AVRO SCHEMA. DO NOT EDIT DIRECTLY */
import scala.annotation.switch
case class WordWithCount(var word: String, var count: Long) extends
org.apache.avro.specific.SpecificRecordBase {
def this() = this("", 0L)
def get(field$: Int): AnyRef = {
(field$: #switch) match {
case 0 => {
word
}.asInstanceOf[AnyRef]
case 1 => {
count
}.asInstanceOf[AnyRef]
case _ => new org.apache.avro.AvroRuntimeException("Bad index")
}
}
def put(field$: Int, value: Any): Unit = {
(field$: #switch) match {
case 0 => this.word = {
value.toString
}.asInstanceOf[String]
case 1 => this.count = {
value
}.asInstanceOf[Long]
case _ => new org.apache.avro.AvroRuntimeException("Bad index")
}
()
}
def getSchema: org.apache.avro.Schema = WordWithCount.SCHEMA$
}
object WordWithCount {
val SCHEMA$ = new org.apache.avro.Schema.Parser().parse(" .
{\"type\":\"record\",\"name\":\"WordWithCount\",\"fields\":
[{\"name\":\"word\",\"type\":\"string\"},
{\"name\":\"count\",\"type\":\"long\"}]}")
}
This, however, also didn’t work out of the box. We then tried to define our own type information using flink-avro’s AvroTypeInfo but this fails because Avro looks for a SCHEMA$ property (SpecificData:285) in the class and is unable to use Java reflection to identify the SCHEMA$ in the Scala companion object.
I could never get reflection to work due to Scala's fields being private under the hood. AFAIK the only solution is to update Flink to use avro's non-reflection-based constructors in AvroInputFormat (compare).
In a pinch, other than Java, one could fall back to avro's GenericRecord, maybe use avro4s to generate them from avrohugger's Standard format (note that Avro4s will generate it's own schema from the generated Scala types)

What is best practice to represent data by Scala case classes

For exmaple, in scope of web service, case classes are used for storing and returning data from REST API.
Sometimes for different endpoint there is a need of similar classes with slight differences (for example only one field is different.)
for example:
case class Example (id: Int, value:Int)
case class ExampleWithName (id: Int, value:Int, name:String)
case class ExampleWithNameAndDate (id: Int, value:Int, name:String,createdOn:LocalDate)
would like to ask is the sollution to create new case class for each return type is the best, or there is better one, because there appears a lot similar code/classes.
There are several options:
1) Copypaste
case class Example (id: Int, value:Int)
case class ExampleWithName (id: Int, value:Int, name:String)
pros and cons: we all know them :)
2) Optional values
case class ExampleWithName (id: Int, value:Int, name: Option[String])
pros: no duplication
cons: if in a particular usecase you expect the name to be present it is not compile-time checked. And if you don't need the name you still have redundant noisy field.
3) Aggregation
case class Example (id: Int, value:Int)
case class ExampleWithName (example: Example, name:String)
pros: no flaws from previous approaches
cons: may require additional efforts to marshal this nested structure into a flat json.
It really depends on how you gonna use these case classes. From your example, one way you can do it is to declare those optional fields as Option.
case class Example(id: Int, value: Int, name: Option[String], createdOn: Option[LocalDate])

Map JSON to case class for scalatest

I am trying to write test cases using scala test for the components.
My application takes JSON through the REST endpoints map it to the case class via Akka http entity mapping, Now while writing the test case all I want to do is Map my json to case class and utilize case class object without using the REST interface.
case class Sample(
projectName : String,
modelName: String,
field2 : String,
field3: FieldConf,
field4: String,
field5: String,
field6 : Seq[field7]
)
//FieldConf is another case class
how do I map my JSON string to this case class ?
When you configured your akka-http to unmarshal JSON to your case class, you had to configure some JSON library as the marshaller.
You can use the same library directly to parse and decode your case class.
For example, here's how you would do it using Circe:
import io.circe.parser.decode
decode[MyCaseClass]("{...}")