I try to change Flink runner code to let it read data from several Kafka topics and write it to different HDFS folders accordingly and without joining. I have a lot Java and Scala generic methods and generic object initiations inside the main process method and reflection.
It work correctly with one Avro schema, but when I try to add unknown amount of Avro schema I have a problem with generics and reflection constructions.
How to resolve it? Whats design pattern can help me?
The model (Avro schema) is in Java classes.
public enum Types implements MessageType {
RECORD_1("record1", "01", Record1.getClassSchema(), Record1.class),
RECORD_2("record2", "02", Record2.getClassSchema(), Record2.class);
private String topicName;
private String dataType;
private Schema schema;
private Class<? extends SpecificRecordBase> clazz;}
public class Record1 extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord
{
public static final org.apache.avro.Schema SCHEMA$ = new org.apache.avro.Schema.Parser().parse("???");
public static org.apache.avro.Schema getClassSchema() { return SCHEMA$; }
... }
public class Record1 ...
The process trait with main process methods.
import org.apache.avro.specific.SpecificRecordBase
import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.fs.Writer
import tests.{Record1, Record2, Types}
import scala.reflect.ClassTag
trait Converter[T] extends Serializable {
def convertToModel(message: KafkaSourceType): T
}
trait FlinkRunner extends Serializable {
val kafkaTopicToModelMapping: Map[String, Class[_ <: SpecificRecordBase]] =
Map(
"record_1" -> Types.RECORD_1.getClassType,
"record_2" -> Types.RECORD_2.getClassType
)
def buildAvroSink1(path: String, writer1: Writer[Record1]): BucketingSink[Record1] = ???
def buildAvroSink2(path: String, writer2: Writer[Record2]): BucketingSink[Record2] = ???
def process(topicList: List[String], env: StreamExecutionEnvironment): Unit = {
// producer kafka source building
val clazz1: Class[Record1] = ClassTag(kafkaTopicToModelMapping(topicList.head)).runtimeClass.asInstanceOf[Class[Record1]]
val clazz2: Class[Record2] = ClassTag(kafkaTopicToModelMapping(topicList.tail.head)).runtimeClass.asInstanceOf[Class[Record2]]
// How to makes clazzes list from that val clazzes: List[Class[???]] = ???
val avroTypeInfo1: TypeInformation[Record1] = TypeInformation.of(clazz1)
val avroTypeInfo2: TypeInformation[Record2] = TypeInformation.of(clazz2)
// How to makes clazzes list from that val avroTypeInfos = ???
val stream: DataStream[KafkaSourceType] = ???
// consumer avro paths building, it
val converter1: Converter[Record1] = new Converter[Record1] {
override def convertToModel(message: KafkaSourceType): Record1 = deserializeAvro[Record1](message.value)
}
val converter2: Converter[Record2] = new Converter[Record2] {
override def convertToModel(message: KafkaSourceType): Record2 = deserializeAvro[Record2](message.value)
}
// How to makes converters list from that
val outputResultStream1 = stream
.filter(_.topic == topicList.head)
.map(record => converter1.convertToModel(record))(avroTypeInfo1)
val outputResultStream2 = stream
.filter(_.topic == topicList.tail.head)
.map(record => converter2.convertToModel(record))(avroTypeInfo2)
val writer1 = new AvroSinkWriter[Record1](???)
val writer2 = new AvroSinkWriter[Record2](???)
// add sink and start process
}
}
AS IS
There are several different topics in Kafka. The Kafka version is 10.2 without Confluent. Every Kafka topic works with only one Avro schema class, written in Java.
The only one Flink job (written in Scala) reads the only one topic, convert with one Avro schema and write the data to only one folder in HDFS. The name, path and output folder name are in config.
For example there are 3 job flows with parameters:
First Job Flow
--brokersAdress …
--topic record1
--folderName folder1
-- avroClassName Record1
--output C:/….
--jobName SingleTopic1
--number_of_parallel 2
--number_of_task 1
--mainClass Runner
….
Second Job Flow
--brokersAdress …
--topic record1
--folderName folder1
-- avroClassName Record1
--output C:/….
--jobName SingleTopic2
--number_of_parallel 2
--number_of_task 1
--mainClass Runner
….
Third Job Flow
…
TO BE
The one Flink job can read more than one Kafka topics, convert it with different Avro schema and write the data to different folders without joining.
For example I can to start only one job flow which will do the same work
--brokersAdress …
--topic record1, record2, record3
--folderName folder1, folder2,
-- avroClassName Record1, Record2
--output C:/….
--jobName MultipleTopics
--number_of_parallel 3
--number_of_task 3
--mainClass Runner
...
Ok, thank you. There are several questions about code organization:
1) How to generalize variables in the method and methods (called procees) parameters to let initiate the List of several clazzes with inherited from SpecificRecordBase classes? If it possible, sure.
val clazz1: Class[Record1] = ClassTag(kafkaTopicToModelMapping(topicList.head)).runtimeClass.asInstanceOf[Class[Record1]]
val clazz2: Class[Record2] = ClassTag(kafkaTopicToModelMapping(topicList.tail.head)).runtimeClass.asInstanceOf[Class[Record2]]
2) The same question is for avroTypeInfo1, avroTypeInfo2 ..., converter1, converter2, ..., buildAvroSink1, buildAvroSink2, ... .
Also I have a questions about architecture. I tried to execute this code and Flink worked correctly with different topics with Avro schema classes.
Which Flink code tools can help me to put different avro schema classes to several outputStrems and add sink with them?
Do you have code examples with it?
And also what could I use instead the Flink to resolve issue with generating several Avro files from different Kafka topics? Perhaps confluent.
I'm a bit lost on your motivation. The general idea is that if you want to use a generic approach, go with GenericRecord. If you have specific code for the different types go SpecificRecord but then don't use the generic code around it.
Further, if you don't need, try to do the best to not mix different events in the same topic/topology. Rather spawn different topologies in your same main for each subtype.
def createTopology[T](topic: String) {
val stream: DataStream[KafkaSourceType] =
env.addSource(new FlinkKafkaConsumer[T](topic, AvroDeserializationSchema.forSpecific(T), properties))
stream.addSink(StreamingFileSink.forBulkFormat(
Path.fromLocalFile(folder),
ParquetAvroWriters.forSpecificRecord(T)))
}
Related
I would like to find a Spark custom data source implementation which itself simply selects from and returns some existing data source implementation, based on dynamic configuration. For example, given an arbitrary configuration key "MyDataSource", in one case it may return a Parquet data source, and in another case it may return an Avro data source, depending on the configuration files at runtime.
Has anybody already done this?
You don't need to implement a datasource, you can create this logic in the driver to choose the format to read. For example:
val sparkSession = ...
trait Source
case class Parquet(path: String) extends Source
case class Jdbc(url: String, port: Int) extends Source
def readSource(s: Source)(spark: SparkSession): DataFrame = s match {
case Parquet(path) => spark.read.parquet(path)
case Jdbc(url, port) => ...
}
val aSourceFromConfig: Source = ...
val df: DataFrame = readSource(aSourceFromConfig)(spark)
I have a persistent actor which can receive one type of command Persist(event) where event is a type of trait Event (there are numerous implementations of it). And on success, this reponds with Persisted(event) to the sender.
The event itself is serializable since this is the data we store in persistence storage and the serialization is implemented with a custom serializer which internally uses classes generated from google protobuf .proto files. And this custom serializer is configured in application.conf and bound to the base trait Event. This is already working fine.
Note: The implementations of Event are not classes generated by protobuf. They are normal scala classes and they have a protobuf equivalent of them too, but that's mapped through the custom serializer that's bound to the base Event type. This was done by my predecessors for versioning (which probably isn't required because this can be handled with plain protobuf classes + custom serialization combi too, but that's a different matter) and I don't wish to change that atm.
We're now trying to implement cluster sharding for this actor which also means that my commands (viz. Persist and Persisted) also need to be serializable since they may be forwarded to other nodes.
This is the domain model :
sealed trait PersistenceCommand {
def event: Event
}
final case class Persisted(event: Event) extends PersistenceCommand
final case class Persist(event: Event) extends PersistenceCommand
Problem is, I do not see an elegent way to make it serializable. Following are the options I have considered
Approach 1. Define a new proto file for Persist and Persisted, but what do I use as the datatype for event? I didn't find a way to define something like this :
message Persist {
"com.example.Event" event = 1 // this doesn't work
}
Such that I can use existing Scala trait Event as a data type. If this works, I guess (it's far fetched though) I could bind the generated code (After compiling this proto file) to akka's inbuilt serializer for google protobuf and it may work. The note above explains why I cannot use oneof construct in my proto file.
Approach 2. This is what I've implemented and it works (but I don't like it)
Basically, I wrote a new serializer for the commands and delegated seraizalition and de-serialization of event part of the command to the existing serializer.
class PersistenceCommandSerializer extends SerializerWithStringManifest {
val eventSerializer: ManifestAwareEventSerializer = new ManifestAwareEventSerializer()
val PersistManifest = Persist.getClass.getName
val PersistedManifest = Persisted.getClass.getName
val Separator = "~"
override def identifier: Int = 808653986
override def manifest(o: AnyRef): String = o match {
case Persist(event) => s"$PersistManifest$Separator${eventSerializer.manifest(event)}"
case Persisted(event) => s"$PersistedManifest$Separator${eventSerializer.manifest(event)}"
}
override def toBinary(o: AnyRef): Array[Byte] = o match {
case command: PersistenceCommand => eventSerializer.toBinary(command.event)
}
override def fromBinary(bytes: Array[Byte], manifest: String): AnyRef = {
val (commandManifest, dataManifest) = splitIntoCommandAndDataManifests(manifest)
val event = eventSerializer.fromBinary(bytes, dataManifest).asInstanceOf[Event]
commandManifest match {
case PersistManifest =>
Persist(event)
case PersistedManifest =>
Persisted(event)
}
}
private def splitIntoCommandAndDataManifests(manifest: String) = {
val commandAndDataManifests = manifest.split(Separator)
(commandAndDataManifests(0), commandAndDataManifests(1))
}
}
Problem with this approach is the thing I'm doing in def manifest and in def fromBinary. I had to make sure that I have the command's manifest as well as the event's manifest while serializing and de-serializing. Hence, I had to use ~ as a separator - sort of, my custom serialization technique for the manifest information.
Is there a better or perhaps, a right way, to implement this?
For context: I'm using ScalaPB for generating scala classes from .proto files and classic akka actors.
Any kind of guidance is hugely appreciated!
If you delegate serialization of the nested object to whichever serializer you have configured the nested field should have bytes for the serialized data but also an int32 with the id of the used serializer and bytes for the message manifest. This ensures that you will be able to version/replace the nested serializers which is important for data that will be stored for a longer time period.
You can see how this is done internally in Akka for our own wire formats here: https://github.com/akka/akka/blob/6bf20f4117a8c27f8bd412228424caafe76a89eb/akka-remote/src/main/protobuf/WireFormats.proto#L48 and here https://github.com/akka/akka/blob/6bf20f4117a8c27f8bd412228424caafe76a89eb/akka-remote/src/main/scala/akka/remote/MessageSerializer.scala#L45
I implemented a ProcessFunction that uses a Guava cache to filter a stream of incoming events. The code looks like this :
object myJob {
private def updateCache(cacheObject, someValue) = {}
private def getCacheValue(cacheObject, someKey) = {}
override def run(params, executionEnv) = {
val inputStream = executionEnv.stream
val c = CacheBuilder.newBuilder()
val outStream = inputStream.process(new ProcessFunction() {
updateCache()
getCacheValue}
)
}
}
On submitting the job to Flink, I'm getting the following error :
Caused by: org.apache.flink.api.common.InvalidProgramException: The implementation of the ProcessFunction is not serializable. The object probably contains or references non serializable fields.
at org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:99)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.clean(StreamExecutionEnvironment.java:1560)
at org.apache.flink.streaming.api.datastream.DataStream.clean(DataStream.java:185)
at org.apache.flink.streaming.api.datastream.DataStream.process(DataStream.java:666)
at org.apache.flink.streaming.api.scala.DataStream.process(DataStream.scala:686)
Any idea of what I'm doing wrong? How can I resolve this serialization error?
The error says basically that you are depending on an Object that is not serializable for Flink. In the case that You have shown, marking the field with the loader as lazy should fix the issue:
lazy val c = CacheBuilder.newBuilder()
Generally, in such case, You should refer to the Flink's documentation, which explains the issue
We've got a Flink job written in Scala using case classes (generated from avsc files by avrohugger) to represent our state. We would like to use Avro for serialising our state so state migration will work when we update our models. We understood since Flink 1.7 Avro serialization is supported OOTB. We added the flink-avro module to the classpath, but when restoring from a saved snapshot we notice that it's still trying to use Kryo serialization. Relevant Code snippet
case class Foo(id: String, timestamp: java.time.Instant)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val conf = env.getConfig
conf.disableForceKryo()
conf.enableForceAvro()
val rawDataStream: DataStream[String] = env.addSource(MyFlinkKafkaConsumer)
val parsedDataSteam: DataStream[Foo] = rawDataStream.flatMap(new JsonParser[Foo])
// do something useful with it
env.execute("my-job")
When performing a state migration on Foo (e.g. by adding a field and deploying the job) I see that it tries to deserialize using Kryo, which obviously fails. How can I make sure Avro serialization is being used?
UPDATE
Found out about https://issues.apache.org/jira/browse/FLINK-10897, so POJO state serialization with Avro is only supported from 1.8 afaik. I tried it using the latest RC of 1.8 with a simple WordCount POJO that extends from SpecificRecord:
/** MACHINE-GENERATED FROM AVRO SCHEMA. DO NOT EDIT DIRECTLY */
import scala.annotation.switch
case class WordWithCount(var word: String, var count: Long) extends
org.apache.avro.specific.SpecificRecordBase {
def this() = this("", 0L)
def get(field$: Int): AnyRef = {
(field$: #switch) match {
case 0 => {
word
}.asInstanceOf[AnyRef]
case 1 => {
count
}.asInstanceOf[AnyRef]
case _ => new org.apache.avro.AvroRuntimeException("Bad index")
}
}
def put(field$: Int, value: Any): Unit = {
(field$: #switch) match {
case 0 => this.word = {
value.toString
}.asInstanceOf[String]
case 1 => this.count = {
value
}.asInstanceOf[Long]
case _ => new org.apache.avro.AvroRuntimeException("Bad index")
}
()
}
def getSchema: org.apache.avro.Schema = WordWithCount.SCHEMA$
}
object WordWithCount {
val SCHEMA$ = new org.apache.avro.Schema.Parser().parse(" .
{\"type\":\"record\",\"name\":\"WordWithCount\",\"fields\":
[{\"name\":\"word\",\"type\":\"string\"},
{\"name\":\"count\",\"type\":\"long\"}]}")
}
This, however, also didn’t work out of the box. We then tried to define our own type information using flink-avro’s AvroTypeInfo but this fails because Avro looks for a SCHEMA$ property (SpecificData:285) in the class and is unable to use Java reflection to identify the SCHEMA$ in the Scala companion object.
I could never get reflection to work due to Scala's fields being private under the hood. AFAIK the only solution is to update Flink to use avro's non-reflection-based constructors in AvroInputFormat (compare).
In a pinch, other than Java, one could fall back to avro's GenericRecord, maybe use avro4s to generate them from avrohugger's Standard format (note that Avro4s will generate it's own schema from the generated Scala types)
I want to share a HashMap across every node in Flink and allow the nodes to update that HashMap. I have this code so far:
object ParallelStreams {
val env = StreamExecutionEnvironment.getExecutionEnvironment
//Is there a way to attach a HashMap to this config variable?
val config = new Configuration()
config.setClass("HashMap", Class[CustomGlobal])
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
class CustomGlobal extends ExecutionConfig.GlobalJobParameters {
override def toMap: util.Map[String, String] = {
new HashMap[String, String]()
}
}
class MyCoMap extends RichCoMapFunction[String, String, String] {
var users: HashMap[String, String] = null
//How do I get access the HashMap I attach to the global config here?
override def open(parameters: Configuration): Unit = {
super.open(parameters)
val globalParams = getRuntimeContext.getExecutionConfig.getGlobalJobParameters
val globalConf = globalParams[Configuration]
val hashMap = globalConf.getClass
}
//Other functions to override here
}
}
I was wondering if you can attach a custom object to the config variable created here val config = new Configuration()? (Please see comments in the code above).
I noticed you can only attach primitive values. I created a custom class that extends ExecutionConfig.GlobalJobParameters and attached that class by doing config.setClass("HashMap", Class[CustomGlobal]) but I am not sure if that is how you are supposed to do it?
The common way to distribute parameters to operators is to have them as regular member variables in the function class. The function object that is created and assigned during plan construction is serialized and shipped to all workers. So you don't have to pass parameters via a configuration.
This would look as follows
class MyMapper(map: HashMap) extends MapFunction[String, String] {
// class definition
}
val inStream: DataStream[String] = ???
val myHashMap: HashMap = ???
val myMapper: MyMapper = new MyMapper(myHashMap)
val mappedStream: DataStream[String] = inStream.map(myMapper)
The myMapper object is serialized (using Java serialization) and shipped for execution. So the type of map must implement the Java Serializable interface.
EDIT: I missed the part that you want the map to be updatable from all parallel tasks. That is not possible with Flink. You would have to either fully replicate the map and all updated (by broadcasting) or use an external system (key-value store) for that.