Spark-Streaming from an Actor - scala

I would like to have a consumer actor subscribe to a Kafka topic and stream data for further processing with Spark Streaming outside the consumer. Why an actor? Because I read that its supervisor strategy would be a great way to handle Kafka failures (e.g., restart on a failure).
I found two options:
The Java KafkaConsumer class: its poll() method returns a Map[String, Object]. I would like a DStream to be returned just like KafkaUtils.createDirectStream would, and I don't know how to fetch the stream from outside the actor.
Extend the ActorHelper trait and use actorStream() like shown in this example. This latter option doesn't display a connection to a topic but to a socket.
Could anyone point me in the right direction?

For handling Kafka failures, I used the Apache Curator framework and the following workaround:
val client: CuratorFramework = ... // see docs
val zk: CuratorZookeeperClient = client.getZookeeperClient
/**
* This method returns false if kafka or zookeeper is down.
*/
def isKafkaAvailable:Boolean =
Try {
if (zk.isConnected) {
val xs = client.getChildren.forPath("/brokers/ids")
xs.size() > 0
}
else false
}.getOrElse(false)
For consuming Kafka topics, I used the com.softwaremill.reactivekafka library. For example:
class KafkaConsumerActor extends Actor {
val kafka = new ReactiveKafka()
val config: ConsumerProperties[Array[Byte], Any] = ... // see docs
override def preStart(): Unit = {
super.preStart()
val publisher = kafka.consume(config)
Source.fromPublisher(publisher)
.map(handleKafkaRecord)
.to(Sink.ignore).run()
}
/**
* This method will be invoked when any kafka records will happen.
*/
def handleKafkaRecord(r: ConsumerRecord[Array[Byte], Any]) = {
// handle record
}
}

Related

Deserialize Protobuf kafka messages with Flink

I am trying to read and print Protobuf message from Kafka using Apache Flink.
I followed the official docs with no success: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/serialization/third_party_serializers/
The Flink consumer code is:
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI()
env.enableCheckpointing(5000)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
env.getCheckpointConfig.setCheckpointStorage(s"$targetPath/checkpoints")
env.getConfig.registerTypeWithKryoSerializer(classOf[User], classOf[ProtobufSerializer])
val source = KafkaSource.builder[User]
.setBootstrapServers(brokers)
.setTopics(topic)
.setGroupId(consumerGroupId)
.setValueOnlyDeserializer(new ProtoDeserializer())
.setStartingOffsets(OffsetsInitializer.earliest)
.build
val stream = env.fromSource(source, WatermarkStrategy.noWatermarks[User], kafkaTableName)
stream.print()
env.execute()
}
The deserializer code is:
class ProtoDeserializer extends DeserializationSchema[User] {
override def getProducedType: TypeInformation[User] = null
override def deserialize(message: Array[Byte]): User = User.parseFrom(message)
override def isEndOfStream(nextElement: User): Boolean = false
}
I get the following error when the streamer is executed:
Protocol message contained an invalid tag (zero).
It's important to mention that I manage to read and deserialize the messages successfully using the confluent protobuf consumer so it seems that the messages are not corrupted.
The confluent protobuf serializer doesn't produce content that can be directly deserialized by other deserializers. The format is described in confluent's documentation: it starts with a magic byte (that is always zero), followed by a four byte schema ID. The protobuf payload follows, starting with byte 5.
The getProducedType method should return appropriate TypeInformation, in this case TypeInformation.of(User.class). Without this you may run into problems at runtime.
Deserializers used with KafkaSource don't need to implement isEndOfStream, but it won't hurt anything.

Read data from several Kafka topics (generic list class design)

I try to change Flink runner code to let it read data from several Kafka topics and write it to different HDFS folders accordingly and without joining. I have a lot Java and Scala generic methods and generic object initiations inside the main process method and reflection.
It work correctly with one Avro schema, but when I try to add unknown amount of Avro schema I have a problem with generics and reflection constructions.
How to resolve it? Whats design pattern can help me?
The model (Avro schema) is in Java classes.
public enum Types implements MessageType {
RECORD_1("record1", "01", Record1.getClassSchema(), Record1.class),
RECORD_2("record2", "02", Record2.getClassSchema(), Record2.class);
private String topicName;
private String dataType;
private Schema schema;
private Class<? extends SpecificRecordBase> clazz;}
public class Record1 extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord
{
public static final org.apache.avro.Schema SCHEMA$ = new org.apache.avro.Schema.Parser().parse("???");
public static org.apache.avro.Schema getClassSchema() { return SCHEMA$; }
... }
public class Record1 ...
The process trait with main process methods.
import org.apache.avro.specific.SpecificRecordBase
import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.fs.Writer
import tests.{Record1, Record2, Types}
import scala.reflect.ClassTag
trait Converter[T] extends Serializable {
def convertToModel(message: KafkaSourceType): T
}
trait FlinkRunner extends Serializable {
val kafkaTopicToModelMapping: Map[String, Class[_ <: SpecificRecordBase]] =
Map(
"record_1" -> Types.RECORD_1.getClassType,
"record_2" -> Types.RECORD_2.getClassType
)
def buildAvroSink1(path: String, writer1: Writer[Record1]): BucketingSink[Record1] = ???
def buildAvroSink2(path: String, writer2: Writer[Record2]): BucketingSink[Record2] = ???
def process(topicList: List[String], env: StreamExecutionEnvironment): Unit = {
// producer kafka source building
val clazz1: Class[Record1] = ClassTag(kafkaTopicToModelMapping(topicList.head)).runtimeClass.asInstanceOf[Class[Record1]]
val clazz2: Class[Record2] = ClassTag(kafkaTopicToModelMapping(topicList.tail.head)).runtimeClass.asInstanceOf[Class[Record2]]
// How to makes clazzes list from that val clazzes: List[Class[???]] = ???
val avroTypeInfo1: TypeInformation[Record1] = TypeInformation.of(clazz1)
val avroTypeInfo2: TypeInformation[Record2] = TypeInformation.of(clazz2)
// How to makes clazzes list from that val avroTypeInfos = ???
val stream: DataStream[KafkaSourceType] = ???
// consumer avro paths building, it
val converter1: Converter[Record1] = new Converter[Record1] {
override def convertToModel(message: KafkaSourceType): Record1 = deserializeAvro[Record1](message.value)
}
val converter2: Converter[Record2] = new Converter[Record2] {
override def convertToModel(message: KafkaSourceType): Record2 = deserializeAvro[Record2](message.value)
}
// How to makes converters list from that
val outputResultStream1 = stream
.filter(_.topic == topicList.head)
.map(record => converter1.convertToModel(record))(avroTypeInfo1)
val outputResultStream2 = stream
.filter(_.topic == topicList.tail.head)
.map(record => converter2.convertToModel(record))(avroTypeInfo2)
val writer1 = new AvroSinkWriter[Record1](???)
val writer2 = new AvroSinkWriter[Record2](???)
// add sink and start process
}
}
AS IS
There are several different topics in Kafka. The Kafka version is 10.2 without Confluent. Every Kafka topic works with only one Avro schema class, written in Java.
The only one Flink job (written in Scala) reads the only one topic, convert with one Avro schema and write the data to only one folder in HDFS. The name, path and output folder name are in config.
For example there are 3 job flows with parameters:
First Job Flow
--brokersAdress …
--topic record1
--folderName folder1
-- avroClassName Record1
--output C:/….
--jobName SingleTopic1
--number_of_parallel 2
--number_of_task 1
--mainClass Runner
….
Second Job Flow
--brokersAdress …
--topic record1
--folderName folder1
-- avroClassName Record1
--output C:/….
--jobName SingleTopic2
--number_of_parallel 2
--number_of_task 1
--mainClass Runner
….
Third Job Flow
…
TO BE
The one Flink job can read more than one Kafka topics, convert it with different Avro schema and write the data to different folders without joining.
For example I can to start only one job flow which will do the same work
--brokersAdress …
--topic record1, record2, record3
--folderName folder1, folder2,
-- avroClassName Record1, Record2
--output C:/….
--jobName MultipleTopics
--number_of_parallel 3
--number_of_task 3
--mainClass Runner
...
Ok, thank you. There are several questions about code organization:
1) How to generalize variables in the method and methods (called procees) parameters to let initiate the List of several clazzes with inherited from SpecificRecordBase classes? If it possible, sure.
val clazz1: Class[Record1] = ClassTag(kafkaTopicToModelMapping(topicList.head)).runtimeClass.asInstanceOf[Class[Record1]]
val clazz2: Class[Record2] = ClassTag(kafkaTopicToModelMapping(topicList.tail.head)).runtimeClass.asInstanceOf[Class[Record2]]
2) The same question is for avroTypeInfo1, avroTypeInfo2 ..., converter1, converter2, ..., buildAvroSink1, buildAvroSink2, ... .
Also I have a questions about architecture. I tried to execute this code and Flink worked correctly with different topics with Avro schema classes.
Which Flink code tools can help me to put different avro schema classes to several outputStrems and add sink with them?
Do you have code examples with it?
And also what could I use instead the Flink to resolve issue with generating several Avro files from different Kafka topics? Perhaps confluent.
I'm a bit lost on your motivation. The general idea is that if you want to use a generic approach, go with GenericRecord. If you have specific code for the different types go SpecificRecord but then don't use the generic code around it.
Further, if you don't need, try to do the best to not mix different events in the same topic/topology. Rather spawn different topologies in your same main for each subtype.
def createTopology[T](topic: String) {
val stream: DataStream[KafkaSourceType] =
env.addSource(new FlinkKafkaConsumer[T](topic, AvroDeserializationSchema.forSpecific(T), properties))
stream.addSink(StreamingFileSink.forBulkFormat(
Path.fromLocalFile(folder),
ParquetAvroWriters.forSpecificRecord(T)))
}

Create Source from Lagom/Akka Kafka Topic Subscriber for Websocket

I want my Lagom subscriber-only service to subscribe to a Kafka Topic and stream the messages to a websocket. I have a service defined as follows using this documentation (https://www.lagomframework.com/documentation/1.4.x/scala/MessageBrokerApi.html#Subscribe-to-a-topic) as a guideline:
// service call
def stream(): ServiceCall[Source[String, NotUsed], Source[String, NotUsed]]
// service implementation
override def stream() = ServiceCall { req =>
req.runForeach(str => log.info(s"client: %str"))
kafkaTopic().subscribe.atLeastOnce(Flow.fromFunction(
// add message to a Source and return Done
))
Future.successful(//some Source[String, NotUsed])
However, I can't quite figure out how to handle my kafka message. The Flow.fromFunction returns [String, Done, _] and implies that I need to add those messages (strings) to a Source that has been created outside of the subscriber.
So my question is twofold:
1) How do I create an akka stream source to receive messages from a kafka topic subscriber at runtime?
2) How do I append kafka messages to said source while in the Flow?
You seem to be misunderstanding the service API of Lagom. If you're trying to materialize a stream from the body of your service call, there's no input to your call; i.e.,
def stream(): ServiceCall[Source[String, NotUsed], Source[String, NotUsed]]
implies that when the client provides a Source[String, NotUsed], the service will respond in kind. Your client is not directly providing this; therefore, your signature should likely be
def stream(): ServiceCall[NotUsed, Source[String, NotUsed]]
Now to your question...
This actually doesn't exist in the scala giter8 template, but the java version contains what they call an autonomous stream which does approximately what you want to do.
In Scala, this code would look something like...
override def autonomousStream(): ServiceCall[
Source[String, NotUsed],
Source[String, NotUsed]
] = ServiceCall { hellos => Future {
hellos.mapAsync(8, ...)
}
}
Since your call isn't mapping over the input stream, but rather a kafka topic, you'll want to do something like this:
override def stream(): ServiceCall[NotUsed, Source[String, NotUsed]] = ServiceCall {
_ =>
Future {
kafkaTopic()
.subscribe
.atMostOnce
.mapAsync(...)
}
}

How to Mock akka Actor for unit Test?

#Singleton
class EventPublisher #Inject() (#Named("rabbit-mq-event-update-actor") rabbitControlActor: ActorRef)
(implicit ctx: ExecutionContext) {
def publish(event: Event): Unit = {
logger.info("Publishing Event: {}", toJsObject(event), routingKey)
rabbitControlActor ! Message.topic(shipmentStatusUpdate, routingKey = "XXX")
}
}
I want to write a unit test to verify if this publish function is called
rabbitControlActor ! Message.topic(shipmentStatusUpdate, routingKey = "XXX")
is called only once.
I am using spingo to publish messages to Rabbit MQ.
I am using Playframework 2.6.x and scala 2.12.
You can create an TestProbe actor with:
val myActorProbe = TestProbe()
and get its ref with myActorProbe.ref
After, you can verify that it receives only one message with:
myActorProbe.expectMsg("myMsg")
myActorProbe.expectNoMsg()
You probably should take a look at this page: https://doc.akka.io/docs/akka/2.5/testing.html
It depends you want to check only the message is received by that actor or you want to test the functionality of that actor.
If you want to check message got delivered to the actor you can go with TestProbe. I.e.
val probe = TestProbe()
probe.ref ! Message
Then do :
probe.expectMsg[Message]
You can make use of TestActorRef in case where you have supervisor actor, which is performing some db operations so you can over ride its receive method and stop the flow to go till DB.
I.e.
val testActor =new TestActorRef(Actor.props{
override receive :Receive ={
case m:Message => //some db operation in real flow
//in that case you can return what your actor return from the db call may be some case class.
case _ => // do something }})
Assume your method return Future of Boolean.
val testResult=(testActor ? Message).mapTo[boolean]
//Then assert your result

Akka's actor based custom Event Bus implementation causes bottleneck

I'm trying to implement Event Bus (Pub-Sub) pattern on top of Akka's actors model.
"Native" EventBus implementation doesn't meet some of my requirements (e.g. possibility of retaining only last message in a topic, it's specific for MQTT protocol, I'm implementing message broker for it https://github.com/butaji/JetMQ).
Current interface of my EventBus is the following:
object Bus {
case class Subscribe(topic: String, actor: ActorRef)
case class Unsubscribe(topic: String, actor: ActorRef)
case class Publish(topic: String, payload: Any, retain: Boolean = false)
}
And usage looks like this:
val system = ActorSystem("System")
val bus = system.actorOf(Props[MqttEventBus], name = "bus")
val device1 = system.actorOf(Props(new DeviceActor(bus)))
val device2 = system.actorOf(Props(new DeviceActor(bus)))
All the devices have the reference to a single Bus actor. Bus actor is responsible for storing of all the state of subscriptions and topics (e.g. retain messages).
Device actors inside themselves can decide whatever they want to Publish, Subscribe or Unsubscribe to topics.
After some performance benchmarks, I realized that my current design affects processing time between Publishings and Subscriptions for the reasons that:
My EventBus actually is a singleton
It is caused a huge queue of processing load for it
How can I distribute (parallelize) workload for my event bus implementation?
Is the current solution good to fit akka-cluster?
Currently, I'm thinking about routing through several instances of Bus as the following:
val paths = (1 to 5).map(x => {
system.actorOf(Props[EventBusActor], name = "event-bus-" + x).path.toString
})
val bus_publisher = system.actorOf(RoundRobinGroup(paths).props())
val bus_manager = system.actorOf(BroadcastGroup(paths).props())
Where:
bus_publisher will be responsible for getting Publish,
bus_manager will be responsible for getting Subscribe / Unsubscribe.
And as the following it will replicate state across all the buses and reduce queue per actor with the distribution of the load.
You could route inside of your singleton bus instead of outside. Your bus could be responsible for routing messages and establishing topics, while sub-Actors could be responsible for distributing the messages. A basic example that demonstrates what I'm describing but without unsubscribe functionality, duplicate subscription checks, or supervision:
import scala.collection.mutable
import akka.actor.{Actor, ActorRef}
class HashBus() extends Actor {
val topicActors = mutable.Map.empty[String, ActorRef]
def createDistributionActor = {
context.actorOf(Props[DistributionActor])
}
override def receive = {
case subscribe : Subscribe =>
topicActors.getOrElseUpdate(subscribe.topic, createDistributionActor) ! subscribe
case publish : Publish =>
topicActors.get(topic).foreach(_ ! publish)
}
}
class DistributionActor extends Actor {
val recipients = mutable.List.empty[ActorRef]
override def receive = {
case Subscribe(topic: String, actorRef: ActorRef) =>
recipients +: actorRef
case publish : Publish =>
recipients.map(_ ! publish)
}
}
This would ensure that your bus Actor's mailbox doesn't get saturated because the bus's job is simply to do hash lookups. The DistributionActors would be responsible for mapping over the recipients and distributing the payload. Similarly, the DistributionActor could retain any state for a topic.