java.io.NotSerializableException using Apache Flink with Lagom - scala

I am writing Flink CEP program inside the Lagom's Microservice Implementation. My FLINK CEP program run perfectly fine in simple scala application. But when i use this code inside the Lagom service implementation i am receiving the following exception
Lagom Service Implementation
override def start = ServiceCall[NotUsed, String] {
val env = StreamExecutionEnvironment.getExecutionEnvironment
var executionConfig = env.getConfig
env.setParallelism(1)
executionConfig.disableSysoutLogging()
var topic_name="topic_test"
var props= new Properties
props.put("bootstrap.servers", "localhost:9092")
props.put("acks","all");
props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer","org.apache.kafka.common.serialization.ByteArraySerializer");
props.put("block.on.buffer.full","false");
val kafkaSource = new FlinkKafkaConsumer010 (topic_name, new KafkaDeserializeSchema , props)
val stream = env.addSource(kafkaSource)
val deliveryPattern = Pattern.begin[XYZ]("begin").where(_.ABC == 5)
.next("next").where(_.ABC == 10).next("end").where(_.ABC==5)
val deliveryPatternStream = CEP.pattern(stream, deliveryPattern)
def selectFn(pattern : collection.mutable.Map[String, XYZ]): String = {
val startEvent = pattern.get("begin").get
val nextEvent = pattern.get("next").get
"Alert Detected"
}
val deliveryResult =deliveryPatternStream.select(selectFn(_)).print()
env.execute("CEP")
req=> Future.successful("Done")
}
}
I don't understand how to resolve this issue.

Related

Scala spark kafka code - functional approach

I've the following code in scala. I am using spark sql to pull data from hadoop, perform some group by on the result, serialize it and then write that message to Kafka.
I've written the code - but i want to write it in functional way. Should i create a new class with function 'getCategories' to get the categories from Hadoop? I am not sure how to approach this.
Here is the code
class ExtractProcessor {
def process(): Unit = {
implicit val formats = DefaultFormats
val spark = SparkSession.builder().appName("test app").getOrCreate()
try {
val df = spark.sql("SELECT DISTINCT SUBCAT_CODE, SUBCAT_NAME, CAT_CODE, CAT_NAME " +
"FROM CATEGORY_HIERARCHY " +
"ORDER BY CAT_CODE, SUBCAT_CODE ")
val result = df.collect().groupBy(row => (row(2), row(3)))
val categories = result.map(cat =>
category(cat._1._1.toString(), cat._1._2.toString(),
cat._2.map(subcat =>
subcategory(subcat(0).toString(), subcat(1).toString())).toList))
val jsonMessage = write(categories)
val kafkaKey = java.security.MessageDigest.getInstance("SHA-1").digest(jsonMessage.getBytes("UTF-8")).map("%02x".format(_)).mkString.toString()
val key = write(kafkaKey)
Logger.log.info(s"Json Message: ${jsonMessage}")
Logger.log.info(s"Kafka Key: ${key}")
KafkaUtil.apply.send(key, jsonMessage, "testTopic")
}
And here is the Kafka Code
class KafkaUtil {
def send(key: String, message: String, topicName: String): Unit = {
val properties = new Properties()
properties.put("bootstrap.servers", "localhost:9092")
properties.put("client.id", "test publisher")
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](properties)
try {
val record = new ProducerRecord[String, String](topicName, key, message)
producer.send(record)
}
finally {
producer.close()
Logger.log.info("Kafka producer closed...")
}
}
}
object KafkaUtil {
def apply: KafkaUtil = {
new KafkaUtil
}
}
Also, for writing unit tests what should i be testing in the functional approach. In OOP we unit test the business logic, but in my scala code there is hardly any business logic.
Any help is appreciated.
Thanks in advance,
Suyog
You code consists of
1) Loading the data into spark df
2) Crunching the data
3) Creating a json message
4) Sending json message to kafka
Unit tests are good for testing pure functions.
You can extract step 2) into a method with signature like
def getCategories(df: DataFrame): Seq[Category] and cover it by a test.
In the test data frame will be generated from just a plain hard-coded in-memory sequence.
Step 3) can be also covered by a unit test if you feel it error-prone
Steps 1) and 4) are to be covered by an end-to-end test
By the way
val result = df.collect().groupBy(row => (row(2), row(3))) is inefficient. Better to replace it by val result = df.groupBy(row => (row(2), row(3))).collect
Also there is no need to initialize a KafkaProducer individually for each single message.

Akka Streams Reactive Kafka - OutOfMemoryError under high load

I am running an Akka Streams Reactive Kafka application which should be functional under heavy load. After running the application for around 10 minutes, the application goes down with an OutOfMemoryError. I tried to debug the heap dump and found that akka.dispatch.Dispatcher is taking ~5GB of memory. Below are my config files.
Akka version: 2.4.18
Reactive Kafka version: 2.4.18
1.application.conf:
consumer {
num-consumers = "2"
c1 {
bootstrap-servers = "localhost:9092"
bootstrap-servers=${?KAFKA_CONSUMER_ENDPOINT1}
groupId = "testakkagroup1"
subscription-topic = "test"
subscription-topic=${?SUBSCRIPTION_TOPIC1}
message-type = "UserEventMessage"
poll-interval = 100ms
poll-timeout = 50ms
stop-timeout = 30s
close-timeout = 20s
commit-timeout = 15s
wakeup-timeout = 10s
use-dispatcher = "akka.kafka.default-dispatcher"
kafka-clients {
enable.auto.commit = true
}
}
2.build.sbt:
java -Xmx6g \
-Dcom.sun.management.jmxremote.port=27019 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false \
-Djava.rmi.server.hostname=localhost \
-Dzookeeper.host=$ZK_HOST \
-Dzookeeper.port=$ZK_PORT \
-jar ./target/scala-2.11/test-assembly-1.0.jar
3.Source and Sink actors:
class EventStream extends Actor with ActorLogging {
implicit val actorSystem = context.system
implicit val timeout: Timeout = Timeout(10 seconds)
implicit val materializer = ActorMaterializer()
val settings = Settings(actorSystem).KafkaConsumers
override def receive: Receive = {
case StartUserEvent(id) =>
startStreamConsumer(consumerConfig("EventMessage"+".c"+id))
}
def startStreamConsumer(config: Map[String, String]) = {
val consumerSource = createConsumerSource(config)
val consumerSink = createConsumerSink()
val messageProcessor = startMessageProcessor(actorA, actorB, actorC)
log.info("Starting The UserEventStream processing")
val future = consumerSource.map { message =>
val m = s"${message.record.value()}"
messageProcessor ? m
}.runWith(consumerSink)
future.onComplete {
case _ => actorSystem.stop(messageProcessor)
}
}
def startMessageProcessor(actorA: ActorRef, actorB: ActorRef, actorC: ActorRef) = {
actorSystem.actorOf(Props(classOf[MessageProcessor], actorA, actorB, actorC))
}
def createConsumerSource(config: Map[String, String]) = {
val kafkaMBAddress = config("bootstrap-servers")
val groupID = config("groupId")
val topicSubscription = config("subscription-topic").split(',').toList
println(s"Subscriptiontopics $topicSubscription")
val consumerSettings = ConsumerSettings(actorSystem, new ByteArrayDeserializer, new StringDeserializer)
.withBootstrapServers(kafkaMBAddress)
.withGroupId(groupID)
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
.withProperty(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG,"true")
Consumer.committableSource(consumerSettings, Subscriptions.topics(topicSubscription:_*))
}
def createConsumerSink() = {
Sink.foreach(println)
}
}
In this case actorA, actorB, and actorC are doing some business logic processing and database interaction. Is there anything I am missing in handling the Akka Reactive Kafka consumers such as commit, error, or throttling configuration? Because looking into the heap dump, I could guess that the messages are piling up.
One thing I would change is the following:
val future = consumerSource.map { message =>
val m = s"${message.record.value()}"
messageProcessor ? m
}.runWith(consumerSink)
In the above code, you're using ask to send messages to the messageProcessor actor and expect replies, but in order for ask to function as a backpressure mechanism, you need to use it with mapAsync (more information is in the documentation). Something like the following:
val future =
consumerSource
.mapAsync(parallelism = 5) { message =>
val m = s"${message.record.value()}"
messageProcessor ? m
}
.runWith(consumerSink)
Adjust the level of parallelism as needed.

How to Test Kafka Consumer

I have a Kafka Consumer (built in Scala) which extracts latest records from Kafka. The consumer looks like this:
val consumerProperties = new Properties()
consumerProperties.put("bootstrap.servers", "localhost:9092")
consumerProperties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
consumerProperties.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
consumerProperties.put("group.id", "something")
consumerProperties.put("auto.offset.reset", "latest")
val consumer = new KafkaConsumer[String, String](consumerProperties)
consumer.subscribe(java.util.Collections.singletonList("topic"))
Now, I want to write an integration test for it. Is there any way or any best practice for Testing Kafka Consumers?
You need to start zookeeper and kafka programmatically for integration tests.
1.1 start zookeeper (ZooKeeperServer)
def startZooKeeper(zooKeeperPort: Int, zkLogsDir: Directory): ServerCnxnFactory = {
val tickTime = 2000
val zkServer = new ZooKeeperServer(zkLogsDir.toFile.jfile, zkLogsDir.toFile.jfile, tickTime)
val factory = ServerCnxnFactory.createFactory
factory.configure(new InetSocketAddress("0.0.0.0", zooKeeperPort), 1024)
factory.startup(zkServer)
factory
}
1.2 start kafka (KafkaServer)
case class StreamConfig(streamTcpPort: Int = 9092,
streamStateTcpPort :Int = 2181,
stream: String,
numOfPartition: Int = 1,
nodes: Map[String, String] = Map.empty)
def startKafkaBroker(config: StreamConfig,
kafkaLogDir: Directory): KafkaServer = {
val syncServiceAddress = s"localhost:${config.streamStateTcpPort}"
val properties: Properties = new Properties
properties.setProperty("zookeeper.connect", syncServiceAddress)
properties.setProperty("broker.id", "0")
properties.setProperty("host.name", "localhost")
properties.setProperty("advertised.host.name", "localhost")
properties.setProperty("port", config.streamTcpPort.toString)
properties.setProperty("auto.create.topics.enable", "true")
properties.setProperty("log.dir", kafkaLogDir.toAbsolute.path)
properties.setProperty("log.flush.interval.messages", 1.toString)
properties.setProperty("log.cleaner.dedupe.buffer.size", "1048577")
config.nodes.foreach {
case (key, value) => properties.setProperty(key, value)
}
val broker = new KafkaServer(new KafkaConfig(properties))
broker.startup()
println(s"KafkaStream Broker started at ${properties.get("host.name")}:${properties.get("port")} at ${kafkaLogDir.toFile}")
broker
}
emit some events to stream using KafkaProducer
Then consume with your consumer to test and verify its working
You can use scalatest-eventstream that has startBroker method which will start Zookeeper and Kafka for you.
Also has destroyBroker which will cleanup your kafka after tests.
eg.
class MyStreamConsumerSpecs extends FunSpec with BeforeAndAfterAll with Matchers {
implicit val config =
StreamConfig(streamTcpPort = 9092, streamStateTcpPort = 2181, stream = "test-topic", numOfPartition = 1)
val kafkaStream = new KafkaEmbeddedStream
override protected def beforeAll(): Unit = {
kafkaStream.startBroker
}
override protected def afterAll(): Unit = {
kafkaStream.destroyBroker
}
describe("Kafka Embedded stream") {
it("does consume some events") {
//uses application.properties
//emitter.broker.endpoint=localhost:9092
//emitter.event.key.serializer=org.apache.kafka.common.serialization.StringSerializer
//emitter.event.value.serializer=org.apache.kafka.common.serialization.StringSerializer
kafkaStream.appendEvent("test-topic", """{"MyEvent" : { "myKey" : "myValue"}}""")
val consumerProperties = new Properties()
consumerProperties.put("bootstrap.servers", "localhost:9092")
consumerProperties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
consumerProperties.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
consumerProperties.put("group.id", "something")
consumerProperties.put("auto.offset.reset", "earliest")
val myConsumer = new KafkaConsumer[String, String](consumerProperties)
myConsumer.subscribe(java.util.Collections.singletonList("test-topic"))
val events = myConsumer.poll(2000)
events.count() shouldBe 1
events.iterator().next().value() shouldBe """{"MyEvent" : { "myKey" : "myValue"}}"""
println("=================" + events.count())
}
}
}

stopping spark streaming after reading first batch of data

I am using spark streaming to consume kafka messages. I want to get some messages as sample from kafka instead of reading all messages. So I want to read a batch of messages, return them to caller and stopping spark streaming. Currently I am passing batchInterval time in awaitTermination method of spark streaming context method. I don't now how to return processed data to caller from spark streaming. Here is my code that I am using currently
def getsample(params: scala.collection.immutable.Map[String, String]): Unit = {
if (params.contains("zookeeperQourum"))
zkQuorum = params.get("zookeeperQourum").get
if (params.contains("userGroup"))
group = params.get("userGroup").get
if (params.contains("topics"))
topics = params.get("topics").get
if (params.contains("numberOfThreads"))
numThreads = params.get("numberOfThreads").get
if (params.contains("sink"))
sink = params.get("sink").get
if (params.contains("batchInterval"))
interval = params.get("batchInterval").get.toInt
val sparkConf = new SparkConf().setAppName("KafkaConsumer").setMaster("spark://cloud2-server:7077")
val ssc = new StreamingContext(sparkConf, Seconds(interval))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
var consumerConfig = scala.collection.immutable.Map.empty[String, String]
consumerConfig += ("auto.offset.reset" -> "smallest")
consumerConfig += ("zookeeper.connect" -> zkQuorum)
consumerConfig += ("group.id" -> group)
var data = KafkaUtils.createStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder](ssc, consumerConfig, topicMap, StorageLevel.MEMORY_ONLY).map(_._2)
val streams = data.window(Seconds(interval), Seconds(interval)).map(x => new String(x))
streams.foreach(rdd => rdd.foreachPartition(itr => {
while (itr.hasNext && size >= 0) {
var msg=itr.next
println(msg)
sample.append(msg)
sample.append("\n")
size -= 1
}
}))
ssc.start()
ssc.awaitTermination(5000)
ssc.stop(true)
}
So instead of saving messages in a String builder called "sample" I want to return to caller.
You can implement a StreamingListener and then inside it, onBatchCompleted you can call ssc.stop()
private class MyJobListener(ssc: StreamingContext) extends StreamingListener {
override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted) = synchronized {
ssc.stop(true)
}
}
This is how you attach your SparkStreaming to the JobListener:
val listen = new MyJobListener(ssc)
ssc.addStreamingListener(listen)
ssc.start()
ssc.awaitTermination()
We can get sample messages using following piece of code
var sampleMessages=streams.repartition(1).mapPartitions(x=>x.take(10))
and if we want to stop after first batch then we should implement our own StreamingListener interface and should stop streaming in onBatchCompleted method.

Spark Streaming using Scala to insert to Hbase Issue

I am trying to read records from Kafka message and put into Hbase. Though the scala script is running with out any issue, the inserts are not happening. Please help me.
Input:
rowkey1,1
rowkey2,2
Here is the code which I am using:
object Blaher {
def blah(row: Array[String]) {
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "test")
val thePut = new Put(Bytes.toBytes(row(0)))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes("a"), Bytes.toBytes(row(1)))
hTable.put(thePut)
}
}
object TheMain extends Serializable{
def run() {
val ssc = new StreamingContext(sc, Seconds(1))
val topicmap = Map("test" -> 1)
val lines = KafkaUtils.createStream(ssc,"127.0.0.1:2181", "test-consumer-group",topicmap).map(_._2)
val words = lines.map(line => line.split(",")).map(line => (line(0),line(1)))
val store = words.foreachRDD(rdd => rdd.foreach(Blaher.blah))
ssc.start()
}
}
TheMain.run()
From the API doc for HTable's flushCommits() method: "Executes all the buffered Put operations". You should call this at the end of your blah() method -- it looks like they're currently being buffered but never executed or executed at some random time.