How to it make pure? - scala

I have following scala code:
import akka.Done
import akka.actor.ActorSystem
import akka.kafka.ConsumerMessage.CommittableOffsetBatch
import akka.kafka.scaladsl.Consumer
import akka.kafka.{ConsumerSettings, Subscriptions}
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.Sink
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.kafka.common.serialization.StringDeserializer
import scala.concurrent.Future
object TestConsumer {
def main(args: Array[String]): Unit = {
implicit val system = ActorSystem("KafkaConsumer")
implicit val materializer = ActorMaterializer()
val consumerSettings = ConsumerSettings(system, new StringDeserializer, new StringDeserializer)
.withBootstrapServers("localhost:9092")
.withGroupId("group1")
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
val result = Consumer
.committableSource(consumerSettings, Subscriptions.topics("test"))
.mapAsync(2)(rec => Future.successful(rec.record.value()))
.runWith(Sink.foreach(ele => {
print(ele)
system.terminate()
}))
}
}
As you can recognize, the application consumes message from kafka printed out on the shell.
runWith is not pure, it generates some side effect, print out the received message and shutdown the actor.
The question is, how to make it pure with cats IO effects? It is possible?

You don't need cats IO to make it pure. Note that your sink is already pure, because it's just the value that describes what will happen when it's used (in this case using means "connecting to the Source and running the stream").
val sink: Sink[String, Future[Done]] = Sink.foreach(ele => {
print(ele)
// system.terminate() // PROBLEM: terminating the system before stream completes!
})
The problem you described has nothing to do with purity. The problem is that the sink above closes over the value of system, and then tries to terminate it when processing each element of the source.
Terminating the system means that you are destroying the whole runtime environment (used by ActorMaterializer) that is used to run the stream. This should only be done when your stream completes.
val result: Future[Done] = Consumer
.committableSource(consumerSettings, Subscriptions.topics("test"))
.mapAsync(2)(rec => Future.successful(rec.record.value()))
.runWith(sink)
result.onComplete(_ => system.terminate())

Related

Alpakka UDP: How can I respond to received datagrams via the already bound socket?

I'm using Alpakkas UDP.bindFlow to forward incoming UDP datagrams to a Kafka broker. The legacy application that is sending these datagrams requires a UDP response from the same port the message was sent to. I am struggling to model this behaviour as it would require me to connect the output of the flow to its input.
I tried this solution, but it does not work because the response datagram is sent from a different source port:
import java.net.InetSocketAddress
import akka.actor.ActorSystem
import akka.kafka.ProducerSettings
import akka.kafka.scaladsl.Producer
import akka.stream.ActorMaterializer
import akka.stream.alpakka.udp.Datagram
import akka.stream.alpakka.udp.scaladsl.Udp
import akka.stream.scaladsl.{Flow, Source}
import akka.util.ByteString
import org.apache.kafka.clients.producer.ProducerRecord
import org.apache.kafka.common.serialization.StringSerializer
object UdpInput extends App {
implicit val system: ActorSystem = ActorSystem()
implicit val materializer: ActorMaterializer = ActorMaterializer()
val socket = new InetSocketAddress("0.0.0.0", 40000)
val udpBindFlow = Udp.bindFlow(socket)
val producerSettings = ProducerSettings(system, new StringSerializer, new StringSerializer)
val kafkaSink = Flow[Datagram].map(toProducerRecord).to(Producer.plainSink(producerSettings))
def toProducerRecord(datagram: Datagram) = new ProducerRecord[String, String]("udp", datagram.data.utf8String)
def toResponseDatagram(datagram: Datagram) = Datagram(ByteString("OK"), datagram.remote)
// Does not model the behaviour I'm looking for because
// the response datagram is sent from a different source port
Source.asSubscriber
.via(udpBindFlow)
.alsoTo(kafkaSink)
.map(toResponseDatagram)
.to(Udp.sendSink)
.run
}
I ended up using GraphDSL to implement a cyclic graph. Thanks to dvim for pointing me in the right direction!
import java.net.InetSocketAddress
import akka.actor.ActorSystem
import akka.kafka.ProducerSettings
import akka.kafka.scaladsl.Producer
import akka.stream.alpakka.udp.Datagram
import akka.stream.alpakka.udp.scaladsl.Udp
import akka.stream.scaladsl.GraphDSL.Implicits._
import akka.stream.scaladsl.{Broadcast, Flow, GraphDSL, MergePreferred, RunnableGraph, Source}
import akka.stream.{ActorMaterializer, ClosedShape}
import akka.util.ByteString
import org.apache.kafka.clients.producer.ProducerRecord
import org.apache.kafka.common.serialization.StringSerializer
object UdpInput extends App {
implicit val system: ActorSystem = ActorSystem()
implicit val materializer: ActorMaterializer = ActorMaterializer()
val producerSettings = ProducerSettings(system, new StringSerializer, new StringSerializer)
val socket = new InetSocketAddress("0.0.0.0", 40000)
val udpBindFlow = Udp.bindFlow(socket)
val udpResponseFlow = Flow[Datagram].map(toResponseDatagram)
val kafkaSink = Flow[Datagram].map(toProducerRecord).to(Producer.plainSink(producerSettings))
def toProducerRecord(datagram: Datagram) = new ProducerRecord[String, String]("udp", datagram.data.utf8String)
def toResponseDatagram(datagram: Datagram) = Datagram(ByteString("OK"), datagram.remote)
RunnableGraph.fromGraph(GraphDSL.create() { implicit b =>
val merge = b.add(MergePreferred[Datagram](1))
val bcast = b.add(Broadcast[Datagram](2))
Source.asSubscriber ~> merge ~> udpBindFlow ~> bcast ~> kafkaSink
merge.preferred <~ udpResponseFlow <~ bcast
ClosedShape
}).run
}

param message.send.max.retries does not work in kafka producer

I have project on scala and sbt
I have producer
I tried to retry messages if kafka unreachable
package com.example
import akka.actor.ActorSystem
import akka.kafka.ProducerSettings
import akka.kafka.scaladsl.Producer
import akka.stream.scaladsl.Source
import akka.stream.{ActorMaterializer, ActorMaterializerSettings, Supervision}
import org.apache.kafka.clients.producer.{ProducerConfig, ProducerRecord}
import org.apache.kafka.common.serialization.StringSerializer
import scala.concurrent.Await
import scala.concurrent.duration._
import scala.util.control.NonFatal
object producer extends App {
private val decider: Supervision.Decider = {
case NonFatal(ex) =>
println("Non fatal exception in flow. Skip message and resuming flow.",
ex)
Supervision.Restart
case ex: Throwable =>
println("Other exception in flow. Stopping flow.", ex)
Supervision.Stop
}
implicit val system = ActorSystem("QuickStart")
private val strategy =
ActorMaterializerSettings(system).withSupervisionStrategy(decider)
implicit val materializer = ActorMaterializer(strategy)
val config = system.settings.config.getConfig("akka.kafka.producer")
val producerSettings =
ProducerSettings(system, new StringSerializer, new StringSerializer)
.withBootstrapServers("10.20.10.193:9092")
.withProperty("message.send.max.retries", "3")
.withProperty(ProducerConfig.MAX_BLOCK_MS_CONFIG, "5000")
//.withProperty(ProducerConfig.RECONNECT_BACKOFF_MS_CONFIG, "5000")
val done =
Source
.single("11")
.map(value => new ProducerRecord[String, String]("example", value))
.runWith(Producer.plainSink(producerSettings))
Await.result(done, 1000 seconds)
}
I defined a property:
.withProperty("message.send.max.retries", "3")
but it does not work
when I run produser with a bad kafka host - an output is
[INFO ] - 2018-07-30 23:00:36,951 - suppression - akka.event.slf4j.Slf4jLogger - Slf4jLogger started
(Non fatal exception in flow. Skip message and resuming flow.,org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 5000 ms.)
(Non fatal exception in flow. Skip message and resuming flow.,org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 5000 ms.)
Only two retries in log instead of three

Spark streaming: How to write cumulative output?

I have to write a single output file for my streaming job.
Question : when will my job actually stop? I killed the server but did not work.
I want to stop my job from commandline(If it is possible)
Code:
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.dstream
import org.apache.spark.streaming.Duration
import org.apache.spark.streaming.Seconds
import org.apache.spark._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import scala.collection.mutable.ArrayBuffer
object MAYUR_BELDAR_PROGRAM5_V1 {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(10))
val lines = ssc.socketTextStream("localhost", args(0).toInt)
val words = lines.flatMap(_.split(" "))
val class1 = words.filter(a => a.charAt(0).toInt%2==0).map(a => a).filter(a => a.length%2==0)
val class2 = words.filter(a => a.charAt(0).toInt%2==0).map(a => a).filter(a => a.length%2==1)
val class3 = words.filter(a => a.charAt(0).toInt%2==1).map(a => a).filter(a => a.length%2==0)
val class4 = words.filter(a => a.charAt(0).toInt%2==1).map(a => a).filter(a => a.length%2==1)
class1.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class1","txt")
class2.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class2", "txt")
class3.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class3","txt")
class4.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class4","txt")
ssc.start() // Start the computation
ssc.awaitTermination()
ssc.stop()
}
}
A stream by definition does not have an end so it will not stop unless you call the method to stop it. In my case I have a business condition that tell when the process is finished, so when I reach this point I'm calling the method JavaStreamingContext.close(). I also have a monitor that checks if the process has not received any data in the past few minutes in which case it will also close the stream.
In order to accumulate data you have to use the method updateStateByKey (on a PairDStream). This method requires checkpointing to be enabled.
I have checked the Spark code and found that saveAsTextFiles uses foreachRDD, so at the end it will save each RDD separately, so previous RDDs will not be taken into account. Using updateStateByKey it will still save multiple files, but each file will consider all RDDs that were processed before.

Reactive-Kafka Stream Consumer: Dead letters occured

I am trying to consume messages from Kafka using akka's reactive kafka library. I am getting one message printed and after that I got
[INFO] [01/24/2017 10:36:52.934] [CommittableSourceConsumerMain-akka.actor.default-dispatcher-5] [akka://CommittableSourceConsumerMain/system/kafka-consumer-1] Message [akka.kafka.KafkaConsumerActor$Internal$Stop$] from Actor[akka://CommittableSourceConsumerMain/deadLetters] to Actor[akka://CommittableSourceConsumerMain/system/kafka-consumer-1#-1726905274] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
This is the code I am executing
import akka.actor.ActorSystem
import akka.kafka.scaladsl.Consumer
import akka.kafka.{ConsumerSettings, Subscriptions}
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.Sink
import org.apache.kafka.clients.consumer.ConsumerConfig
import play.api.libs.json._
import org.apache.kafka.common.serialization.{ByteArrayDeserializer, StringDeserializer}
object CommittableSourceConsumerMain extends App {
implicit val system = ActorSystem("CommittableSourceConsumerMain")
implicit val materializer = ActorMaterializer()
val consumerSettings =ConsumerSettings(system, new ByteArrayDeserializer, new StringDeserializer).withBootstrapServers("localhost:9092").withGroupId("CommittableSourceConsumer").withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
val done =
Consumer.committableSource(consumerSettings, Subscriptions.topics("topic1"))
.mapAsync(1) { msg =>
val record=(msg.record.value())
val data=Json.parse(record)
val recordType=data \ "data" \"event" \"type"
val actualData=data \ "data" \ "row"
if(recordType.as[String]=="created"){
"Some saving logic"
}
else{
"Some logic"
}
msg.committableOffset.commitScaladsl()
}
.runWith(Sink.ignore)
}
I finally figured out the solution. Due to a runtime exception in the stream a Future of failure is returned which terminates the stream immediately.
Akka-stream does not provide or display the runtime exception. So as to know the exception
done.onFailure{
case NonFatal(e)=>println(e)
}
The exception was in the if-else block.
Also one can use Actor Strategy to resume stream if exception occurs.

Why is AssertionError: assertion failed: Executor has not been attached to this receiver?

I'm trying to build a simple spark streaming custom receiver where messages are stored directly in the spark stream in order to be processed. However I get:
java.lang.AssertionError: assertion failed: Executor has not been attached to this receiver
I'm integrating into a third party java library that generates methods calls based on a socket it is listening to. By implementing an interface in the third party java library I plan to call the store method in the custom spark receiver.
I've created a simple cut down example which has the error, which does not reference the third party java library.
package com.custom.spark
import org.apache.spark.streaming.receiver.Receiver
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.storage.StorageLevel
import org.apache.spark.SparkConf
import org.apache.spark.streaming.Seconds
object CustomSparkReceiver {
def main(args: Array[String]) {
// create stream with custom receiver
val conf = new SparkConf().setMaster("local[*]").setAppName("CustomSparkReceiver")
val ssc = new StreamingContext(conf, Seconds(1))
val customReceiver = new CustomReceiver
val stream = ssc.receiverStream(customReceiver)
// print values in spark stream
stream.print
// start pushing data into stream
ssc.start
Thread.sleep(1000)
List.range(1, 10)
.foreach { number => customReceiver.store(number) }
Thread.sleep(1000)
ssc.stop()
}
class CustomReceiver extends Receiver[Integer](StorageLevel.MEMORY_AND_DISK_2) {
def onStart() = {}
def onStop() = {}
}
}
I think it is some sort of threading issue but not sure how to fix it. Any pointers on the above would be great.