I can't find lifecycle description for High level consumer. I'm on 0.8.2.2 and I can't use "modern" consumer from kafka-clients. Here is my code:
def consume(numberOfEvents: Int, await: Duration = 100.millis): List[MessageEnvelope] = {
val consumerProperties = new Properties()
consumerProperties.put("zookeeper.connect", kafkaConfig.zooKeeperConnectString)
consumerProperties.put("group.id", consumerGroup)
consumerProperties.put("auto.offset.reset", "smallest")
val consumer = Consumer.create(new ConsumerConfig(consumerProperties))
try {
val messageStreams = consumer.createMessageStreams(
Predef.Map(kafkaConfig.topic -> 1),
new DefaultDecoder,
new MessageEnvelopeDecoder)
val receiveMessageFuture = Future[List[MessageEnvelope]] {
messageStreams(kafkaConfig.topic)
.flatMap(stream => stream.take(numberOfEvents).map(_.message()))
}
Await.result(receiveMessageFuture, await)
} finally {
consumer.shutdown()
}
It's not clear to me. Should I shutdown consumer after each message retrieval or I can keep instance and reuse it for message fetching? I suppose reusing instance is the right way, but can't find some articles / best practices.
I'm trying to reuse consumer and / or messageStreams. It doesn't work well for me and I can't find the reason for it.
If I try to reuse messageStreams, I get exception:
2017-04-17_19:57:57.088 ERROR MessageEnvelopeConsumer - Error while awaiting for messages java.lang.IllegalStateException: Iterator is in failed state
java.lang.IllegalStateException: Iterator is in failed state
at kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:54)
at scala.collection.IterableLike$class.take(IterableLike.scala:134)
at kafka.consumer.KafkaStream.take(KafkaStream.scala:25)
Happens here:
def consume(numberOfEvents: Int, await: Duration = 100.millis): List[MessageEnvelope] = {
try {
val receiveMessageFuture = Future[List[MessageEnvelope]] {
messageStreams(kafkaConfig.topic)
.flatMap(stream => stream.take(numberOfEvents).map(_.message()))
}
Try(Await.result(receiveMessageFuture, await)) match {
case Success(result) => result
case Failure(_: TimeoutException) => List.empty
case Failure(e) =>
// ===> never got any message from topic
logger.error(s"Error while awaiting for messages ${e.getClass.getName}: ${e.getMessage}", e)
List.empty
}
} catch {
case e: Exception =>
logger.warn(s"Error while consuming messages", e)
List.empty
}
}
I tried to create messageStreams each time:
no luck...
2017-04-17_20:02:44.236 WARN MessageEnvelopeConsumer - Error while consuming messages
kafka.common.MessageStreamsExistException: ZookeeperConsumerConnector can create message streams at most once
at kafka.consumer.ZookeeperConsumerConnector.createMessageStreams(ZookeeperConsumerConnector.scala:151)
at MessageEnvelopeConsumer.consume(MessageEnvelopeConsumer.scala:47)
Happens here:
def consume(numberOfEvents: Int, await: Duration = 100.millis): List[MessageEnvelope] = {
try {
val messageStreams = consumer.createMessageStreams(
Predef.Map(kafkaConfig.topic -> 1),
new DefaultDecoder,
new MessageEnvelopeDecoder)
val receiveMessageFuture = Future[List[MessageEnvelope]] {
messageStreams(kafkaConfig.topic)
.flatMap(stream => stream.take(numberOfEvents).map(_.message()))
}
Try(Await.result(receiveMessageFuture, await)) match {
case Success(result) => result
case Failure(_: TimeoutException) => List.empty
case Failure(e) =>
logger.error(s"Error while awaiting for messages ${e.getClass.getName}: ${e.getMessage}", e)
List.empty
}
} catch {
case e: Exception =>
// ===> now exception raised here
logger.warn(s"Error while consuming messages", e)
List.empty
}
}
UPD
I used iterator based approach. It looks this way:
// consumerProperties.put("consumer.timeout.ms", "100")
private lazy val consumer: ConsumerConnector = Consumer.create(new ConsumerConfig(consumerProperties))
private lazy val messageStreams: Seq[KafkaStream[Array[Byte], MessageEnvelope]] =
consumer.createMessageStreamsByFilter(Whitelist(kafkaConfig.topic), 1, new DefaultDecoder, new MessageEnvelopeDecoder)
private lazy val iterator: ConsumerIterator[Array[Byte], MessageEnvelope] = {
val stream = messageStreams.head
stream.iterator()
}
def consume(): List[MessageEnvelope] = {
try {
if (iterator.hasNext) {
val fromKafka: MessageAndMetadata[Array[Byte], MessageEnvelope] = iterator.next
List(fromKafka.message())
} else {
List.empty
}
} catch {
case _: ConsumerTimeoutException =>
List.empty
case e: Exception =>
logger.warn(s"Error while consuming messages", e)
List.empty
}
}
Now I'm trying to figure out if it automatically commits offsets to ZK...
Constant shutdown causes unnecessary consumer group rebalances which affects the performance a lot. See this article for best practices: https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example
My answer is the latest question update. Iterator approach works for me as expected.
Related
How can we consume SSE in scala play framework? Most of the resources that I could find were to make an SSE source. I want to reliably listen to SSE events from other services (with autoconnect). The most relevant article was https://doc.akka.io/docs/alpakka/current/sse.html . I implemented this but this does not seem to work (code below). Also the event that I am su
#Singleton
class SseConsumer #Inject()((implicit ec: ExecutionContext) {
implicit val system = ActorSystem()
val send: HttpRequest => Future[HttpResponse] = foo
def foo(x:HttpRequest) = {
try {
println("foo")
val authHeader = Authorization(BasicHttpCredentials("user", "pass"))
val newHeaders = x.withHeaders(authHeader)
Http().singleRequest(newHeaders)
}catch {
case e:Exception => {
println("Exception", e.printStackTrace())
throw e
}
}
}
val eventSource: Source[ServerSentEvent, NotUsed] =
EventSource(
uri = Uri("https://abc/v1/events"),
send,
initialLastEventId = Some("2"),
retryDelay = 1.second
)
def orderStatusEventStable() = {
val events: Future[immutable.Seq[ServerSentEvent]] =
eventSource
.throttle(elements = 1, per = 500.milliseconds, maximumBurst = 1, ThrottleMode.Shaping)
.take(10)
.runWith(Sink.seq)
events.map(_.foreach( x => {
println("456")
println(x.data)
}))
}
Future {
blocking{
while(true){
try{
Thread.sleep(2000)
orderStatusEventStable()
} catch {
case e:Exception => {
println("Exception", e.printStackTrace())
}
}
}
}
}
}
This does not give any exceptions and println("456") is never printed.
EDIT:
Future {
blocking {
while(true){
try{
Await.result(orderStatusEventStable() recover {
case e: Exception => {
println("exception", e)
throw e
}
}, Duration.Inf)
} catch {
case e:Exception => {
println("Exception", e.printStackTrace())
}
}
}
}
}
Added an await and it started working. Able to read 10 messages at a time. But now I am faced with another problem.
I have a producer which can at times produce faster than I can consume and with this code I have 2 issues:
I have to wait until 10 messages are available. How can we take a max. of 10 and a min. of 0 messages?
When the production rate > consumption rate, I am missing few events. I am guessing this is due to throttling. How do we handle it using backpressure?
The issue in your code is that the events: Future would only complete when the stream (eventSource) completes.
I'm not familiar with SSE but the stream likely never completes in your case as it's always listening for new events.
You can learn more in Akka Stream documentation.
Depending on what you want to do with the events, you could just map on the stream like:
eventSource
...
.map(/* do something */)
.runWith(...)
Basically, you need to work with the Akka Stream Source as data is going through it but don't wait for its completion.
EDIT: I didn't notice the take(10), my answer applies only if the take was not here. Your code should work after 10 events sent.
When doing the re-partitioning spark is breaking the lazy-evaluation-chain and triggers the error which I cannot control/catch.
//simulation of reading a stream from s3
def readFromS3(partition: Int) : Iterator[(Int, String)] = {
Iterator.tabulate(3){idx =>
// simulate an error only on partition 3 record 2
(idx, if(partition == 3 && idx == 2) throw new RuntimeException("error") else s"elem $idx on partition $partition" )
}
}
val rdd = sc.parallelize(Seq(1,2,3,4))
.mapPartitionsWithIndex((partitionIndex, iter) => readFromS3(partitionIndex))
// I can do whatever I want here
//this is what triggers the evaluation of the iterator
val partitionedRdd = rdd.partitionBy(new HashPartitioner(2))
// I can do whatever I want here
//desperately trying to catch the exception
partitionedRdd.foreachPartition{ iter =>
try{
iter.foreach(println)
}catch{
case _ => println("error caught")
}
}
Before you comment be aware that:
This is an over-simplification of my real world application
I know the reading from s3 can be done differently and that I should use sc.textFile. I don't have control over this, I cannot change it.
I understand what the problem is : when partitioning, spark is breaking the lazy-chain-evaluation and triggers the error. I have to do it !
I don't claim there is a bug in spark, spark needs to evaluate the records for shuffling
I can only do whatever I want :
between the reading from s3 and partitioning
after partitioning
I can write my own custom partitioner
Given the restrictions mentioned above, can I work around this ? Is there a solution ?
The only solution I could find was to have an EvaluateAheadIterator ( one which evaluates the head of the buffer before the iterator.next is called )
import scala.collection.AbstractIterator
import scala.util.control.NonFatal
class EvalAheadIterator[+A](iter : Iterator[A]) extends AbstractIterator[A] {
private val bufferedIter : BufferedIterator[A] = iter.buffered
override def hasNext: Boolean =
if(bufferedIter.hasNext){
try{
bufferedIter.head //evaluate the head and trigger potential exceptions
true
}catch{
case NonFatal(e) =>
println("caught exception ahead of time")
false
}
}else{
false
}
override def next() : A = bufferedIter.next()
}
Now we should apply the EvalAheadIterator in a mapPartition:
//simulation of reading a stream from s3
def readFromS3(partition: Int) : Iterator[(Int, String)] = {
Iterator.tabulate(3){idx =>
// simulate an error only on partition 3 record 2
(idx, if(partition == 3 && idx == 2) throw new RuntimeException("error") else s"elem $idx on partition $partition" )
}
}
val rdd = sc.parallelize(Seq(1,2,3,4))
.mapPartitionsWithIndex((partitionIndex, iter) => readFromS3(partitionIndex))
.mapPartitions{iter => new EvalAheadIterator(iter)}
// I can do whatever I want here
//this is what triggers the evaluation of the iterator
val partitionedRdd = rdd.partitionBy(new org.apache.spark.HashPartitioner(2))
// I can do whatever I want here
//desperately trying to catch the exception
partitionedRdd.foreachPartition{ iter =>
try{
iter.foreach(println)
}catch{
case _ => println("error caught")
}
}
I am trying to open a stream, try to decompress it as gzip, if that fails try to decompress it as zlib, and return the stream for further use. The underlying stream must be closed in the case of exceptions creating the wrapping decompression streams, as otherwise I will run into resource exhaustion issues.
pbData is a standard, non-resettable, InputStream
There has to be a cleaner way to do this.
val input = {
var pb = pbData.open()
try {
log.trace("Attempting to create GZIPInputStream")
new GZIPInputStream(pb)
} catch {
case e: ZipException => {
log.trace("Attempting to create InflaterInputStream")
pb.close()
pb = pbData.open()
try {
new InflaterInputStream(pb)
} catch {
case e: ZipException => {
pb.close()
throw e
}
}
}
}
Your process is actually iteration over InputStream instance generators. Check this, far more idiomatic solution compared to a lot of nested try-catch'es:
val bis = new BufferedInputStream(pbData.open())
// allows us to read and reset 16 bytes, as in your answer
bis.mark(16)
// list of functions, because we need lazy evaluation of streams
val streamGens: List[Any => InputStream] = List(
_ => new GZIPInputStream(bis),
_ => new InflaterInputStream(bis),
_ => bis
)
def firstStreamOf(streamGens: List[Any => InputStream]): Try[InputStream] =
streamGens match {
case x :: xs =>
Try(x()).recoverWith {
case NonFatal(_) =>
// reset in case of failure
bis.reset()
firstStreamOf(xs)
}
case Nil =>
// shouldn't get to this line because of last streamGens element
Failure(new Exception)
}
val tryStream = firstStreamOf(streamGens)
tryStream.foreach { stream =>
// do something with stream...
}
As a bonus, if you'll need to add trying more stream generators, you will have to add exactly one line into streamGens initialization. Also, we won't need to add bit.reset() invocation manually.
Something like this, perhaps:
def tryDecompress(pbData: PBData, ds: InputStream => InputStream *): InputStream = {
def tryIt(s: InputStream, dc: InputStream => InputStream) = try {
dc(s)
} catch { case NonFatal(e) =>
close(s)
throw e
}
val (first, rest) = ds.head -> ds.tail
try {
tryIt(pbData.open, first)
} catch {
case _: ZipException if rest.nonEmpty =>
tryDecompress(pbData, rest)
}
}
val stream = tryDecompress(pbData, new GZipInputStream(_), new InflaterInputStream(_))
(Too bad scala's Try does not have onFailure ... this could look much nicer if it did :/)
val result: Try[InflaterInputStream] = Try(new GZIPInputStream(pb)) match {
case res#Success(x) => res
case Failure(e) => e match {
case e: ZipException =>
Try(new InflaterInputStream(pb)) match {
case res2#Success(x2) => res2
case Failure(e2) => e match {
case _ => pb.close()
Failure(e2)
}
}
case ex#_ => pb.close()
Failure(ex)
}
}
I found that using a BufferedInputStream to wrap the underlying stream, then resetting it between each decompression library attempt, looks pretty clean.
val bis = new BufferedInputStream(pbData.open())
// allows us to read and reset 16 bytes
bis.mark(16)
val input: InputStream = {
try {
log.trace("attempting to open as gzip")
new GZIPInputStream(bis)
} catch {
case e: ZipException => try {
bis.reset()
log.trace("attempting to open as zlib")
new InflaterInputStream(bis)
} catch {
case e: ZipException => {
bis.reset()
log.trace("attempting to open as uncompressed")
bis
}
}
}
}
I am toying around trying to use a source.queue from an Actor. I am stuck in parttern match the result of an offer operation
class MarcReaderActor(file: File, sourceQueue: SourceQueueWithComplete[Record]) extends Actor {
val inStream = file.newInputStream
val reader = new MarcStreamReader(inStream)
override def receive: Receive = {
case Process => {
if (reader.hasNext()) {
val record = reader.next()
pipe(sourceQueue.offer(record)) to self
}
}
case f:Future[QueueOfferResult] =>
}
}
}
I don't know how to check if it was Enqueued or Dropped or Failure
if i write f:Future[QueueOfferResult.Enqueued] the compile complain
Since you use pipeTo, you do no need to match on futures - the contents of the future will be sent to the actor when this future is completed, not the future itself. Do this:
override def receive: Receive = {
case Process =>
if (reader.hasNext()) {
val record = reader.next()
pipe(sourceQueue.offer(record)) to self
}
case r: QueueOfferResult =>
r match {
case QueueOfferResult.Enqueued => // element has been consumed
case QueueOfferResult.Dropped => // element has been ignored because of backpressure
case QueueOfferResult.QueueClosed => // the queue upstream has terminated
case QueueOfferResult.Failure(e) => // the queue upstream has failed with an exception
}
case Status.Failure(e) => // future has failed, e.g. because of invalid usage of `offer()`
}
I started playing around scala and came to this particular boilerplate of web socket chatroom in scala.
They use MessageHub.source() and BroadcastHub.sink() as their Source and Sink for sending the messages to all connected clients.
The example is working fine for exchanging messages as it is.
private val (chatSink, chatSource) = {
// Don't log MergeHub$ProducerFailed as error if the client disconnects.
// recoverWithRetries -1 is essentially "recoverWith"
val source = MergeHub.source[WSMessage]
.log("source")
.recoverWithRetries(-1, { case _: Exception ⇒ Source.empty })
val sink = BroadcastHub.sink[WSMessage]
source.toMat(sink)(Keep.both).run()
}
private val userFlow: Flow[WSMessage, WSMessage, _] = {
Flow.fromSinkAndSource(chatSink, chatSource)
}
def chat(): WebSocket = {
WebSocket.acceptOrResult[WSMessage, WSMessage] {
case rh if sameOriginCheck(rh) =>
Future.successful(userFlow).map { flow =>
Right(flow)
}.recover {
case e: Exception =>
val msg = "Cannot create websocket"
logger.error(msg, e)
val result = InternalServerError(msg)
Left(result)
}
case rejected =>
logger.error(s"Request ${rejected} failed same origin check")
Future.successful {
Left(Forbidden("forbidden"))
}
}
}
I want to store the messages that are exchanged in the chatroom in a DB.
I tried adding map and fold functions to source and sink to get hold of the messages that are sent but I wasn't able to.
I tried adding a Flow stage between MergeHub and BroadcastHub like below
val flow = Flow[WSMessage].map(element => println(s"Message: $element"))
source.via(flow).toMat(sink)(Keep.both).run()
But it throws a compilation error that cannot reference toMat with such signature.
Can someone help or point me how can I get hold of messages that are sent and store them in DB.
Link for full template:
https://github.com/playframework/play-scala-chatroom-example
Let's look at your flow:
val flow = Flow[WSMessage].map(element => println(s"Message: $element"))
It takes elements of type WSMessage, and returns nothing (Unit). Here it is again with the correct type:
val flow: Flow[Unit] = Flow[WSMessage].map(element => println(s"Message: $element"))
This will clearly not work as the sink expects WSMessage and not Unit.
Here's how you can fix the above problem:
val flow = Flow[WSMessage].map { element =>
println(s"Message: $element")
element
}
Not that for persisting messages in the database, you will most likely want to use an async stage, roughly:
val flow = Flow[WSMessage].mapAsync(parallelism) { element =>
println(s"Message: $element")
// assuming DB.write() returns a Future[Unit]
DB.write(element).map(_ => element)
}