Random fail on KafkaStreams stateful application

Random fail on KafkaStreams stateful application - scala

Hi here is a problem I stumble upon since a few days and can't find the answer by myself.
I am using the scala streams API v2.0.0.
I have two incoming streams, branched over two handlers for segregation and both declaring a Transformer using a common StateStore.
To do a quick overview, it looks like
def buildStream(builder: StreamsBuilder, config: Config) = {
val store = Stores.keyValueStoreBuilder[String, AggregatedState](Stores.persistentKeyValueStore(config.storeName), ...)
builder.addStateStore(store)
val handlers = List(handler1, handler2)
builder
.stream(config.topic)
.branch(handlers.map(_.accepts).toList: _*) // Dispatch events to the first handler accepting it
.zip(handlers.toList) // (KStream[K, V], Handler)
.map((h, stream) => h.handle(stream)) // process the event on the correct handler
.reduce((s1, s2) => s1.merge(s2)) // merge them back as they return the same object
.to(config.output)
builder
}
Each of my handlers look the same: Take an event, do some operations, pass through the transform() method to derive a state and emit an aggregate:
class Handler1(config: Config) {
def accepts(key: String, value: Event): Boolean = ??? // Implementation not needed
def handle(stream: KStream[String, Event]) = {
stream
.(join/map/filter)
.transform(new Transformer1(config.storeName))
}
}
class Handler2(config: Config) {
def accepts(key: String, value: Event): Boolean = ??? // Implementation not needed
def handle(stream: KStream[String, Event]) = {
stream
.(join/map/filter)
.transform(new Transformer2(config.storeName))
}
}
The transformers use the same StateStore with the following logic: for a new event, check if its aggregate exists, if yes, update it + store it + emit the new aggregate, otherwise build the aggregate + store it + emit .
class Transformer1(storeName: String) {
private var store: KeyValueStore[String, AggregatedState] = _
override def init(context: ProcessorContext): Unit = {
store = context.getStateStore(storeName).asInstanceOf[KeyValueStore[K, AggregatedState]]
}
override def transform(key: String, value: Event): (String, AggregatedState) = {
val existing: Option[AggregatedState] = Option(store.get(key))
val agg = existing.map(_.updateWith(event)).getOrElse(new AggregatedState(event))
store.put(key, agg)
if(agg.isTerminal){
store.delete(key)
}
if(isDuplicate(existing, agg)){
null // Tombstone, we have a duplicate
} else{
(key, agg) // Emit the new aggregate
}
}
override def close() = Unit
}
class Transformer2(storeName: String) {
private var store: KeyValueStore[String, AggregatedState] = _
override def init(context: ProcessorContext): Unit = {
store = context.getStateStore(storeName).asInstanceOf[KeyValueStore[K, AggregatedState]]
}
override def transform(key: String, value: Event): (String, AggregatedState) = {
val existing: Option[AggregatedState] = Option(store.get(key))
val agg = existing.map(_.updateWith(event)).getOrElse(new AggregatedState(event))
store.put(key, agg)
if(agg.isTerminal){
store.delete(key)
}
if(isDuplicate(existing, agg)){
null // Tombstone, we have a duplicate
} else{
(key, agg) // Emit the new aggregate
}
}
override def close() = Unit
}
Transformer2 is the same, it's just the business logic that changes (how to merge a new event with an aggregated state)
The problem I have is that on stream startup, I can either have a normal startup or a boot exception :
15:07:23,420 ERROR org.apache.kafka.streams.processor.internals.AssignedStreamsTasks - stream-thread [job-tracker-prod-5ba8c2f7-d7fd-48b5-af4a-ac78feef71d3-StreamThread-1] Failed to commit stream task 1_0 due to the following error:
org.apache.kafka.streams.errors.ProcessorStateException: task [1_0] Failed to flush state store KSTREAM-AGGREGATE-STATE-STORE-0000000003
at org.apache.kafka.streams.processor.internals.ProcessorStateManager.flush(ProcessorStateManager.java:242)
at org.apache.kafka.streams.processor.internals.AbstractTask.flushState(AbstractTask.java:198)
at org.apache.kafka.streams.processor.internals.StreamTask.flushState(StreamTask.java:406)
at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:380)
at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:368)
at org.apache.kafka.streams.processor.internals.AssignedTasks$1.apply(AssignedTasks.java:67)
at org.apache.kafka.streams.processor.internals.AssignedTasks.applyToRunningTasks(AssignedTasks.java:362)
at org.apache.kafka.streams.processor.internals.AssignedTasks.commit(AssignedTasks.java:352)
at org.apache.kafka.streams.processor.internals.TaskManager.commitAll(TaskManager.java:401)
at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:1035)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:845)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:767)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:736)
Caused by: java.lang.IllegalStateException: This should not happen as timestamp() should only be called while a record is processed
at org.apache.kafka.streams.processor.internals.AbstractProcessorContext.timestamp(AbstractProcessorContext.java:161)
at org.apache.kafka.streams.state.internals.StoreChangeLogger.logChange(StoreChangeLogger.java:59)
at org.apache.kafka.streams.state.internals.ChangeLoggingKeyValueBytesStore.put(ChangeLoggingKeyValueBytesStore.java:66)
at org.apache.kafka.streams.state.internals.ChangeLoggingKeyValueBytesStore.put(ChangeLoggingKeyValueBytesStore.java:31)
at org.apache.kafka.streams.state.internals.InnerMeteredKeyValueStore.put(InnerMeteredKeyValueStore.java:206)
at org.apache.kafka.streams.state.internals.MeteredKeyValueBytesStore.put(MeteredKeyValueBytesStore.java:117)
at com.mycompany.streamprocess.Transformer1.transform(Transformer1.scala:49) // Line with store.put(key, agg)
I already searched and got results with "The transformer uses a Factory Pattern", which is what is used here (as the .transform takes the transformer and creates a TransformerSupplier under the hood).
As the error is pseudo-random (I could re-create it some times), I guess it could be a race condition on startup but I found nothing concluding.
Is it because I use the same state-store on different transformers?

I assume you are hitting https://issues.apache.org/jira/browse/KAFKA-7250
It's fixed in version 2.0.1 and 2.1.0.
If you cannot upgrade, you need to pass in the TransformerSupplier explicitly, because the Scale API constructs the supplier incorrectly in 2.0.0.
.transform(() => new Transformer1(config.storeName))

Related

Akka Streams: handling Future inside GraphStage Source

I am trying to build an Akka Stream Source which receives data by making Future API calls (The nature of API is scrolling, which incrementally fetches results). To build such Source, I am using GraphStage.
I have modified the NumberSource example which simply pushes an Int at a time. The only change I did was to replace that Int with getvalue(): Future[Int] (to simulate the API call):
class NumbersSource extends GraphStage[SourceShape[Int]] {
val out: Outlet[Int] = Outlet("NumbersSource")
override val shape: SourceShape[Int] = SourceShape(out)
// simple example of future API call
private def getvalue(): Future[Int] = Future.successful(Random.nextInt())
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
new GraphStageLogic(shape) {
setHandler(out, new OutHandler {
override def onPull(): Unit = {
// Future API call
getvalue().onComplete{
case Success(value) =>
println("Pushing value received..") // this is currently being printed just once
push(out, counter)
case Failure(exception) =>
}
}
}
})
}
}
// Using the Source and Running the stream
val sourceGraph: Graph[SourceShape[Int], NotUsed] = new NumbersSource
val mySource: Source[Int, NotUsed] = Source.fromGraph(sourceGraph)
val done: Future[Done] = mySource.runForeach{
num => println(s"Received: $num") // This is currently not printed
}
done.onComplete(_ => system.terminate())
The above code doesn't work. The println statement inside setHandler is executed just once and nothing is pushed downstream.
How should such Future calls be handled ? Thanks.
UPDATE
I tried to use getAsyncCallback by making changes as follow:
class NumbersSource(futureNum: Future[Int]) extends GraphStage[SourceShape[Int]] {
val out: Outlet[Int] = Outlet("NumbersSource")
override val shape: SourceShape[Int] = SourceShape(out)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
new GraphStageLogic(shape) {
override def preStart(): Unit = {
val callback = getAsyncCallback[Int] { (_) =>
completeStage()
}
futureNum.foreach(callback.invoke)
}
setHandler(out, new OutHandler {
override def onPull(): Unit = {
val value: Int = ??? // How to get this value ??
push(out, value)
}
})
}
}
// Using the Source and Running the Stream
def random(): Future[Int] = Future.successful(Random.nextInt())
val sourceGraph: Graph[SourceShape[Int], NotUsed] = new NumbersSource(random())
val mySource: Source[Int, NotUsed] = Source.fromGraph(sourceGraph)
val done: Future[Done] = mySource.runForeach{
num => println(s"Received: $num") // This is currently not printed
}
done.onComplete(_ => system.terminate())
But, now I am stuck at how to grab the value computed from Future. In case of a GraphStage, Flow, I could use:
val value = grab(in) // where in is Inlet of a Flow
But, what I have is a GraphStage, Source, so I have no idea how to grab the Int value of computed Future above.

I'm not sure if I understand correctly, but if you are trying to implement an infinite source out of elements computed in Futures then there is really no need to do it with your own GraphStage. You can do it simply as below:
Source.repeat(())
.mapAsync(parallelism) { _ => Future.successful(Random.nextInt()) }
The Source.repeat(()) is simply an infinite source of some arbitrary values (of type Unit in this case, but you can change () to whatever you want since it's ignored here). mapAsync then is used to integrate the asynchronous computations into the flow.

I would join to the other answer to try to avoid creating your own graphstage. After some experimentation this is what seems to work for me:
type Data = Int
trait DbResponse {
// this is just a callback for a compact solution
def nextPage: Option[() => Future[DbResponse]]
def data: List[Data]
}
def createSource(dbCall: DbResponse): Source[Data, NotUsed] = {
val thisPageSource = Source.apply(dbCall.data)
val nextPageSource = dbCall.nextPage match {
case Some(dbCallBack) => Source.lazySource(() => Source.future(dbCallBack()).flatMapConcat(createSource))
case None => Source.empty
}
thisPageSource.concat(nextPageSource)
}
val dataSource: Source[Data, NotUsed] = Source
.future(???: Future[DbResponse]) // the first db call
.flatMapConcat(createSource)
I tried it out and it works almost perfectly, I couldn't find out why, but the second page is instantaneously requested, but the rest will work as expected (with backpressure and what not).

Set timestamp in output with Kafka Streams fails for transformations

Suppose we have a transformer (written in Scala)
new Transformer[String, V, (String, V)]() {
var context: ProcessorContext = _
override def init(context: ProcessorContext): Unit = {
this.context = context
}
override def transform(key: String, value: V): (String, V) = {
val timestamp = toTimestamp(value)
context.forward(key, value, To.all().withTimestamp(timestamp))
key -> value
}
override def close(): Unit = ()
}
where toTimestamp is just a function which returns an a timestamp fetched from the record value. Once it gets executed, there's an NPE:
Exception in thread "...-6f3693b9-4e8d-4e65-9af6-928884320351-StreamThread-5" java.lang.NullPointerException
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:110)
at CustomTransformer.transform()
at CustomTransformer.transform()
at org.apache.kafka.streams.scala.kstream.KStream$$anon$1$$anon$2.transform(KStream.scala:302)
at org.apache.kafka.streams.scala.kstream.KStream$$anon$1$$anon$2.transform(KStream.scala:300)
at
what essentially happens is that ProcessorContextImpl fails in:
public <K, V> void forward(final K key, final V value, final To to) {
toInternal.update(to);
if (toInternal.hasTimestamp()) {
recordContext.setTimestamp(toInternal.timestamp());
}
final ProcessorNode previousNode = currentNode();
because the recordContext was not initialized (an it could only be done internally by KafkaStreams).
This is a follow up question Set timestamp in output with Kafka Streams 1

If you work with transformer, you need to make sure that a new Transformer object is create when TransformerSupplier#get() is called. (cf. https://docs.confluent.io/current/streams/faq.html#why-do-i-get-an-illegalstateexception-when-accessing-record-metadata)
In the original question, I thought it's about your context variable that results in NPE, but now I realized it's about the Kafka Streams internals.
The Scala API has a bug in 2.0.0 that may result in the case that the same Transformer instance is reused (https://issues.apache.org/jira/browse/KAFKA-7250). I think that you are hitting this bug. Rewriting your code a little bit should fix the issues. Note, that Kafka 2.0.1 and Kafka 2.1.0 contain a fix.

#matthias-j-sax Same behavior if processor reused in Java code.
Topology topology = new Topology();
MyProcessor myProcessor = new MyProcessor();
topology.addSource("source", "topic-1")
.addProcessor(
"processor",
() -> {
return myProcessor;
},
"source"
)
.addSink("sink", "topic-2", "processor");
KafkaStreams streams = new KafkaStreams(topology, config);
streams.start();

How to deal with future inside a customized akka Sink?

I am trying to implement a customized Akka Sink, but I could not find a way to handle future inside it.
class EventSink(...) {
val in: Inlet[EventEnvelope2] = Inlet("EventSink")
override val shape: SinkShape[EventEnvelope2] = SinkShape(in)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = {
new GraphStageLogic(shape) {
// This requests one element at the Sink startup.
override def preStart(): Unit = pull(in)
setHandler(in, new InHandler {
override def onPush(): Unit = {
val future = handle(grab(in))
Await.ready(future, Duration.Inf)
/*
future.onComplete {
case Success(_) =>
logger.info("pulling next events")
pull(in)
case Failure(failure) =>
logger.error(failure.getMessage, failure)
throw failure
}*/
pull(in)
}
})
}
}
private def handle(envelope: EventEnvelope2): Future[Unit] = {
val EventEnvelope2(query.Sequence(offset), _/*persistenceId*/, _/*sequenceNr*/, event) = envelope
...
db.run(statements.transactionally)
}
}
I have to go with blocking future at the moment, which does not look good. The non-blocking one I commented out only works for the first event. Could anyone please help?
Updated Thanks #ViktorKlang. It seems to be working now.
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
{
new GraphStageLogic(shape) {
val callback = getAsyncCallback[Try[Unit]] {
case Success(_) =>
//completeStage()
pull(in)
case Failure(error) =>
failStage(error)
}
// This requests one element at the Sink startup.
override def preStart(): Unit = {
pull(in)
}
setHandler(in, new InHandler {
override def onPush(): Unit = {
val future = handle(grab(in))
future.onComplete { result =>
callback.invoke(result)
}
}
})
}
}
I am trying to implement a Rational DB event sink connnecting to ReadJournal.eventsByTag. So this is a continuous stream, which will never end unless there is an error - This is what I want. Is my approach correct?
Two more questions:
Will the GraphStage never end unless I manually invoke completeStage or failStage?
Am I right or normal to declare callback outside preStart method? and Am I right to invoke pull(in) in preStart in this case?
Thanks,
Cheng

Avoid Custom Stages
In general, you should try to exhaust all possibilities with the given methods of the library's Source, Flow, and Sink. Custom stages are almost never necessary and make your code difficult to maintain.
Writing Your "Custom" Stage Using Standard Methods
Based on the details of your question's example code I don't see any reason why you would use a custom Sink to begin with.
Given your handle method, you could slightly modify it to do the logging that you specified in the question:
val loggedHandle : (EventEnvelope2) => Future[Unit] =
handle(_) transform {
case Success(_) => {
logger.info("pulling next events")
Success(Unit)
}
case Failure(failure) => {
logger.error(failure.getMessage, failure)
Failure(failure)
}
}
Then just use Sink.foreachParallel to handle the envelopes:
val createEventEnvelope2Sink : (Int) => Sink[EventEnvelope2, Future[Done]] =
(parallelism) =>
Sink[EventEnvelope2].foreachParallel(parallelism)(handle _)
Now, even if you want each EventEnvelope2 to be sent to the db in order you can just use 1 for parallelism:
val inOrderDBInsertSink : Sink[EventEnvelope2, Future[Done]] =
createEventEnvelope2Sink(1)
Also, if the database throws an exception you can still get a hold of it when the foreachParallel completes:
val someEnvelopeSource : Source[EventEnvelope2, _] = ???
someEnvelopeSource
.to(createEventEnvelope2Sink(1))
.run()
.andThen {
case Failure(throwable) => { /*deal with db exception*/ }
case Success(_) => { /*all inserts succeeded*/ }
}

Reading from postgres using Akka Streams 2.4.2 and Slick 3.0

Trying out the newly minted Akka Streams. It seems to be working except for one small thing - there's no output.
I have the following table definition:
case class my_stream(id: Int, value: String)
class Streams(tag: Tag) extends Table[my_stream](tag, "my_stream") {
def id = column[Int]("id")
def value = column[String]("value")
def * = (id, value) <> (my_stream.tupled, my_stream.unapply)
}
And I'm trying to output the contents of the table to stdout like this:
def main(args: Array[String]) : Unit = {
implicit val system = ActorSystem("Subscriber")
implicit val materializer = ActorMaterializer()
val strm = TableQuery[Streams]
val db = Database.forConfig("pg-postgres")
try{
var src = Source.fromPublisher(db.stream(strm.result))
src.runForeach(r => println(s"${r.id},${r.value}"))(materializer)
} finally {
system.shutdown
db.close
}
}
I have verified that the query is being run by configuring debug logging. However, all I get is this:
08:59:24.099 [main] INFO com.zaxxer.hikari.HikariDataSource - pg-postgres - is starting.
08:59:24.428 [main] INFO com.zaxxer.hikari.pool.HikariPool - pg-postgres - is closing down.

The cause is that Akka Streams is asynchronous and runForeach returns a Future which will be completed once the stream completes, but that Future is not being handled and as such the system.shutdown and db.close executes immediately instead of after the stream completes.

Just in case it helps anyone searching this very same issue but in MySQL, take into account that you should enable the driver stream support "manually":
def enableStream(statement: java.sql.Statement): Unit = {
statement match {
case s: com.mysql.jdbc.StatementImpl => s.enableStreamingResults()
case _ =>
}
}
val publisher = sourceDb.stream(query.result.withStatementParameters(statementInit = enableStream))
Source: http://www.slideshare.net/kazukinegoro5/akka-streams-100-scalamatsuri

Ended up using #ViktorKlang answer and just wrapped the run with an Await.result. I also found an alternative answer in the docs which demonstrates using the reactive streams publisher and subscriber interfaces:
The stream method returns a DatabasePublisher[T] and Source.fromPublisher returns a Source[T, NotUsed]. This means you have to attach a subscriber instead of using runForEach - according to the release notes NotUsed is a replacement for Unit. Which means nothing gets passed to the Sink.
Since Slick implements the reactive streams interface and not the Akka Stream interfaces you need to use the fromPublisher and fromSubscriber integration point. That means you need to implement the org.reactivestreams.Subscriber[T] interface.
Here's a quick and dirty Subscriber[T] implementation which simply calls println:
class MyStreamWriter extends org.reactivestreams.Subscriber[my_stream] {
private var sub : Option[Subscription] = None;
override def onNext(t: my_stream): Unit = {
println(t.value)
if(sub.nonEmpty) sub.head.request(1)
}
override def onError(throwable: Throwable): Unit = {
println(throwable.getMessage)
}
override def onSubscribe(subscription: Subscription): Unit = {
sub = Some(subscription)
sub.head.request(1)
}
override def onComplete(): Unit = {
println("ALL DONE!")
}
}
You need to make sure you call the Subscription.request(Long) method in onSubscribe and then in onNext to ask for data or nothing will be sent or you won't get the full set of results.
And here's how you use it:
def main(args: Array[String]) : Unit = {
implicit val system = ActorSystem("Subscriber")
implicit val materializer = ActorMaterializer()
val strm = TableQuery[Streams]
val db = Database.forConfig("pg-postgres")
try{
val src = Source.fromPublisher(db.stream(strm.result))
val flow = src.to(Sink.fromSubscriber(new MyStreamWriter()))
flow.run()
} finally {
system.shutdown
db.close
}
}
I'm still trying to figure this out so I welcome any feedback. Thanks!

scala, transform a callback pattern to a functional style internal iterator

Suppose this API is given and we cannot change it:
object ProviderAPI {
trait Receiver[T] {
def receive(entry: T)
def close()
}
def run(r: Receiver[Int]) {
new Thread() {
override def run() {
(0 to 9).foreach { i =>
r.receive(i)
Thread.sleep(100)
}
r.close()
}
}.start()
}
}
In this example, ProviderAPI.run takes a Receiver, calls receive(i) 10 times and then closes. Typically, ProviderAPI.run would call receive(i) based on a collection which could be infinite.
This API is intended to be used in imperative style, like an external iterator. If our application needs to filter, map and print this input, we need to implement a Receiver which mixes all these operations:
object Main extends App {
class MyReceiver extends ProviderAPI.Receiver[Int] {
def receive(entry: Int) {
if (entry % 2 == 0) {
println("Entry#" + entry)
}
}
def close() {}
}
ProviderAPI.run(new MyReceiver())
}
Now, the question is how to use the ProviderAPI in functional style, internal iterator (without changing the implementation of ProviderAPI, which is given to us). Note that ProviderAPI could also call receive(i) infinite times, so it is not an option to collect everything in a list (also, we should handle each result one by one, instead of collecting all the input first, and processing it afterwards).
I am asking how to implement such a ReceiverToIterator, so that we can use the ProviderAPI in functional style:
object Main extends App {
val iterator = new ReceiverToIterator[Int] // how to implement this?
ProviderAPI.run(iterator)
iterator
.view
.filter(_ % 2 == 0)
.map("Entry#" + _)
.foreach(println)
}
Update
Here are four solutions:
IteratorWithSemaphorSolution: The workaround solution I proposed first attached to the question
QueueIteratorSolution: Using the BlockingQueue[Option[T]] based on the suggestion of nadavwr.
It allows the producer to continue producing up to queueCapacity before being blocked by the consumer.
PublishSubjectSolution: Very simple solution, using PublishSubject from Netflix RxJava-Scala API.
SameThreadReceiverToTraversable: Very simple solution, by relaxing the constraints of the question

Updated: BlockingQueue of 1 entry
What you've implemented here is essentially Java's BlockingQueue, with a queue size of 1.
Main characteristic: uber-blocking. A slow consumer will kill your producer's performance.
Update: #gzm0 mentioned that BlockingQueue doesn't cover EOF. You'll have to use BlockingQueue[Option[T]] for that.
Update: Here's a code fragment. It can be made to fit with your Receiver.
Some of it inspired by Iterator.buffered. Note that peek is a misleading name, as it may block -- and so will hasNext.
// fairness enabled -- you probably want to preserve order...
// alternatively, disable fairness and increase buffer to be 'big enough'
private val queue = new java.util.concurrent.ArrayBlockingQueue[Option[T]](1, true)
// the following block provides you with a potentially blocking peek operation
// it should `queue.take` when the previous peeked head has been invalidated
// specifically, it will `queue.take` and block when the queue is empty
private var head: Option[T] = _
private var headDefined: Boolean = false
private def invalidateHead() { headDefined = false }
private def peek: Option[T] = {
if (!headDefined) {
head = queue.take()
headDefined = true
}
head
}
def iterator = new Iterator[T] {
// potentially blocking; only false upon taking `None`
def hasNext = peek.isDefined
// peeks and invalidates head; throws NoSuchElementException as appropriate
def next: T = {
val opt = peek; invalidateHead()
if (opt.isEmpty) throw new NoSuchElementException
else opt.get
}
}
Alternative: Iteratees
Iterator-based solutions will generally involve more blocking. Conceptually, you could use continuations on the thread doing the iteration to avoid blocking the thread, but continuations mess with Scala's for-comprehensions, so no joy down that road.
Alternatively, you could consider an iteratee-based solution. Iteratees are different than iterators in that the consumer isn't responsible for advancing the iteration -- the producer is. With iteratees, the consumer basically folds over the entries pushed by the producer over time. Folding each next entry as it becomes available can take place in a thread pool, since the thread is relinquished after each fold completes.
You won't get nice for-syntax for iteration, and the learning curve is a little challenging, but if you feel confident using a foldLeft you'll end up with a non-blocking solution that does look reasonable on the eye.
To read more about iteratees, I suggest taking a peek at PlayFramework 2.X's iteratee reference. The documentation describes their stand-alone iteratee library, which is 100% usable outside the context of Play. Scalaz 7 also has a comprehensive iteratee library.

IteratorWithSemaphorSolution
The first workaround solution that I proposed attached to the question.
I moved it here as an answer.
import java.util.concurrent.Semaphore
object Main extends App {
val iterator = new ReceiverToIterator[Int]
ProviderAPI.run(iterator)
iterator
.filter(_ % 2 == 0)
.map("Entry#" + _)
.foreach(println)
}
class ReceiverToIterator[T] extends ProviderAPI.Receiver[T] with Iterator[T] {
var lastEntry: T = _
var waitingToReceive = new Semaphore(1)
var waitingToBeConsumed = new Semaphore(1)
var eof = false
waitingToReceive.acquire()
def receive(entry: T) {
println("ReceiverToIterator.receive(" + entry + "). START.")
waitingToBeConsumed.acquire()
lastEntry = entry
waitingToReceive.release()
println("ReceiverToIterator.receive(" + entry + "). END.")
}
def close() {
println("ReceiverToIterator.close().")
eof = true
waitingToReceive.release()
}
def hasNext = {
println("ReceiverToIterator.hasNext().START.")
waitingToReceive.acquire()
waitingToReceive.release()
println("ReceiverToIterator.hasNext().END.")
!eof
}
def next = {
println("ReceiverToIterator.next().START.")
waitingToReceive.acquire()
if (eof) { throw new NoSuchElementException }
val entryToReturn = lastEntry
waitingToBeConsumed.release()
println("ReceiverToIterator.next().END.")
entryToReturn
}
}

QueueIteratorSolution
The second workaround solution that I proposed attached to the question. I moved it here as an answer.
Solution using the BlockingQueue[Option[T]] based on the suggestion of nadavwr.
It allows the producer to continue producing up to queueCapacity before being blocked by the consumer.
I implement a QueueToIterator that uses a ArrayBlockingQueue with a given capacity.
BlockingQueue has a take() method, but not a peek or hasNext, so I need an OptionNextToIterator as follows:
trait OptionNextToIterator[T] extends Iterator[T] {
def getOptionNext: Option[T] // abstract
def hasNext = { ... }
def next = { ... }
}
Note: I am using the synchronized block inside OptionNextToIterator, and I am not sure it is totally correct
Solution:
import java.util.concurrent.ArrayBlockingQueue
object Main extends App {
val receiverToIterator = new ReceiverToIterator[Int](queueCapacity = 3)
ProviderAPI.run(receiverToIterator)
Thread.sleep(3000) // test that ProviderAPI.run can produce 3 items ahead before being blocked by the consumer
receiverToIterator.filter(_ % 2 == 0).map("Entry#" + _).foreach(println)
}
class ReceiverToIterator[T](val queueCapacity: Int = 1) extends ProviderAPI.Receiver[T] with QueueToIterator[T] {
def receive(entry: T) { queuePut(entry) }
def close() { queueClose() }
}
trait QueueToIterator[T] extends OptionNextToIterator[T] {
val queueCapacity: Int
val queue = new ArrayBlockingQueue[Option[T]](queueCapacity)
var queueClosed = false
def queuePut(entry: T) {
if (queueClosed) { throw new IllegalStateException("The queue has already been closed."); }
queue.put(Some(entry))
}
def queueClose() {
queueClosed = true
queue.put(None)
}
def getOptionNext = queue.take
}
trait OptionNextToIterator[T] extends Iterator[T] {
def getOptionNext: Option[T]
var answerReady: Boolean = false
var eof: Boolean = false
var element: T = _
def hasNext = {
prepareNextAnswerIfNecessary()
!eof
}
def next = {
prepareNextAnswerIfNecessary()
if (eof) { throw new NoSuchElementException }
val retVal = element
answerReady = false
retVal
}
def prepareNextAnswerIfNecessary() {
if (answerReady) {
return
}
synchronized {
getOptionNext match {
case None => eof = true
case Some(e) => element = e
}
answerReady = true
}
}
}

PublishSubjectSolution
A very simple solution using PublishSubject from Netflix RxJava-Scala API:
// libraryDependencies += "com.netflix.rxjava" % "rxjava-scala" % "0.20.7"
import rx.lang.scala.subjects.PublishSubject
class MyReceiver[T] extends ProviderAPI.Receiver[T] {
val channel = PublishSubject[T]()
def receive(entry: T) { channel.onNext(entry) }
def close() { channel.onCompleted() }
}
object Main extends App {
val myReceiver = new MyReceiver[Int]()
ProviderAPI.run(myReceiver)
myReceiver.channel.filter(_ % 2 == 0).map("Entry#" + _).subscribe{n => println(n)}
}

ReceiverToTraversable
This stackoverflow question came when I wanted to list and process a svn repository using the svnkit.com API as follows:
SvnList svnList = new SvnOperationFactory().createList();
svnList.setReceiver(new ISvnObjectReceiver<SVNDirEntry>() {
public void receive(SvnTarget target, SVNDirEntry dirEntry) throws SVNException {
// do something with dirEntry
}
});
svnList.run();
the API used a callback function, and I wanted to use a functional style instead, as follows:
svnList.
.filter(e => "pom.xml".compareToIgnoreCase(e.getName()) == 0)
.map(_.getURL)
.map(getMavenArtifact)
.foreach(insertArtifact)
I thought of having a class ReceiverToIterator[T] extends ProviderAPI.Receiver[T] with Iterator[T],
but this required the svnkit api to run in another thread.
That's why I asked how to solve this problem with a ProviderAPI.run method that run in a new thread. But that was not very wise: if I had explained the real case, someone might have found a better solution before.
Solution
If we tackle the real problem (so, no need of using a thread for the svnkit),
a simpler solution is to implement a scala.collection.Traversable instead of a scala.collection.Iterator.
While Iterator requires a next and hasNext def, Traversable requires a foreach def,
which is very similar to the svnkit callback!
Note that by using view, we make the transformers lazy, so elements are passed one by one through all the chain to foreach(println).
this allows to process an infinite collection.
object ProviderAPI {
trait Receiver[T] {
def receive(entry: T)
def close()
}
// Later I found out that I don't need a thread
def run(r: Receiver[Int]) {
(0 to 9).foreach { i => r.receive(i); Thread.sleep(100) }
}
}
object Main extends App {
new ReceiverToTraversable[Int](r => ProviderAPI.run(r))
.view
.filter(_ % 2 == 0)
.map("Entry#" + _)
.foreach(println)
}
class ReceiverToTraversable[T](val runProducer: (ProviderAPI.Receiver[T] => Unit)) extends Traversable[T] {
override def foreach[U](f: (T) => U) = {
object MyReceiver extends ProviderAPI.Receiver[T] {
def receive(entry: T) = f(entry)
def close() = {}
}
runProducer(MyReceiver)
}
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Random fail on KafkaStreams stateful application - scala

Related

Akka Streams: handling Future inside GraphStage Source

Set timestamp in output with Kafka Streams fails for transformations

How to deal with future inside a customized akka Sink?

Reading from postgres using Akka Streams 2.4.2 and Slick 3.0

scala, transform a callback pattern to a functional style internal iterator

Categories

Resources