Is there a FIFO stream in Scala? - scala

I'm looking for a FIFO stream in Scala, i.e., something that provides the functionality of
immutable.Stream (a stream that can be finite and memorizes the elements that have already been read)
mutable.Queue (which allows for added elements to the FIFO)
The stream should be closable and should block access to the next element until the element has been added or the stream has been closed.
Actually I'm a bit surprised that the collection library does not (seem to) include such a data structure, since it is IMO a quite classical one.
My questions:
1) Did I overlook something? Is there already a class providing this functionality?
2) OK, if it's not included in the collection library then it might by just a trivial combination of existing collection classes. However, I tried to find this trivial code but my implementation looks still quite complex for such a simple problem. Is there a simpler solution for such a FifoStream?
class FifoStream[T] extends Closeable {
val queue = new Queue[Option[T]]
lazy val stream = nextStreamElem
private def nextStreamElem: Stream[T] = next() match {
case Some(elem) => Stream.cons(elem, nextStreamElem)
case None => Stream.empty
}
/** Returns next element in the queue (may wait for it to be inserted). */
private def next() = {
queue.synchronized {
if (queue.isEmpty) queue.wait()
queue.dequeue()
}
}
/** Adds new elements to this stream. */
def enqueue(elems: T*) {
queue.synchronized {
queue.enqueue(elems.map{Some(_)}: _*)
queue.notify()
}
}
/** Closes this stream. */
def close() {
queue.synchronized {
queue.enqueue(None)
queue.notify()
}
}
}
Paradigmatic's solution (sightly modified)
Thanks for your suggestions. I slightly modified paradigmatic's solution so that toStream returns an immutable stream (allows for repeatable reads) so that it fits my needs. Just for completeness, here is the code:
import collection.JavaConversions._
import java.util.concurrent.{LinkedBlockingQueue, BlockingQueue}
class FIFOStream[A]( private val queue: BlockingQueue[Option[A]] = new LinkedBlockingQueue[Option[A]]() ) {
lazy val toStream: Stream[A] = queue2stream
private def queue2stream: Stream[A] = queue take match {
case Some(a) => Stream cons ( a, queue2stream )
case None => Stream empty
}
def close() = queue add None
def enqueue( as: A* ) = queue addAll as.map( Some(_) )
}

In Scala, streams are "functional iterators". People expect them to be pure (no side effects) and immutable. In you case, everytime you iterate on the stream you modify the queue (so it's no pure). This can create a lot of misunderstandings, because iterating twice the same stream, will have two different results.
That being said, you should rather use Java BlockingQueues, rather than rolling your own implementation. They are considered well implemented in term of safety and performances. Here is the cleanest code I can think of (using your approach):
import java.util.concurrent.BlockingQueue
import scala.collection.JavaConversions._
class FIFOStream[A]( private val queue: BlockingQueue[Option[A]] ) {
def toStream: Stream[A] = queue take match {
case Some(a) => Stream cons ( a, toStream )
case None => Stream empty
}
def close() = queue add None
def enqueue( as: A* ) = queue addAll as.map( Some(_) )
}
object FIFOStream {
def apply[A]() = new LinkedBlockingQueue
}

I'm assuming you're looking for something like java.util.concurrent.BlockingQueue?
Akka has a BoundedBlockingQueue implementation of this interface. There are of course the implementations available in java.util.concurrent.
You might also consider using Akka's actors for whatever it is you are doing. Use Actors to be notified or pushed a new event or message instead of pulling.

1) It seems you're looking for a dataflow stream seen in languages like Oz, which supports the producer-consumer pattern. Such a collection is not available in the collections API, but you could always create one yourself.
2) The data flow stream relies on the concept of single-assignment variables (such that they don't have to be initialized upon declaration point and reading them prior to initialization causes blocking):
val x: Int
startThread {
println(x)
}
println("The other thread waits for the x to be assigned")
x = 1
It would be straightforward to implement such a stream if single-assignment (or dataflow) variables were supported in the language (see the link). Since they are not a part of Scala, you have to use the wait-synchronized-notify pattern just like you did.
Concurrent queues from Java can be used to achieve that as well, as the other user suggested.

Related

How to backpressure a ActorPublisher

I'm writing few samples to understand akka streams and backpressures. I'm trying to see how a slow consumer backpressure's a AkkaPublisher
My code as follows.
class DataPublisher extends ActorPublisher[Int] {
import akka.stream.actor.ActorPublisherMessage._
var items: List[Int] = List.empty
def receive = {
case s: String =>
println(s"Producer buffer size ${items.size}")
if (totalDemand == 0)
items = items :+ s.toInt
else
onNext(s.toInt)
case Request(demand) =>
if (demand > items.size) {
items foreach (onNext)
items = List.empty
}
else {
val (send, keep) = items.splitAt(demand.toInt)
items = keep
send foreach (onNext)
}
case other =>
println(s"got other $other")
}
}
and
Source.fromPublisher(ActorPublisher[Int](dataPublisherRef)).runWith(sink)
Where the sink is a Subscriber with a sleep to emulate slow consumer. And publisher keeps producing data regardless.
--EDIT--
My question is when the demand is 0 programatically buffers data. How can I make use of backpressure to slow down the publisher
Something like
throttledSource().buffer(10, OverflowStrategy.backpressure).runWith(throttledSink())
This will not effect the publisher and its buffer keeps going.
Thanks,
Sajith
Don't use ActorPublisher
Firstly, don't use ActorPublisher - it is a very low-level and deprecated API. We decided to deprecate as users should not be working on such low level of abstraction in Akka Streams.
One of the tricky things is exactly what you're asking about -- handling backpressure is entirely in the hands of the developer writing the ActorPublisher if they use this API. So you have to receive the Request(n) messages and make sure that you never signal more elements than you got requests for. This behaviour is specified in the Reactive Streams Specification which you then have to implement correctly. Basically, you're exposed to all the complexities of Reactive Streams (which is a full specification, with many edge cases -- disclaimer: I was/am part of developing Reactive Streams as well as Akka Streams).
Showing how back-pressure manifests in GraphStage
Secondly, to build custom stages you should be using the API designed for it: GraphStage. Please note that such stage is also pretty low level. Normally users of Akka Streams don't need to write custom stages, however it is absolutely expected and fine to write your own stages if they would implement some logic that the built-in stages don't provide.
Here's a simplified Filter implementation from the Akka codebase:
case class Filter[T](p: T ⇒ Boolean) extends SimpleLinearGraphStage[T] {
override def initialAttributes: Attributes = DefaultAttributes.filter
override def toString: String = "Filter"
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
new GraphStageLogic(shape) with OutHandler with InHandler {
override def onPush(): Unit = {
val elem = grab(in)
if (p(elem)) push(out, elem)
else pull(in)
}
// this method will NOT be called, if the downstream has not signalled enough demand!
// this method being NOT called is how back-pressure manifests in stages
override def onPull(): Unit = pull(in)
setHandlers(in, out, this)
}
}
As you can see, instead of implementing the entire Reactive Streams logic and rules yourself (which is hard), you get simple callbacks like onPush and onPull. Akka Streams handles the demand management, and it will automatically call onPull if the downstream has signaled demand, and it will NOT call it, if there is no demand -- which would mean the downstream is applying backpressure to this stage.
This can be accomplished with an intermediate Flow.buffer:
val flowBuffer = Flow[Int].buffer(10, OverflowStrategy.dropHead)
Source
.fromPublisher(ActorPublisher[Int](dataPublisherRef))
.via(flowBuffer)
.runWith(sink)

Scala concurrency on iterators as queues

I'm not really sure of the correct language of my problem, so feel free to provide me with the right terms.
Say I have a process A, which outputs an iterator (lazy evaluation)
This produces Iterator[A]
I then have another process B, which maps the events returning
Iterator[B]
This continues for several more processes
Iterator[A] -> Iterator[B] -> Iterator[C] -> ---
Now eventually I evaluate this stream into a list[Z].
This saves me the memory hit of having a List[A] -> List[B] -> List[C] etc
Now I want to improve performance by introducing parallelisation, but I don't want to parallelise the evaluation of each element across the iterators, but rather each iterator stack. So in this case a thread for process A fills a Queue[A] for Iterator[A], a thread for process B takes from Queue[A], applies whatever mapping, and then adds to Queue[B] for Iterator[B] to read from.
Now I have done this before in other languages by designing my own Async queues, I was wondering what Scala has to solve this.
Heres a first stab solutions I made using an actor.
Its fully blocking, so maybe an implementation using futures could be developed
case class AsyncIterator[T](iterator:Iterator[T]) extends Iterator[T] {
private val queue = new scala.collection.mutable.SynchronizedQueue[Int]()
private var end = !iterator.hasNext
def hasNext() = {
if (end) false
else if (!queue.isEmpty) true
else hasNext
}
def next() = {
while (q.isEmpty) {
if (end) throw new Exception("blah")
}
q.dequeue()
}
private val producer: Actor = actor {
loop {
if (!iterator.hasNext) {
end = true
exit
}
else {
q.enqueue(iterator.next)
}
}
}
producer.start()
}
Since you're open to alternative languages, how about Go?
There was a discussion recently about how to construct an event-driven pipeline, which would achieve the same thing as you describe but in a completely different way.
It's arguably easier to think about and design an event pipeline than it is to reason about lazy iterators because it becomes a data flow system in which the key question at each stage is 'what does this stage do with a single entity?' rather than 'how can I iterate efficiently over many entities?'
Once an event-driven pipeline has been implemented, the question of how to make it concurrent or parallel is moot - you've already done it.

Is there an easy way to get a Stream as output of a RowParser?

Given rowParser of type RowParser[Photo], this is how you would parse a list of rows coming from a table photo, according to the code samples I have seen so far:
def getPhotos(album: Album): List[Photo] = DB.withConnection { implicit c =>
SQL("select * from photo where album = {album}").on(
'album -> album.id
).as(rowParser *)
}
Where the * operator creates a parser of type ResultSetParser[List[Photo]]. Now, I was wondering if it was equally possible to get a parser that yields a Stream (thinking that being more lazy is always better), but I only came up with this:
def getPhotos(album: Album): Stream[Photo] = DB.withConnection { implicit c =>
SQL("select * from photo where album = {album}").on(
'album -> album.id
)() collect (rowParser(_) match { case Success(photo) => photo })
}
It works, but it seems overly complicated. I could of course just call toStream on the List I get from the first function, but my goal was to only apply rowParser on rows that are actually read. Is there an easier way to achieve this?
EDIT: I know that limit should be used in the query, if the number of rows of interest is known beforehand. I am also aware that, in many cases, you are going to use the whole result anyway, so being lazy will not improve performance. But there might be a case where you save a few cycles, e.g. if for some reason, you have search criteria that you cannot or do not want to express in SQL. So I thought it was odd that, given the fact that anorm provides a way to obtain a Stream of SqlRow, I didn't find a straightforward way to apply a RowParser on that.
I ended up creating my own stream method which corresponds to the list method:
def stream[A](p: RowParser[A]) = new ResultSetParser[Stream[A]] {
def apply(rows: SqlParser.ResultSet): SqlResult[Stream[A]] = rows.headOption.map(p(_)) match {
case None => Success(Stream.empty[A])
case Some(Success(a)) => {
val s: Stream[A] = a #:: rows.tail.flatMap(r => p(r) match {
case Success(r) => Some(r)
case _ => None
})
Success(s)
}
case Some(Error(msg)) => Error(msg)
}
}
Note that the Play SqlResult can only be either Success/Error while each row can also be Success/Error. I handle this for the first row only, assuming the rest will be the same. This may or may not work for you.
You're better off making smaller (paged) queries using limit and offset.
Anorm would need some modification if you're going to keep your (large) result around in memory and stream it from there. Then the other concern would be the new memory requirements for your JVM. And how would you deal with caching on the service level? See, previously you could easily cache something like photos?page=1&size=10, but now you just have photos, and the caching technology would have no idea what to do with the stream.
Even worse, and possibly on a JDBC-level, wrapping Stream around limited and offset-ed execute statements and just making multiple calls to the database behind the scenes, but this sounds like it would need a fair bit of work to port the Stream code that Scala generates to Java land (to work with Groovy, jRuby, etc), then get it on the approved for the JDBC 5 or 6 roadmap. This idea will probably be shunned as being too complicated, which it is.
You could wrap Stream around your entire DAO (where the limit and offset trickery would happen), but this almost sounds like more trouble than it's worth :-)
I ran into a similar situation but ran into a Call Stack Overflow exception when the built-in anorm function to convert to Streams attempted to parse the result set.
In order to get around this I elected to abandon the anorm ResultSetParser paradigm, and fall back to the java.sql.ResultSet object.
I wanted to use anorm's internal classes for the parsing result set rows, but, ever since version 2.4, they have made all of the pertinent classes and methods private to their package, and have deprecated several other methods that would have been more straight-forward to use.
I used a combination of Promises and Futures to work around the ManagedResource that anorm now returns. I avoided all deprecated functions.
import anorm._
import java.sql.ResultSet
import scala.concurrent._
def SqlStream[T](sql:SqlQuery)(parse:ResultSet => T)(implicit ec:ExecutionContext):Future[Stream[T]] = {
val conn = db.getConnection()
val mr = sql.preparedStatement(conn, false)
val p = Promise[Unit]()
val p2 = Promise[ResultSet]()
Future {
mr.map({ stmt =>
p2.success(stmt.executeQuery)
Await.ready(p.future, duration.Duration.Inf)
}).acquireAndGet(identity).andThen { case _ => conn.close() }
}
def _stream(rs:ResultSet):Stream[T] = {
if (rs.next()) parse(rs) #:: _stream(rs)
else {
p.success(())
Stream.empty
}
}
p2.future.map { rs =>
rs.beforeFirst()
_stream(rs)
}
}
A rather trivial usage of this function would be something like this:
def getText(implicit ec:ExecutionContext):Future[Stream[String]] = {
SqlStream(SQL("select FIELD from TABLE")) { rs => rs.getString("FIELD") }
}
There are, of course, drawbacks to this approach, however, this got around my problem and did not require inclusion of any other libraries.

Parallel file processing in Scala

Suppose I need to process files in a given folder in parallel. In Java I would create a FolderReader thread to read file names from the folder and a pool of FileProcessor threads. FolderReader reads file names and submits the file processing function (Runnable) to the pool executor.
In Scala I see two options:
create a pool of FileProcessor actors and schedule a file processing function with Actors.Scheduler.
create an actor for each file name while reading the file names.
Does it make sense? What is the best option?
Depending on what you're doing, it may be as simple as
for(file<-files.par){
//process the file
}
I suggest with all my energies to keep as far as you can from the threads. Luckily we have better abstractions which take care of what's happening below, and in your case it appears to me that you do not need to use actors (while you can) but you can use a simpler abstraction, called Futures. They are a part of Akka open source library, and I think in the future will be a part of the Scala standard library as well.
A Future[T] is simply something that will return a T in the future.
All you need to run a future, is to have an implicit ExecutionContext, which you can derive from a java executor service. Then you will be able to enjoy the elegant API and the fact that a future is a monad to transform collections into collections of futures, collect the result and so on. I suggest you to give a look to http://doc.akka.io/docs/akka/2.0.1/scala/futures.html
object TestingFutures {
implicit val executorService = Executors.newFixedThreadPool(20)
implicit val executorContext = ExecutionContext.fromExecutorService(executorService)
def testFutures(myList:List[String]):List[String]= {
val listOfFutures : Future[List[String]] = Future.traverse(myList){
aString => Future{
aString.reverse
}
}
val result:List[String] = Await.result(listOfFutures,1 minute)
result
}
}
There's a lot going on here:
I am using Future.traverse which receives as a first parameter which is M[T]<:Traversable[T] and as second parameter a T => Future[T] or if you prefer a Function1[T,Future[T]] and returns Future[M[T]]
I am using the Future.apply method to create an anonymous class of type Future[T]
There are many other reasons to look at Akka futures.
Futures can be mapped because they are monad, i.e. you can chain Futures execution :
Future { 3 }.map { _ * 2 }.map { _.toString }
Futures have callback: future.onComplete, onSuccess, onFailure, andThen etc.
Futures support not only traverse, but also for comprehension
Ideally you should use two actors. One for reading the list of files, and one for actually reading the file.
You start the process by simply sending a single "start" message to the first actor. The actor can then read the list of files, and send a message to the second actor. The second actor then reads the file and processes the contents.
Having multiple actors, which might seem complicated, is actually a good thing in the sense that you have a bunch of objects communicating with eachother, like in a theoretical OO system.
Edit: you REALLY shouldn't be doing doing concurrent reading of a single file.
I was going to write up exactly what #Edmondo1984 did except he beat me to it. :) I second his suggestion in a big way. I'll also suggest that you read the documentation for Akka 2.0.2. As well, I'll give you a slightly more concrete example:
import akka.dispatch.{ExecutionContext, Future, Await}
import akka.util.duration._
import java.util.concurrent.Executors
import java.io.File
val execService = Executors.newCachedThreadPool()
implicit val execContext = ExecutionContext.fromExecutorService(execService)
val tmp = new File("/tmp/")
val files = tmp.listFiles()
val workers = files.map { f =>
Future {
f.getAbsolutePath()
}
}.toSeq
val result = Future.sequence(workers)
result.onSuccess {
case filenames =>
filenames.foreach { fn =>
println(fn)
}
}
// Artificial just to make things work for the example
Thread.sleep(100)
execContext.shutdown()
Here I use sequence instead of traverse, but the difference is going to depend on your needs.
Go with the Future, my friend; the Actor is just a more painful approach in this instance.
But if use actors, what's wrong with that?
If we have to read / write to some property file. There is my Java example. But still with Akka Actors.
Lest's say we have an actor ActorFile represents one file. Hm.. Probably it can not represent One file. Right? (would be nice it could). So then it represents several files like PropertyFilesActor then:
Why would not use something like this:
public class PropertyFilesActor extends UntypedActor {
Map<String, String> filesContent = new LinkedHashMap<String, String>();
{ // here we should use real files of cource
filesContent.put("file1.xml", "");
filesContent.put("file2.xml", "");
}
#Override
public void onReceive(Object message) throws Exception {
if (message instanceof WriteMessage) {
WriteMessage writeMessage = (WriteMessage) message;
String content = filesContent.get(writeMessage.fileName);
String newContent = content + writeMessage.stringToWrite;
filesContent.put(writeMessage.fileName, newContent);
}
else if (message instanceof ReadMessage) {
ReadMessage readMessage = (ReadMessage) message;
String currentContent = filesContent.get(readMessage.fileName);
// Send the current content back to the sender
getSender().tell(new ReadMessage(readMessage.fileName, currentContent), getSelf());
}
else unhandled(message);
}
}
...a message will go with parameter (fileName)
It has its own in-box, accepting messages like:
WriteLine(fileName, string)
ReadLine(fileName, string)
Those messages will be storing into to the in-box in the order, one after antoher. The actor would do its work by receiving messages from the box - storing/reading, and meanwhile sending feedback sender ! message back.
Thus, let's say if we write to the property file, and send showing the content on the web page. We can start showing page (right after we sent message to store a data to the file) and as soon as we received the feedback, update part of the page with a data from just updated file (by ajax).
Well, grab your files and stick them in a parallel structure
scala> new java.io.File("/tmp").listFiles.par
res0: scala.collection.parallel.mutable.ParArray[java.io.File] = ParArray( ... )
Then...
scala> res0 map (_.length)
res1: scala.collection.parallel.mutable.ParArray[Long] = ParArray(4943, 1960, 4208, 103266, 363 ... )

Can I transform this asynchronous java network API into a monadic representation (or something else idiomatic)?

I've been given a java api for connecting to and communicating over a proprietary bus using a callback based style. I'm currently implementing a proof-of-concept application in scala, and I'm trying to work out how I might produce a slightly more idiomatic scala interface.
A typical (simplified) application might look something like this in Java:
DataType type = new DataType();
BusConnector con = new BusConnector();
con.waitForData(type.getClass()).addListener(new IListener<DataType>() {
public void onEvent(DataType t) {
//some stuff happens in here, and then we need some more data
con.waitForData(anotherType.getClass()).addListener(new IListener<anotherType>() {
public void onEvent(anotherType t) {
//we do more stuff in here, and so on
}
});
}
});
//now we've got the behaviours set up we call
con.start();
In scala I can obviously define an implicit conversion from (T => Unit) into an IListener, which certainly makes things a bit simpler to read:
implicit def func2Ilistener[T](f: (T => Unit)) : IListener[T] = new IListener[T]{
def onEvent(t:T) = f
}
val con = new BusConnector
con.waitForData(DataType.getClass).addListener( (d:DataType) => {
//some stuff, then another wait for stuff
con.waitForData(OtherType.getClass).addListener( (o:OtherType) => {
//etc
})
})
Looking at this reminded me of both scalaz promises and f# async workflows.
My question is this:
Can I convert this into either a for comprehension or something similarly idiomatic (I feel like this should map to actors reasonably well too)
Ideally I'd like to see something like:
for(
d <- con.waitForData(DataType.getClass);
val _ = doSomethingWith(d);
o <- con.waitForData(OtherType.getClass)
//etc
)
If you want to use a for comprehension for this, I'd recommend looking at the Scala Language Specification for how for comprehensions are expanded to map, flatMap, etc. This will give you some clues about how this structure relates to what you've already got (with nested calls to addListener). You can then add an implicit conversion from the return type of the waitForData call to a new type with the appropriate map, flatMap, etc methods that delegate to addListener.
Update
I think you can use scala.Responder[T] from the standard library:
Assuming the class with the addListener is called Dispatcher[T]:
trait Dispatcher[T] {
def addListener(listener: IListener[T]): Unit
}
trait IListener[T] {
def onEvent(t: T): Unit
}
implicit def dispatcher2Responder[T](d: Dispatcher[T]):Responder[T] = new Responder[T} {
def respond(k: T => Unit) = d.addListener(new IListener[T] {
def onEvent(t:T) = k
})
}
You can then use this as requested
for(
d <- con.waitForData(DataType.getClass);
val _ = doSomethingWith(d);
o <- con.waitForData(OtherType.getClass)
//etc
) ()
See the Scala wiki and this presentation on using Responder[T] for a Comet chat application.
I have very little Scala experience, but if I were implementing something like this I'd look to leverage the actor mechanism rather than using callback listener classes. Actors were made for asynchronous communication, they nicely separate those different parts of your app for you. You can also have them send messages to multiple listeners.
We'll have to wait for a "real" Scala programmer to flesh this idea out, though. ;)