Scala concurrency on iterators as queues - scala

I'm not really sure of the correct language of my problem, so feel free to provide me with the right terms.
Say I have a process A, which outputs an iterator (lazy evaluation)
This produces Iterator[A]
I then have another process B, which maps the events returning
Iterator[B]
This continues for several more processes
Iterator[A] -> Iterator[B] -> Iterator[C] -> ---
Now eventually I evaluate this stream into a list[Z].
This saves me the memory hit of having a List[A] -> List[B] -> List[C] etc
Now I want to improve performance by introducing parallelisation, but I don't want to parallelise the evaluation of each element across the iterators, but rather each iterator stack. So in this case a thread for process A fills a Queue[A] for Iterator[A], a thread for process B takes from Queue[A], applies whatever mapping, and then adds to Queue[B] for Iterator[B] to read from.
Now I have done this before in other languages by designing my own Async queues, I was wondering what Scala has to solve this.

Heres a first stab solutions I made using an actor.
Its fully blocking, so maybe an implementation using futures could be developed
case class AsyncIterator[T](iterator:Iterator[T]) extends Iterator[T] {
private val queue = new scala.collection.mutable.SynchronizedQueue[Int]()
private var end = !iterator.hasNext
def hasNext() = {
if (end) false
else if (!queue.isEmpty) true
else hasNext
}
def next() = {
while (q.isEmpty) {
if (end) throw new Exception("blah")
}
q.dequeue()
}
private val producer: Actor = actor {
loop {
if (!iterator.hasNext) {
end = true
exit
}
else {
q.enqueue(iterator.next)
}
}
}
producer.start()
}

Since you're open to alternative languages, how about Go?
There was a discussion recently about how to construct an event-driven pipeline, which would achieve the same thing as you describe but in a completely different way.
It's arguably easier to think about and design an event pipeline than it is to reason about lazy iterators because it becomes a data flow system in which the key question at each stage is 'what does this stage do with a single entity?' rather than 'how can I iterate efficiently over many entities?'
Once an event-driven pipeline has been implemented, the question of how to make it concurrent or parallel is moot - you've already done it.

Related

Reentrant locks within monads in Scala

A colleague of mine stated the following, about using a Java ReentrantReadWriteLock in some Scala code:
Acquiring the lock here is risky. It's "reentrant", but that internally depends on the thread context. F may run different stages of the same computation in different threads. You can easily cause a deadlock.
F here refers to some effectful monad.
Basically what I'm trying to do is to acquire the same reentrant lock twice, within the same monad.
Could somebody clarify why this could be a problem?
The code is split into two files. The outermost one:
val lock: Resource[F, Unit] = for {
// some other resource
_ <- store.writeLock
} yield ()
lock.use { _ =>
for {
// stuff
_ <- EitherT(store.doSomething())
// other stuff
} yield ()
}
Then, in the store:
import java.util.concurrent.locks.{Lock, ReentrantReadWriteLock}
import cats.effect.{Resource, Sync}
private def lockAsResource[F[_]](lock: Lock)(implicit F: Sync[F]): Resource[F, Unit] =
Resource.make {
F.delay(lock.lock())
} { _ =>
F.delay(lock.unlock())
}
private val lock = new ReentrantReadWriteLock
val writeLock: Resource[F, Unit] = lockAsResource(lock.writeLock())
def doSomething(): F[Either[Throwable, Unit]] = writeLock.use { _ =>
// etc etc
}
The writeLock in the two pieces of code is the same, and it's a cats.effect.Resource[F, Unit] wrapping a ReentrantReadWriteLock's writeLock. There are some reasons why I was writing the code this way, so I wouldn't want to dig into that. I would just like to understand why (according to my colleague, at least), this could potentially break stuff.
Also, I'd like to know if there is some alternative in Scala that would allow something like this without the risk for deadlocks.
IIUC your question:
You expect that for each interaction with the Resource lock.lock and lock.unlock actions happen in the same thread.
1) There is no guarantee at all since you are using arbitrary effect F here.
It's possible to write an implementation of F that executes every action in a new thread.
2) Even if we assume that F is IO then the body of doSomething someone could do IO.shift. So the next actions including unlock would happen in another thread. Probably it's not possible with the current signature of doSomething but you get the idea.
Also, I'd like to know if there is some alternative in Scala that would allow something like this without the risk for deadlocks.
You can take a look at scalaz zio STM.

How to backpressure a ActorPublisher

I'm writing few samples to understand akka streams and backpressures. I'm trying to see how a slow consumer backpressure's a AkkaPublisher
My code as follows.
class DataPublisher extends ActorPublisher[Int] {
import akka.stream.actor.ActorPublisherMessage._
var items: List[Int] = List.empty
def receive = {
case s: String =>
println(s"Producer buffer size ${items.size}")
if (totalDemand == 0)
items = items :+ s.toInt
else
onNext(s.toInt)
case Request(demand) =>
if (demand > items.size) {
items foreach (onNext)
items = List.empty
}
else {
val (send, keep) = items.splitAt(demand.toInt)
items = keep
send foreach (onNext)
}
case other =>
println(s"got other $other")
}
}
and
Source.fromPublisher(ActorPublisher[Int](dataPublisherRef)).runWith(sink)
Where the sink is a Subscriber with a sleep to emulate slow consumer. And publisher keeps producing data regardless.
--EDIT--
My question is when the demand is 0 programatically buffers data. How can I make use of backpressure to slow down the publisher
Something like
throttledSource().buffer(10, OverflowStrategy.backpressure).runWith(throttledSink())
This will not effect the publisher and its buffer keeps going.
Thanks,
Sajith
Don't use ActorPublisher
Firstly, don't use ActorPublisher - it is a very low-level and deprecated API. We decided to deprecate as users should not be working on such low level of abstraction in Akka Streams.
One of the tricky things is exactly what you're asking about -- handling backpressure is entirely in the hands of the developer writing the ActorPublisher if they use this API. So you have to receive the Request(n) messages and make sure that you never signal more elements than you got requests for. This behaviour is specified in the Reactive Streams Specification which you then have to implement correctly. Basically, you're exposed to all the complexities of Reactive Streams (which is a full specification, with many edge cases -- disclaimer: I was/am part of developing Reactive Streams as well as Akka Streams).
Showing how back-pressure manifests in GraphStage
Secondly, to build custom stages you should be using the API designed for it: GraphStage. Please note that such stage is also pretty low level. Normally users of Akka Streams don't need to write custom stages, however it is absolutely expected and fine to write your own stages if they would implement some logic that the built-in stages don't provide.
Here's a simplified Filter implementation from the Akka codebase:
case class Filter[T](p: T ⇒ Boolean) extends SimpleLinearGraphStage[T] {
override def initialAttributes: Attributes = DefaultAttributes.filter
override def toString: String = "Filter"
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
new GraphStageLogic(shape) with OutHandler with InHandler {
override def onPush(): Unit = {
val elem = grab(in)
if (p(elem)) push(out, elem)
else pull(in)
}
// this method will NOT be called, if the downstream has not signalled enough demand!
// this method being NOT called is how back-pressure manifests in stages
override def onPull(): Unit = pull(in)
setHandlers(in, out, this)
}
}
As you can see, instead of implementing the entire Reactive Streams logic and rules yourself (which is hard), you get simple callbacks like onPush and onPull. Akka Streams handles the demand management, and it will automatically call onPull if the downstream has signaled demand, and it will NOT call it, if there is no demand -- which would mean the downstream is applying backpressure to this stage.
This can be accomplished with an intermediate Flow.buffer:
val flowBuffer = Flow[Int].buffer(10, OverflowStrategy.dropHead)
Source
.fromPublisher(ActorPublisher[Int](dataPublisherRef))
.via(flowBuffer)
.runWith(sink)

Scala: Thread safe mutable lazy Iterator with append

For an immutable flavour, Iterator does the job.
val x = Iterator.fill(100000)(someFn)
Now I want to implement a mutable version of Iterator, with three guarantees:
thread-safe on all transformations(fold, foldLeft, ..) and append
lazy evaluated
traversable only once! Once used, an object from this Iterator should be destroyed.
Is there an existing implementation to give me these guarantees? Any library or framework example would be great.
Update
To illustrate the desired behaviour.
class SomeThing {}
class Test(val list: Iterator[SomeThing]) {
def add(thing: SomeThing): Test = {
new Test(list ++ Iterator(thing))
}
}
(new Test()).add(new SomeThing).add(new SomeThing);
In this example, SomeThing is an expensive construct, it needs to be lazy.
Re-iterating over list is never required, Iterator is a good fit.
This is supposed to asynchronously and lazily sequence 10 million SomeThing instances without depleting the executor(a cached thread pool executor) or running out of memory.
You don't need a mutable Iterator for this, just daisy-chain the immutable form:
class SomeThing {}
case class Test(val list: Iterator[SomeThing]) {
def add(thing: => SomeThing) = Test(list ++ Iterator(thing))
}
(new Test()).add(new SomeThing).add(new SomeThing)
Although you don't really need the extra boilerplate of Test here:
Iterator(new SomeThing) ++ Iterator(new SomeThing)
Note that Iterator.++ takes a by-name param, so the ++ operation is already lazy.
You might also want to try this, to avoid building intermediate Iterators:
Iterator.continually(new SomeThing) take 2
UPDATE
If you don't know the size in advance, then I'll often use a tactic like this:
def mkSomething = if(cond) Some(new Something) else None
Iterator.continually(mkSomething) takeWhile (_.isDefined) map { _.get }
The trick is to have your generator function wrap its output in an Option, which then gives you a way to flag that the iteration is finished by returning None
Of course... If you're really pushing out the boat, you can even use the dreaded null:
def mkSomething = if(cond) { new Something } else null
Iterator.continually(mkSomething) takeWhile (_ != null)
Seems like you need to hide the fact that the iterator is mutable but at the same time allow it to grow mutably. What I'm going to propose is the same sort of trick I've used to speed up ::: in the past:
abstract class AppendableIterator[A] extends Iterator[A]{
protected var inner: Iterator[A]
def hasNext = inner.hasNext
def next() = inner next ()
def append(that: Iterator[A]) = synchronized{
inner = new JoinedIterator(inner, that)
}
}
//You might need to add some more things, this is a skeleton
class JoinedIterator[A](first: Iterator[A], second: Iterator[A]) extends Iterator[A]{
def hasNext = first.hasNext || second.hasNext
def next() = if(first.hasNext) first next () else if(second.hasNext) second next () else Iterator.next()
}
So what you're really doing is leaving the Iterator at whatever place in its iteration you might have it while still preserving the thread safety of the append by "joining" another Iterator in non-destructively. You avoid the need to recompute the two together because you never actually force them through a CanBuildFrom.
This is also a generalization of just adding one item. You can always wrap some A in an Iterator[A] of one element if you so choose.
Have you looked at the mutable.ParIterable in the collection.parallel package?
To access an iterator over elements you can do something like
val x = ParIterable.fill(100000)(someFn).iterator
From the docs:
Parallel operations are implemented with divide and conquer style algorithms that parallelize well. The basic idea is to split the collection into smaller parts until they are small enough to be operated on sequentially.
...
The higher-order functions passed to certain operations may contain side-effects. Since implementations of bulk operations may not be sequential, this means that side-effects may not be predictable and may produce data-races, deadlocks or invalidation of state if care is not taken. It is up to the programmer to either avoid using side-effects or to use some form of synchronization when accessing mutable data.

Parallel file processing in Scala

Suppose I need to process files in a given folder in parallel. In Java I would create a FolderReader thread to read file names from the folder and a pool of FileProcessor threads. FolderReader reads file names and submits the file processing function (Runnable) to the pool executor.
In Scala I see two options:
create a pool of FileProcessor actors and schedule a file processing function with Actors.Scheduler.
create an actor for each file name while reading the file names.
Does it make sense? What is the best option?
Depending on what you're doing, it may be as simple as
for(file<-files.par){
//process the file
}
I suggest with all my energies to keep as far as you can from the threads. Luckily we have better abstractions which take care of what's happening below, and in your case it appears to me that you do not need to use actors (while you can) but you can use a simpler abstraction, called Futures. They are a part of Akka open source library, and I think in the future will be a part of the Scala standard library as well.
A Future[T] is simply something that will return a T in the future.
All you need to run a future, is to have an implicit ExecutionContext, which you can derive from a java executor service. Then you will be able to enjoy the elegant API and the fact that a future is a monad to transform collections into collections of futures, collect the result and so on. I suggest you to give a look to http://doc.akka.io/docs/akka/2.0.1/scala/futures.html
object TestingFutures {
implicit val executorService = Executors.newFixedThreadPool(20)
implicit val executorContext = ExecutionContext.fromExecutorService(executorService)
def testFutures(myList:List[String]):List[String]= {
val listOfFutures : Future[List[String]] = Future.traverse(myList){
aString => Future{
aString.reverse
}
}
val result:List[String] = Await.result(listOfFutures,1 minute)
result
}
}
There's a lot going on here:
I am using Future.traverse which receives as a first parameter which is M[T]<:Traversable[T] and as second parameter a T => Future[T] or if you prefer a Function1[T,Future[T]] and returns Future[M[T]]
I am using the Future.apply method to create an anonymous class of type Future[T]
There are many other reasons to look at Akka futures.
Futures can be mapped because they are monad, i.e. you can chain Futures execution :
Future { 3 }.map { _ * 2 }.map { _.toString }
Futures have callback: future.onComplete, onSuccess, onFailure, andThen etc.
Futures support not only traverse, but also for comprehension
Ideally you should use two actors. One for reading the list of files, and one for actually reading the file.
You start the process by simply sending a single "start" message to the first actor. The actor can then read the list of files, and send a message to the second actor. The second actor then reads the file and processes the contents.
Having multiple actors, which might seem complicated, is actually a good thing in the sense that you have a bunch of objects communicating with eachother, like in a theoretical OO system.
Edit: you REALLY shouldn't be doing doing concurrent reading of a single file.
I was going to write up exactly what #Edmondo1984 did except he beat me to it. :) I second his suggestion in a big way. I'll also suggest that you read the documentation for Akka 2.0.2. As well, I'll give you a slightly more concrete example:
import akka.dispatch.{ExecutionContext, Future, Await}
import akka.util.duration._
import java.util.concurrent.Executors
import java.io.File
val execService = Executors.newCachedThreadPool()
implicit val execContext = ExecutionContext.fromExecutorService(execService)
val tmp = new File("/tmp/")
val files = tmp.listFiles()
val workers = files.map { f =>
Future {
f.getAbsolutePath()
}
}.toSeq
val result = Future.sequence(workers)
result.onSuccess {
case filenames =>
filenames.foreach { fn =>
println(fn)
}
}
// Artificial just to make things work for the example
Thread.sleep(100)
execContext.shutdown()
Here I use sequence instead of traverse, but the difference is going to depend on your needs.
Go with the Future, my friend; the Actor is just a more painful approach in this instance.
But if use actors, what's wrong with that?
If we have to read / write to some property file. There is my Java example. But still with Akka Actors.
Lest's say we have an actor ActorFile represents one file. Hm.. Probably it can not represent One file. Right? (would be nice it could). So then it represents several files like PropertyFilesActor then:
Why would not use something like this:
public class PropertyFilesActor extends UntypedActor {
Map<String, String> filesContent = new LinkedHashMap<String, String>();
{ // here we should use real files of cource
filesContent.put("file1.xml", "");
filesContent.put("file2.xml", "");
}
#Override
public void onReceive(Object message) throws Exception {
if (message instanceof WriteMessage) {
WriteMessage writeMessage = (WriteMessage) message;
String content = filesContent.get(writeMessage.fileName);
String newContent = content + writeMessage.stringToWrite;
filesContent.put(writeMessage.fileName, newContent);
}
else if (message instanceof ReadMessage) {
ReadMessage readMessage = (ReadMessage) message;
String currentContent = filesContent.get(readMessage.fileName);
// Send the current content back to the sender
getSender().tell(new ReadMessage(readMessage.fileName, currentContent), getSelf());
}
else unhandled(message);
}
}
...a message will go with parameter (fileName)
It has its own in-box, accepting messages like:
WriteLine(fileName, string)
ReadLine(fileName, string)
Those messages will be storing into to the in-box in the order, one after antoher. The actor would do its work by receiving messages from the box - storing/reading, and meanwhile sending feedback sender ! message back.
Thus, let's say if we write to the property file, and send showing the content on the web page. We can start showing page (right after we sent message to store a data to the file) and as soon as we received the feedback, update part of the page with a data from just updated file (by ajax).
Well, grab your files and stick them in a parallel structure
scala> new java.io.File("/tmp").listFiles.par
res0: scala.collection.parallel.mutable.ParArray[java.io.File] = ParArray( ... )
Then...
scala> res0 map (_.length)
res1: scala.collection.parallel.mutable.ParArray[Long] = ParArray(4943, 1960, 4208, 103266, 363 ... )

Is there a FIFO stream in Scala?

I'm looking for a FIFO stream in Scala, i.e., something that provides the functionality of
immutable.Stream (a stream that can be finite and memorizes the elements that have already been read)
mutable.Queue (which allows for added elements to the FIFO)
The stream should be closable and should block access to the next element until the element has been added or the stream has been closed.
Actually I'm a bit surprised that the collection library does not (seem to) include such a data structure, since it is IMO a quite classical one.
My questions:
1) Did I overlook something? Is there already a class providing this functionality?
2) OK, if it's not included in the collection library then it might by just a trivial combination of existing collection classes. However, I tried to find this trivial code but my implementation looks still quite complex for such a simple problem. Is there a simpler solution for such a FifoStream?
class FifoStream[T] extends Closeable {
val queue = new Queue[Option[T]]
lazy val stream = nextStreamElem
private def nextStreamElem: Stream[T] = next() match {
case Some(elem) => Stream.cons(elem, nextStreamElem)
case None => Stream.empty
}
/** Returns next element in the queue (may wait for it to be inserted). */
private def next() = {
queue.synchronized {
if (queue.isEmpty) queue.wait()
queue.dequeue()
}
}
/** Adds new elements to this stream. */
def enqueue(elems: T*) {
queue.synchronized {
queue.enqueue(elems.map{Some(_)}: _*)
queue.notify()
}
}
/** Closes this stream. */
def close() {
queue.synchronized {
queue.enqueue(None)
queue.notify()
}
}
}
Paradigmatic's solution (sightly modified)
Thanks for your suggestions. I slightly modified paradigmatic's solution so that toStream returns an immutable stream (allows for repeatable reads) so that it fits my needs. Just for completeness, here is the code:
import collection.JavaConversions._
import java.util.concurrent.{LinkedBlockingQueue, BlockingQueue}
class FIFOStream[A]( private val queue: BlockingQueue[Option[A]] = new LinkedBlockingQueue[Option[A]]() ) {
lazy val toStream: Stream[A] = queue2stream
private def queue2stream: Stream[A] = queue take match {
case Some(a) => Stream cons ( a, queue2stream )
case None => Stream empty
}
def close() = queue add None
def enqueue( as: A* ) = queue addAll as.map( Some(_) )
}
In Scala, streams are "functional iterators". People expect them to be pure (no side effects) and immutable. In you case, everytime you iterate on the stream you modify the queue (so it's no pure). This can create a lot of misunderstandings, because iterating twice the same stream, will have two different results.
That being said, you should rather use Java BlockingQueues, rather than rolling your own implementation. They are considered well implemented in term of safety and performances. Here is the cleanest code I can think of (using your approach):
import java.util.concurrent.BlockingQueue
import scala.collection.JavaConversions._
class FIFOStream[A]( private val queue: BlockingQueue[Option[A]] ) {
def toStream: Stream[A] = queue take match {
case Some(a) => Stream cons ( a, toStream )
case None => Stream empty
}
def close() = queue add None
def enqueue( as: A* ) = queue addAll as.map( Some(_) )
}
object FIFOStream {
def apply[A]() = new LinkedBlockingQueue
}
I'm assuming you're looking for something like java.util.concurrent.BlockingQueue?
Akka has a BoundedBlockingQueue implementation of this interface. There are of course the implementations available in java.util.concurrent.
You might also consider using Akka's actors for whatever it is you are doing. Use Actors to be notified or pushed a new event or message instead of pulling.
1) It seems you're looking for a dataflow stream seen in languages like Oz, which supports the producer-consumer pattern. Such a collection is not available in the collections API, but you could always create one yourself.
2) The data flow stream relies on the concept of single-assignment variables (such that they don't have to be initialized upon declaration point and reading them prior to initialization causes blocking):
val x: Int
startThread {
println(x)
}
println("The other thread waits for the x to be assigned")
x = 1
It would be straightforward to implement such a stream if single-assignment (or dataflow) variables were supported in the language (see the link). Since they are not a part of Scala, you have to use the wait-synchronized-notify pattern just like you did.
Concurrent queues from Java can be used to achieve that as well, as the other user suggested.