Parallel file processing in Scala - scala

Suppose I need to process files in a given folder in parallel. In Java I would create a FolderReader thread to read file names from the folder and a pool of FileProcessor threads. FolderReader reads file names and submits the file processing function (Runnable) to the pool executor.
In Scala I see two options:
create a pool of FileProcessor actors and schedule a file processing function with Actors.Scheduler.
create an actor for each file name while reading the file names.
Does it make sense? What is the best option?

Depending on what you're doing, it may be as simple as
for(file<-files.par){
//process the file
}

I suggest with all my energies to keep as far as you can from the threads. Luckily we have better abstractions which take care of what's happening below, and in your case it appears to me that you do not need to use actors (while you can) but you can use a simpler abstraction, called Futures. They are a part of Akka open source library, and I think in the future will be a part of the Scala standard library as well.
A Future[T] is simply something that will return a T in the future.
All you need to run a future, is to have an implicit ExecutionContext, which you can derive from a java executor service. Then you will be able to enjoy the elegant API and the fact that a future is a monad to transform collections into collections of futures, collect the result and so on. I suggest you to give a look to http://doc.akka.io/docs/akka/2.0.1/scala/futures.html
object TestingFutures {
implicit val executorService = Executors.newFixedThreadPool(20)
implicit val executorContext = ExecutionContext.fromExecutorService(executorService)
def testFutures(myList:List[String]):List[String]= {
val listOfFutures : Future[List[String]] = Future.traverse(myList){
aString => Future{
aString.reverse
}
}
val result:List[String] = Await.result(listOfFutures,1 minute)
result
}
}
There's a lot going on here:
I am using Future.traverse which receives as a first parameter which is M[T]<:Traversable[T] and as second parameter a T => Future[T] or if you prefer a Function1[T,Future[T]] and returns Future[M[T]]
I am using the Future.apply method to create an anonymous class of type Future[T]
There are many other reasons to look at Akka futures.
Futures can be mapped because they are monad, i.e. you can chain Futures execution :
Future { 3 }.map { _ * 2 }.map { _.toString }
Futures have callback: future.onComplete, onSuccess, onFailure, andThen etc.
Futures support not only traverse, but also for comprehension

Ideally you should use two actors. One for reading the list of files, and one for actually reading the file.
You start the process by simply sending a single "start" message to the first actor. The actor can then read the list of files, and send a message to the second actor. The second actor then reads the file and processes the contents.
Having multiple actors, which might seem complicated, is actually a good thing in the sense that you have a bunch of objects communicating with eachother, like in a theoretical OO system.
Edit: you REALLY shouldn't be doing doing concurrent reading of a single file.

I was going to write up exactly what #Edmondo1984 did except he beat me to it. :) I second his suggestion in a big way. I'll also suggest that you read the documentation for Akka 2.0.2. As well, I'll give you a slightly more concrete example:
import akka.dispatch.{ExecutionContext, Future, Await}
import akka.util.duration._
import java.util.concurrent.Executors
import java.io.File
val execService = Executors.newCachedThreadPool()
implicit val execContext = ExecutionContext.fromExecutorService(execService)
val tmp = new File("/tmp/")
val files = tmp.listFiles()
val workers = files.map { f =>
Future {
f.getAbsolutePath()
}
}.toSeq
val result = Future.sequence(workers)
result.onSuccess {
case filenames =>
filenames.foreach { fn =>
println(fn)
}
}
// Artificial just to make things work for the example
Thread.sleep(100)
execContext.shutdown()
Here I use sequence instead of traverse, but the difference is going to depend on your needs.
Go with the Future, my friend; the Actor is just a more painful approach in this instance.

But if use actors, what's wrong with that?
If we have to read / write to some property file. There is my Java example. But still with Akka Actors.
Lest's say we have an actor ActorFile represents one file. Hm.. Probably it can not represent One file. Right? (would be nice it could). So then it represents several files like PropertyFilesActor then:
Why would not use something like this:
public class PropertyFilesActor extends UntypedActor {
Map<String, String> filesContent = new LinkedHashMap<String, String>();
{ // here we should use real files of cource
filesContent.put("file1.xml", "");
filesContent.put("file2.xml", "");
}
#Override
public void onReceive(Object message) throws Exception {
if (message instanceof WriteMessage) {
WriteMessage writeMessage = (WriteMessage) message;
String content = filesContent.get(writeMessage.fileName);
String newContent = content + writeMessage.stringToWrite;
filesContent.put(writeMessage.fileName, newContent);
}
else if (message instanceof ReadMessage) {
ReadMessage readMessage = (ReadMessage) message;
String currentContent = filesContent.get(readMessage.fileName);
// Send the current content back to the sender
getSender().tell(new ReadMessage(readMessage.fileName, currentContent), getSelf());
}
else unhandled(message);
}
}
...a message will go with parameter (fileName)
It has its own in-box, accepting messages like:
WriteLine(fileName, string)
ReadLine(fileName, string)
Those messages will be storing into to the in-box in the order, one after antoher. The actor would do its work by receiving messages from the box - storing/reading, and meanwhile sending feedback sender ! message back.
Thus, let's say if we write to the property file, and send showing the content on the web page. We can start showing page (right after we sent message to store a data to the file) and as soon as we received the feedback, update part of the page with a data from just updated file (by ajax).

Well, grab your files and stick them in a parallel structure
scala> new java.io.File("/tmp").listFiles.par
res0: scala.collection.parallel.mutable.ParArray[java.io.File] = ParArray( ... )
Then...
scala> res0 map (_.length)
res1: scala.collection.parallel.mutable.ParArray[Long] = ParArray(4943, 1960, 4208, 103266, 363 ... )

Related

scala ZIO foreachPar

I'm new to parallel programming and ZIO, i'm trying to get data from an API, by parallel requests.
import sttp.client._
import zio.{Task, ZIO}
ZIO.foreach(files) { file =>
getData(file)
Task(file.getName)
}
def getData(file: File) = {
val data: String = readData(file)
val request = basicRequest.body(data).post(uri"$url")
.headers(content -> "text", char -> "utf-8")
.response(asString)
implicit val backend: SttpBackend[Identity, Nothing, NothingT] = HttpURLConnectionBackend()
request.send().body
resquest.Response match {
case Success(value) => {
val src = new PrintWriter(new File(filename))
src.write(value.toString)
src.close()
}
case Failure(exception) => log error
}
when i execute the program sequentially, it work as expected,
if i tried to run parallel, by changing ZIO.foreach to ZIO.foreachPar.
The program is terminating prematurely, i get that, i'm missing something basic here,
any help is appreciated to help me figure out the issue.
Generally speaking I wouldn't recommend mixing synchronous blocking code as you have with asynchronous non-blocking code which is the primary role of ZIO. There are some great talks out there on how to effectively use ZIO with the "world" so to speak.
There are two key points I would make, one ZIO lets you manage resources effectively by attaching allocation and finalization steps and two, "effects" we could say are "things which actually interact with the world" should be wrapped in the tightest scope possible*.
So lets go through this example a bit, first of all, I would not suggest using the default Identity backed backend with ZIO, I would recommend using the AsyncHttpClientZioBackend instead.
import sttp.client._
import zio.{Task, ZIO}
import zio.blocking.effectBlocking
import sttp.client.asynchttpclient.zio.AsyncHttpClientZioBackend
// Extract the common elements of the request
val baseRequest = basicRequest.post(uri"$url")
.headers(content -> "text", char -> "utf-8")
.response(asString)
// Produces a writer which is wrapped in a `Managed` allowing it to be properly
// closed after being used
def managedWriter(filename: String): Managed[IOException, PrintWriter] =
ZManaged.fromAutoCloseable(UIO(new PrintWriter(new File(filename))))
// This returns an effect which produces an `SttpBackend`, thus we flatMap over it
// to extract the backend.
val program = AsyncHttpClientZioBackend().flatMap { implicit backend =>
ZIO.foreachPar(files) { file =>
for {
// Wrap the synchronous reading of data in a `Task`, but which allows runs this effect on a "blocking" threadpool instead of blocking the main one.
data <- effectBlocking(readData(file))
// `send` will return a `Task` because it is using the implicit backend in scope
resp <- baseRequest.body(data).send()
// Build the managed writer, then "use" it to produce an effect, at the end of `use` it will automatically close the writer.
_ <- managedWriter("").use(w => Task(w.write(resp.body.toString)))
} yield ()
}
}
At this point you will just have the program which you will need to run using one of the unsafe methods or if you are using a zio.App through the main method.
* Not always possible or convenient, but it is useful because it prevents resource hogging by yielding tasks back to the runtime for scheduling.
When you use a purely functional IO library like ZIO, you must not call any side-effecting functions (like getData) except when calling factory methods like Task.effect or Task.apply.
ZIO.foreach(files) { file =>
Task {
getData(file)
file.getName
}
}

Reading multiple Files Asynchronously using Akka Streams, Scala

I want to read many .CSV files inside a folder asynchronously and return an Iterable of a custom case class.
Can i achieve this with Akka Streams and How?
*I have tried to somehow Balance the job according to documentation but it's a little hard to manage through...
Or
Is it a good practice to use Actors instead?(a parent Actor with children, every child to read a File, and return an Iterable to parent, and then parent combine all Iterables?)
Mostly the same as #paul answer but with small improvements
def files = new java.io.File("").listFiles().map(_.getAbsolutePath).to[scala.collection.immutable.Iterable]
Source(files).flatMapConcat( filename => //you could use flatMapMerge if you don't bother about line ordering
FileIO.fromPath(Paths.get(filename))
.via(Framing.delimiter(ByteString("\n"), 256, allowTruncation = true).map(_.utf8String))
).map { csvLine =>
// parse csv here
println(csvLine)
}
first of all you need to read/learn how Akka stream works, with Source, Flow and Sink. Then you can start learning the operators.
To make multiple actions in parallel you can use operator mapAsync In which you specify the number of parallelism.
/**
* Using mapAsync operator, we pass a function which return a Future, the number of parallel run futures will
* be determine by the argument passed to the operator.
*/
#Test def readAsync(): Unit = {
Source(0 to 10)//-->Your files
.mapAsync(5) { value => //-> It will run in parallel 5 reads
implicit val ec: ExecutionContext = ActorSystem().dispatcher
Future {
//Here read your file
Thread.sleep(500)
println(s"Process in Thread:${Thread.currentThread().getName}")
value
}
}
.runWith(Sink.foreach(value => println(s"Item emitted:$value in Thread:${Thread.currentThread().getName}")))
}
You can learn more about akka and akka stream here https://github.com/politrons/Akka

Possible to offload route completion DSL to actors in akka-http?

I have become interested in the akka-http implementation but one thing that strikes me as kind of an anti-pattern is the creation of a DSL for all routing, parsing of parameters, error handling and so on. The examples given in the documents are trivial in the extreme. However I saw a route of a real product on the market and it was a massive 10k line file with mesting many levels deep and a ton of business logic in the route. Real world systems ahave to deal with users passing bad parameters, not having the right permissions and so on so the simple DSL explodes fast in real life. To me the optimal solution would be to hand off the route completion to actors, each with the same api who will then do what is needed to complete the route. This would spread out the logic and enable maintainable code but after hours I have been unable to manage it. With the low level API I can pass off the HttpRequest and handle it the old way but that leaves me out of most of the tools in the DSL. So is there a way I could pass something to an actor that would enable it to continue the DSL at that point, handling route specific stuff? I.e. I am talking about something like this:
class MySlashHandler() extends Actor {
def receive = {
case ctxt: ContextOfSomeKind =>
decodeRequest {
// unmarshal with in-scope unmarshaller
entity(as[Order]) { order =>
sender ! "Order received"
}
context.stop(self)
}
}
val route =
pathEndOrSingleSlash {
get { ctxt =>
val actor = actorSystem.actorOf(Props(classOf[MySlashHandler]))
complete(actor ? ctxt)
}
}
Naturally that wont even compile. Despite my best efforts i haven't found a type for ContextOfSomeKind or how to re-enter the DSL once I am inside the actor. It could be this isnt possible. If not I dont think I like the DSL because it encourages what I would consider horrible programming methodology. Then the only problem with the low level API is getting access to the entity marshallers but I would rather do that then make a massive app in a single source file.
Offload Route Completion
Answering your question directly: a Route is nothing more than a function. The definition being:
type Route = (RequestContext) => Future[RouteResult]
Therefore you can simply write a function that does what you are look for, e.g. sends the RequestContext to an Actor and gets back the result:
class MySlashHandler extends Actor {
val routeHandler = (_ : RequestContext) => complete("Actor Complete")
override def receive : Receive = {
case requestContext : RequestContext => routeHandler(requestContext) pipeTo sender
}
}
val actorRef : ActorRef = actorSystem actorOf (Props[MySlashHandler])
val route : Route =
(requestContext : RequestContext) => (actorRef ? requestContext).mapTo[RouteResult]
Actors Don't Solve Your Problem
The problem you are trying to deal with is the complexity of the real world and modelling that complexity in code. I agree this is an issue, but an Actor isn't your solution. There are several reasons to avoid Actors for the design solution you seek:
There are compelling arguments against putting business logic in
Actors.
Akka-http uses akka-stream under the hood. Akka stream uses Actors under the hood. Therefore, you are trying to escape a DSL based on composable Actors by using Actors. Water isn't usually the solution for a drowning person...
The akka-http DSL provides a lot of compile time checks that are washed away once you revert to the non-typed receive method of an Actor. You'll get more run time errors, e.g. dead letters, by using Actors.
Route Organization
As stated previously: a Route is a simple function, the building block of scala. Just because you saw an example of a sloppy developer keeping all of the logic in 1 file doesn't mean that is the only solution.
I could write all of my application code in the main method, but that doesn't make it good functional programming design. Similarly, I could write all of my collection logic in a single for-loop, but I usually use filter/map/reduce.
The same organizing principles apply to Routes. Just from the most basic perspective you could break up the Route logic according to method type:
//GetLogic.scala
object GetLogic {
val getRoute = get {
complete("get received")
}
}
//PutLogic.scala
object PutLogic {
val putRoute = put {
complete("put received")
}
}
Another common organizing principle is to keep your business logic separate from your Route logic:
object BusinessLogic {
type UserName = String
type UserId = String
//isolated business logic
val dbLookup(userId : UserId) : UserName = ???
val businessRoute = get {
entity(as[String]) { userId => complete(dbLookup(userId)) }
}
}
These can then all be combined in your main method:
val finalRoute : Route =
GetLogic.getRoute ~ PutLogic.putRoute ~ BusinessLogic.businessRoute
The routing DSL can be misleading because it sort of looks like magic at times, but underneath it's just plain-old functions which scala can organize and isolate just fine...
I encountered a problem like this last week as well. Eventually I ended up at this blog and decided to go the same way as described there.
I created a custom directive which makes it possible for me to pass request contexts to Actors.
def imperativelyComplete(inner: ImperativeRequestContext => Unit): Route = { ctx: RequestContext =>
val p = Promise[RouteResult]()
inner(new ImperativeRequestContext(ctx, p))
p.future
}
Now I can use this in my Routes file like this:
val route =
put {
imperativelyComplete { ctx =>
val actor = actorSystem.actorOf(Props(classOf[RequestHandler], ctx))
actor ! HandleRequest
}
}
With my RequestHandler Actor looking like the following:
class RequestHandler(ctx: ImperativeRequestContext) extends Actor {
def receive: Receive = {
case handleRequest: HandleRequest =>
someActor ! DoSomething() // will return SomethingDone to the sender
case somethingDone: SomethingDone =>
ctx.complete("Done handling request")
context.stop(self)
}
}
I hope this brings you into the direction of finding a better solution. I am not sure if this solution should be the way to go, but up until now it works out really well for me.

Is there a FIFO stream in Scala?

I'm looking for a FIFO stream in Scala, i.e., something that provides the functionality of
immutable.Stream (a stream that can be finite and memorizes the elements that have already been read)
mutable.Queue (which allows for added elements to the FIFO)
The stream should be closable and should block access to the next element until the element has been added or the stream has been closed.
Actually I'm a bit surprised that the collection library does not (seem to) include such a data structure, since it is IMO a quite classical one.
My questions:
1) Did I overlook something? Is there already a class providing this functionality?
2) OK, if it's not included in the collection library then it might by just a trivial combination of existing collection classes. However, I tried to find this trivial code but my implementation looks still quite complex for such a simple problem. Is there a simpler solution for such a FifoStream?
class FifoStream[T] extends Closeable {
val queue = new Queue[Option[T]]
lazy val stream = nextStreamElem
private def nextStreamElem: Stream[T] = next() match {
case Some(elem) => Stream.cons(elem, nextStreamElem)
case None => Stream.empty
}
/** Returns next element in the queue (may wait for it to be inserted). */
private def next() = {
queue.synchronized {
if (queue.isEmpty) queue.wait()
queue.dequeue()
}
}
/** Adds new elements to this stream. */
def enqueue(elems: T*) {
queue.synchronized {
queue.enqueue(elems.map{Some(_)}: _*)
queue.notify()
}
}
/** Closes this stream. */
def close() {
queue.synchronized {
queue.enqueue(None)
queue.notify()
}
}
}
Paradigmatic's solution (sightly modified)
Thanks for your suggestions. I slightly modified paradigmatic's solution so that toStream returns an immutable stream (allows for repeatable reads) so that it fits my needs. Just for completeness, here is the code:
import collection.JavaConversions._
import java.util.concurrent.{LinkedBlockingQueue, BlockingQueue}
class FIFOStream[A]( private val queue: BlockingQueue[Option[A]] = new LinkedBlockingQueue[Option[A]]() ) {
lazy val toStream: Stream[A] = queue2stream
private def queue2stream: Stream[A] = queue take match {
case Some(a) => Stream cons ( a, queue2stream )
case None => Stream empty
}
def close() = queue add None
def enqueue( as: A* ) = queue addAll as.map( Some(_) )
}
In Scala, streams are "functional iterators". People expect them to be pure (no side effects) and immutable. In you case, everytime you iterate on the stream you modify the queue (so it's no pure). This can create a lot of misunderstandings, because iterating twice the same stream, will have two different results.
That being said, you should rather use Java BlockingQueues, rather than rolling your own implementation. They are considered well implemented in term of safety and performances. Here is the cleanest code I can think of (using your approach):
import java.util.concurrent.BlockingQueue
import scala.collection.JavaConversions._
class FIFOStream[A]( private val queue: BlockingQueue[Option[A]] ) {
def toStream: Stream[A] = queue take match {
case Some(a) => Stream cons ( a, toStream )
case None => Stream empty
}
def close() = queue add None
def enqueue( as: A* ) = queue addAll as.map( Some(_) )
}
object FIFOStream {
def apply[A]() = new LinkedBlockingQueue
}
I'm assuming you're looking for something like java.util.concurrent.BlockingQueue?
Akka has a BoundedBlockingQueue implementation of this interface. There are of course the implementations available in java.util.concurrent.
You might also consider using Akka's actors for whatever it is you are doing. Use Actors to be notified or pushed a new event or message instead of pulling.
1) It seems you're looking for a dataflow stream seen in languages like Oz, which supports the producer-consumer pattern. Such a collection is not available in the collections API, but you could always create one yourself.
2) The data flow stream relies on the concept of single-assignment variables (such that they don't have to be initialized upon declaration point and reading them prior to initialization causes blocking):
val x: Int
startThread {
println(x)
}
println("The other thread waits for the x to be assigned")
x = 1
It would be straightforward to implement such a stream if single-assignment (or dataflow) variables were supported in the language (see the link). Since they are not a part of Scala, you have to use the wait-synchronized-notify pattern just like you did.
Concurrent queues from Java can be used to achieve that as well, as the other user suggested.

Iterate over lines in a file in parallel (Scala)?

I know about the parallel collections in Scala. They are handy! However, I would like to iterate over the lines of a file that is too large for memory in parallel. I could create threads and set up a lock over a Scanner, for example, but it would be great if I could run code such as:
Source.fromFile(path).getLines.par foreach { line =>
Unfortunately, however
error: value par is not a member of Iterator[String]
What is the easiest way to accomplish some parallelism here? For now, I will read in somes lines and handle them in parallel.
You could use grouping to easily slice the iterator into chunks you can load into memory and then process in parallel.
val chunkSize = 128 * 1024
val iterator = Source.fromFile(path).getLines.grouped(chunkSize)
iterator.foreach { lines =>
lines.par.foreach { line => process(line) }
}
In my opinion, something like this is the simplest way to do it.
I'll put this as a separate answer since it's fundamentally different from my last one (and it actually works)
Here's an outline for a solution using actors, which is basically what Kim Stebel's comment describes. There are two actor classes, a single FileReader actor that reads individual lines from the file on demand, and several Worker actors. The workers all send requests for lines to the reader, and process lines in parallel as they are read from the file.
I'm using Akka actors here but using another implementation is basically the same idea.
case object LineRequest
case object BeginProcessing
class FileReader extends Actor {
//reads a single line from the file or returns None if EOF
def getLine:Option[String] = ...
def receive = {
case LineRequest => self.sender.foreach{_ ! getLine} //sender is an Option[ActorRef]
}
}
class Worker(reader: ActorRef) extends Actor {
def process(line:String) ...
def receive = {
case BeginProcessing => reader ! LineRequest
case Some(line) => {
process(line)
reader ! LineRequest
}
case None => self.stop
}
}
val reader = actorOf[FileReader].start
val workers = Vector.fill(4)(actorOf(new Worker(reader)).start)
workers.foreach{_ ! BeginProcessing}
//wait for the workers to stop...
This way, no more than 4 (or however many workers you have) unprocessed lines are in memory at a time.
Below helped me to achieve
source.getLines.toStream.par.foreach( line => println(line))
The comments on Dan Simon's answer got me thinking. Why don't we try wrapping the Source in a Stream:
def src(source: Source) = Stream[String] = {
if (source.hasNext) Stream.cons(source.takeWhile( _ != '\n' ).mkString)
else Stream.empty
}
Then you could consume it in parallel like this:
src(Source.fromFile(path)).par foreach process
I tried this out, and it compiles and runs at any rate. I'm not honestly sure if it's loading the whole file into memory or not, but I don't think it is.
I realize this is an old question, but you may find the ParIterator implementation in the iterata library to be a useful no-assembly-required implementation of this:
scala> import com.timgroup.iterata.ParIterator.Implicits._
scala> val it = (1 to 100000).toIterator.par().map(n => (n + 1, Thread.currentThread.getId))
scala> it.map(_._2).toSet.size
res2: Int = 8 // addition was distributed over 8 threads