scala ZIO foreachPar - scala

I'm new to parallel programming and ZIO, i'm trying to get data from an API, by parallel requests.
import sttp.client._
import zio.{Task, ZIO}
ZIO.foreach(files) { file =>
getData(file)
Task(file.getName)
}
def getData(file: File) = {
val data: String = readData(file)
val request = basicRequest.body(data).post(uri"$url")
.headers(content -> "text", char -> "utf-8")
.response(asString)
implicit val backend: SttpBackend[Identity, Nothing, NothingT] = HttpURLConnectionBackend()
request.send().body
resquest.Response match {
case Success(value) => {
val src = new PrintWriter(new File(filename))
src.write(value.toString)
src.close()
}
case Failure(exception) => log error
}
when i execute the program sequentially, it work as expected,
if i tried to run parallel, by changing ZIO.foreach to ZIO.foreachPar.
The program is terminating prematurely, i get that, i'm missing something basic here,
any help is appreciated to help me figure out the issue.

Generally speaking I wouldn't recommend mixing synchronous blocking code as you have with asynchronous non-blocking code which is the primary role of ZIO. There are some great talks out there on how to effectively use ZIO with the "world" so to speak.
There are two key points I would make, one ZIO lets you manage resources effectively by attaching allocation and finalization steps and two, "effects" we could say are "things which actually interact with the world" should be wrapped in the tightest scope possible*.
So lets go through this example a bit, first of all, I would not suggest using the default Identity backed backend with ZIO, I would recommend using the AsyncHttpClientZioBackend instead.
import sttp.client._
import zio.{Task, ZIO}
import zio.blocking.effectBlocking
import sttp.client.asynchttpclient.zio.AsyncHttpClientZioBackend
// Extract the common elements of the request
val baseRequest = basicRequest.post(uri"$url")
.headers(content -> "text", char -> "utf-8")
.response(asString)
// Produces a writer which is wrapped in a `Managed` allowing it to be properly
// closed after being used
def managedWriter(filename: String): Managed[IOException, PrintWriter] =
ZManaged.fromAutoCloseable(UIO(new PrintWriter(new File(filename))))
// This returns an effect which produces an `SttpBackend`, thus we flatMap over it
// to extract the backend.
val program = AsyncHttpClientZioBackend().flatMap { implicit backend =>
ZIO.foreachPar(files) { file =>
for {
// Wrap the synchronous reading of data in a `Task`, but which allows runs this effect on a "blocking" threadpool instead of blocking the main one.
data <- effectBlocking(readData(file))
// `send` will return a `Task` because it is using the implicit backend in scope
resp <- baseRequest.body(data).send()
// Build the managed writer, then "use" it to produce an effect, at the end of `use` it will automatically close the writer.
_ <- managedWriter("").use(w => Task(w.write(resp.body.toString)))
} yield ()
}
}
At this point you will just have the program which you will need to run using one of the unsafe methods or if you are using a zio.App through the main method.
* Not always possible or convenient, but it is useful because it prevents resource hogging by yielding tasks back to the runtime for scheduling.

When you use a purely functional IO library like ZIO, you must not call any side-effecting functions (like getData) except when calling factory methods like Task.effect or Task.apply.
ZIO.foreach(files) { file =>
Task {
getData(file)
file.getName
}
}

Related

Functional scala log accumulator

I'm working on a Scala project using cats library, mainly. In there, we have calls like
for {
_ <- initSomeServiceAndLog("something from a far away service")
_ <- initSomeOtherServiceAndLog("something from another far away service")
a <- b()
c <- d(a)
} yield c
Imagine that b also logs something or might throw a business error (I know, we avoid to throw in Scala, but it's not the case right now). I'm looking for a solution to accumulate logs and print them all in the end, in a single message.
For a happy path, I saw that Writer Monad from Cats might be an acceptable solution.
But what if b method throws? The requirements are to logs everything - all previous logs and the error message, in a single message, with some kind of unique trace ID.
Any thoughts? Thanks in advance
Implementing functional logging (in a way that preserves logs even if error happened) using monad transformers like Writer (WriterT) or State (StateT) is hard. However, if we don't be anal about FP approach we could do the following:
use some IO monad
with it create something like in-memory storage for logs
however implement in in a functional way
Personally I would pick either cats.effect.concurrent.Ref or monix.eval.TaskLocal.
Example using Ref (and Task):
type Log = Ref[Task, Chain[String]]
type FunctionalLogger = String => Task[Unit]
val createLog: Task[Log] = Ref.of[Task, Chain[String]](Chain.empty)
def createAppender(log: Log): FunctionalLogger =
entry => log.update(chain => chain.append(entry))
def outputLog(log: Log): Task[Chain[String]] = log.get
with helpers like that I could:
def doOperations(logger: FunctionalLogger) = for {
_ <- operation1(logger) // logging is a side effect managed by IO monad
_ <- operation2(logger) // so it is referentially transparent
} yield result
createLog.flatMap { log =>
doOperations(createAppender(log))
.recoverWith(...)
.flatMap { result =>
outputLog(log)
...
}
}
However, making sure that output is called is a bit of a pain so we could use some form of Bracket or Resource to handle it:
val loggerResource: Resource[Task, FunctionalLogger] = Resource.make {
createLog // acquiring resource - IO operation that accesses something
} { log =>
outputLog(log) // releasing resource - works like finally in try-catchso it should
.flatMap(... /* log entries or sth */) // be called no matter if error occured
}.map(createAppender)
loggerResource.use { logger =>
doSomething(logger)
}
If you don't like passing this appender around explicitly you could use Kleisli to inject it:
type WithLogger[A] = Kleisli[Task, FunctionalLogger, A]
// def operation1: WithLogger[A]
// def operation2: WithLogger[B]
def doSomething: WithLogger[C] = for {
a <- operation1
b <- operation2
} yield c
loggerResource.use { logger =>
doSomething(logger)
}
TaskLocal would be used in a very similar way.
At the end of the day you would end up with:
type that says that it is logging
mutability managed through IO, so referential transparency would not be lost
certainty that even if IO fails, log will be preserved and the results sent
I believe some purist would not like this solution, but it has all the benefits of FP, so I would personally use it.

How do I create a seq of string from a file that is opened with managed?

Tried this to create a seq from file:
def getFileAsList(bufferedReader: BufferedReader): Seq[String] ={
import resource._
for(source <- managed(bufferedReader)){
for(line<-source.lines())
yield line
}
}
I don't think you use Scala-ARM in a way it was designed to be used. The thing is that unless you use Imperative style i.e. consume your managed resource in place, you use Monadic style so what you get is result wrapped into a ExtractableManagedResource which is a delayed (lazy) computation rather than an immediate result. So this is not a direct substitute for Java try-with-resource construct. Monadic style is more useful if you have a method that wants to return some lazy resource that is also happens to be managed i.e. requires some kind of explicit close after usage. But this means that the managed resource is created inside the method rather than passed from the outside as in your case.
Still you probably can achieve something similar to what you want with a construction like
def getFileAsList(bufferedReader: BufferedReader): java.util.stream.Stream[String] = {
import resource._
val managedWrapper = for (source <- managed(bufferedReader))
yield for (line <- source.lines())
yield line
managedWrapper.tried.get
}
The tried method converts ExtractableManagedResource into a Try and get on that will either get you the result or (re-)throw the exception that happened during result calculation.
Please also note, that java.util.Stream is a beast quite different from scala.collection.Seq or scala.collection.Stream. If you want get Scala-specific Stream you should use some Scala-specific code such as
def getFileAsList(bufferedReader: BufferedReader): scala.collection.immutable.Stream[String] = {
import resource._
val managedWrapper = for (source <- managed(bufferedReader))
yield Stream.continually(source.readLine()).takeWhile(_ != null)
managedWrapper.tried.get
}

How to save & return data from within Future callback

I've been facing an issue the past few days regarding saving & handling data from Futures in Scala. I'm new to the language and the concept of both. Lagom's documentation on Cassandra says to implement roughly 9 files of code and I want to ensure my database code works before spreading it out over that much code.
Specifically, I'm currently trying to implement a proof of concept to send data to/from the cassandra database that lagom implements for you. So far I'm able to send and retrieve data to/from the database, but I'm having trouble returning that data since this all runs asynchronously, and also returning that the data returned successfully.
I've been playing around for a while; The retrieval code looks like this:
override def getBucket(logicalBucket: String) = ServiceCall[NotUsed, String] {
request => {
val returnList = ListBuffer[String]()
println("Retrieving Bucket " + logicalBucket)
val readingFromTable = "SELECT * FROM data_access_service_impl.s3buckets;"
//DB query
var rowsFuture: Future[Seq[Row]] = cassandraSession.selectAll(readingFromTable)
println(rowsFuture)
Await.result(rowsFuture, 10 seconds)
rowsFuture onSuccess {
case rows => {
println(rows)
for (row <- rows) println(row.getString("name"))
for (row <- rows) returnList += row.getString("name")
println("ReturnList: " + returnList.mkString)
}
}
rowsFuture onFailure {
case e => println("An error has occured: " + e.getMessage)
Future {"An error has occured: " + e.getMessage}
}
Future.successful("ReturnList: " + returnList.mkString)
}
}
When this runs, I get the expected database values to 'println' in the onSuccess callback. However, that same variable, which I use in the return statement, outside of the callback prints as empty (and returns empty data as well). This also happens in the 'insertion' function I use, where it doesn't always return variables I set within callback functions.
If I try to put the statement within the callback function, I'm given an error of 'returns Unit, expects Future[String]'. So I'm stuck where I can't return from within the callback functions, so I can't guarantee I'm returning data).
The goal for me is to return a string to the API so that it shows a list of all the s3 bucket names within the DB. That would mean iterating through the Future[Seq[Row]] datatype, and saving the data into a concatenated string. If somebody could help with that, they'll solve 2 weeks of problems I've had reading through Lagom, Akka, Datastax, and Cassandra documentation. I'm flabbergasted at this point (information overload) and there's no clearcut guide I've found on this.
For reference, here's the cassandraSession documentation:
LagomTutorial/Documentation Style Information with their only cassandra-query example
CassandraSession.scala code
The key thing to understand about Future, (and Option, and Either, and Try) is that you do not (in general) get values out of them, you bring computations into them. The most common way to do that is with the map and flatMap methods.
In your case, you want to take a Seq[Row] and transform it into a String. However, your Seq[Row] is wrapped in this opaque data structure called Future, so you can't just rows.mkString as you would if you actually had a Seq[Row]. So, instead of getting the value and performing computation on it, bring your computation rows.mkString to the data:
//DB query
val rowsFuture: Future[Seq[Row]] = cassandraSession.selectAll(readingFromTable)
val rowToString = (row: Row) => row.getString("name")
val computation = (rows: Seq[Row]) => rows.map(rowToString).mkString
// Computation to the data, rather than the other way around
val resultFuture = rowsFuture.map(computation)
Now, when rowsFuture is completed, the new future that you created by calling rowsFuture.map will be fulfilled with the result of calling computation on the Seq[Row] that you actually care about.
At that point you can just return resultFuture and everything will work as anticipated, because the code that calls getBucket is expecting a Future and will handle it as is appropriate.
Why is Future opaque?
The simple reason is because it represents a value that may not currently exist. You can only get the value when the value is there, but when you start your call it isn't there. Rather than have you poll some isComplete field yourself, the code lets you register computations (callbacks, like onSuccess and onFailure) or create new derived future values using map and flatMap.
The deeper reason is because Future is a Monad and monads encompass computation, but do not have an operation to extract that computation out of them
Replace the select for your specific select and the field that you want to obtain for your specific field.The example is only for test, is not a architecture propose.
package ldg.com.dbmodule.model
/**
* Created by ldipotet on 05/11/17.
*/
import com.datastax.driver.core.{Cluster, ResultSet, ResultSetFuture}
import scala.util.{Failure, Success, Try}
import java.util.concurrent.TimeUnit
import scala.collection.JavaConversions._
//Use Implicit Global Context is strongly discouraged! we must create
//our OWN execution CONTEXT !
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.{Future, _}
import scala.concurrent.duration._
object CassandraDataStaxClient {
//We create here a CallBack in Scala with the DataStax api
implicit def resultSetFutureToScala(rf: ResultSetFuture):
Future[ResultSet] = {
val promiseResult = Promise[ResultSet]()
val producer = Future {
getResultSet(rf) match {
//we write a promise with an specific value
case Success(resultset) => promiseResult success resultset
case Failure(e) => promiseResult failure (new
IllegalStateException)
}
}
promiseResult.future
}
def getResultSet: ResultSetFuture => Try[ResultSet] = rsetFuture => {
Try(
// Other choice can be:
// getUninterruptibly(long timeout, TimeUnit unit) throws
TimeoutException
// for an specific time
//can deal an IOException
rsetFuture.getUninterruptibly
)
}
def main(args: Array[String]) {
def defaultFutureUnit() = TimeUnit.SECONDS
val time = 20 seconds
//Better use addContactPoints and adds more tha one contact point
val cluster = Cluster.builder().addContactPoint("127.0.0.1").build()
val session = cluster.connect("myOwnKeySpace")
//session.executeAsync es asyncronous so we'll have here a
//ResultSetFuture
//converted to a resulset due to Implicitconversion
val future: Future[ResultSet] = session.executeAsync("SELECT * FROM
myOwnTable")
//blocking on a future is strongly discouraged!! next is an specific
//case
//to make sure that all of the futures have been completed
val results = Await.result(future,time).all()
results.foreach(row=>println(row.getString("any_String_Field"))
session.close()
cluster.close()
}
}

Nesting Futures in Play Action

Im using Play and have an action in which I want to do two things:-
firstly check my cache for a value
secondly, call a web service with the value
Since WS API returns a Future, I'm using Action.async.
My Redis cache module also returns a Future.
Assume I'm using another ExecutionContext appropriately for the potentially long running tasks.
Q. Can someone confirm if I'm on the right track by doing the following. I know I have not catered for the Exceptional cases in the below - just keeping it simple for brevity.
def token = Action.async { implicit request =>
// 1. Get Future for read on cache
val cacheFuture = scala.concurrent.Future {
cache.get[String](id)
}
// 2. Map inside cache Future to call web service
cacheFuture.map { result =>
WS.url(url).withQueryString("id" -> result).get().map { response =>
// process response
Ok(responseData)
}
}
}
My concern is that this may not be the most efficient way of doing things because I assume different threads may handle the task of completing each of the Futures.
Any recommendations for a better approach are greatly appreciated.
That's not specific to Play. I suggest you have a look at documentations explaining how Futures work.
val x: Future[FutureOp2ResType] = futureOp1(???).flatMap { res1 => futureOp2(res1, ???) }
Or with for-comprehension
val x: Future[TypeOfRes] = for {
res1 <- futureOp1(???)
res2 <- futureOp2(res1, ???)
// ...
} yield res
As for how the Futures are executed (using threads), it depends on which ExecutionContext you use (e.g. the global one, the Play one, ...).
WS.get returning a Future, it should not be called within cacheFuture.map, or it will returns a Future[Future[...]].

Parallel file processing in Scala

Suppose I need to process files in a given folder in parallel. In Java I would create a FolderReader thread to read file names from the folder and a pool of FileProcessor threads. FolderReader reads file names and submits the file processing function (Runnable) to the pool executor.
In Scala I see two options:
create a pool of FileProcessor actors and schedule a file processing function with Actors.Scheduler.
create an actor for each file name while reading the file names.
Does it make sense? What is the best option?
Depending on what you're doing, it may be as simple as
for(file<-files.par){
//process the file
}
I suggest with all my energies to keep as far as you can from the threads. Luckily we have better abstractions which take care of what's happening below, and in your case it appears to me that you do not need to use actors (while you can) but you can use a simpler abstraction, called Futures. They are a part of Akka open source library, and I think in the future will be a part of the Scala standard library as well.
A Future[T] is simply something that will return a T in the future.
All you need to run a future, is to have an implicit ExecutionContext, which you can derive from a java executor service. Then you will be able to enjoy the elegant API and the fact that a future is a monad to transform collections into collections of futures, collect the result and so on. I suggest you to give a look to http://doc.akka.io/docs/akka/2.0.1/scala/futures.html
object TestingFutures {
implicit val executorService = Executors.newFixedThreadPool(20)
implicit val executorContext = ExecutionContext.fromExecutorService(executorService)
def testFutures(myList:List[String]):List[String]= {
val listOfFutures : Future[List[String]] = Future.traverse(myList){
aString => Future{
aString.reverse
}
}
val result:List[String] = Await.result(listOfFutures,1 minute)
result
}
}
There's a lot going on here:
I am using Future.traverse which receives as a first parameter which is M[T]<:Traversable[T] and as second parameter a T => Future[T] or if you prefer a Function1[T,Future[T]] and returns Future[M[T]]
I am using the Future.apply method to create an anonymous class of type Future[T]
There are many other reasons to look at Akka futures.
Futures can be mapped because they are monad, i.e. you can chain Futures execution :
Future { 3 }.map { _ * 2 }.map { _.toString }
Futures have callback: future.onComplete, onSuccess, onFailure, andThen etc.
Futures support not only traverse, but also for comprehension
Ideally you should use two actors. One for reading the list of files, and one for actually reading the file.
You start the process by simply sending a single "start" message to the first actor. The actor can then read the list of files, and send a message to the second actor. The second actor then reads the file and processes the contents.
Having multiple actors, which might seem complicated, is actually a good thing in the sense that you have a bunch of objects communicating with eachother, like in a theoretical OO system.
Edit: you REALLY shouldn't be doing doing concurrent reading of a single file.
I was going to write up exactly what #Edmondo1984 did except he beat me to it. :) I second his suggestion in a big way. I'll also suggest that you read the documentation for Akka 2.0.2. As well, I'll give you a slightly more concrete example:
import akka.dispatch.{ExecutionContext, Future, Await}
import akka.util.duration._
import java.util.concurrent.Executors
import java.io.File
val execService = Executors.newCachedThreadPool()
implicit val execContext = ExecutionContext.fromExecutorService(execService)
val tmp = new File("/tmp/")
val files = tmp.listFiles()
val workers = files.map { f =>
Future {
f.getAbsolutePath()
}
}.toSeq
val result = Future.sequence(workers)
result.onSuccess {
case filenames =>
filenames.foreach { fn =>
println(fn)
}
}
// Artificial just to make things work for the example
Thread.sleep(100)
execContext.shutdown()
Here I use sequence instead of traverse, but the difference is going to depend on your needs.
Go with the Future, my friend; the Actor is just a more painful approach in this instance.
But if use actors, what's wrong with that?
If we have to read / write to some property file. There is my Java example. But still with Akka Actors.
Lest's say we have an actor ActorFile represents one file. Hm.. Probably it can not represent One file. Right? (would be nice it could). So then it represents several files like PropertyFilesActor then:
Why would not use something like this:
public class PropertyFilesActor extends UntypedActor {
Map<String, String> filesContent = new LinkedHashMap<String, String>();
{ // here we should use real files of cource
filesContent.put("file1.xml", "");
filesContent.put("file2.xml", "");
}
#Override
public void onReceive(Object message) throws Exception {
if (message instanceof WriteMessage) {
WriteMessage writeMessage = (WriteMessage) message;
String content = filesContent.get(writeMessage.fileName);
String newContent = content + writeMessage.stringToWrite;
filesContent.put(writeMessage.fileName, newContent);
}
else if (message instanceof ReadMessage) {
ReadMessage readMessage = (ReadMessage) message;
String currentContent = filesContent.get(readMessage.fileName);
// Send the current content back to the sender
getSender().tell(new ReadMessage(readMessage.fileName, currentContent), getSelf());
}
else unhandled(message);
}
}
...a message will go with parameter (fileName)
It has its own in-box, accepting messages like:
WriteLine(fileName, string)
ReadLine(fileName, string)
Those messages will be storing into to the in-box in the order, one after antoher. The actor would do its work by receiving messages from the box - storing/reading, and meanwhile sending feedback sender ! message back.
Thus, let's say if we write to the property file, and send showing the content on the web page. We can start showing page (right after we sent message to store a data to the file) and as soon as we received the feedback, update part of the page with a data from just updated file (by ajax).
Well, grab your files and stick them in a parallel structure
scala> new java.io.File("/tmp").listFiles.par
res0: scala.collection.parallel.mutable.ParArray[java.io.File] = ParArray( ... )
Then...
scala> res0 map (_.length)
res1: scala.collection.parallel.mutable.ParArray[Long] = ParArray(4943, 1960, 4208, 103266, 363 ... )