Scala: Sequential processing of file data - scala

I have a csv file from which i read data and populate my database. I am using scala to do this. Instead of firing db inserts in a paralleled way I want to execute the insert in sequential manner(i.e. one after another). I am not willing to use Await in a for loop. Any other approach apart from using await?
P.S: I have read the 1000 entries from csv to a list and looping on the list to create db inserts

Assuming you have some kind of save(entity: T): Future[_] method for your database, you can just fold your futures with flatMap (or for comprehension):
def saveAll(entities: List[T]): Future[Unit]
entities.foldLeft(Future.successful(())){
case (f, entity) => for {
_ <- f
_ <- save(entity)
} yield ()
}
}

Another option is recursive function. Less concise than foldLeft, but more readable to some. Just one more option for your consideration (assume save(entity: T): Future[R]:
def saveAll(entities: List[T]): Future[List[R]] = {
entities.headOption match {
case Some(entity) =>
for {
head <- save(entity)
tail <- saveAll(entities.tail)
} yield {
head :: tail
}
case None =>
Future.successful(Nil)
}
}
Yet another option, if your save method allows you to supply your own ExecutionContext i.e. save(entity: T)(implicit ec: ExecutionContext): Future[R], is just fire the Futures concurrently but use a single thread execution context:
def saveAll(entities: List[T]): Future[List[R]] = {
implicit ec = ExecutionContext.fromExecutionService(java.util.concurrent.Executors.newSingleThreadExecutor)
Future.sequence(entities.map(save))
}

Related

How to Promise.allSettled with Scala futures?

I have two scala futures. I want to perform an action once both are completed, regardless of whether they were completed successfully. (Additionally, I want the ability to inspect those results at that time.)
In Javascript, this is Promise.allSettled.
Does Scala offer a simple way to do this?
One last wrinkle, if it matters: I want to do this in a JRuby application.
You can use the transform method to create a Future that will always succeed and return the result or the error as a Try object.
def toTry[A](future: Future[A])(implicit ec: ExecutionContext): Future[Try[A]] =
future.transform(x => Success(x))
To combine two Futures into one, you can use zip:
def settle2[A, B](fa: Future[A], fb: Future[B])(implicit ec: ExecutionContext)
: Future[(Try[A], Try[B])] =
toTry(fa).zip(toTry(fb))
If you want to combine an arbitrary number of Futures this way, you can use Future.traverse:
def allSettled[A](futures: List[Future[A]])(implicit ec: ExecutionContext)
: Future[List[Try[A]]] =
Future.traverse(futures)(toTry(_))
Normally in this case we use Future.sequence to transform a collection of a Future into one single Future so you can map on it, but Scala short circuit the failed Future and doesn't wait for anything after that (Scala considers one failure to be a failure for all), which doesn't fit your case.
In this case you need to map failed ones to successful, then do the sequence, e.g.
val settledFuture = Future.sequence(List(future1, future2, ...).map(_.recoverWith { case _ => Future.unit }))
settledFuture.map(//Here it is all settled)
EDIT
Since the results need to be kept, instead of mapping to Future.unit, we map the actual result into another layer of Try:
val settledFuture = Future.sequence(
List(Future(1), Future(throw new Exception))
.map(_.map(Success(_)).recover(Failure(_)))
)
settledFuture.map(println(_))
//Output: List(Success(1), Failure(java.lang.Exception))
EDIT2
It can be further simplified with transform:
Future.sequence(listOfFutures.map(_.transform(Success(_))))
Perhaps you could use a concurrent counter to keep track of the number of completed Futures and then complete the Promise once all Futures have completed
def allSettled[T](futures: List[Future[T]]): Future[List[Future[T]]] = {
val p = Promise[List[Future[T]]]()
val length = futures.length
val completedCount = new AtomicInteger(0)
futures foreach {
_.onComplete { _ =>
if (completedCount.incrementAndGet == length) p.trySuccess(futures)
}
}
p.future
}
val futures = List(
Future(-11),
Future(throw new Exception("boom")),
Future(42)
)
allSettled(futures).andThen(println(_))
// Success(List(Future(Success(-11)), Future(Failure(java.lang.Exception: boom)), Future(Success(42))))
scastie

Dealing with impure side-effects in FP, IO Monad

Trying to understand how best deal with side-effects in FP.
I implemented this rudimentary IO implementation:
trait IO[A] {
def run: A
}
object IO {
def unit[A](a: => A): IO[A] = new IO[A] { def run = a }
def loadFile(fileResourcePath: String) = IO.unit[List[String]]{
Source.fromResource(fileResourcePath).getLines.toList }
def printMessage(message: String) = IO.unit[Unit]{ println(message) }
def readLine(message:String) = IO.unit[String]{ StdIn.readLine() }
}
I have the following use case:
- load lines from log file
- parse each line to BusinessType object
- process each BusinessType object
- print process result
Case 1:
So Scala code may look like this
val load: String => List[String]
val parse: List[String] => List[BusinessType]
val process: List[BusinessType] => String
val output: String => Unit
Case 2:
I decide to use IO above:
val load: String => IO[List[String]]
val parse: IO[List[String]] => List[BusinessType]
val process: List[BusinessType] => IO[Unit]
val output: IO[Unit] => Unit
In case 1 the load is impure because it's reading from file so is the output is also impure because, it's writing the result to console.
To be more functional I use case 2.
Questions:
- Aren't case 1 and 2 really the same thing?
- In case 2 aren't we just delaying the inevitable?
as the parse function will need to call the io.run
method and cause a side-effect?
- when they say "leave side-effects until the end of the world"
how does this apply to the example above? where is the
end of the world here?
Your IO monad seems to lack all the monad stuff, namely the part where you can flatMap over it to build bigger IO out of smaller IO. That way, everything stays "pure" until the call run at the very end.
In case 2 aren't we just delaying the inevitable?
as the parse function will need call the io.run
method and cause a side effect?
No. The parse function should not call io.run. It should return another IO that you can then combine with its input IO.
when they say "leave side-effects until the end of the world"
how does this apply to the example above? where is the
end of the world here?
End of the world would be the last thing your program does. You only run once. The rest of your program "purely" builds one giant IO for that.
Something like
def load(): IO[Seq[String]]
def parse(data: Seq[String]): IO[Parsed] // returns IO, because has side-effects
def pureComputation(data: Parsed): Result // no side-effects, no need to use I/O
def output(data: Result): IO[Unit]
// combining effects is "pure", so the whole thing
// can be a `val` (or a `def` if it takes some input params)
val program: IO[Unit] = for {
data <- load() // use <- to "map" over IO
parsed <- parse()
result = pureComputation(parsed) // = instead of <-, no I/O here
_ <- output(result)
} yield ()
// only `run` at the end produces any effects
def main() {
program.run()
}

Scala - Batched Stream from Futures

I have instances of a case class Thing, and I have a bunch of queries to run that return a collection of Things like so:
def queries: Seq[Future[Seq[Thing]]]
I need to collect all Things from all futures (like above) and group them into equally sized collections of 10,000 so they can be serialized to files of 10,000 Things.
def serializeThings(Seq[Thing]): Future[Unit]
I want it to be implemented in such a way that I don't wait for all queries to run before serializing. As soon as there are 10,000 Things returned after the futures of the first queries complete, I want to start serializing.
If I do something like:
Future.sequence(queries)
It will collect the results of all the queries, but my understanding is that operations like map won't be invoked until all queries complete and all the Things must fit into memory at once.
What's the best way to implement a batched stream pipeline using Scala collections and concurrent libraries?
I think that I managed to make something. The solution is based on my previous answer. It collects results from Future[List[Thing]] results until it reaches a treshold of BatchSize. Then it calls serializeThings future, when it finishes, the loop continues with the rest.
object BatchFutures extends App {
case class Thing(id: Int)
def getFuture(id: Int): Future[List[Thing]] = {
Future.successful {
List.fill(3)(Thing(id))
}
}
def serializeThings(things: Seq[Thing]): Future[Unit] = Future.successful {
//Thread.sleep(2000)
println("processing: " + things)
}
val ids = (1 to 4).toList
val BatchSize = 5
val future = ids.foldLeft(Future.successful[List[Thing]](Nil)) {
case (acc, id) =>
acc flatMap { processed =>
getFuture(id) flatMap { res =>
val all = processed ++ res
val (batch, rest) = all.splitAt(5)
if (batch.length == BatchSize) { // if futures filled the batch with needed amount
serializeThings(batch) map { _ =>
rest // process the rest
}
} else {
Future.successful(all) //if we need more Things for a batch
}
}
}
}.flatMap { rest =>
serializeThings(rest)
}
Await.result(future, Duration.Inf)
}
The result prints:
processing: List(Thing(1), Thing(1), Thing(1), Thing(2), Thing(2))
processing: List(Thing(2), Thing(3), Thing(3), Thing(3), Thing(4))
processing: List(Thing(4), Thing(4))
When the number of Things isn't divisible by BatchSize we have to call serializeThings once more(last flatMap). I hope it helps! :)
Before you do Future.sequence do what you want to do with individual future and then use Future.sequence.
//this can be used for serializing
def doSomething(): Unit = ???
//do something with the failed future
def doSomethingElse(): Unit = ???
def doSomething(list: List[_]) = ???
val list: List[Future[_]] = List.fill(10000)(Future(doSomething()))
val newList =
list.par.map { f =>
f.map { result =>
doSomething()
}.recover { case throwable =>
doSomethingElse()
}
}
Future.sequence(newList).map ( list => doSomething(list)) //wait till all are complete
instead of newList generation you could use Future.traverse
Future.traverse(list)(f => f.map( x => doSomething()).recover {case th => doSomethingElse() }).map ( completeListOfValues => doSomething(completeListOfValues))

Generating lazy scala streams by iteration

I'm looking for a way to generate a scala stream (the equivalent of F#'s sequence) of this form:
let allRows resultSet : seq<Row> =
seq {
while resultSet.next() do
yield new Row(resultSet)
}
Is there any way to easily do this in scala? The only way I found involved (non-tailrecursive) recursion, which for large amounts of rows in a resultSet would mean certain stackoverflow.
Thanks
You can implement it like this:
def toStream(rs:ResultSet):Stream[Row] =
if(!rs.next) Stream.Empty
else new Row(rs) #:: toStream(rs)
Note that since toStream is defined using def (in opposite to definition with val) this solution will no keep whole stream in memory and head of stream will be garbage collected.
Another option you can use is to define new Iterator:
def toIterator(rs:ResultSet) = new Iterator[Row] {
override def hasNext: Boolean = rs.next()
override def next(): Row = new Row(rs)
}
Suppose you have something like
trait ResultSet {
def next: Boolean
}
class Row(rs: ResultSet)
You can define your function as
def allRows(rs: ResultSet): Stream[Row] =
Stream.continually(if (rs.next) Some(new Row(rs)) else None)
.takeWhile(_.isDefined).map(_.get)

Play/Scala: Making unknown number of I/O calls in parallell, watining for the results

So, I read the article here about parallel comprehension. He gives the following code example:
// Make 3 parallel async calls
val fooFuture = WS.url("http://foo.com").get()
val barFuture = WS.url("http://bar.com").get()
val bazFuture = WS.url("http://baz.com").get()
for {
foo <- fooFuture
bar <- barFuture
baz <- bazFuture
} yield {
// Build a Result using foo, bar, and baz
Ok(...)
}
All fine so far, but, I am in a situation where I don't know how many WS.get()'s I need to do always, I want it to be dynamic. So for instance:
val checks = Seq(callOne(param), callTwo(param))
Where the calls are:
def callOne(param: String): Future[Boolean] = {
// do something and return the Future with a true/false value
Future(true)
}
def callTwo(param: String): Future[Boolean] = {
// do something and return the Future with a true/false value
Future(false)
}
So, my question is, how shall I react on the results of my sequence with WS calls (or database queries for that matter), in a for-yield?
I have given two example of calls, but I want the same code be able to process 1 to many number of calls in parallel and gather the results in the for-yield to ultimately proceed to do other things.
Important: All calls should be carried out in parallel, the quickest ones will complete before the slow ones without any respect to what order they are fired.
Future.sequence is likely what you want.
Example usage:
val futures = List(WS.url("http://foo.com").get(), WS.url("http://bar.com").get())
Future.sequence(futures) # => Transforms a Seq[Future[_]] to Future[Seq[_]]
The future returns from Future.sequence will not be completed until the all of the futures in the input sequence are completed.
Bonus:
If your futures are heterogeneously typed, and you need to preserve that type, you can use Hlist. I've written the following snippet which will take an Hlist of futures, and transform it to a Future containing an Hlist of resolved values:
import shapeless._
import scala.concurrent.{ExecutionContext,Future}
object FutureHelpers {
object FutureReducer extends Poly2 {
import scala.concurrent.ExecutionContext.Implicits.global
implicit def f[A, B <: HList] = at[Future[A], Future[B]] { (f, resultFuture) =>
for {
result <- resultFuture
value <- f
} yield value :: result
}
}
// Like Future.sequence, but for HList
// hsequence(Future { 1 } :: Future { "string" } :: HNil)
// => Future { 1 :: "string" :: HNil }
def hsequence[T <: HList](hlist: T)(implicit
executor: ExecutionContext,
folder: RightFolder[T, Future[HNil], FutureReducer.type]) = {
hlist.foldRight(Future.successful[HNil](HNil))(FutureReducer)
}
}