Akka streams: dealing with futures within graph stage - scala

Within an akka stream stage FlowShape[A, B] , part of the processing I need to do on the A's is to save/query a datastore with a query built with A data. But that datastore driver query gives me a future, and I am not sure how best to deal with it (my main question here).
case class Obj(a: String, b: Int, c: String)
case class Foo(myobject: Obj, name: String)
case class Bar(st: String)
//
class SaveAndGetId extends GraphStage[FlowShape[Foo, Bar]] {
val dao = new DbDao // some dao with an async driver
override def createLogic(inheritedAttributes: Attributes) = new GraphStageLogic(shape) {
setHandlers(in, out, new InHandler with Outhandler {
override def onPush() = {
val foo = grab(in)
val add = foo.record.value()
val result: Future[String] = dao.saveAndGetRecord(add.myobject)//saves and returns id as string
//the naive approach
val record = Await(result, Duration.inf)
push(out, Bar(record))// ***tests pass every time
//mapping the future approach
result.map { x=>
push(out, Bar(x))
} //***tests fail every time
The next stage depends on the id of the db record returned from query, but I want to avoid Await. I am not sure why mapping approach fails:
"it should work" in {
val source = Source.single(Foo(Obj("hello", 1, "world")))
val probe = source
.via(new SaveAndGetId))
.runWith(TestSink.probe)
probe
.request(1)
.expectBarwithId("one")//say we know this will be
.expectComplete()
}
private implicit class RichTestProbe(probe: Probe[Bar]) {
def expectBarwithId(expected: String): Probe[Bar] =
probe.expectNextChainingPF{
case r # Bar(str) if str == expected => r
}
}
When run with mapping future, I get failure:
should work ***FAILED***
java.lang.AssertionError: assertion failed: expected: message matching partial function but got unexpected message OnComplete
at scala.Predef$.assert(Predef.scala:170)
at akka.testkit.TestKitBase$class.expectMsgPF(TestKit.scala:406)
at akka.testkit.TestKit.expectMsgPF(TestKit.scala:814)
at akka.stream.testkit.TestSubscriber$ManualProbe.expectEventPF(StreamTestKit.scala:570)
The async side channels example in the docs has the future in the constructor of the stage, as opposed to building the future within the stage, so doesn't seem to apply to my case.

I agree with Ramon. Constructing a new FlowShapeis not necessary in this case and it is too complicated. It is very much convenient to use mapAsync method here:
Here is a code snippet to utilize mapAsync:
import akka.stream.scaladsl.{Sink, Source}
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
object MapAsyncExample {
val numOfParallelism: Int = 10
def main(args: Array[String]): Unit = {
Source.repeat(5)
.mapAsync(parallelism)(x => asyncSquare(x))
.runWith(Sink.foreach(println)) previous stage
}
//This method returns a Future
//You can replace this part with your database operations
def asyncSquare(value: Int): Future[Int] = Future {
value * value
}
}
In the snippet above, Source.repeat(5) is a dummy source to emit 5 indefinitely. There is a sample function asyncSquare which takes an integer and calculates its square in a Future. .mapAsync(parallelism)(x => asyncSquare(x)) line uses that function and emits the output of Future to the next stage. In this snipet, the next stage is a sink which prints every item.
parallelism is the maximum number of asyncSquare calls that can run concurrently.

I think your GraphStage is unnecessarily overcomplicated. The below Flow performs the same actions without the need to write a custom stage:
val dao = new DbDao
val parallelism = 10 //number of parallel db queries
val SaveAndGetId : Flow[Foo, Bar, _] =
Flow[Foo]
.map(foo => foo.record.value().myobject)
.mapAsync(parallelism)(rec => dao.saveAndGetRecord(rec))
.map(Bar.apply)
I generally try to treat GraphStage as a last resort, there is almost always an idiomatic way of getting the same Flow by using the methods provided by the akka-stream library.

Related

How to Promise.allSettled with Scala futures?

I have two scala futures. I want to perform an action once both are completed, regardless of whether they were completed successfully. (Additionally, I want the ability to inspect those results at that time.)
In Javascript, this is Promise.allSettled.
Does Scala offer a simple way to do this?
One last wrinkle, if it matters: I want to do this in a JRuby application.
You can use the transform method to create a Future that will always succeed and return the result or the error as a Try object.
def toTry[A](future: Future[A])(implicit ec: ExecutionContext): Future[Try[A]] =
future.transform(x => Success(x))
To combine two Futures into one, you can use zip:
def settle2[A, B](fa: Future[A], fb: Future[B])(implicit ec: ExecutionContext)
: Future[(Try[A], Try[B])] =
toTry(fa).zip(toTry(fb))
If you want to combine an arbitrary number of Futures this way, you can use Future.traverse:
def allSettled[A](futures: List[Future[A]])(implicit ec: ExecutionContext)
: Future[List[Try[A]]] =
Future.traverse(futures)(toTry(_))
Normally in this case we use Future.sequence to transform a collection of a Future into one single Future so you can map on it, but Scala short circuit the failed Future and doesn't wait for anything after that (Scala considers one failure to be a failure for all), which doesn't fit your case.
In this case you need to map failed ones to successful, then do the sequence, e.g.
val settledFuture = Future.sequence(List(future1, future2, ...).map(_.recoverWith { case _ => Future.unit }))
settledFuture.map(//Here it is all settled)
EDIT
Since the results need to be kept, instead of mapping to Future.unit, we map the actual result into another layer of Try:
val settledFuture = Future.sequence(
List(Future(1), Future(throw new Exception))
.map(_.map(Success(_)).recover(Failure(_)))
)
settledFuture.map(println(_))
//Output: List(Success(1), Failure(java.lang.Exception))
EDIT2
It can be further simplified with transform:
Future.sequence(listOfFutures.map(_.transform(Success(_))))
Perhaps you could use a concurrent counter to keep track of the number of completed Futures and then complete the Promise once all Futures have completed
def allSettled[T](futures: List[Future[T]]): Future[List[Future[T]]] = {
val p = Promise[List[Future[T]]]()
val length = futures.length
val completedCount = new AtomicInteger(0)
futures foreach {
_.onComplete { _ =>
if (completedCount.incrementAndGet == length) p.trySuccess(futures)
}
}
p.future
}
val futures = List(
Future(-11),
Future(throw new Exception("boom")),
Future(42)
)
allSettled(futures).andThen(println(_))
// Success(List(Future(Success(-11)), Future(Failure(java.lang.Exception: boom)), Future(Success(42))))
scastie

ConcurrentHashMap[String, AtomicInteger] or ConcurrentHashMap[String, Int] for thread-safe counters?

When incrementing concurrent counters by key in ConcurrentHashMap is it safe to use regular Int for value or do we have to use AtomicInteger? For example consider the following two implementations
ConcurrentHashMap[String, Int]
final class ExpensiveMetrics(implicit system: ActorSystem, ec: ExecutionContext) {
import scala.collection.JavaConverters._
private val chm = new ConcurrentHashMap[String, Int]().asScala
system.scheduler.schedule(5.seconds, 60.seconds)(publishAllMetrics())
def countRequest(key: String): Unit =
chm.get(key) match {
case Some(value) => chm.update(key, value + 1)
case None => chm.update(key, 1)
}
private def resetCount(key: String) = chm.replace(key, 0)
private def publishAllMetrics(): Unit =
chm foreach { case (key, value) =>
// publishMetric(key, value.doubleValue())
resetCount(key)
}
}
ConcurrentHashMap[String, AtomicInteger]
final class ExpensiveMetrics(implicit system: ActorSystem, ec: ExecutionContext) {
import scala.collection.JavaConverters._
private val chm = new ConcurrentHashMap[String, AtomicInteger]().asScala
system.scheduler.schedule(5.seconds, 60.seconds)(publishAllMetrics())
def countRequest(key: String): Unit =
chm.getOrElseUpdate(key, new AtomicInteger(1)).incrementAndGet()
private def resetCount(key: String): Unit =
chm.getOrElseUpdate(key, new AtomicInteger(0)).set(0)
private def publishAllMetrics(): Unit =
chm foreach { case (key, value) =>
// publishMetric(key, value.doubleValue())
resetCount(key)
}
}
Is the former implementation safe? If not, at what point in the snippet can race-condition be introduced and why?
The context of the question are AWS CloudWatch metrics which can get very expensive on high-frequency APIs if posted on each request. So I am trying to "batch" them up and publish them periodically.
The first implementation is not correct, because the countRequest method is not atomic. Consider this sequence of events:
Threads A and B both call countRequest with key "foo"
Thread A obtains the counter value, let's call it x
Thread B obtains the counter value. It's the same value x, because Thread A hasn't updated the counter yet.
Thread B updates the map with the new counter value, x+1
Thread A updates the map, and because it obtained the counter value before B wrote the new counter value, it also writes x+1.
The counter should be x+2, but it is x+1. It's a classic lost update problem.
The second implementation has a similar problem due to the use of the `getOrElseUpdate` method. `ConcurrentHashMap` does not have that method, therefore the Scala wrapper needs to emulate it. I think the implementation is the one inherited from `scala.collection.mutable.MapOps`, and it is defined like so:
```
def getOrElseUpdate(key: K, op: => V): V =
get(key) match {
case Some(v) => v
case None => val d = op; this(key) = d; d
}
```
This is obviously not atomic.
To implement this correctly, use the compute method on ConcurrentHashMap.
This method will execute atomically, so you won't need an AtomicInteger.

How to implement a concurrent processing in akka?

I have a method in which there are multiple calls to db. As I have not implemented any concurrent processing, a 2nd db call has to wait until the 1st db call gets completed, 3rd has to wait until the 2nd gets completed and so on.
All db calls are independent of each other. I want to make this in such a way that all DB calls run concurrently.
I am new to Akka framework.
Can someone please help me with small sample or references would help. Application is developed in Scala Lang.
There are three primary ways that you could achieve concurrency for the given example needs.
Futures
For the particular use case that is asked about in the question I would recommend Futures before any akka construct.
Suppose we are given the database calls as functions:
type Data = ???
val dbcall1 : () => Data = ???
val dbcall2 : () => Data = ???
val dbcall3 : () => Data = ???
Concurrency can be easily applied, and then the results can be collected, using Futures:
val f1 = Future { dbcall1() }
val f2 = Future { dbcall2() }
val f3 = Future { dbcall3() }
for {
v1 <- f1
v2 <- f2
v3 <- f3
} {
println(s"All data collected: ${v1}, ${v2}, ${v3}")
}
Akka Streams
There is a similar stack answer which demonstrates how to use the akka-stream library to do concurrent db querying.
Akka Actors
It is also possible to write an Actor to do the querying:
object MakeQuery
class DBActor(dbCall : () => Data) extends Actor {
override def receive = {
case _ : MakeQuery => sender ! dbCall()
}
}
val dbcall1ActorRef = system.actorOf(Props(classOf[DBActor], dbcall1))
However, in this use case Actors are less helpful because you still need to collect all of the data together.
You can either use the same technique as the "Futures" section:
val f1 : Future[Data] = (dbcall1ActorRef ? MakeQuery).mapTo[Data]
for {
v1 <- f1
...
Or, you would have to wire the Actors together by hand through the constructor and handle all of the callback logic for waiting on the other Actor:
class WaitingDBActor(dbCall : () => Data, previousActor : ActorRef) {
override def receive = {
case _ : MakeQuery => previousActor forward MakeQuery
case previousData : Data => sender ! (dbCall(), previousData)
}
}
If you want to querying database, you should use something like slick which is a modern database query and access library for Scala.
quick example of slick:
case class User(id: Option[Int], first: String, last: String)
class Users(tag: Tag) extends Table[User](tag, "users") {
def id = column[Int]("id", O.PrimaryKey, O.AutoInc)
def first = column[String]("first")
def last = column[String]("last")
def * = (id.?, first, last) <> (User.tupled, User.unapply)
}
val users = TableQuery[Users]
then your need to create configuration for your db:
mydb = {
dataSourceClass = "org.postgresql.ds.PGSimpleDataSource"
properties = {
databaseName = "mydb"
user = "myuser"
password = "secret"
}
numThreads = 10
}
and in your code you load configuration:
val db = Database.forConfig("mydb")
then run your query with db.run method which gives you future as result, for example you can get all rows by calling method result
val allRows: Future[Seq[User]] = db.run(users.result)
this query run without blocking current thread.
If you have task which take long time to execute or calling to another service, you should use futures.
Example of that is simple HTTP call to external service. you can find example in here
If you have task which take long time to execute and for doing so, you have to keep mutable states, in this case the best option is using Akka Actors which encapsulate your state inside an actor which solve problem of concurrency and thread safety as simple as possible.Example of suck tasks are:
import akka.actor.Actor
import scala.concurrent.Future
case class RegisterEndpoint(endpoint: String)
case class NewUpdate(update: String)
class UpdateConsumer extends Actor {
val endpoints = scala.collection.mutable.Set.empty[String]
override def receive: Receive = {
case RegisterEndpoint(endpoint) =>
endpoints += endpoint
case NewUpdate(update) =>
endpoints.foreach { endpoint =>
deliverUpdate(endpoint, update)
}
}
def deliverUpdate(endpoint: String, update: String): Future[Unit] = {
Future.successful(Unit)
}
}
If you want to process huge amount of live data, or websocket connection, processing CSV file which is growing over time, ... or etc, the best option is Akka stream. For example reading data from kafka topic using Alpakka:Alpakka kafka connector

Completing future with TimeoutException or Awaiting?

I have the following trait:
trait Tr{
/**
* Returns future of the number of bytes written
*/
def write(bytes: Array[Byte]): Future[Long]
}
So I can await the result like this:
I.
val bts: Array[Byte] = //...
val to: Long = //...
val tr: Tr = //...
Await.result(tr.write(bts), to)
But I also can design the Tr in a bit different way:
trait Tr{
/**
* Returns future of the number of bytes written
*/
def write(bytes: Array[Byte], timeout: Long): Future[Long]
}
II.
val bts: Array[Byte] = //...
val to: Long = //...
val tr: Tr = //...
Await.result(tr.write(bts, to), Duration.Inf)
What is the better way? I think the II case has its use in case where the actual writing-IO is not interruptible or performed by a caller thread. So for flexibility I would design the thread in the II way.
The thing is writing forever in II looks a bit wierd.
Is it correct to do so? Or I'm abuse the Future?
Upd: Consider the following possible implementation of the Tr:
class SameThreadExecutionContext extends ExecutionContext{
override def execute(runnable: Runnable): Unit = runnable.run()
override def reportFailure(cause: Throwable): Unit = ???
}
clas DummyTrImpl extends Tr{
private final implicit val ec = new SameThreadExecutionContext
override def write(bytes: Array[Byte]): Future[Long] = {
Thread.sleep(10000)
throw new RuntimeException("failed")
}
}
Now if I write this:
val bts: Array[Byte] = //...
val to: Long = 1000
val tr: Tr = new DummyTrImpl
Await.result(tr.write(bts), to) //Waiting 10 secs instead of 1
//and throwing RuntimeException instead of timeout
Option II is generally preferred (or globally set timeouts), because Futures are not generally "awaited", but tend to be chained instead.
It is unusual to write code like:
Await.result(tr.write(bts, to))
and more common to write code like:
tr.write(bts, to).then(written => ... /* write succeeded, do next action */ ..)
The two traits have different behaviours, so it depends a little on what you are trying to achieve.
Trait I. says
Write some data and take as long as necessary. Complete the Future when the write completes or fails.
Trait II. says
Try to write some data but give up if it takes too long. Complete the Future when the write completes, or fails, or times out
The second option is preferred because it is guaranteed to progress, and because the write is not left hanging if the Await.result times out.

Reading from postgres using Akka Streams 2.4.2 and Slick 3.0

Trying out the newly minted Akka Streams. It seems to be working except for one small thing - there's no output.
I have the following table definition:
case class my_stream(id: Int, value: String)
class Streams(tag: Tag) extends Table[my_stream](tag, "my_stream") {
def id = column[Int]("id")
def value = column[String]("value")
def * = (id, value) <> (my_stream.tupled, my_stream.unapply)
}
And I'm trying to output the contents of the table to stdout like this:
def main(args: Array[String]) : Unit = {
implicit val system = ActorSystem("Subscriber")
implicit val materializer = ActorMaterializer()
val strm = TableQuery[Streams]
val db = Database.forConfig("pg-postgres")
try{
var src = Source.fromPublisher(db.stream(strm.result))
src.runForeach(r => println(s"${r.id},${r.value}"))(materializer)
} finally {
system.shutdown
db.close
}
}
I have verified that the query is being run by configuring debug logging. However, all I get is this:
08:59:24.099 [main] INFO com.zaxxer.hikari.HikariDataSource - pg-postgres - is starting.
08:59:24.428 [main] INFO com.zaxxer.hikari.pool.HikariPool - pg-postgres - is closing down.
The cause is that Akka Streams is asynchronous and runForeach returns a Future which will be completed once the stream completes, but that Future is not being handled and as such the system.shutdown and db.close executes immediately instead of after the stream completes.
Just in case it helps anyone searching this very same issue but in MySQL, take into account that you should enable the driver stream support "manually":
def enableStream(statement: java.sql.Statement): Unit = {
statement match {
case s: com.mysql.jdbc.StatementImpl => s.enableStreamingResults()
case _ =>
}
}
val publisher = sourceDb.stream(query.result.withStatementParameters(statementInit = enableStream))
Source: http://www.slideshare.net/kazukinegoro5/akka-streams-100-scalamatsuri
Ended up using #ViktorKlang answer and just wrapped the run with an Await.result. I also found an alternative answer in the docs which demonstrates using the reactive streams publisher and subscriber interfaces:
The stream method returns a DatabasePublisher[T] and Source.fromPublisher returns a Source[T, NotUsed]. This means you have to attach a subscriber instead of using runForEach - according to the release notes NotUsed is a replacement for Unit. Which means nothing gets passed to the Sink.
Since Slick implements the reactive streams interface and not the Akka Stream interfaces you need to use the fromPublisher and fromSubscriber integration point. That means you need to implement the org.reactivestreams.Subscriber[T] interface.
Here's a quick and dirty Subscriber[T] implementation which simply calls println:
class MyStreamWriter extends org.reactivestreams.Subscriber[my_stream] {
private var sub : Option[Subscription] = None;
override def onNext(t: my_stream): Unit = {
println(t.value)
if(sub.nonEmpty) sub.head.request(1)
}
override def onError(throwable: Throwable): Unit = {
println(throwable.getMessage)
}
override def onSubscribe(subscription: Subscription): Unit = {
sub = Some(subscription)
sub.head.request(1)
}
override def onComplete(): Unit = {
println("ALL DONE!")
}
}
You need to make sure you call the Subscription.request(Long) method in onSubscribe and then in onNext to ask for data or nothing will be sent or you won't get the full set of results.
And here's how you use it:
def main(args: Array[String]) : Unit = {
implicit val system = ActorSystem("Subscriber")
implicit val materializer = ActorMaterializer()
val strm = TableQuery[Streams]
val db = Database.forConfig("pg-postgres")
try{
val src = Source.fromPublisher(db.stream(strm.result))
val flow = src.to(Sink.fromSubscriber(new MyStreamWriter()))
flow.run()
} finally {
system.shutdown
db.close
}
}
I'm still trying to figure this out so I welcome any feedback. Thanks!