Akka-streams: execute action on flow start - scala

Having a flow description in akka-streams
val flow: Flow[Input, Output, Unit] = ???
, how do I modify it to get a new flow description that performs a specified side-affect on start, i.e. when flow is materialized?

Starting materialization of a stream processing graph will set it in motion piece by piece, concurrently. The only way to perform an action that is guaranteed to happen before the first element is passed somewhere within that graph is to perform that action before materializing the graph. In this sense the answer by sschaef is slightly incorrect: using mapMaterializedValue runs the action pretty early, but not such that it is guaranteed to happend before the first element is processed.
If we are talking about a Flow here which only takes in inputs on one side and produces outputs on the other—i.e. it does not contain internal cycles or data sources—then one thing you can do to perform an action before the first element arrives is to attach a processing step to its input that does that:
def effectSource[T](block: => Unit) = Source.fromIterator(() => {block; Iterator.empty})
val newFlow = Flow[Input].prepend(effectSource(/* do stuff */)).via(flow)
Note
The above is using upcoming 2.0 syntax, in Akka Streams 1.0 it would be Source(() => { block; Iterator.empty }) and the prepend operation would need to be done using the FlowGraph DSL (the graph can be found here).

You said it by yourself, use the force of the materialization:
val newFlow = flow.mapMaterializedValue(_ ⇒ println("materialized"))

Related

When materialiser is actually used in Akka Streams Flows and when do we need to Keep values

I'm trying to learn Akka Streams and I'm stuck with this materialization here.
Every tutorial shows some basics source via to run examples where no real between Keep.left and Keep.right is explained. So I wrote this little piece of code, asked IntelliJ to add a type annotation to the values and started to dig the sources.
val single: Source[Int, NotUsed] = Source(Seq(1, 2, 3, 4, 5))
val flow: Flow[Int, Int, NotUsed] = Flow[Int].map(_ * 2)
val sink: Sink[Int, Future[Int]] = Sink.fold[Int, Int](0)(_ + _)
val run1: RunnableGraph[Future[Int]] =
single.viaMat(flow)(Keep.right).toMat(sink)(Keep.right)
val run2: RunnableGraph[NotUsed] =
single.viaMat(flow)(Keep.right).toMat(sink)(Keep.left)
val run3: RunnableGraph[(NotUsed, Future[Int])] =
single.viaMat(flow)(Keep.right).toMat(sink)(Keep.both)
val run4: RunnableGraph[NotUsed] =
single.viaMat(flow)(Keep.right).toMat(sink)(Keep.none)
So far I can understand that at the end of the execution we can need the value of the Sink that is of type Future[Int]. But I cannot think of any case when I gonna need to keep some of the values.
In the third example it is possible to acces both left and right values of the materialized output.
run3.run()._2 onComplete {
case Success(value) ⇒ println(value)
case Failure(exception) ⇒ println(exception.getMessage)
}
It actually works absolutely the same way if I change it to viaMat(flowMultiply)(Keep.left) or none or both.
But in what scenarios the materialized value could be used within the graph? Why would we need it if the value is flowing within anyway? Why do we need one of the values if we aren't gonna keep it?
Could you pelase provide an example where changing from left to right will not just break the compiler, but will actually bring a difference to the program logic?
For most streams, you only care about the value at the end of the stream. Accordingly, most of the Source and nearly all of the standard Flow operators have a materialized value of NotUsed, and the syntactic sugar .runWith boils down to .toMat(sink)(Keep.right).run.
Where one might care about the materialized value of a Source or Flow stage is when you want to be able to control a stage outside of the stream. An example of this is Source.actorRef, which allows you to send messages to an actor which get forwarded to the stream: you need the Source's materialized ActorRef in order to actually send a message to it. Likewise, you probably still want the materialized value of the Sink (whether to know that the stream processing happened (Future[Done]) or for an actual value at the end of the stream). In such a stream you'd probably have something like:
val stream: RunnableGraph[(ActorRef, Future[Done])] =
Source.actorRef(...)
.viaMat(calculateStuffFlow)(Keep.left) // propagates the ActorRef
.toMat(Sink.foreach { ... })(Keep.both)
val (sendToStream, done) = stream.run()
Another reasonably common use-case for this is in the Alpakka Kafka integration, where it's possible for the consumer to have a controller as a materialized value: this controller allows you to stop consuming from a topic and not unsubscribe until any pending offset commits have happened.

Detecting that back pressure is happening

My Akka HTTP application streams some data via server-sent events, and clients can request way more events than they can handle. The code looks something like this
complete {
source.filter(predicate.isMatch)
.buffer(1000, OverflowStrategy.dropTail)
.throttle(20, 1 second)
.map { evt => ServerSentEvent(evt) }
}
Is there a way to detect the fact that a stage backpressures and somehow notify the client preferably using the same sink (by emitting a different kind of output) or if not possible just make Akka framework call some sort of callback that will deal with the fact through a control side-channel?
So, I'm not sure I understand your use case. Are you asking about back pressure at .buffer or at .throttle? Another part of my confusion is that you are suggesting emitting a new "control" element in a situation where the stream is already back pressured. So your control element might not be received for some time. Also, if you emit a control element every single time you receive back pressure you will likely create a flood of control elements.
One way to build this (overly naive) solution would be to use conflate.
val simpleSink: Sink[String, Future[Done]] =
Sink.foreach(e => println(s"simple: $e"))
val cycleSource: Source[String, NotUsed] =
Source.cycle(() => List("1", "2", "3", "4").iterator).throttle(5, 1.second)
val conflateFlow: Flow[String, String, NotUsed] =
Flow[String].conflate((a, b) => {
"BACKPRESSURE CONTROL ELEMENT"
})
val backpressureFlow: Flow[String, String, NotUsed] =
Flow[String]
.buffer(10, OverflowStrategy.backpressure) throttle (2, 1.second)
val backpressureTest =
cycleSource.via(conflateFlow).via(backpressureFlow).to(simpleSink).run()
To turn this into a more usable example you could either:
Make some sort of call inside of .conflate (and then just drop one of the elements). Be careful not to do anything blocking though. Perhaps just send a message that could be de-duplicated elsewhere.
Write a custom graph stage. Doing something simple like this wouldn't be too difficult.
I think I'd have to understand more about the use case though. Take a look at all of the off the shelf backpressure aware operators and see if one of them helps.

Spark httpclient job does not parallel well [duplicate]

I build a RDD from a list of urls, and then try to fetch datas with some async http call.
I need all the results before doing other calculs.
Ideally, I need to make the http calls on differents nodes for scaling considerations.
I did something like this:
//init spark
val sparkContext = new SparkContext(conf)
val datas = Seq[String]("url1", "url2")
//create rdd
val rdd = sparkContext.parallelize[String](datas)
//httpCall return Future[String]
val requests = rdd.map((url: String) => httpCall(url))
//await all results (Future.sequence may be better)
val responses = requests.map(r => Await.result(r, 10.seconds))
//print responses
response.collect().foreach((s: String) => println(s))
//stop spark
sparkContext.stop()
This work, but Spark job never finish !
So I wonder what is are the best practices for dealing with Future using Spark (or Future[RDD]).
I think this use case looks pretty common, but didn't find any answer yet.
Best regards
this use case looks pretty common
Not really, because it simply doesn't work as you (probably) expect. Since each task operates on standard Scala Iterators these operations will be squashed together. It means that all operations will be blocking in practice. Assuming you have three URLs ["x", "y", "z"] you code will be executed in a following order:
Await.result(httpCall("x", 10.seconds))
Await.result(httpCall("y", 10.seconds))
Await.result(httpCall("z", 10.seconds))
You can easily reproduce the same behavior locally. If you want to execute your code asynchronously you should handle this explicitly using mapPartitions:
rdd.mapPartitions(iter => {
??? // Submit requests
??? // Wait until all requests completed and return Iterator of results
})
but this is relatively tricky. There is no guarantee all data for a given partition fits into memory so you'll probably need some batching mechanism as well.
All of that being said I couldn't reproduce the problem you've described to is can be some configuration issue or a problem with httpCall itself.
On a side note allowing a single timeout to kill whole task doesn't look like a good idea.
I couldnt find an easy way to achieve this. But after several iteration of retries this is what I did and its working for a huge list of queries. Basically we used this to do a batch operation for a huge query into multiple sub queries.
// Break down your huge workload into smaller chunks, in this case huge query string is broken
// down to a small set of subqueries
// Here if needed to optimize further down, you can provide an optimal partition when parallelizing
val queries = sqlContext.sparkContext.parallelize[String](subQueryList.toSeq)
// Then map each one those to a Spark Task, in this case its a Future that returns a string
val tasks: RDD[Future[String]] = queries.map(query => {
val task = makeHttpCall(query) // Method returns http call response as a Future[String]
task.recover {
case ex => logger.error("recover: " + ex.printStackTrace()) }
task onFailure {
case t => logger.error("execution failed: " + t.getMessage) }
task
})
// Note:: Http call is still not invoked, you are including this as part of the lineage
// Then in each partition you combine all Futures (means there could be several tasks in each partition) and sequence it
// And Await for the result, in this way you making it to block untill all the future in that sequence is resolved
val contentRdd = tasks.mapPartitions[String] { f: Iterator[Future[String]] =>
val searchFuture: Future[Iterator[String]] = Future sequence f
Await.result(searchFuture, threadWaitTime.seconds)
}
// Note: At this point, you can do any transformations on this rdd and it will be appended to the lineage.
// When you perform any action on that Rdd, then at that point,
// those mapPartition process will be evaluated to find the tasks and the subqueries to perform a full parallel http requests and
// collect those data in a single rdd.
If you dont want to perform any transformation on the content like parsing the response payload, etc. Then you could use foreachPartition instead of mapPartitions to perform all those http calls immediately.
I finally made it using scalaj-http instead of Dispatch.
Call are synchronous, but this match my use case.
I think the Spark Job never finish using Dispatch because the Http connection was not closed properly.
Best Regards
This wont work.
You cannot expect the request objects be distributed and responses collected over a cluster by other nodes. If you do then the spark calls for future will never end. The futures will never work in this case.
If your map() make sync(http) requests then please collect responses within the same action/transformation call and then subject the results(responses) to further map/reduce/other calls.
In your case, please rewrite logic collect the responses for each call in sync and remove the notion of futures then all should be fine.

How to parallelize several apache spark rdds?

I have the next code:
sc.parquetFile("some large parquet file with bc").registerTempTable("bcs")
sc.parquetFile("some large parquet file with imps").registerTempTable("imps")
val bcs = sc.sql("select * from bcs")
val imps = sc.sql("select * from imps")
I want to do:
bcs.map(x => wrapBC(x)).collect
imps.map(x => wrapIMP(x)).collect
but when I do this, it's running not async. I can to do it with Future, like that:
val bcsFuture = Future { bcs.map(x => wrapBC(x)).collect }
val impsFuture = Future { imps.map(x => wrapIMP(x)).collect }
val result = for {
bcs <- bcsFuture
imps <- impsFuture
} yield (bcs, imps)
Await.result(result, Duration.Inf) //this return (Array[Bc], Array[Imp])
I want to do this without Future, how can I do it?
Update This was originally composed before the question was updated. Given those updates, I agree with #stholzm's answer to use cartesian in this case.
There do exist a limited number of actions which will produce a FutureAction[A] for an RDD[A] and be executed in the background. These are available on the AsyncRDDActions class, and so long as you import SparkContext._ any RDD will can be implicitly converted to an AysnchRDDAction as needed. For your specific code example that would be:
bcs.map(x => wrapBC(x)).collectAsync
imps.map(x => wrapIMP(x)).collectAsync
In additionally to evaluating the DAG up to action in the background, the FutureAction produced has the cancel method to attempt to end processing early.
Caveat
This may not do what you think it does. If the intent is to get data from both sources and then combine them you're more likely to want to join or group the RDDs instead. For that you can look at the functions available in PairRDDFunctions, again available on RDDs through implicit conversion.
If the intention isn't to have the data graphs interact then so far in my experience then running batches concurrently might only serve to slow down both, though that may be a consequence of how the cluster is configured. If the resource manager is set up to give each execution stage a monopoly on the cluster in FIFO order (the default in standalone and YARN modes, I believe; I'm not sure about Mesos) then each of the asynchronous collects will contend with each other for that monopoly, run their tasks, then contend again for the next execution stage.
Compare this to using a Future to wrap blocking calls to downstream services or database, for example, where either the resources in question are completely separate or generally have enough resource capacity to handle multiple requests in parallel without contention.
Update: I misunderstood the question. The desired result is not the cartesian product Array[(Bc, Imp)].
But I'd argue that it does not matter how long the single map calls take because as soon as you add other transformations, Spark tries to combine them in an efficient way. As long as you only chain transformations on RDDs, nothing happens on the data. When you finally apply an action then the execution engine will figure out a way to produce the requested data.
So my advice would be to not think so much about the intermediate steps and avoid collect as much as possible because it will fetch all the data to the driver program.
It seems you are building a cartesian product yourself. Try cartesian instead:
val bc = bcs.map(x => wrapBC(x))
val imp = imps.map(x => wrapIMP(x))
val result = bc.cartesian(imp).collect
Note that collect is called on the final RDD and no longer on intermediate results.
You can use union for solve this problem. For example:
bcs.map(x => wrapBC(x).asInstanceOf[Any])
imps.map(x => wrapIMP(x).asInstanceOf[Any])
val result = (bcs union imps).collect()
val bcsResult = result collect { case bc: Bc => bc }
val impsResult = result collect { case imp: Imp => imp }
If you want to use sortBy or another operations, you can use inheritance of trait or main class.

What effect does using Action.async have, since Play uses Netty which is non-blocking

Since Netty is a non-blocking server, what effect does changing an action to using .async?
def index = Action { ... }
versus
def index = Action.async { ... }
I understand that with .async you will get a Future[SimpleResult]. But since Netty is non-blocking, will Play do something similar under the covers anyway?
What effect will this have on throughput/scalability? Is this a hard question to answer where it depends on other factors?
The reason I am asking is, I have my own custom Action and I wanted to reset the cookie timeout for every page request so I am doing this which is a async call:
object MyAction extends ActionBuilder[abc123] {
def invokeBlock[A](request: Request[A], block: (abc123[A]) => Future[SimpleResult]) = {
...
val result: Future[SimpleResult] = block(new abc123(..., result))
result.map(_.withCookies(...))
}
}
The take away from the above snippet is I am using a Future[SimpleResult], is this similar to calling Action.async but this is inside of my Action itself?
I want to understand what effect this will have on my application design. It seems like just for the ability to set my cookie on a per request basis I have changed from blocking to non-blocking. But I am confused since Netty is non-blocking, maybe I haven't really changed anything in reality as it was already async?
Or have I simply created another async call embedded in another one?
Hoping someone can clarify this with some details and how or what effect this will have in performance/throughput.
def index = Action { ... } is non-blocking you are right.
The purpose of Action.async is simply to make it easier to work with Futures in your actions.
For example:
def index = Action.async {
val allOptionsFuture: Future[List[UserOption]] = optionService.findAll()
allOptionFuture map {
options =>
Ok(views.html.main(options))
}
}
Here my service returns a Future, and to avoid dealing with extracting the result I just map it to a Future[SimpleResult] and Action.async takes care of the rest.
If my service was returning List[UserOption] directly I could just use Action.apply, but under the hood it would still be non-blocking.
If you look at Action source code, you can even see that apply eventually calls async:
https://github.com/playframework/playframework/blob/2.3.x/framework/src/play/src/main/scala/play/api/mvc/Action.scala#L432
I happened to come across this question, I like the answer from #vptheron, and I also want to share something I read from book "Reactive Web Applications", which, I think, is also great.
The Action.async builder expects to be given a function of type Request => Future[Result]. Actions declared in this fashion are not much different from plain Action { request => ... } calls, the only difference is that Play knows that Action.async actions are already asynchronous, so it doesn’t wrap their contents in a future block.
That’s right — Play will by default schedule any Action body to be executed asynchronously against its default web worker pool by wrapping the execution in a future. The only difference between Action and Action.async is that in the second case, we’re taking care of providing an asynchronous computation.
It also presented one sample:
def listFiles = Action { implicit request =>
val files = new java.io.File(".").listFiles
Ok(files.map(_.getName).mkString(", "))
}
which is problematic, given its use of the blocking java.io.File API.
Here the java.io.File API is performing a blocking I/O operation, which means that one of the few threads of Play's web worker pool will be hijacked while the OS figures out the list of files in the execution directory. This is the kind of situation you should avoid at all costs, because it means that the worker pool may run out of threads.
-
The reactive audit tool, available at https://github.com/octo-online/reactive-audit, aims to point out blocking calls in a project.
Hope it helps, too.