Akka Streams how to write a GraphStage with OrElse - scala

I have the following requirement. I am writing a GraphStage which needs to lookup a SQL Table (in a different db) . If that lookup fails, only then it should lookup a second table (in a different db) and if that lookup also fails, then lookup in the 3rd table (in a different table). if all lookups fail, then use a default
I googled and found this
http://doc.akka.io/japi/akka/current/akka/stream/scaladsl/OrElse.html
and also this thead
Alternative flows based on condition for akka stream
But broadcast and partition is not what I am looking for. I don't want to lookup both the tables simultaneously. what I want is that if one flow returns a None, only then the second flow is used to fetch the value.
Right now I have done something like this
val flow = Flow[Foo].map{foo =>
lookup1(foo.id) orElse lookup2(foo.id) getOrElse default
}
But this makes the flow above very monolithic. it would be nice if I can break the flow above into 3 separate ones, and then connect them via orelse clause in my graphstage.

Using flatMapConcat and orElse might help you getting your code more generic in terms of the number of source you want to combine. See example below
val altFlows: List[Flow[Foo, Option[Bar], NotUsed]] = ???
val default : Bar = ???
Flow[Foo].flatMapConcat { foo ⇒
val altSources = altFlows.map(Source.single(foo).via(_).collect{ case Some(x) ⇒ x })
val default = Source.single(default)
(altSources :+ default).reduce(_ orElse _)
}

Related

What's the right way to "log and skip" validated transformations in spark-streaming

I have a spark-streaming application where I want to do some data transformations before my main operation, but the transformation involves some data validation.
When the validation fails, I want to log the failure cases, and then proceed on with the rest.
Currently, I have code like this:
def values: DStream[String] = ???
def validate(element: String): Either[String, MyCaseClass] = ???
val validationResults = values.map(validate)
validationResults.foreachRDD { rdd =>
rdd.foreach {
case Left(error) => logger.error(error)
case _ =>
}
}
val validatedValues: DStream[MyCaseClass] =
validationResults.mapPartitions { partition =>
partition.collect { case Right(record) => record }
}
This currently works, but it feels like I'm doing something wrong.
Questions
As far as I understand, this will perform the validation twice, which is potentially wasteful.
Is it correct to use values.map(validation).persist() to solve that problem?
Even if I persist the values, it still iterates and pattern matches on all the elements twice. It feels like there should be some method I can use to solve this. On a regular scala collection, I might use some of the cats TraverseFilter api, or with fs2.Stream an evalMapFilter. What DStream api can I use for that? Maybe something with mapPartitions?
I would say that the best way to tackle this is to take advantage that the stdlib flatMap accepts Option
def values: DStream[String] = ???
def validate(element: String): Either[String, MyCaseClass] = ???
val validatedValues: DStream[MyCaseClass] =
values.map(validate).flatMap {
case Left(error) =>
logger.error(error)
None
case Right(record) =>
Some(record)
}
You can also be a little bit more verbose using mapPartitions which should be a little bit more efficient.
The 'best' option here depends a bit on the rest of your spark job and your version of spark.
Ideally you'd pick a mechanism natively understood by catalyst. The spark3 dataset observe listener may be what you're looking for there. (I haven't seen many examples of using this in the wild but it seems like this was the motivation behind such a thing.)
In pure spark sql, one option is to add a new column with the results of validation, e.g. a column named invalid_reason which is NULL if the record is valid or some [enumerated] string containing the reason the column failed validation. At this point, you likely want to persist/cache the dataset before doing a groupBy/count/collect/log operation, then filter where invalid_reason is null on the persisted dataframe and continue on the rest of the processing.
tl;dr: consider adding a validation column rather than just applying a 'validate' function. You then 'fork' processing here: log the records which have the invalid column specified, process the rest of your job on the records which don't. It does add some volume to your dataframe, but doesn't require processing the same records twice.

When materialiser is actually used in Akka Streams Flows and when do we need to Keep values

I'm trying to learn Akka Streams and I'm stuck with this materialization here.
Every tutorial shows some basics source via to run examples where no real between Keep.left and Keep.right is explained. So I wrote this little piece of code, asked IntelliJ to add a type annotation to the values and started to dig the sources.
val single: Source[Int, NotUsed] = Source(Seq(1, 2, 3, 4, 5))
val flow: Flow[Int, Int, NotUsed] = Flow[Int].map(_ * 2)
val sink: Sink[Int, Future[Int]] = Sink.fold[Int, Int](0)(_ + _)
val run1: RunnableGraph[Future[Int]] =
single.viaMat(flow)(Keep.right).toMat(sink)(Keep.right)
val run2: RunnableGraph[NotUsed] =
single.viaMat(flow)(Keep.right).toMat(sink)(Keep.left)
val run3: RunnableGraph[(NotUsed, Future[Int])] =
single.viaMat(flow)(Keep.right).toMat(sink)(Keep.both)
val run4: RunnableGraph[NotUsed] =
single.viaMat(flow)(Keep.right).toMat(sink)(Keep.none)
So far I can understand that at the end of the execution we can need the value of the Sink that is of type Future[Int]. But I cannot think of any case when I gonna need to keep some of the values.
In the third example it is possible to acces both left and right values of the materialized output.
run3.run()._2 onComplete {
case Success(value) ⇒ println(value)
case Failure(exception) ⇒ println(exception.getMessage)
}
It actually works absolutely the same way if I change it to viaMat(flowMultiply)(Keep.left) or none or both.
But in what scenarios the materialized value could be used within the graph? Why would we need it if the value is flowing within anyway? Why do we need one of the values if we aren't gonna keep it?
Could you pelase provide an example where changing from left to right will not just break the compiler, but will actually bring a difference to the program logic?
For most streams, you only care about the value at the end of the stream. Accordingly, most of the Source and nearly all of the standard Flow operators have a materialized value of NotUsed, and the syntactic sugar .runWith boils down to .toMat(sink)(Keep.right).run.
Where one might care about the materialized value of a Source or Flow stage is when you want to be able to control a stage outside of the stream. An example of this is Source.actorRef, which allows you to send messages to an actor which get forwarded to the stream: you need the Source's materialized ActorRef in order to actually send a message to it. Likewise, you probably still want the materialized value of the Sink (whether to know that the stream processing happened (Future[Done]) or for an actual value at the end of the stream). In such a stream you'd probably have something like:
val stream: RunnableGraph[(ActorRef, Future[Done])] =
Source.actorRef(...)
.viaMat(calculateStuffFlow)(Keep.left) // propagates the ActorRef
.toMat(Sink.foreach { ... })(Keep.both)
val (sendToStream, done) = stream.run()
Another reasonably common use-case for this is in the Alpakka Kafka integration, where it's possible for the consumer to have a controller as a materialized value: this controller allows you to stop consuming from a topic and not unsubscribe until any pending offset commits have happened.

How to get values from query result in slick

I tried to use the Slick(3.0.2) to operate database in my project with scala.
Here is my part of code
val query = table.filter(_.username === "LeoAshin").map { user => (user.username, user.password, user.role, user.delFlg) }
val f = db.run(query.result)
How can I read the data from "f"
I have tried google the solution many times, but no answer can solved my confuse
Thanks a lot
f is a Future and there are several things you can do to get at the value, depending on just how urgently you need it. If there's nothing else you can do while you're waiting then you will just have to wait for it to complete and then get the value. Perhaps the easiest is along the following lines (from the Slick documentation):
val q = for (c <- coffees) yield c.name
val a = q.result
val f: Future[Seq[String]] = db.run(a)
f.onSuccess { case s => println(s"Result: $s") }
You could then go on to do other things that don't depend on the result of f and the result of the query will be printed on the console asynchronously.
However, most of the time you'll want to use the value for some other operation, perhaps another database query. In that case, the easiest thing to do is to use a for comprehension. Something like the following:
for (r <- f) yield db.run(q(r))
where q is a function/method that will take the result of your first query and build another one. The result of this for comprehension will also be a Future, of course.
One thing to be aware of is that if you are running this in a program that will do an exit after all the code has been run, you will need to use Await (see the Scala API) to prevent the program exiting while one of your db queries is still working.
Type of query.result is DBIO. When you call db.run, it turns into Future.
If you want to print the data, use
import scala.concurrent.ExecutionContext.Implicits.global
f.foreach(println)
To continue working with data, use f.map { case (username, password, role, delFlg) ⇒ ... }
If you want to block on the Future and get the result (if you're playing around in REPL for example), use something like
import scala.concurrent.Await
import scala.concurrent.duration._
Await.result(f, 1.second)
Bear in mind, this is not what you want to do in production code — blocking on Futures is a bad practice.
I generally recommend learning about Scala core types and Futures specifically. Slick "responsibility" ends when you call db.run.

Iterator of InputStream: How to close the InputStreams?

I have an Iterator[InputStream] which i map over to retrieve the individual results:
val streams: Iterator[InputStream[CustomType]] = retrieveStreams()
val results: Iterator[MyResultType] = streams flatMap (c => transformToResult(c))
This works as expected, meaning I can retrieve values of type MyResultType from the results iterator. The only problem I have is that the individual InputStreams are never being closed. Is there any way to do this?
There is no magic way to close it, or at least to guarantee that it will get closed. Thus you have to close each stream yourself. Take a look at the Loan Pattern which makes it less error prone: Loaner Pattern in Scala.
In your case you don't have a single resource to release but rather a collection of resources, so adjust your custom loan pattern accordingly.
Since you are dealing with Iterator you might have unlimited supply of InputStreams, in that case your transformToResult function would have to close the stream or something else at the element level.
It could look something like this:
val streams: Iterator[InputStream[CustomType]] = retrieveStreams()
val results: Iterator[MyResultType] =
streams flatMap (c => yourLoaner(c)(transformToResult))

How to parallelize several apache spark rdds?

I have the next code:
sc.parquetFile("some large parquet file with bc").registerTempTable("bcs")
sc.parquetFile("some large parquet file with imps").registerTempTable("imps")
val bcs = sc.sql("select * from bcs")
val imps = sc.sql("select * from imps")
I want to do:
bcs.map(x => wrapBC(x)).collect
imps.map(x => wrapIMP(x)).collect
but when I do this, it's running not async. I can to do it with Future, like that:
val bcsFuture = Future { bcs.map(x => wrapBC(x)).collect }
val impsFuture = Future { imps.map(x => wrapIMP(x)).collect }
val result = for {
bcs <- bcsFuture
imps <- impsFuture
} yield (bcs, imps)
Await.result(result, Duration.Inf) //this return (Array[Bc], Array[Imp])
I want to do this without Future, how can I do it?
Update This was originally composed before the question was updated. Given those updates, I agree with #stholzm's answer to use cartesian in this case.
There do exist a limited number of actions which will produce a FutureAction[A] for an RDD[A] and be executed in the background. These are available on the AsyncRDDActions class, and so long as you import SparkContext._ any RDD will can be implicitly converted to an AysnchRDDAction as needed. For your specific code example that would be:
bcs.map(x => wrapBC(x)).collectAsync
imps.map(x => wrapIMP(x)).collectAsync
In additionally to evaluating the DAG up to action in the background, the FutureAction produced has the cancel method to attempt to end processing early.
Caveat
This may not do what you think it does. If the intent is to get data from both sources and then combine them you're more likely to want to join or group the RDDs instead. For that you can look at the functions available in PairRDDFunctions, again available on RDDs through implicit conversion.
If the intention isn't to have the data graphs interact then so far in my experience then running batches concurrently might only serve to slow down both, though that may be a consequence of how the cluster is configured. If the resource manager is set up to give each execution stage a monopoly on the cluster in FIFO order (the default in standalone and YARN modes, I believe; I'm not sure about Mesos) then each of the asynchronous collects will contend with each other for that monopoly, run their tasks, then contend again for the next execution stage.
Compare this to using a Future to wrap blocking calls to downstream services or database, for example, where either the resources in question are completely separate or generally have enough resource capacity to handle multiple requests in parallel without contention.
Update: I misunderstood the question. The desired result is not the cartesian product Array[(Bc, Imp)].
But I'd argue that it does not matter how long the single map calls take because as soon as you add other transformations, Spark tries to combine them in an efficient way. As long as you only chain transformations on RDDs, nothing happens on the data. When you finally apply an action then the execution engine will figure out a way to produce the requested data.
So my advice would be to not think so much about the intermediate steps and avoid collect as much as possible because it will fetch all the data to the driver program.
It seems you are building a cartesian product yourself. Try cartesian instead:
val bc = bcs.map(x => wrapBC(x))
val imp = imps.map(x => wrapIMP(x))
val result = bc.cartesian(imp).collect
Note that collect is called on the final RDD and no longer on intermediate results.
You can use union for solve this problem. For example:
bcs.map(x => wrapBC(x).asInstanceOf[Any])
imps.map(x => wrapIMP(x).asInstanceOf[Any])
val result = (bcs union imps).collect()
val bcsResult = result collect { case bc: Bc => bc }
val impsResult = result collect { case imp: Imp => imp }
If you want to use sortBy or another operations, you can use inheritance of trait or main class.