Akka FileIO.fromPath - How to deal with IOResult and get the data instead? - scala

I looked at a lot of examples and posts about this. I got it working in one way but I haven't quite gotten the idea yet, I'm still getting tripped up by Future[IOResult] when I'm trying to read a file into a stream of record objects, one per line, call it Future[List[LineRecordCaseClass]] is what I want instead.
val source = FileIO.fromPath(Paths.get("/tmp/junk_data.csv"))
val flow = makeFlow() // Framing.delimiter->split(",")->map to LineRecordCaseClass
val sink = Sink.collection[LineRecordCaseClass, List[LineRecordCaseClass]]
val graph = source.via(flow).to(sink)
val typeMismatchError: Future[List[LineRecordCaseClass]] = graph.run()
Why does graph.run() return a Future[IOResult] instead? Perhaps I'm missing a Keep.left somewhere, or something? If so what and where at?
Some concept I'm missing.

Here are the type of yours vals
val source: Source[ByteString, Future[IOResult]] =
val flow: Flow[ByteString, LineRecordCaseClass, NotUsed] =
val sink: Sink[LineRecordCaseClass, Future[List[LineRecordCaseClass]]] =
From the akka-stream doc , in the code snippet
By default, the materialized value of the leftmost stage is preserved
The materialized value at your leftmost stage (the source) is Future[IOResult].
In source.via(flow).to(sink), if you look at the implementation of .to, it calls .toMat with a default Keep.left
The type for Keep.both is
val check: RunnableGraph[(Future[IOResult], Future[List[LineRecordCaseClass]])] = source.via(flow).toMat(sink)(Keep.both)
So if you want Future[List[LineRecordCaseClass]], you can do
source.via(flow).toMat(sink)(Keep.right)
I recommend this video which explains the materialized value

Related

How can I use a value in an Akka Stream to instantiate a GooglePubSub Flow?

I'm attempting to create a Flow to be used with a Source queue. I would like this to work with the Alpakka Google PubSub connector: https://doc.akka.io/docs/alpakka/current/google-cloud-pub-sub.html
In order to use this connector, I need to create a Flow that depends on the topic name provided as a String, as shown in the above link and in the code snippet.
val publishFlow: Flow[PublishRequest, Seq[String], NotUsed] =
GooglePubSub.publish(topic, config)
The question
I would like to be able to setup a Source queue that receives the topic and message required for publishing a message. I first create the necessary PublishRequest out of the message String. I then want to run this through the Flow that is instantiated by running GooglePubSub.publish(topic, config). However, I don't know how to get the topic to that part of the flow.
val gcFlow: Flow[(String, String), PublishRequest, NotUsed] = Flow[(String, String)]
.map(messageData => {
PublishRequest(Seq(
PubSubMessage(new String(Base64.getEncoder.encode(messageData._1.getBytes))))
)
})
.via(GooglePubSub.publish(topic, config))
val bufferSize = 10
val elementsToProcess = 5
// newSource is a Source[PublishRequest, NotUsed]
val (queue, newSource) = Source
.queue[(String, String)](bufferSize, OverflowStrategy.backpressure)
.via(gcFlow)
.preMaterialize()
I'm not sure if there's a way to get the topic into the queue without it being a part of the initial data stream. And I don't know how to get the stream value into the dynamic Flow.
If I have improperly used some terminology, please keep in mind that I'm new to this.
You can achieve it by using flatMapConcat and generating a new Source within it:
// using tuple assuming (Topic, Message)
val gcFlow: Flow[(String, String), (String, PublishRequest), NotUsed] = Flow[(String, String)]
.map(messageData => {
val pr = PublishRequest(immutable.Seq(
PubSubMessage(new String(Base64.getEncoder.encode(messageData._2.getBytes)))))
// output flow shape of (String, PublishRequest)
(messageData._1, pr)
})
val publishFlow: Flow[(String, PublishRequest), Seq[String], NotUsed] =
Flow[(String, PublishRequest)].flatMapConcat {
case (topic: String, pr: PublishRequest) =>
// Create a Source[PublishRequest]
Source.single(pr).via(GooglePubSub.publish(topic, config))
}
// wire it up
val (queue, newSource) = Source
.queue[(String, String)](bufferSize, OverflowStrategy.backpressure)
.via(gcFlow)
.via(publishFlow)
.preMaterialize()
Optionally you could substitute tuple with a case class to document it better
case class Something(topic: String, payload: PublishRequest)
// output flow shape of Something[String, PublishRequest]
Something(messageData._1, pr)
Flow[Something[String, PublishRequest]].flatMapConcat { s =>
Source.single(s.payload)... // etc
}
Explanation:
In gcFlow we output FlowShape of tuple (String, PublishRequest) which is passed through publishFlow. The input is tuple (String, PublishRequest) and in flatMapConcat we generate new Source[PublishRequest] which is flowed through GooglePubSub.publish
There would be slight overhead creating new Source for every item. This shouldn't have measurable impact on performance

Change a materialized value in a source using the contents of the stream

Alpakka provides a great way to access dozens of different data sources. File oriented sources such as HDFS and FTP sources are delivered as Source[ByteString, Future[IOResult]. However, HTTP requests via Akka HTTP are delivered as entity streams of Source[ByteString, NotUsed]. In my use case, I would like to retrieve content from HTTP sources as Source[ByteString, Future[IOResult] so I can build a unified resource fetcher that works from multiple schemes (hdfs, file, ftp and S3 in this case).
In particular, I would like to convert the Source[ByteString, NotUsed] source to
Source[ByteString, Future[IOResult] where I am able to calculate the IOResult from the incoming byte stream. There are plenty of methods like flatMapConcat and viaMat but none seem to be able to extract details from the input stream (such as number of bytes read) or initialise the IOResult structure properly. Ideally, I am looking for a method with the following signature that will update the IOResult as the stream comes in.
def matCalc(src: Source[ByteString, Any]) = Source[ByteString, Future[IOResult]] = {
src.someMatFoldMagic[ByteString, IOResult](IOResult.createSuccessful(0))(m, b) => m.withCount(m.count + b.length))
}
i can't recall any existing functionality, which can out of the box do this, but you can use alsoToMat (surprisingly didn't find it in akka streams docs, although you can look it in source code documentation & java api) flow function together with Sink.fold to accumulate some value and give it in the very end. eg:
def magic(source: Source[Int, Any]): Source[Int, Future[Int]] =
source.alsoToMat(Sink.fold(0)((acc, _) => acc + 1))((_, f) => f)
the thing is that alsoToMat combines input mat value with the one provided in alsoToMat. at the same time the values produced by source are not affected by the sink in alsoToMat:
def alsoToMat[Mat2, Mat3](that: Graph[SinkShape[Out], Mat2])(matF: (Mat, Mat2) ⇒ Mat3): ReprMat[Out, Mat3] =
viaMat(alsoToGraph(that))(matF)
it's not that hard to adapt this function to return IOResult, which is according to the source code:
final case class IOResult(count: Long, status: Try[Done]) { ... }
one more last thing which you need to pay attention - you want your source be like:
Source[ByteString, Future[IOResult]]
but if you wan't to carry these mat value till the very end of stream definition, and then do smth based on this future completion, that might be error prone approach. eg, in this example i finish the work based on that future, so the last value will not be processed:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Keep, Sink, Source}
import scala.concurrent.duration._
import scala.concurrent.{Await, ExecutionContext, Future}
object App extends App {
private implicit val sys: ActorSystem = ActorSystem()
private implicit val mat: ActorMaterializer = ActorMaterializer()
private implicit val ec: ExecutionContext = sys.dispatcher
val source: Source[Int, Any] = Source((1 to 5).toList)
def magic(source: Source[Int, Any]): Source[Int, Future[Int]] =
source.alsoToMat(Sink.fold(0)((acc, _) => acc + 1))((_, f) => f)
val f = magic(source).throttle(1, 1.second).toMat(Sink.foreach(println))(Keep.left).run()
f.onComplete(t => println(s"f1 completed - $t"))
Await.ready(f, 5.minutes)
mat.shutdown()
sys.terminate()
}
This can be done by using a Promise for the materialized value propagation.
val completion = Promise[IoResult]
val httpWithIoResult = http.mapMaterializedValue(_ => completion.future)
What is left now is to complete the completion promise when the relevant data becomes available.
Alternative approach would be to drop down to the GraphStage API where you get lower level control of materialized value propagation. But even there using Promises is often the chosen implementation for materialized value propagation. Take a look at built in operator implementations like Ignore.

How to get the Partitioner in Apache Flink?

we are trying to create an extension for Apache Flink, which uses custom partitioning. For some operators we want to check/retrieve the used partitioner. Unfortunately, I could not find any possibility to do this on a given DataSet. Have I missed something or is there another workaround for this?
I would start with something like this:
class MyPartitioner[..](..) extends Partitioner[..] {..}
[..]
val myP = new MyPartitioner(...)
val ds = in.partitionCustom(myP, 0)
Now from another class I would like to access the partitioner (if defined). In Spark I would do it the following way:
val myP = ds.partitioner.get.asInstanceOf[MyPartitioner]
However, for Flink I could not find a possibility for this.
Edit1:
It seems to be possible with the suggestion of Fabian. However, there are two limitations:
(1) When using Scala you have to retrieve the underlying Java DataSet first to cast it to a PartitionOperator
(2) The partitioning must be the last operation. So one can not use other operations between setting and getting the partitioner. E.g. the following is not possible:
val in: DataSet[(String, Int)] = ???
val myP = new MyPartitioner()
val ds = in.partitionCustom(myP, 0)
val ds2 = ds.map(x => x)
val myP2 = ds2.asInstanceOf[PartitionOperator].getCustomPartitioner
Thank you and best regards,
Philipp
You can cast the returned DataSet into a PartitionOperator and call PartitionOperator.getCustomPartitioner():
val in: DataSet[(String, Int)] = ???
val myP = new MyPartitioner()
val ds = in.partitionCustom(myP, 0)
val myP2 = ds.asInstanceOf[PartitionOperator].getCustomPartitioner
Note that
getCustomPartitioner() is an internal method (i.e., not part of the public API) and might change in future versions of Flink.
PartitionOperator is also used for other partitioning types, such as DataSet.partitionByHash(). In these cases getCustomPartitioner() might return null.

How do I dynamically add Source to existing Graph?

What can be alternative to dynamically changing running graph ? Here is my situation. I have graph that ingests articles into DB. Articles come from 3 plugins in different format. Thus I have several flows
val converterFlow1: Flow[ImpArticle, Article, NotUsed]
val converterFlow2: Flow[NewsArticle, Article, NotUsed]
val sinkDB: Sink[Article, Future[Done]]
// These are being created every time I poll plugins
val sourceContentProvider : Source[ImpArticle, NotUsed]
val sourceNews : Source[NewsArticle, NotUsed]
val sourceCit : Source[Article, NotUsed]
val merged = Source.combine(
sourceContentProvider.via(converterFlow1),
sourceNews.via(converterFlow2),
sourceCit)(Merge(_))
val res = merged
.buffer(10, OverflowStrategy.backpressure)
.toMat(sinkDB)(Keep.both)
.run()
Problem is that I get data from content provider once per 24 hrs, from news once per 2 hrs and last source may come at any time because it's coming from humans.
I realize that graphs are immutable but how I can periodically attach new instances of Source to my graph so that I have single point of throttling of the process of ingesting ?
UPDATE: You can say my data is stream of Source-s, three sources in my case. But I cannot change that because I get instances of Source from external classes (so called plugins). These plugins work independently from my ingestion class. I can't combine them into one gigantic class to have single Source.
Okay, in general the correct way would be to join a stream of sources into a single source, i.e. go from Source[Source[T, _], Whatever] to Source[T, Whatever]. This can be done with flatMapConcat or with flatMapMerge. Therefore, if you can get a Source[Source[Article, NotUsed], NotUsed], you can use one of flatMap* variants and obtain a final Source[Article, NotUsed]. Do it for each of your sources (no pun intended), and then your original approach should work.
I've implemented code based up on answer given by Vladimir Matveev and want to share it with others since it looks like common use-case to me.
I knew about Source.queue which Viktor Klang mentioned but I wasn't aware of flatMapConcat. It's pure awesomeness.
implicit val system = ActorSystem("root")
implicit val executor = system.dispatcher
implicit val materializer = ActorMaterializer()
case class ImpArticle(text: String)
case class NewsArticle(text: String)
case class Article(text: String)
val converterFlow1: Flow[ImpArticle, Article, NotUsed] = Flow[ImpArticle].map(a => Article("a:" + a.text))
val converterFlow2: Flow[NewsArticle, Article, NotUsed] = Flow[NewsArticle].map(a => Article("a:" + a.text))
val sinkDB: Sink[Article, Future[Done]] = Sink.foreach { a =>
Thread.sleep(1000)
println(a)
}
// These are being created every time I poll plugins
val sourceContentProvider: Source[ImpArticle, NotUsed] = Source(List(ImpArticle("cp1"), ImpArticle("cp2")))
val sourceNews: Source[NewsArticle, NotUsed] = Source(List(NewsArticle("news1"), NewsArticle("news2")))
val sourceCit: Source[Article, NotUsed] = Source(List(Article("a1"), Article("a2")))
val (queue, completionFut) = Source
.queue[Source[Article, NotUsed]](10, backpressure)
.flatMapConcat(identity)
.buffer(2, OverflowStrategy.backpressure)
.toMat(sinkDB)(Keep.both)
.run()
queue.offer(sourceContentProvider.via(converterFlow1))
queue.offer(sourceNews.via(converterFlow2))
queue.offer(sourceCit)
queue.complete()
completionFut.onComplete {
case Success(res) =>
println(res)
system.terminate()
case Failure(ex) =>
ex.printStackTrace()
system.terminate()
}
Await.result(system.whenTerminated, Duration.Inf)
I'd still check success of Future returned by queue.offer but in my case these calls will be pretty infrequent.
If you cannot model it as a Source[Source[_,_],_] then I'd consider using a Source.queue[Source[T,_]](queueSize, overflowStrategy): here
What you'll have to be careful about though is what happens if submission fails.

How can I pipe the output of an Akka Streams Merge to another Flow?

I'm playing with Akka Streams, and have figured out most of the basics, but I'm not clear on how to take the results of a Merge and do further operations (map, filter, fold, etc) on it.
I'd like to modify the following code so that, instead of piping the merge to a sink, that I could instead manipulate the data further.
implicit val materializer = FlowMaterializer()
val items_a = Source(List(10,20,30,40,50))
val items_b = Source(List(60,70,80,90,100))
val sink = ForeachSink(println)
val materialized = FlowGraph { implicit builder =>
import FlowGraphImplicits._
val merge = Merge[Int]("m1")
items_a ~> merge
items_b ~> merge ~> sink
}.run()
I guess my primary problem is that I can't figure out how to make a flow component that doesn't have a source, and I can't figure out how to do a merge without using the special Merge object and the ~> syntax.
EDIT: This question and answer was for and worked with Akka Streams 0.11
If you don't care about the semantic of Merge where elements go downstream randomly then you could just try concat on Source instead like so:
items_a.concat(items_b).map(_ * 2).map(_.toString).foreach(println)
The difference here is that all items from a will flow downstream first before any elements of b. If you really need the behavior of Merge, then you could consider something like the following (keep in mind that you will eventually need a sink, but you certainly can do additional transforms after the merging):
val items_a = Source(List(10,20,30,40,50))
val items_b = Source(List(60,70,80,90,100))
val sink = ForeachSink[Double](println)
val transform = Flow[Int].map(_ * 2).map(_.toDouble).to(sink)
val materialized = FlowGraph { implicit builder =>
import FlowGraphImplicits._
val merge = Merge[Int]("m1")
items_a ~> merge
items_b ~> merge ~> transform
}.run
In this example, you can see that I use the helper from the Flow companion to create a Flow without a specific input Source. From there I can then attach that to the merge point to get my additional processing.
Use Source.combine:
val items_a :: items_b :: items_c = List(
Source(List(10,20,30,40,50)),
Source(List(60,70,80,90,100),
Source(List(110,120,130,140,1500))
Source.combine(items_a, items_b, items_c : _*)(Merge(_))
.map(_+1)
.runForeach(println)
Or, if you need to preserve the order of the input-sources (e.g. items_a must before items_b and items_b must before items_c) you can use Concat, instead of Merge.
val items_a :: items_b :: items_c = List(
Source(List(10,20,30,40,50)),
Source(List(60,70,80,90,100),
Source(List(110,120,130,140,1500))
Source.combine(items_a, items_b, items_c : _*)(Concat(_))