Change a materialized value in a source using the contents of the stream - scala

Alpakka provides a great way to access dozens of different data sources. File oriented sources such as HDFS and FTP sources are delivered as Source[ByteString, Future[IOResult]. However, HTTP requests via Akka HTTP are delivered as entity streams of Source[ByteString, NotUsed]. In my use case, I would like to retrieve content from HTTP sources as Source[ByteString, Future[IOResult] so I can build a unified resource fetcher that works from multiple schemes (hdfs, file, ftp and S3 in this case).
In particular, I would like to convert the Source[ByteString, NotUsed] source to
Source[ByteString, Future[IOResult] where I am able to calculate the IOResult from the incoming byte stream. There are plenty of methods like flatMapConcat and viaMat but none seem to be able to extract details from the input stream (such as number of bytes read) or initialise the IOResult structure properly. Ideally, I am looking for a method with the following signature that will update the IOResult as the stream comes in.
def matCalc(src: Source[ByteString, Any]) = Source[ByteString, Future[IOResult]] = {
src.someMatFoldMagic[ByteString, IOResult](IOResult.createSuccessful(0))(m, b) => m.withCount(m.count + b.length))
}

i can't recall any existing functionality, which can out of the box do this, but you can use alsoToMat (surprisingly didn't find it in akka streams docs, although you can look it in source code documentation & java api) flow function together with Sink.fold to accumulate some value and give it in the very end. eg:
def magic(source: Source[Int, Any]): Source[Int, Future[Int]] =
source.alsoToMat(Sink.fold(0)((acc, _) => acc + 1))((_, f) => f)
the thing is that alsoToMat combines input mat value with the one provided in alsoToMat. at the same time the values produced by source are not affected by the sink in alsoToMat:
def alsoToMat[Mat2, Mat3](that: Graph[SinkShape[Out], Mat2])(matF: (Mat, Mat2) ⇒ Mat3): ReprMat[Out, Mat3] =
viaMat(alsoToGraph(that))(matF)
it's not that hard to adapt this function to return IOResult, which is according to the source code:
final case class IOResult(count: Long, status: Try[Done]) { ... }
one more last thing which you need to pay attention - you want your source be like:
Source[ByteString, Future[IOResult]]
but if you wan't to carry these mat value till the very end of stream definition, and then do smth based on this future completion, that might be error prone approach. eg, in this example i finish the work based on that future, so the last value will not be processed:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Keep, Sink, Source}
import scala.concurrent.duration._
import scala.concurrent.{Await, ExecutionContext, Future}
object App extends App {
private implicit val sys: ActorSystem = ActorSystem()
private implicit val mat: ActorMaterializer = ActorMaterializer()
private implicit val ec: ExecutionContext = sys.dispatcher
val source: Source[Int, Any] = Source((1 to 5).toList)
def magic(source: Source[Int, Any]): Source[Int, Future[Int]] =
source.alsoToMat(Sink.fold(0)((acc, _) => acc + 1))((_, f) => f)
val f = magic(source).throttle(1, 1.second).toMat(Sink.foreach(println))(Keep.left).run()
f.onComplete(t => println(s"f1 completed - $t"))
Await.ready(f, 5.minutes)
mat.shutdown()
sys.terminate()
}

This can be done by using a Promise for the materialized value propagation.
val completion = Promise[IoResult]
val httpWithIoResult = http.mapMaterializedValue(_ => completion.future)
What is left now is to complete the completion promise when the relevant data becomes available.
Alternative approach would be to drop down to the GraphStage API where you get lower level control of materialized value propagation. But even there using Promises is often the chosen implementation for materialized value propagation. Take a look at built in operator implementations like Ignore.

Related

How can I use a value in an Akka Stream to instantiate a GooglePubSub Flow?

I'm attempting to create a Flow to be used with a Source queue. I would like this to work with the Alpakka Google PubSub connector: https://doc.akka.io/docs/alpakka/current/google-cloud-pub-sub.html
In order to use this connector, I need to create a Flow that depends on the topic name provided as a String, as shown in the above link and in the code snippet.
val publishFlow: Flow[PublishRequest, Seq[String], NotUsed] =
GooglePubSub.publish(topic, config)
The question
I would like to be able to setup a Source queue that receives the topic and message required for publishing a message. I first create the necessary PublishRequest out of the message String. I then want to run this through the Flow that is instantiated by running GooglePubSub.publish(topic, config). However, I don't know how to get the topic to that part of the flow.
val gcFlow: Flow[(String, String), PublishRequest, NotUsed] = Flow[(String, String)]
.map(messageData => {
PublishRequest(Seq(
PubSubMessage(new String(Base64.getEncoder.encode(messageData._1.getBytes))))
)
})
.via(GooglePubSub.publish(topic, config))
val bufferSize = 10
val elementsToProcess = 5
// newSource is a Source[PublishRequest, NotUsed]
val (queue, newSource) = Source
.queue[(String, String)](bufferSize, OverflowStrategy.backpressure)
.via(gcFlow)
.preMaterialize()
I'm not sure if there's a way to get the topic into the queue without it being a part of the initial data stream. And I don't know how to get the stream value into the dynamic Flow.
If I have improperly used some terminology, please keep in mind that I'm new to this.
You can achieve it by using flatMapConcat and generating a new Source within it:
// using tuple assuming (Topic, Message)
val gcFlow: Flow[(String, String), (String, PublishRequest), NotUsed] = Flow[(String, String)]
.map(messageData => {
val pr = PublishRequest(immutable.Seq(
PubSubMessage(new String(Base64.getEncoder.encode(messageData._2.getBytes)))))
// output flow shape of (String, PublishRequest)
(messageData._1, pr)
})
val publishFlow: Flow[(String, PublishRequest), Seq[String], NotUsed] =
Flow[(String, PublishRequest)].flatMapConcat {
case (topic: String, pr: PublishRequest) =>
// Create a Source[PublishRequest]
Source.single(pr).via(GooglePubSub.publish(topic, config))
}
// wire it up
val (queue, newSource) = Source
.queue[(String, String)](bufferSize, OverflowStrategy.backpressure)
.via(gcFlow)
.via(publishFlow)
.preMaterialize()
Optionally you could substitute tuple with a case class to document it better
case class Something(topic: String, payload: PublishRequest)
// output flow shape of Something[String, PublishRequest]
Something(messageData._1, pr)
Flow[Something[String, PublishRequest]].flatMapConcat { s =>
Source.single(s.payload)... // etc
}
Explanation:
In gcFlow we output FlowShape of tuple (String, PublishRequest) which is passed through publishFlow. The input is tuple (String, PublishRequest) and in flatMapConcat we generate new Source[PublishRequest] which is flowed through GooglePubSub.publish
There would be slight overhead creating new Source for every item. This shouldn't have measurable impact on performance

Stop the fs2-stream after a timeout

I want to use a function similar to take(n: Int) but in a time dimension:
consume(period: Duration. So I want a stream to terminate if a timeout occurs. I know that I could compile a stream to something like IO[List[T]] and cancel it, but then I'll lose the result. In reality I want to convert an endless stream into a limited one and preserve the results.
More on the wider scope of the problem. I have an endless stream of events from a messaging broker, but I also have rotating credentials to connect to the broker. So what I want is to consume the stream of events for some time, then stop, acquire new credentials, connect again to the broker creating a new stream and concatenate two streams into one.
There is a method, that does exactly this:
/**
* Interrupts this stream after the specified duration has passed.
*/
def interruptAfter[F2[x] >: F[x]: Concurrent: Timer](duration: FiniteDuration): Stream[F2, O]
You need something like that
import scala.util.Random
import scala.concurrent.ExecutionContext
import fs2._
import fs2.concurrent.SignallingRef
implicit val ex = ExecutionContext.global
implicit val t: Timer[IO] = IO.timer(ex)
implicit val cs: ContextShift[IO] = IO.contextShift(ex)
val effect: IO[Long] = IO.sleep(1.second).flatMap(_ => IO{
val next = Random.nextLong()
println("NEXT: " + next)
next
})
val signal = SignallingRef[IO, Boolean](false).unsafeRunSync()
val timer = Stream.sleep(10.seconds).flatMap(_ =>
Stream.eval(signal.set(true)).flatMap(_ =>
Stream.emit(println("Finish")).covary[IO]))
val stream = timer concurrently
Stream.repeatEval(effect).interruptWhen(signal)
stream.compile.drain.unsafeRunSync()
Also if you want to save your result of publishing data you need to have some unbounded Queue from fs2 for converting published data to your result via queue.stream

FS2 stream to unread InputStream

I'd like to convert fs2.Stream to java.io.InputStream so I can pass that input stream to an http framework (Finch and Akka Http).
I found a fs2.io.toInputStream, but this doesn't work (it prints nothing):
import java.io.{ByteArrayInputStream, InputStream}
import cats.effect.IO
import scala.concurrent.ExecutionContext.Implicits.global
object IOTest {
def main(args: Array[String]): Unit = {
val is: InputStream = new ByteArrayInputStream("test".getBytes)
val stream: fs2.Stream[IO, Byte] = fs2.io.readInputStream(IO(is), 128)
val test: Seq[InputStream] = stream.through(fs2.io.toInputStream).compile.toList.unsafeRunSync()
println(scala.io.Source.fromInputStream(test.head).mkString)
}
}
As far as I understand when I run .unsafeRunSync() it's consuming the whole stream, so even though it returns a Seq[InputStream] the under-laying input stream is already consumed.
Is there any way I can convert fs2.Stream[IO, Byte] to java.io.InputStream without it being consumed?
Thnaks!
The problem is that compile is being invoked prematurely. I'm sure that under the hood fs2.io.toInputStream does the correct thing and brackets the created InputStream. Which means that the InputStream must be accessed inside the Stream itself (e.g., in a map/flatMap call):
val wire: fs2.Stream[IO, Byte] = ???
val result: fs2.Stream[IO, String] = for {
is <- wire.through(fs2.io.toInputStream)
str = scala.io.Source.fromInputStream(is).mkString //<--- use the InputStream here
} yield str
println( result.compile.lastOrError.unsafeRunSync() ) //<--- compile at the _very_ end
Outputs:
test
It looks that Finch has fs2 support https://github.com/finagle/finch/tree/master/fs2 and Akka also has it's stream implementation and there are fs2 - Akka Stream interop libraries like https://github.com/krasserm/streamz/tree/master/streamz-converter
So i recommend you to take a look to the implementations because they take care of the resources life cycle. Probably you don't need the whole library but it serves as guideline.
And if you are starting at the "safe zone" with fs2, why moving out of there :)

Alpakka MongoDB - specify type in MongoSource

I'm currently playing around with Akka Streams and the Alpakka MongoDB connector.
Is it possible to specify the type for MongoSource?
val codecRegistry = fromRegistries(fromProviders(classOf[TodoMongo]), DEFAULT_CODEC_REGISTRY)
private val todoCollection: MongoCollection[TodoMongo] = mongoDb
.withCodecRegistry(codecRegistry)
.getCollection("todo")
I would like to do something like this:
val t: FindObservable[Seq[TodoMongo]] = todoCollection.find()
MongoSource(t) // Stuck here
But I get the following error:
Expected Observable[scala.Document], Actual FindObservable[Seq[TodoMongo]].
I can't find the correct documentation about this part.
This is not published yet, but in Alpakka's master branch, MongoSource.apply takes a type parameter:
object MongoSource {
def apply[T](query: Observable[T]): Source[T, NotUsed] =
Source.fromPublisher(ObservableToPublisher(query))
}
Therefore, with the upcoming 0.18 release of Alpakka, you'll be able to do the following:
val source: Source[TodoMongo, NotUsed] = MongoSource[TodoMongo](todoCollection.find())
Note that source here assumes that todoCollection.find() returns an Observable[TodoMongo]; adjust the types as needed.
In the meantime, you could simply add the above code manually. For example:
package akka.stream.alpakka.mongodb.scaladsl
import akka.NotUsed
import akka.stream.alpakka.mongodb.ObservableToPublisher
import akka.stream.scaladsl.Source
import org.mongodb.scala.Observable
object MyMongoSource {
def apply[T](query: Observable[T]): Source[T, NotUsed] =
Source.fromPublisher(ObservableToPublisher(query))
}
Note that MyMongoSource is defined to reside in the akka.stream.alpakka.mongodb.scaladsl package (like MongoSource), because ObservableToPublisher is a package-private class. You would use MyMongoSource in the same way that you would use MongoSource:
val source: Source[TodoMongo, NotUsed] = MyMongoSource[TodoMongo](todoCollection.find())

How do I dynamically add Source to existing Graph?

What can be alternative to dynamically changing running graph ? Here is my situation. I have graph that ingests articles into DB. Articles come from 3 plugins in different format. Thus I have several flows
val converterFlow1: Flow[ImpArticle, Article, NotUsed]
val converterFlow2: Flow[NewsArticle, Article, NotUsed]
val sinkDB: Sink[Article, Future[Done]]
// These are being created every time I poll plugins
val sourceContentProvider : Source[ImpArticle, NotUsed]
val sourceNews : Source[NewsArticle, NotUsed]
val sourceCit : Source[Article, NotUsed]
val merged = Source.combine(
sourceContentProvider.via(converterFlow1),
sourceNews.via(converterFlow2),
sourceCit)(Merge(_))
val res = merged
.buffer(10, OverflowStrategy.backpressure)
.toMat(sinkDB)(Keep.both)
.run()
Problem is that I get data from content provider once per 24 hrs, from news once per 2 hrs and last source may come at any time because it's coming from humans.
I realize that graphs are immutable but how I can periodically attach new instances of Source to my graph so that I have single point of throttling of the process of ingesting ?
UPDATE: You can say my data is stream of Source-s, three sources in my case. But I cannot change that because I get instances of Source from external classes (so called plugins). These plugins work independently from my ingestion class. I can't combine them into one gigantic class to have single Source.
Okay, in general the correct way would be to join a stream of sources into a single source, i.e. go from Source[Source[T, _], Whatever] to Source[T, Whatever]. This can be done with flatMapConcat or with flatMapMerge. Therefore, if you can get a Source[Source[Article, NotUsed], NotUsed], you can use one of flatMap* variants and obtain a final Source[Article, NotUsed]. Do it for each of your sources (no pun intended), and then your original approach should work.
I've implemented code based up on answer given by Vladimir Matveev and want to share it with others since it looks like common use-case to me.
I knew about Source.queue which Viktor Klang mentioned but I wasn't aware of flatMapConcat. It's pure awesomeness.
implicit val system = ActorSystem("root")
implicit val executor = system.dispatcher
implicit val materializer = ActorMaterializer()
case class ImpArticle(text: String)
case class NewsArticle(text: String)
case class Article(text: String)
val converterFlow1: Flow[ImpArticle, Article, NotUsed] = Flow[ImpArticle].map(a => Article("a:" + a.text))
val converterFlow2: Flow[NewsArticle, Article, NotUsed] = Flow[NewsArticle].map(a => Article("a:" + a.text))
val sinkDB: Sink[Article, Future[Done]] = Sink.foreach { a =>
Thread.sleep(1000)
println(a)
}
// These are being created every time I poll plugins
val sourceContentProvider: Source[ImpArticle, NotUsed] = Source(List(ImpArticle("cp1"), ImpArticle("cp2")))
val sourceNews: Source[NewsArticle, NotUsed] = Source(List(NewsArticle("news1"), NewsArticle("news2")))
val sourceCit: Source[Article, NotUsed] = Source(List(Article("a1"), Article("a2")))
val (queue, completionFut) = Source
.queue[Source[Article, NotUsed]](10, backpressure)
.flatMapConcat(identity)
.buffer(2, OverflowStrategy.backpressure)
.toMat(sinkDB)(Keep.both)
.run()
queue.offer(sourceContentProvider.via(converterFlow1))
queue.offer(sourceNews.via(converterFlow2))
queue.offer(sourceCit)
queue.complete()
completionFut.onComplete {
case Success(res) =>
println(res)
system.terminate()
case Failure(ex) =>
ex.printStackTrace()
system.terminate()
}
Await.result(system.whenTerminated, Duration.Inf)
I'd still check success of Future returned by queue.offer but in my case these calls will be pretty infrequent.
If you cannot model it as a Source[Source[_,_],_] then I'd consider using a Source.queue[Source[T,_]](queueSize, overflowStrategy): here
What you'll have to be careful about though is what happens if submission fails.