Akka Streams: Concatenating streams that initialise resources - scala

I have two streams: A and B.
A is writing a file to disk and appends its elements during its execution.
B is reading that file and does some processing with the data.
Only AFTER A is done with processing and writing its data to the file, I want to start with B.
I tried to concat the two streams with:
A.concat(
Source.lazily { () =>
println("B is getting initialised")
getStreamForB()
}
)
But this is already initialising B BEFORE A has finished.

There is a ticket tracking the fact that Source#concat does not support lazy materialization. That ticket mentions the following work-around:
implicit class SourceLazyOps[E, M](val src: Source[E, M]) {
def concatLazy[M1](src2: => Source[E, M1]): Source[E, NotUsed] =
Source(List(() => src, () => src2)).flatMapConcat(_())
}
Applying the above implicit class to your case:
A.concatLazy(
Source.lazily { () =>
println("B is getting initialised")
getStreamForB()
}
)

The FileIO.toPath method will materialize the stream into a Future[IOResult]. If you are working with stream A that is writing to a file:
val someDataSource : Source[ByteString, _] = ???
val filePath : Path = ???
val fileWriteOptions : Set[OpenOption] = ???
val A : Future[IOResult] =
someDataSource
.to(FileIO.toPath(filePath, fileWriteOptions))
.run()
You can use the materialized Future to kick off your stream B once the writing is completed:
val fileReadOptions : Set[OpenOption] = ???
val someProcessingWithTheDataOfB : Sink[ByteString, _] = ???
A foreach { _ =>
val B : Future[IOResult] =
FileIO
.fromPath(filePath, fileReadOptions)
.to(someProcessingWithTheDataOfB)
.run()
}
Similarly, you could do some testing of the IOResult before doing the reading to make sure there were no failures during the writing process:
A.filter(ioResult => ioResult.status.isSuccess)
.foreach { _ =>
val B : Future[IOResult] =
FileIO
.fromPath(filePath, readOptions)
.to(someProcessingOfB)
.run()
}

Related

Reading multiple files with akka streams in scala

I'am trying to read multiple files with akka streams and put result in a list.
I can read one file with no problem. the return type is Future[Seq[String]]. problem is processing the sequence inside the Future must go inside an onComplete{}.
i'am trying the following code but abviously it will not work. the list acc outside of the onComplete is empty. but holds values inside the inComplete. I understand the problem but i don't know how to approach this.
// works fine
def readStream(path: String, date: String): Future[Seq[String]] = {
implicit val system = ActorSystem("Sys")
val settings = ActorMaterializerSettings(system)
implicit val materializer = ActorMaterializer(settings)
val result: Future[Seq[String]] =
FileIO.fromPath(Paths.get(path + "transactions_" + date +
".data"))
.via(Framing.delimiter(ByteString("\n"), 256, true))
.map(_.utf8String)
.toMat(Sink.seq)(Keep.right)
.run()
var aa: List[scala.Array[String]] = Nil
result.onComplete(x => {
aa = x.get.map(line => line.split('|')).toList
})
result
}
//this won't work
def concatFiles(path : String, date : String, numberOfDays : Int) :
List[scala.Array[String]] = {
val formatter = DateTimeFormatter.ofPattern("yyyyMMdd");
val formattedDate = LocalDate.parse(date, formatter);
var acc = List[scala.Array[String]]()
for( a <- 0 to numberOfDays){
val date = formattedDate.minusDays(a).toString().replace("-", "")
val transactions = readStream(path , date)
var result: List[scala.Array[String]] = Nil
transactions.onComplete(x => {
result = x.get.map(line => line.split('|')).toList
acc= acc ++ result })
}
acc}
General Solution
Given an Iterator of Paths values a Source of the file lines can be created by combining FileIO & flatMapConcat:
val lineSourceFromPaths : (() => Iterator[Path]) => Source[String, _] = pathsIterator =>
Source
.fromIterator(pathsIterator)
.flatMapConcat { path =>
FileIO
.fromPath(path)
.via(Framing.delimiter(ByteString("\n"), 256, true))
.map(_.utf8String)
}
Application to Question
The reason your List is empty is because the Future values have not completed and therefore your mutable list is not be updated before the function returns the list.
Critique of Code in Question
The organization and style of the code within the question suggest several misunderstandings related to akka & Future. I think you are attempting a rather complex workflow without understanding the fundamentals of the tools you are trying to use.
1.You should not create an ActorSystem each time a function is being called. There is usually 1 ActorSystem per application and it's created only once.
implicit val system = ActorSystem("Sys")
val settings = ActorMaterializerSettings(system)
implicit val materializer = ActorMaterializer(settings)
def readStream(...
2.You should try to avoid mutable collections and instead use Iterator with corresponding functionality:
def concatFiles(path : String, date : String, numberOfDays : Int) : List[scala.Array[String]] = {
val formattedDate = LocalDate.parse(date, DateTimeFormatter.ofPattern("yyyyMMdd"))
val pathsIterator : () => Iterator[Path] = () =>
Iterator
.range(0, numberOfDays+1)
.map(formattedDate.minusDays)
.map(_.String().replace("-", "")
.map(path => Paths.get(path + "transactions_" + date + ".data")
lineSourceFromPaths(pathsIterator)
3.Since you are dealing with Futures you should not wait for Futures to complete and should instead change the return type of concateFiles to Future[List[Array[String]]].

How to send multiple files to kafka producer using akka stream in Scala

I am trying to send multiple data to kafka producer using akka stream , meanwhile I wrote the producer itself , but struggling of how to use akka-streamIO in order to get multiple files which will be the data I want to send to my kafka Producer this is my code:
object App {
def main(args: Array[String]): Unit = {
val file = Paths.get("233339.8.1231731728115136.1722327129833578.log")
// val file = Paths.get("example.csv")
//
// val foreach: Future[IOResult] = FileIO.fromPath(file)
// .to(Sink.ignore)
// .run()
println("Hello from producer")
implicit val system:ActorSystem = ActorSystem("producer-example")
implicit val materializer:Materializer = ActorMaterializer()
val producerSettings = ProducerSettings(system,new StringSerializer,new StringSerializer)
val done: Future[Done] =
Source(1 to 955)
.map(value => new ProducerRecord[String, String]("test-topic", s"$file : $value"))
.runWith(Producer.plainSink(producerSettings))
implicit val ec: ExecutionContextExecutor = system.dispatcher
done onComplete {
case Success(_) => println("Done"); system.terminate()
case Failure(err) => println(err.toString); system.terminate()
}
}
}
Given multiple file names:
val fileNames : Iterable[String] = ???
It is possible to create a Source that emits the contents of the files concatenated together using flatMapConcat:
val chunkSize = 8192
val chunkSource : Source[ByteString, _] =
Source.apply(fileNames)
.map(fileName => Paths get fileName)
.flatMapConcat(path => FileIO.fromPath(path, chunkSize))
This will emit fixed size ByteString values that are all chunkSize length, except for possibly the last value which may be smaller.
If you want to breakup the lines by some delimiter then you can use Framing:
val delimiter : ByteString = ???
val maxFrameLength : Int = ???
val framingSource : Source[ByteString, _] =
chunkSource.via(Framing.delimiter(delimiter, maxFrameLength))

Akka Streams scala DSL and Op-Rabbit

I have started using Akka Streams and Op-Rabbit and am a bit confused.
I need to split the stream based on a predicate and then combine them much like I have done when creating graphs and using the Partition and Merge.
I have been able to do things like this using the GraphDSL.Builder, but can't seem to get it to work with AckedSource/Flow/Sink
the graph would look like:
| --> flow1 --> |
source--> partition --> | | --> flow3 --> sink
| --> flow2 --> |
I'm not sure if splitWhen is what I should use because I always need exactly 2 flows.
This is a sample that does not do the partitioning and does not use the GraphDSL.Builder:
def splitExample(source: AckedSource[String, SubscriptionRef],
queueName: String)
(implicit actorSystem: ActorSystem): RunnableGraph[SubscriptionRef] = {
val toStringFlow: Flow[AckTup[Message], AckTup[String], NotUsed] = Flow[AckTup[Message]]
.map[AckTup[String]](tup => {
val (p,m) = tup
(p, new String(m.data))
})
val printFlow1: Flow[AckTup[String], AckTup[String], NotUsed] = Flow[AckTup[String]]
.map[AckTup[String]](tup => {
val (p, s) = tup
println(s"flow1 processing $s")
tup
})
val printFlow2: Flow[AckTup[String], AckTup[String], NotUsed] = Flow[AckTup[String]]
.map[AckTup[String]](tup => {
val (p, s) = tup
println(s"flow2 processing $s")
tup
})
source
.map(Message.queue(_, queueName))
.via(AckedFlow(toStringFlow))
// partition if string.length < 10
.via(AckedFlow(printFlow1))
.via(AckedFlow(printFlow2))
.to(AckedSink.ack)
}
This is the code that I can't seem to get working:
import GraphDSL.Implicits._
def buildModelAcked(source: AckedSource[String, SubscriptionRef] , queueName: String)(implicit actorSystem: ActorSystem): Graph[ClosedShape, Future[Done]] = {
import GraphDSL.Implicits._
GraphDSL.create(Sink.ignore) { implicit builder: GraphDSL.Builder[Future[Done]] => s =>
import GraphDSL.Implicits._
source.map(Message.queue(_, queueName)) ~> AckedFlow(toStringFlow) ~> AckedSink.ack
// source.map(Message.queue(_, queueName)).via(AckedFlow(toStringFlow)).to(AckedSink.ack)
ClosedShape
}}
The compiler can't resolve the ~> operator
So my questions are:
Is there an example project that uses the scala dsl to build graphs of Acked/Source/Flow/Sink?
Is there an example project that partitions and merges that is similar to what I am trying to do here?
Keep in mind the following definitions when dealing the acked-stream.
AckedSource[Out, Mat] is a wrapper for Source[AckTup[Out], Mat]]
AckedFlow[In, Out, Mat] is a wrapper for Flow[AckTup[In], AckTup[Out], Mat]
AckedSink[In, Mat] is a wrapper for Sink[AckTup[In], Mat]
AckTup[T] is an alias for (Promise[Unit], T)
the classic flow combinators will operate on the T part of the AckTup
the .acked combinator will complete the Promise[Unit] of an AckedFlow
The GraphDSL edge operator (~>) will work against a bunch of Akka predefined shapes (see the code for GraphDSL.Implicits), but it won't work against custom shapes defined by the acked-stream lib.
You got 2 ways out:
you define your own ~> implicit operator, along the lines of the ones in GraphDSL.Implicits
you unwrap the acked stages to obtain standard stages. You are able to access the wrapped stage using .wrappedRepr - available on AckedSource, AckedFlow and AckedSink.
Based on Stefano Bonetti's excellent direction, here is a possible solution:
graph:
|--> short --|
rabbitMq --> before --| |--> after
|--> long --|
solution:
val before: Flow[AckTup[Message], AckTup[String], NotUsed] = Flow[AckTup[Message]].map[AckTup[String]](tup => {
val (p,m) = tup
(p, new String(m.data))
})
val short: Flow[AckTup[String], AckTup[String], NotUsed] = Flow[AckTup[String]].map[AckTup[String]](tup => {
val (p, s) = tup
println(s"short: $s")
tup
})
val long: Flow[AckTup[String], AckTup[String], NotUsed] = Flow[AckTup[String]].map[AckTup[String]](tup => {
val (p, s) = tup
println(s"long: $s")
tup
})
val after: Flow[AckTup[String], AckTup[String], NotUsed] = Flow[AckTup[String]].map[AckTup[String]](tup => {
val (p, s) = tup
println(s"all $s")
tup
})
def buildSplitGraph(source: AckedSource[String, SubscriptionRef]
, queueName: String
, splitLength: Int)(implicit actorSystem: ActorSystem): Graph[ClosedShape, Future[Done]] = {
GraphDSL.create(Sink.ignore) { implicit builder: GraphDSL.Builder[Future[Done]] => s =>
val toShort = 0
val toLong = 1
// junctions
val split = builder.add(Partition[AckTup[String]](2, (tup: AckTup[String]) => {
val (p, s) = tup
if (s.length < splitLength) toShort else toLong
}
))
val merge = builder.add(Merge[AckTup[String]](2))
//graph
val beforeSplit = source.map(Message.queue(_, queueName)).wrappedRepr ~> AckedFlow(before).wrappedRepr
beforeSplit ~> split
// must do short, then long since the split goes in that order
split ~> AckedFlow(short).wrappedRepr ~> merge
split ~> AckedFlow(long).wrappedRepr ~> merge
// after the last AckedFlow, be sure to '.acked' so that the message will be removed from the queue
merge ~> AckedFlow(after).acked ~> s
ClosedShape
}}
As Stefano Bonetti said, the key was to use the .wrappedRepr associated with the AckedFlow and then to use the .acked combinator as the last step.

File Upload and processing using akka-http websockets

I'm using some sample Scala code to make a server that receives a file over websocket, stores the file temporarily, runs a bash script on it, and then returns stdout by TextMessage.
Sample code was taken from this github project.
I edited the code slightly within echoService so that it runs another function that processes the temporary file.
object WebServer {
def main(args: Array[String]) {
implicit val actorSystem = ActorSystem("akka-system")
implicit val flowMaterializer = ActorMaterializer()
val interface = "localhost"
val port = 3000
import Directives._
val route = get {
pathEndOrSingleSlash {
complete("Welcome to websocket server")
}
} ~
path("upload") {
handleWebSocketMessages(echoService)
}
val binding = Http().bindAndHandle(route, interface, port)
println(s"Server is now online at http://$interface:$port\nPress RETURN to stop...")
StdIn.readLine()
binding.flatMap(_.unbind()).onComplete(_ => actorSystem.shutdown())
println("Server is down...")
}
implicit val actorSystem = ActorSystem("akka-system")
implicit val flowMaterializer = ActorMaterializer()
val echoService: Flow[Message, Message, _] = Flow[Message].mapConcat {
case BinaryMessage.Strict(msg) => {
val decoded: Array[Byte] = msg.toArray
val imgOutFile = new File("/tmp/" + "filename")
val fileOuputStream = new FileOutputStream(imgOutFile)
fileOuputStream.write(decoded)
fileOuputStream.close()
TextMessage(analyze(imgOutFile))
}
case BinaryMessage.Streamed(stream) => {
stream
.limit(Int.MaxValue) // Max frames we are willing to wait for
.completionTimeout(50 seconds) // Max time until last frame
.runFold(ByteString(""))(_ ++ _) // Merges the frames
.flatMap { (msg: ByteString) =>
val decoded: Array[Byte] = msg.toArray
val imgOutFile = new File("/tmp/" + "filename")
val fileOuputStream = new FileOutputStream(imgOutFile)
fileOuputStream.write(decoded)
fileOuputStream.close()
Future(Source.single(""))
}
TextMessage(analyze(imgOutFile))
}
private def analyze(imgfile: File): String = {
val p = Runtime.getRuntime.exec(Array("./run-vision.sh", imgfile.toString))
val br = new BufferedReader(new InputStreamReader(p.getInputStream, StandardCharsets.UTF_8))
try {
val result = Stream
.continually(br.readLine())
.takeWhile(_ ne null)
.mkString
result
} finally {
br.close()
}
}
}
}
During testing using Dark WebSocket Terminal, case BinaryMessage.Strict works fine.
Problem: However, case BinaryMessage.Streaming doesn't finish writing the file before running the analyze function, resulting in a blank response from the server.
I'm trying to wrap my head around how Futures are being used here with the Flows in Akka-HTTP, but I'm not having much luck outside trying to get through all the official documentation.
Currently, .mapAsync seems promising, or basically finding a way to chain futures.
I'd really appreciate some insight.
Yes, mapAsync will help you in this occasion. It is a combinator to execute Futures (potentially in parallel) in your stream, and present their results on the output side.
In your case to make things homogenous and make the type checker happy, you'll need to wrap the result of the Strict case into a Future.successful.
A quick fix for your code could be:
val echoService: Flow[Message, Message, _] = Flow[Message].mapAsync(parallelism = 5) {
case BinaryMessage.Strict(msg) => {
val decoded: Array[Byte] = msg.toArray
val imgOutFile = new File("/tmp/" + "filename")
val fileOuputStream = new FileOutputStream(imgOutFile)
fileOuputStream.write(decoded)
fileOuputStream.close()
Future.successful(TextMessage(analyze(imgOutFile)))
}
case BinaryMessage.Streamed(stream) =>
stream
.limit(Int.MaxValue) // Max frames we are willing to wait for
.completionTimeout(50 seconds) // Max time until last frame
.runFold(ByteString(""))(_ ++ _) // Merges the frames
.flatMap { (msg: ByteString) =>
val decoded: Array[Byte] = msg.toArray
val imgOutFile = new File("/tmp/" + "filename")
val fileOuputStream = new FileOutputStream(imgOutFile)
fileOuputStream.write(decoded)
fileOuputStream.close()
Future.successful(TextMessage(analyze(imgOutFile)))
}
}

How to download a HTTP resource to a file with Akka Streams and HTTP?

Over the past few days I have been trying to figure out the best way to download a HTTP resource to a file using Akka Streams and HTTP.
Initially I started with the Future-Based Variant and that looked something like this:
def downloadViaFutures(uri: Uri, file: File): Future[Long] = {
val request = Get(uri)
val responseFuture = Http().singleRequest(request)
responseFuture.flatMap { response =>
val source = response.entity.dataBytes
source.runWith(FileIO.toFile(file))
}
}
That was kind of okay but once I learnt more about pure Akka Streams I wanted to try and use the Flow-Based Variant to create a stream starting from a Source[HttpRequest]. At first this completely stumped me until I stumbled upon the flatMapConcat flow transformation. This ended up a little more verbose:
def responseOrFail[T](in: (Try[HttpResponse], T)): (HttpResponse, T) = in match {
case (responseTry, context) => (responseTry.get, context)
}
def responseToByteSource[T](in: (HttpResponse, T)): Source[ByteString, Any] = in match {
case (response, _) => response.entity.dataBytes
}
def downloadViaFlow(uri: Uri, file: File): Future[Long] = {
val request = Get(uri)
val source = Source.single((request, ()))
val requestResponseFlow = Http().superPool[Unit]()
source.
via(requestResponseFlow).
map(responseOrFail).
flatMapConcat(responseToByteSource).
runWith(FileIO.toFile(file))
}
Then I wanted to get a little tricky and use the Content-Disposition header.
Going back to the Future-Based Variant:
def destinationFile(downloadDir: File, response: HttpResponse): File = {
val fileName = response.header[ContentDisposition].get.value
val file = new File(downloadDir, fileName)
file.createNewFile()
file
}
def downloadViaFutures2(uri: Uri, downloadDir: File): Future[Long] = {
val request = Get(uri)
val responseFuture = Http().singleRequest(request)
responseFuture.flatMap { response =>
val file = destinationFile(downloadDir, response)
val source = response.entity.dataBytes
source.runWith(FileIO.toFile(file))
}
}
But now I have no idea how to do this with the Future-Based Variant. This is as far as I got:
def responseToByteSourceWithDest[T](in: (HttpResponse, T), downloadDir: File): Source[(ByteString, File), Any] = in match {
case (response, _) =>
val source = responseToByteSource(in)
val file = destinationFile(downloadDir, response)
source.map((_, file))
}
def downloadViaFlow2(uri: Uri, downloadDir: File): Future[Long] = {
val request = Get(uri)
val source = Source.single((request, ()))
val requestResponseFlow = Http().superPool[Unit]()
val sourceWithDest: Source[(ByteString, File), Unit] = source.
via(requestResponseFlow).
map(responseOrFail).
flatMapConcat(responseToByteSourceWithDest(_, downloadDir))
sourceWithDest.runWith(???)
}
So now I have a Source that will emit one or more (ByteString, File) elements for each File (I say each File since there is no reason the original Source has to be a single HttpRequest).
Is there anyway to take these and route them to a dynamic Sink?
I'm thinking something like flatMapConcat, such as:
def runWithMap[T, Mat2](f: T => Graph[SinkShape[Out], Mat2])(implicit materializer: Materializer): Mat2 = ???
So that I could complete downloadViaFlow2 with:
def destToSink(destination: File): Sink[(ByteString, File), Future[Long]] = {
val sink = FileIO.toFile(destination, true)
Flow[(ByteString, File)].map(_._1).toMat(sink)(Keep.right)
}
sourceWithDest.runWithMap {
case (_, file) => destToSink(file)
}
The solution does not require a flatMapConcat. If you don't need any return values from the file writing then you can use Sink.foreach:
def writeFile(downloadDir : File)(httpResponse : HttpResponse) : Future[Long] = {
val file = destinationFile(downloadDir, httpResponse)
httpResponse.entity.dataBytes.runWith(FileIO.toFile(file))
}
def downloadViaFlow2(uri: Uri, downloadDir: File) : Future[Unit] = {
val request = HttpRequest(uri=uri)
val source = Source.single((request, ()))
val requestResponseFlow = Http().superPool[Unit]()
source.via(requestResponseFlow)
.map(responseOrFail)
.map(_._1)
.runWith(Sink.foreach(writeFile(downloadDir)))
}
Note that the Sink.foreach creates Futures from the writeFile function. Therefore there's not much back-pressure involved. The writeFile could be slowed down by the hard drive but the stream would keep generating Futures. To control this you can use Flow.mapAsyncUnordered (or Flow.mapAsync) :
val parallelism = 10
source.via(requestResponseFlow)
.map(responseOrFail)
.map(_._1)
.mapAsyncUnordered(parallelism)(writeFile(downloadDir))
.runWith(Sink.ignore)
If you want to accumulate the Long values for a total count you need to combine with a Sink.fold:
source.via(requestResponseFlow)
.map(responseOrFail)
.map(_._1)
.mapAsyncUnordered(parallelism)(writeFile(downloadDir))
.runWith(Sink.fold(0L)(_ + _))
The fold will keep a running sum and emit the final value when the source of requests has dried up.
Using the play Web Services client injected in ws, remmebering to import scala.concurrent.duration._:
def downloadFromUrl(url: String)(ws: WSClient): Future[Try[File]] = {
val file = File.createTempFile("my-prefix", new File("/tmp"))
file.deleteOnExit()
val futureResponse: Future[WSResponse] =
ws.url(url).withMethod("GET").withRequestTimeout(5 minutes).stream()
futureResponse.flatMap { res =>
res.status match {
case 200 =>
val outputStream = java.nio.file.Files.newOutputStream(file.toPath)
val sink = Sink.foreach[ByteString] { bytes => outputStream.write(bytes.toArray) }
res.bodyAsSource.runWith(sink).andThen {
case result =>
outputStream.close()
result.get
} map (_ => Success(file))
case other => Future(Failure[File](new Exception("HTTP Failure, response code " + other + " : " + res.statusText)))
}
}
}