Reading multiple files with akka streams in scala - scala

I'am trying to read multiple files with akka streams and put result in a list.
I can read one file with no problem. the return type is Future[Seq[String]]. problem is processing the sequence inside the Future must go inside an onComplete{}.
i'am trying the following code but abviously it will not work. the list acc outside of the onComplete is empty. but holds values inside the inComplete. I understand the problem but i don't know how to approach this.
// works fine
def readStream(path: String, date: String): Future[Seq[String]] = {
implicit val system = ActorSystem("Sys")
val settings = ActorMaterializerSettings(system)
implicit val materializer = ActorMaterializer(settings)
val result: Future[Seq[String]] =
FileIO.fromPath(Paths.get(path + "transactions_" + date +
".data"))
.via(Framing.delimiter(ByteString("\n"), 256, true))
.map(_.utf8String)
.toMat(Sink.seq)(Keep.right)
.run()
var aa: List[scala.Array[String]] = Nil
result.onComplete(x => {
aa = x.get.map(line => line.split('|')).toList
})
result
}
//this won't work
def concatFiles(path : String, date : String, numberOfDays : Int) :
List[scala.Array[String]] = {
val formatter = DateTimeFormatter.ofPattern("yyyyMMdd");
val formattedDate = LocalDate.parse(date, formatter);
var acc = List[scala.Array[String]]()
for( a <- 0 to numberOfDays){
val date = formattedDate.minusDays(a).toString().replace("-", "")
val transactions = readStream(path , date)
var result: List[scala.Array[String]] = Nil
transactions.onComplete(x => {
result = x.get.map(line => line.split('|')).toList
acc= acc ++ result })
}
acc}

General Solution
Given an Iterator of Paths values a Source of the file lines can be created by combining FileIO & flatMapConcat:
val lineSourceFromPaths : (() => Iterator[Path]) => Source[String, _] = pathsIterator =>
Source
.fromIterator(pathsIterator)
.flatMapConcat { path =>
FileIO
.fromPath(path)
.via(Framing.delimiter(ByteString("\n"), 256, true))
.map(_.utf8String)
}
Application to Question
The reason your List is empty is because the Future values have not completed and therefore your mutable list is not be updated before the function returns the list.
Critique of Code in Question
The organization and style of the code within the question suggest several misunderstandings related to akka & Future. I think you are attempting a rather complex workflow without understanding the fundamentals of the tools you are trying to use.
1.You should not create an ActorSystem each time a function is being called. There is usually 1 ActorSystem per application and it's created only once.
implicit val system = ActorSystem("Sys")
val settings = ActorMaterializerSettings(system)
implicit val materializer = ActorMaterializer(settings)
def readStream(...
2.You should try to avoid mutable collections and instead use Iterator with corresponding functionality:
def concatFiles(path : String, date : String, numberOfDays : Int) : List[scala.Array[String]] = {
val formattedDate = LocalDate.parse(date, DateTimeFormatter.ofPattern("yyyyMMdd"))
val pathsIterator : () => Iterator[Path] = () =>
Iterator
.range(0, numberOfDays+1)
.map(formattedDate.minusDays)
.map(_.String().replace("-", "")
.map(path => Paths.get(path + "transactions_" + date + ".data")
lineSourceFromPaths(pathsIterator)
3.Since you are dealing with Futures you should not wait for Futures to complete and should instead change the return type of concateFiles to Future[List[Array[String]]].

Related

Akka Streams: Concatenating streams that initialise resources

I have two streams: A and B.
A is writing a file to disk and appends its elements during its execution.
B is reading that file and does some processing with the data.
Only AFTER A is done with processing and writing its data to the file, I want to start with B.
I tried to concat the two streams with:
A.concat(
Source.lazily { () =>
println("B is getting initialised")
getStreamForB()
}
)
But this is already initialising B BEFORE A has finished.
There is a ticket tracking the fact that Source#concat does not support lazy materialization. That ticket mentions the following work-around:
implicit class SourceLazyOps[E, M](val src: Source[E, M]) {
def concatLazy[M1](src2: => Source[E, M1]): Source[E, NotUsed] =
Source(List(() => src, () => src2)).flatMapConcat(_())
}
Applying the above implicit class to your case:
A.concatLazy(
Source.lazily { () =>
println("B is getting initialised")
getStreamForB()
}
)
The FileIO.toPath method will materialize the stream into a Future[IOResult]. If you are working with stream A that is writing to a file:
val someDataSource : Source[ByteString, _] = ???
val filePath : Path = ???
val fileWriteOptions : Set[OpenOption] = ???
val A : Future[IOResult] =
someDataSource
.to(FileIO.toPath(filePath, fileWriteOptions))
.run()
You can use the materialized Future to kick off your stream B once the writing is completed:
val fileReadOptions : Set[OpenOption] = ???
val someProcessingWithTheDataOfB : Sink[ByteString, _] = ???
A foreach { _ =>
val B : Future[IOResult] =
FileIO
.fromPath(filePath, fileReadOptions)
.to(someProcessingWithTheDataOfB)
.run()
}
Similarly, you could do some testing of the IOResult before doing the reading to make sure there were no failures during the writing process:
A.filter(ioResult => ioResult.status.isSuccess)
.foreach { _ =>
val B : Future[IOResult] =
FileIO
.fromPath(filePath, readOptions)
.to(someProcessingOfB)
.run()
}

How can I dynamically invoke the same scala function in cascading manner with output of previous call goes as input to the next call

I am new to Spark-Scala and trying following thing but I am stuck up and not getting on how to achieve this requirement. I shall be really thankful if someone can really help in this regards.
We have to invoke different rules on different columns of given table. The list of column names and rules is being passed as argument to the program
The resultant of first rule should go as input to the next rule input.
question : How can I execute exec() function in cascading manner with dynamically filling the arguments for as many rules as specified in arguments.
I have developed a code as follows.
object Rules {
def main(args: Array[String]) = {
if (args.length != 3) {
println("Need exactly 3 arguments in format : <sourceTableName> <destTableName> <[<colName>=<Rule> <colName>=<Rule>,...")
println("E.g : INPUT_TABLE OUTPUT_TABLE [NAME=RULE1,ID=RULE2,TRAIT=RULE3]");
System.exit(-1)
}
val conf = new SparkConf().setAppName("My-Rules").setMaster("local");
val sc = new SparkContext(conf);
val srcTableName = args(0).trim();
val destTableName = args(1).trim();
val ruleArguments = StringUtils.substringBetween(args(2).trim(), "[", "]");
val businessRuleMappings = ruleArguments.split(",").map(_.split("=")).map(arr => arr(0) -> arr(1)).toMap;
val sqlContext : SQLContext = new org.apache.spark.sql.SQLContext(sc) ;
val hiveContext : HiveContext = new org.apache.spark.sql.hive.HiveContext(sc);
val dfSourceTbl = hiveContext.table("TEST.INPUT_TABLE");
def exec(dfSource: DataFrame,columnName :String ,funName: String): DataFrame = {
funName match {
case "RULE1" => TransformDF(columnName,dfSource,RULE1);
case "RULE2" => TransformDF(columnName,dfSource,RULE2);
case "RULE3" => TransformDF(columnName,dfSource,RULE3);
case _ =>dfSource;
}
}
def TransformDF(x:String, df:DataFrame, f:(String,DataFrame)=>DataFrame) : DataFrame = {
f(x,df);
}
def RULE1(column : String, sourceDF: DataFrame): DataFrame = {
//put businees logic
return sourceDF;
}
def RULE2(column : String, sourceDF: DataFrame): DataFrame = {
//put businees logic
return sourceDF;
}
def RULE3(column : String,sourceDF: DataFrame): DataFrame = {
//put businees logic
return sourceDF;
}
// How can I call this exec() function with output casacing and arguments for variable number of rules.
val finalResultDF = exec(exec(exec(dfSourceTbl,"NAME","RULE1"),"ID","RULE2"),"TRAIT","RULE3);
finalResultDF.write.mode(org.apache.spark.sql.SaveMode.Append).insertInto("DB.destTableName")
}
}
I would write all the rules as functions transforming one dataframe to another:
val rules: Seq[(DataFrame) => DataFrame] = Seq(
RULE1("NAME",_:DataFrame),
RULE2("ID",_:DataFrame),
RULE3("TRAIT",_:DataFrame)
)
Not you can apply them using folding
val finalResultDF = rules.foldLeft(dfSourceTbl)(_ transform _)

File Upload and processing using akka-http websockets

I'm using some sample Scala code to make a server that receives a file over websocket, stores the file temporarily, runs a bash script on it, and then returns stdout by TextMessage.
Sample code was taken from this github project.
I edited the code slightly within echoService so that it runs another function that processes the temporary file.
object WebServer {
def main(args: Array[String]) {
implicit val actorSystem = ActorSystem("akka-system")
implicit val flowMaterializer = ActorMaterializer()
val interface = "localhost"
val port = 3000
import Directives._
val route = get {
pathEndOrSingleSlash {
complete("Welcome to websocket server")
}
} ~
path("upload") {
handleWebSocketMessages(echoService)
}
val binding = Http().bindAndHandle(route, interface, port)
println(s"Server is now online at http://$interface:$port\nPress RETURN to stop...")
StdIn.readLine()
binding.flatMap(_.unbind()).onComplete(_ => actorSystem.shutdown())
println("Server is down...")
}
implicit val actorSystem = ActorSystem("akka-system")
implicit val flowMaterializer = ActorMaterializer()
val echoService: Flow[Message, Message, _] = Flow[Message].mapConcat {
case BinaryMessage.Strict(msg) => {
val decoded: Array[Byte] = msg.toArray
val imgOutFile = new File("/tmp/" + "filename")
val fileOuputStream = new FileOutputStream(imgOutFile)
fileOuputStream.write(decoded)
fileOuputStream.close()
TextMessage(analyze(imgOutFile))
}
case BinaryMessage.Streamed(stream) => {
stream
.limit(Int.MaxValue) // Max frames we are willing to wait for
.completionTimeout(50 seconds) // Max time until last frame
.runFold(ByteString(""))(_ ++ _) // Merges the frames
.flatMap { (msg: ByteString) =>
val decoded: Array[Byte] = msg.toArray
val imgOutFile = new File("/tmp/" + "filename")
val fileOuputStream = new FileOutputStream(imgOutFile)
fileOuputStream.write(decoded)
fileOuputStream.close()
Future(Source.single(""))
}
TextMessage(analyze(imgOutFile))
}
private def analyze(imgfile: File): String = {
val p = Runtime.getRuntime.exec(Array("./run-vision.sh", imgfile.toString))
val br = new BufferedReader(new InputStreamReader(p.getInputStream, StandardCharsets.UTF_8))
try {
val result = Stream
.continually(br.readLine())
.takeWhile(_ ne null)
.mkString
result
} finally {
br.close()
}
}
}
}
During testing using Dark WebSocket Terminal, case BinaryMessage.Strict works fine.
Problem: However, case BinaryMessage.Streaming doesn't finish writing the file before running the analyze function, resulting in a blank response from the server.
I'm trying to wrap my head around how Futures are being used here with the Flows in Akka-HTTP, but I'm not having much luck outside trying to get through all the official documentation.
Currently, .mapAsync seems promising, or basically finding a way to chain futures.
I'd really appreciate some insight.
Yes, mapAsync will help you in this occasion. It is a combinator to execute Futures (potentially in parallel) in your stream, and present their results on the output side.
In your case to make things homogenous and make the type checker happy, you'll need to wrap the result of the Strict case into a Future.successful.
A quick fix for your code could be:
val echoService: Flow[Message, Message, _] = Flow[Message].mapAsync(parallelism = 5) {
case BinaryMessage.Strict(msg) => {
val decoded: Array[Byte] = msg.toArray
val imgOutFile = new File("/tmp/" + "filename")
val fileOuputStream = new FileOutputStream(imgOutFile)
fileOuputStream.write(decoded)
fileOuputStream.close()
Future.successful(TextMessage(analyze(imgOutFile)))
}
case BinaryMessage.Streamed(stream) =>
stream
.limit(Int.MaxValue) // Max frames we are willing to wait for
.completionTimeout(50 seconds) // Max time until last frame
.runFold(ByteString(""))(_ ++ _) // Merges the frames
.flatMap { (msg: ByteString) =>
val decoded: Array[Byte] = msg.toArray
val imgOutFile = new File("/tmp/" + "filename")
val fileOuputStream = new FileOutputStream(imgOutFile)
fileOuputStream.write(decoded)
fileOuputStream.close()
Future.successful(TextMessage(analyze(imgOutFile)))
}
}

Compilation error not found: value x but I declared it already, what's wrong

result.map { res =>
val totaldocs: Int = res.value
// do something with this number
}
//val totaldocs = 60
val totalpages:Int = (totaldocs/ipp)+1
Compilation error not found: value x but I declared it already, what is wrong with my implementation, sorry I am new to play framework and scala programming language.
I would say this line is the problem:
val totalpages:Int = (totaldocs/ipp)+1
because totaldocs is only defined inside the map scope
maybe you want something like:
private def getTotalPages(query:BSONDocument, ipp:Int) (implicit ec: ExecutionContext) = {
val key = collectionName + ":" + BSONDocument.pretty(query)
Logger.debug("Query key = "+key)
val command = Count(query)
val result: Future[CountResult] = collection.runCommand(command)
result.map { res =>
val totaldocs: Int = res.value
// do something with this number
val totalpages:Int = (totaldocs/ipp)+1
Logger.debug(s"Total docs $totaldocs, Total pages $totalpages, Items per page, $ipp")
totalpages
}
}
but now it will return a Future[Int] and you will have to deal with the future on the caller.
Note: this is just one solution, depending on your code it may not be the most adequate one

How to download a HTTP resource to a file with Akka Streams and HTTP?

Over the past few days I have been trying to figure out the best way to download a HTTP resource to a file using Akka Streams and HTTP.
Initially I started with the Future-Based Variant and that looked something like this:
def downloadViaFutures(uri: Uri, file: File): Future[Long] = {
val request = Get(uri)
val responseFuture = Http().singleRequest(request)
responseFuture.flatMap { response =>
val source = response.entity.dataBytes
source.runWith(FileIO.toFile(file))
}
}
That was kind of okay but once I learnt more about pure Akka Streams I wanted to try and use the Flow-Based Variant to create a stream starting from a Source[HttpRequest]. At first this completely stumped me until I stumbled upon the flatMapConcat flow transformation. This ended up a little more verbose:
def responseOrFail[T](in: (Try[HttpResponse], T)): (HttpResponse, T) = in match {
case (responseTry, context) => (responseTry.get, context)
}
def responseToByteSource[T](in: (HttpResponse, T)): Source[ByteString, Any] = in match {
case (response, _) => response.entity.dataBytes
}
def downloadViaFlow(uri: Uri, file: File): Future[Long] = {
val request = Get(uri)
val source = Source.single((request, ()))
val requestResponseFlow = Http().superPool[Unit]()
source.
via(requestResponseFlow).
map(responseOrFail).
flatMapConcat(responseToByteSource).
runWith(FileIO.toFile(file))
}
Then I wanted to get a little tricky and use the Content-Disposition header.
Going back to the Future-Based Variant:
def destinationFile(downloadDir: File, response: HttpResponse): File = {
val fileName = response.header[ContentDisposition].get.value
val file = new File(downloadDir, fileName)
file.createNewFile()
file
}
def downloadViaFutures2(uri: Uri, downloadDir: File): Future[Long] = {
val request = Get(uri)
val responseFuture = Http().singleRequest(request)
responseFuture.flatMap { response =>
val file = destinationFile(downloadDir, response)
val source = response.entity.dataBytes
source.runWith(FileIO.toFile(file))
}
}
But now I have no idea how to do this with the Future-Based Variant. This is as far as I got:
def responseToByteSourceWithDest[T](in: (HttpResponse, T), downloadDir: File): Source[(ByteString, File), Any] = in match {
case (response, _) =>
val source = responseToByteSource(in)
val file = destinationFile(downloadDir, response)
source.map((_, file))
}
def downloadViaFlow2(uri: Uri, downloadDir: File): Future[Long] = {
val request = Get(uri)
val source = Source.single((request, ()))
val requestResponseFlow = Http().superPool[Unit]()
val sourceWithDest: Source[(ByteString, File), Unit] = source.
via(requestResponseFlow).
map(responseOrFail).
flatMapConcat(responseToByteSourceWithDest(_, downloadDir))
sourceWithDest.runWith(???)
}
So now I have a Source that will emit one or more (ByteString, File) elements for each File (I say each File since there is no reason the original Source has to be a single HttpRequest).
Is there anyway to take these and route them to a dynamic Sink?
I'm thinking something like flatMapConcat, such as:
def runWithMap[T, Mat2](f: T => Graph[SinkShape[Out], Mat2])(implicit materializer: Materializer): Mat2 = ???
So that I could complete downloadViaFlow2 with:
def destToSink(destination: File): Sink[(ByteString, File), Future[Long]] = {
val sink = FileIO.toFile(destination, true)
Flow[(ByteString, File)].map(_._1).toMat(sink)(Keep.right)
}
sourceWithDest.runWithMap {
case (_, file) => destToSink(file)
}
The solution does not require a flatMapConcat. If you don't need any return values from the file writing then you can use Sink.foreach:
def writeFile(downloadDir : File)(httpResponse : HttpResponse) : Future[Long] = {
val file = destinationFile(downloadDir, httpResponse)
httpResponse.entity.dataBytes.runWith(FileIO.toFile(file))
}
def downloadViaFlow2(uri: Uri, downloadDir: File) : Future[Unit] = {
val request = HttpRequest(uri=uri)
val source = Source.single((request, ()))
val requestResponseFlow = Http().superPool[Unit]()
source.via(requestResponseFlow)
.map(responseOrFail)
.map(_._1)
.runWith(Sink.foreach(writeFile(downloadDir)))
}
Note that the Sink.foreach creates Futures from the writeFile function. Therefore there's not much back-pressure involved. The writeFile could be slowed down by the hard drive but the stream would keep generating Futures. To control this you can use Flow.mapAsyncUnordered (or Flow.mapAsync) :
val parallelism = 10
source.via(requestResponseFlow)
.map(responseOrFail)
.map(_._1)
.mapAsyncUnordered(parallelism)(writeFile(downloadDir))
.runWith(Sink.ignore)
If you want to accumulate the Long values for a total count you need to combine with a Sink.fold:
source.via(requestResponseFlow)
.map(responseOrFail)
.map(_._1)
.mapAsyncUnordered(parallelism)(writeFile(downloadDir))
.runWith(Sink.fold(0L)(_ + _))
The fold will keep a running sum and emit the final value when the source of requests has dried up.
Using the play Web Services client injected in ws, remmebering to import scala.concurrent.duration._:
def downloadFromUrl(url: String)(ws: WSClient): Future[Try[File]] = {
val file = File.createTempFile("my-prefix", new File("/tmp"))
file.deleteOnExit()
val futureResponse: Future[WSResponse] =
ws.url(url).withMethod("GET").withRequestTimeout(5 minutes).stream()
futureResponse.flatMap { res =>
res.status match {
case 200 =>
val outputStream = java.nio.file.Files.newOutputStream(file.toPath)
val sink = Sink.foreach[ByteString] { bytes => outputStream.write(bytes.toArray) }
res.bodyAsSource.runWith(sink).andThen {
case result =>
outputStream.close()
result.get
} map (_ => Success(file))
case other => Future(Failure[File](new Exception("HTTP Failure, response code " + other + " : " + res.statusText)))
}
}
}