How to download a HTTP resource to a file with Akka Streams and HTTP? - scala

Over the past few days I have been trying to figure out the best way to download a HTTP resource to a file using Akka Streams and HTTP.
Initially I started with the Future-Based Variant and that looked something like this:
def downloadViaFutures(uri: Uri, file: File): Future[Long] = {
val request = Get(uri)
val responseFuture = Http().singleRequest(request)
responseFuture.flatMap { response =>
val source = response.entity.dataBytes
source.runWith(FileIO.toFile(file))
}
}
That was kind of okay but once I learnt more about pure Akka Streams I wanted to try and use the Flow-Based Variant to create a stream starting from a Source[HttpRequest]. At first this completely stumped me until I stumbled upon the flatMapConcat flow transformation. This ended up a little more verbose:
def responseOrFail[T](in: (Try[HttpResponse], T)): (HttpResponse, T) = in match {
case (responseTry, context) => (responseTry.get, context)
}
def responseToByteSource[T](in: (HttpResponse, T)): Source[ByteString, Any] = in match {
case (response, _) => response.entity.dataBytes
}
def downloadViaFlow(uri: Uri, file: File): Future[Long] = {
val request = Get(uri)
val source = Source.single((request, ()))
val requestResponseFlow = Http().superPool[Unit]()
source.
via(requestResponseFlow).
map(responseOrFail).
flatMapConcat(responseToByteSource).
runWith(FileIO.toFile(file))
}
Then I wanted to get a little tricky and use the Content-Disposition header.
Going back to the Future-Based Variant:
def destinationFile(downloadDir: File, response: HttpResponse): File = {
val fileName = response.header[ContentDisposition].get.value
val file = new File(downloadDir, fileName)
file.createNewFile()
file
}
def downloadViaFutures2(uri: Uri, downloadDir: File): Future[Long] = {
val request = Get(uri)
val responseFuture = Http().singleRequest(request)
responseFuture.flatMap { response =>
val file = destinationFile(downloadDir, response)
val source = response.entity.dataBytes
source.runWith(FileIO.toFile(file))
}
}
But now I have no idea how to do this with the Future-Based Variant. This is as far as I got:
def responseToByteSourceWithDest[T](in: (HttpResponse, T), downloadDir: File): Source[(ByteString, File), Any] = in match {
case (response, _) =>
val source = responseToByteSource(in)
val file = destinationFile(downloadDir, response)
source.map((_, file))
}
def downloadViaFlow2(uri: Uri, downloadDir: File): Future[Long] = {
val request = Get(uri)
val source = Source.single((request, ()))
val requestResponseFlow = Http().superPool[Unit]()
val sourceWithDest: Source[(ByteString, File), Unit] = source.
via(requestResponseFlow).
map(responseOrFail).
flatMapConcat(responseToByteSourceWithDest(_, downloadDir))
sourceWithDest.runWith(???)
}
So now I have a Source that will emit one or more (ByteString, File) elements for each File (I say each File since there is no reason the original Source has to be a single HttpRequest).
Is there anyway to take these and route them to a dynamic Sink?
I'm thinking something like flatMapConcat, such as:
def runWithMap[T, Mat2](f: T => Graph[SinkShape[Out], Mat2])(implicit materializer: Materializer): Mat2 = ???
So that I could complete downloadViaFlow2 with:
def destToSink(destination: File): Sink[(ByteString, File), Future[Long]] = {
val sink = FileIO.toFile(destination, true)
Flow[(ByteString, File)].map(_._1).toMat(sink)(Keep.right)
}
sourceWithDest.runWithMap {
case (_, file) => destToSink(file)
}

The solution does not require a flatMapConcat. If you don't need any return values from the file writing then you can use Sink.foreach:
def writeFile(downloadDir : File)(httpResponse : HttpResponse) : Future[Long] = {
val file = destinationFile(downloadDir, httpResponse)
httpResponse.entity.dataBytes.runWith(FileIO.toFile(file))
}
def downloadViaFlow2(uri: Uri, downloadDir: File) : Future[Unit] = {
val request = HttpRequest(uri=uri)
val source = Source.single((request, ()))
val requestResponseFlow = Http().superPool[Unit]()
source.via(requestResponseFlow)
.map(responseOrFail)
.map(_._1)
.runWith(Sink.foreach(writeFile(downloadDir)))
}
Note that the Sink.foreach creates Futures from the writeFile function. Therefore there's not much back-pressure involved. The writeFile could be slowed down by the hard drive but the stream would keep generating Futures. To control this you can use Flow.mapAsyncUnordered (or Flow.mapAsync) :
val parallelism = 10
source.via(requestResponseFlow)
.map(responseOrFail)
.map(_._1)
.mapAsyncUnordered(parallelism)(writeFile(downloadDir))
.runWith(Sink.ignore)
If you want to accumulate the Long values for a total count you need to combine with a Sink.fold:
source.via(requestResponseFlow)
.map(responseOrFail)
.map(_._1)
.mapAsyncUnordered(parallelism)(writeFile(downloadDir))
.runWith(Sink.fold(0L)(_ + _))
The fold will keep a running sum and emit the final value when the source of requests has dried up.

Using the play Web Services client injected in ws, remmebering to import scala.concurrent.duration._:
def downloadFromUrl(url: String)(ws: WSClient): Future[Try[File]] = {
val file = File.createTempFile("my-prefix", new File("/tmp"))
file.deleteOnExit()
val futureResponse: Future[WSResponse] =
ws.url(url).withMethod("GET").withRequestTimeout(5 minutes).stream()
futureResponse.flatMap { res =>
res.status match {
case 200 =>
val outputStream = java.nio.file.Files.newOutputStream(file.toPath)
val sink = Sink.foreach[ByteString] { bytes => outputStream.write(bytes.toArray) }
res.bodyAsSource.runWith(sink).andThen {
case result =>
outputStream.close()
result.get
} map (_ => Success(file))
case other => Future(Failure[File](new Exception("HTTP Failure, response code " + other + " : " + res.statusText)))
}
}
}

Related

How to using Akka Stream with Akk-Http to stream the response

I'm new to Akka Stream. I used following code for CSV parsing.
class CsvParser(config: Config)(implicit system: ActorSystem) extends LazyLogging with NumberValidation {
import system.dispatcher
private val importDirectory = Paths.get(config.getString("importer.import-directory")).toFile
private val linesToSkip = config.getInt("importer.lines-to-skip")
private val concurrentFiles = config.getInt("importer.concurrent-files")
private val concurrentWrites = config.getInt("importer.concurrent-writes")
private val nonIOParallelism = config.getInt("importer.non-io-parallelism")
def save(r: ValidReading): Future[Unit] = {
Future()
}
def parseLine(filePath: String)(line: String): Future[Reading] = Future {
val fields = line.split(";")
val id = fields(0).toInt
try {
val value = fields(1).toDouble
ValidReading(id, value)
} catch {
case t: Throwable =>
logger.error(s"Unable to parse line in $filePath:\n$line: ${t.getMessage}")
InvalidReading(id)
}
}
val lineDelimiter: Flow[ByteString, ByteString, NotUsed] =
Framing.delimiter(ByteString("\n"), 128, allowTruncation = true)
val parseFile: Flow[File, Reading, NotUsed] =
Flow[File].flatMapConcat { file =>
val src = FileSource.fromFile(file).getLines()
val source : Source[String, NotUsed] = Source.fromIterator(() => src)
// val gzipInputStream = new GZIPInputStream(new FileInputStream(file))
source
.mapAsync(parallelism = nonIOParallelism)(parseLine(file.getPath))
}
val computeAverage: Flow[Reading, ValidReading, NotUsed] =
Flow[Reading].grouped(2).mapAsyncUnordered(parallelism = nonIOParallelism) { readings =>
Future {
val validReadings = readings.collect { case r: ValidReading => r }
val average = if (validReadings.nonEmpty) validReadings.map(_.value).sum / validReadings.size else -1
ValidReading(readings.head.id, average)
}
}
val storeReadings: Sink[ValidReading, Future[Done]] =
Flow[ValidReading]
.mapAsyncUnordered(concurrentWrites)(save)
.toMat(Sink.ignore)(Keep.right)
val processSingleFile: Flow[File, ValidReading, NotUsed] =
Flow[File]
.via(parseFile)
.via(computeAverage)
def importFromFiles = {
implicit val materializer = ActorMaterializer()
val files = importDirectory.listFiles.toList
logger.info(s"Starting import of ${files.size} files from ${importDirectory.getPath}")
val startTime = System.currentTimeMillis()
val balancer = GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val balance = builder.add(Balance[File](concurrentFiles))
val merge = builder.add(Merge[ValidReading](concurrentFiles))
(1 to concurrentFiles).foreach { _ =>
balance ~> processSingleFile ~> merge
}
FlowShape(balance.in, merge.out)
}
Source(files)
.via(balancer)
.withAttributes(ActorAttributes.supervisionStrategy { e =>
logger.error("Exception thrown during stream processing", e)
Supervision.Resume
})
.runWith(storeReadings)
.andThen {
case Success(_) =>
val elapsedTime = (System.currentTimeMillis() - startTime) / 1000.0
logger.info(s"Import finished in ${elapsedTime}s")
case Failure(e) => logger.error("Import failed", e)
}
}
}
I wanted to to use Akka HTTP which would give all ValidReading entities parsed from CSV but I couldn't understand on how would I do that.
The above code fetches file from server and parse each lines to generate ValidReading.
How can I pass/upload CSV via akka-http, parse the file and stream the resulted response back to the endpoint?
The "essence" of the solution is something like this:
import akka.http.scaladsl.server.Directives._
val route = fileUpload("csv") {
case (metadata, byteSource) =>
val source = byteSource.map(x => x)
complete(HttpResponse(entity = HttpEntity(ContentTypes.`text/csv(UTF-8)`, source)))
}
You detect that the uploaded thing is a multipart-form-data with a chunk named "csv". You get the byteSource from that. Do the calculation (insert your logic to the .map(x=>x) part). Convert your data back to ByteString. Complete the request with the new source. This will make your endoint like a proxy.

Parsing stops with Akka Streams mapAsync

I am parsing 50000 records which contain their titles and URLs on the web page. While parsing, I am writing them to the database, which is PostgreSQL. I deployed my application using docker-compose. However, it keeps stopping on some page without any reason. I tried to write some logs to figure out what's happening, but there is no connection error or anything like that.
Here is my code for parsing and writing to the database:
object App {
val db = Database.forURL("jdbc:postgresql://db:5432/toloka?user=user&password=password")
val browser = JsoupBrowser()
val catRepo = new CategoryRepo(db)
val torrentRepo = new TorrentRepo(db)
val torrentForParseRepo = new TorrentForParseRepo(db)
val parallelismFactor = 10
val groupFactor = 10
implicit val system = ActorSystem("TolokaParser")
implicit val materializer = ActorMaterializer()
implicit val executionContext = system.dispatcher
def parseAndWriteTorrentsForParseToDb(doc: App.browser.DocumentType) = {
Source(getRecordsLists(doc))
.grouped(groupFactor)
.mapAsync(parallelismFactor) { torrentForParse: Seq[TorrentForParse] =>
torrentForParseRepo.createInBatch(torrentForParse)
}
.runWith(Sink.ignore)
}
def getRecordsLists(doc: App.browser.DocumentType) = {
val pages = generatePagesFromHomePage(doc)
println("torrent links generated")
println(pages.size)
val result = for {
page <- pages
} yield {
println(s"Parsing torrent list...$page")
val tmp = getTitlesAndLinksTuple(getTitlesList(browser.get(page)), getLinksList(browser.get(page)))
println(tmp.size)
tmp
}
println("torrent links and names tupled")
result flatten
}
}
What may be the cause of such problems?
Put a supervision strategy to avoid stream finalization in case of error. Such as:
val decider: Supervision.Decider = {
case _ => Supervision.Resume
}
def parseAndWriteTorrentsForParseToDb = {
Source.fromIterator(() => List(1,2,3).toIterator)
.grouped(1)
.mapAsync(1) { torrentForParse: Seq[Int] =>
Future { 0 }
}
.withAttributes(ActorAttributes.supervisionStrategy(decider))
.runWith(Sink.ignore)
}
The stream should not stop with this async stage config

File Upload and processing using akka-http websockets

I'm using some sample Scala code to make a server that receives a file over websocket, stores the file temporarily, runs a bash script on it, and then returns stdout by TextMessage.
Sample code was taken from this github project.
I edited the code slightly within echoService so that it runs another function that processes the temporary file.
object WebServer {
def main(args: Array[String]) {
implicit val actorSystem = ActorSystem("akka-system")
implicit val flowMaterializer = ActorMaterializer()
val interface = "localhost"
val port = 3000
import Directives._
val route = get {
pathEndOrSingleSlash {
complete("Welcome to websocket server")
}
} ~
path("upload") {
handleWebSocketMessages(echoService)
}
val binding = Http().bindAndHandle(route, interface, port)
println(s"Server is now online at http://$interface:$port\nPress RETURN to stop...")
StdIn.readLine()
binding.flatMap(_.unbind()).onComplete(_ => actorSystem.shutdown())
println("Server is down...")
}
implicit val actorSystem = ActorSystem("akka-system")
implicit val flowMaterializer = ActorMaterializer()
val echoService: Flow[Message, Message, _] = Flow[Message].mapConcat {
case BinaryMessage.Strict(msg) => {
val decoded: Array[Byte] = msg.toArray
val imgOutFile = new File("/tmp/" + "filename")
val fileOuputStream = new FileOutputStream(imgOutFile)
fileOuputStream.write(decoded)
fileOuputStream.close()
TextMessage(analyze(imgOutFile))
}
case BinaryMessage.Streamed(stream) => {
stream
.limit(Int.MaxValue) // Max frames we are willing to wait for
.completionTimeout(50 seconds) // Max time until last frame
.runFold(ByteString(""))(_ ++ _) // Merges the frames
.flatMap { (msg: ByteString) =>
val decoded: Array[Byte] = msg.toArray
val imgOutFile = new File("/tmp/" + "filename")
val fileOuputStream = new FileOutputStream(imgOutFile)
fileOuputStream.write(decoded)
fileOuputStream.close()
Future(Source.single(""))
}
TextMessage(analyze(imgOutFile))
}
private def analyze(imgfile: File): String = {
val p = Runtime.getRuntime.exec(Array("./run-vision.sh", imgfile.toString))
val br = new BufferedReader(new InputStreamReader(p.getInputStream, StandardCharsets.UTF_8))
try {
val result = Stream
.continually(br.readLine())
.takeWhile(_ ne null)
.mkString
result
} finally {
br.close()
}
}
}
}
During testing using Dark WebSocket Terminal, case BinaryMessage.Strict works fine.
Problem: However, case BinaryMessage.Streaming doesn't finish writing the file before running the analyze function, resulting in a blank response from the server.
I'm trying to wrap my head around how Futures are being used here with the Flows in Akka-HTTP, but I'm not having much luck outside trying to get through all the official documentation.
Currently, .mapAsync seems promising, or basically finding a way to chain futures.
I'd really appreciate some insight.
Yes, mapAsync will help you in this occasion. It is a combinator to execute Futures (potentially in parallel) in your stream, and present their results on the output side.
In your case to make things homogenous and make the type checker happy, you'll need to wrap the result of the Strict case into a Future.successful.
A quick fix for your code could be:
val echoService: Flow[Message, Message, _] = Flow[Message].mapAsync(parallelism = 5) {
case BinaryMessage.Strict(msg) => {
val decoded: Array[Byte] = msg.toArray
val imgOutFile = new File("/tmp/" + "filename")
val fileOuputStream = new FileOutputStream(imgOutFile)
fileOuputStream.write(decoded)
fileOuputStream.close()
Future.successful(TextMessage(analyze(imgOutFile)))
}
case BinaryMessage.Streamed(stream) =>
stream
.limit(Int.MaxValue) // Max frames we are willing to wait for
.completionTimeout(50 seconds) // Max time until last frame
.runFold(ByteString(""))(_ ++ _) // Merges the frames
.flatMap { (msg: ByteString) =>
val decoded: Array[Byte] = msg.toArray
val imgOutFile = new File("/tmp/" + "filename")
val fileOuputStream = new FileOutputStream(imgOutFile)
fileOuputStream.write(decoded)
fileOuputStream.close()
Future.successful(TextMessage(analyze(imgOutFile)))
}
}

Scala Play 2.5.0 2.5.1 Filtering with access to result body

I didn't find a way to get the body within a Filter in Play 2.5.x.
I want to create a "BadRequestLogFilter", which should log the request AND the result, if my appliation returns a status code 400-500
In Play 2.4.x, I used Iteratees and it worked.
I was not able to migrate this piece of code to Play 2.5.x. Can somebody give me a hint here? Maybe the hole approach to get the body in an Filter is an bad idea?
Here is my old (in 2.4.x working properly) Filter for Play 2.4.x
class BadRequestLogFilter #Inject() (implicit val mat: Materializer, ec: ExecutionContext) extends Filter {
val logger = Logger("bad_status").underlyingLogger
override def apply(next: (RequestHeader) => Future[Result])(request: RequestHeader): Future[Result] = {
val resultFuture = next(request)
resultFuture.foreach(result => {
val status = result.header.status
if (status < 200 || status >= 400) {
val c = Try(request.tags(Router.Tags.RouteController))
val a = Try(request.tags(Router.Tags.RouteActionMethod))
val body = result.body.run(Iteratee.fold(Array.empty[Byte]) { (memo, nextChunk) => memo ++ nextChunk })
val futResponse = body.map(bytes => new String(bytes))
futResponse.map { response =>
val m = Map("method" -> request.method,
"uri" -> request.uri,
"status" -> status,
"response" -> response,
"request" -> request,
"controller" -> c.getOrElse("empty"),
"actionMethod" -> a.getOrElse("empty"))
val msg = m.map { case (k, v) => s"$k=$v" }.mkString(", ")
logger.info(appendEntries(m), msg)
}
}
})
resultFuture
}
}
I guess I just need an valid replacement for this line here:
val body = result.body.run(Iteratee.fold(Array.empty[Byte]) { (memo, nextChunk) => memo ++ nextChunk })
The body of a result in Play 2.5.x is of type HttpEntity
So once you have the result you can get the body and then materialize it:
val body = result.body.consumeData(mat)
Here mat is the implicit Materializer you have. This is going to return you a Future[ByteString] which you can then decode to get a String representation (I have omitted future handling here for simplicity):
val bodyAsString = body.decodeString("UTF-8")
logger.info(bodyAsString)

Convert HttpEntity.Chunked to Array[String]

I have the following problem.
I am querying a server for some data and getting it back as HttpEntity.Chunked.
The response String looks like this with up to 10.000.000 lines like this:
[{"name":"param1","value":122343,"time":45435345},
{"name":"param2","value":243,"time":4325435},
......]
Now I want to get the incoming data into and Array[String] where each String is a line from the response, because later on it should be imported into an apache spark dataframe.
Currently I am doing it likes this:
//For the http request
trait StartHttpRequest {
implicit val system: ActorSystem
implicit val materializer: ActorMaterializer
def httpRequest(data: String, path: String, targetPort: Int, host: String): Future[HttpResponse] = {
val connectionFlow: Flow[HttpRequest, HttpResponse, Future[OutgoingConnection]] = {
Http().outgoingConnection(host, port = targetPort)
}
val responseFuture: Future[HttpResponse] =
Source.single(RequestBuilding.Post(uri = path, entity = HttpEntity(ContentTypes.`application/json`, data)))
.via(connectionFlow)
.runWith(Sink.head)
responseFuture
}
}
//result of the request
val responseFuture: Future[HttpResponse] = httpRequest(.....)
//convert to string
responseFuture.flatMap { response =>
response.status match {
case StatusCodes.OK =>
Unmarshal(response.entity).to[String]
}
}
//and then something like this, but with even more stupid stuff
responseFuture.onSuccess { str:String =>
masterActor! str.split("""\},\{""")
}
My question is, what would be a better way to get the result into an array?
How can I unmarshall the response entity directly? Because .to[Array[String]] for example did not work. And because there are so many lines coming, could I do it with a stream, to be more efficent?
Answering your questions out of order:
How can I unmarshall the response entity directly?
There is an existing question & answer related to unmarshalling an Array of case classes.
what would be a better way to get the result into an array?
I would take advantage of the Chunked nature and use streams. This allows you to do string processing and json parsing concurrently.
First you need a container class and parser:
case class Data(name : String, value : Int, time : Long)
object MyJsonProtocol extends DefaultJsonProtocol {
implicit val dataFormat = jsonFormat3(Data)
}
Then you have to do some manipulations to get the json objects to look right:
//Drops the '[' and the ']' characters
val dropArrayMarkers =
Flow[ByteString].map(_.filterNot(b => b == '['.toByte || b == ']'.toByte))
val preppendBrace =
Flow[String].map(s => if(!s.startsWith("{")) "{" + s else s)
val appendBrace =
Flow[String].map(s => if(!s.endsWith("}")) s + "}" else s)
val parseJson =
Flow[String].map(_.parseJson.convertTo[Data])
Finally, combine these Flows to convert a Source of ByteString into a Source of Data objects:
def strSourceToDataSource(source : Source[ByteString,_]) : Source[Data, _] =
source.via(dropArrayMarkers)
.via(Framing.delimiter(ByteString("},{"), 256, true))
.map(_.utf8String)
.via(prependBrace)
.via(appendBrace)
.via(parseJson)
This source can then be drained into an Seq of Data objects:
val dataSeq : Future[Seq[Data]] =
responseFuture flatMap { response =>
response.status match {
case StatusCodes.OK =>
strSourceToDataSource(response.entity.dataBytes).runWith(Sink.seq)
}
}