Large file download with Play framework - scala

I have a sample download code that works fine if the file is not zipped because I know the length and when I provide, it I think while streaming play does not have to bring the whole file in memory and it works. The below code works
def downloadLocalBackup() = Action {
var pathOfFile = "/opt/mydir/backups/big/backup"
val file = new java.io.File(pathOfFile)
val path: java.nio.file.Path = file.toPath
val source: Source[ByteString, _] = FileIO.fromPath(path)
logger.info("from local backup set the length in header as "+file.length())
Ok.sendEntity(HttpEntity.Streamed(source, Some(file.length()), Some("application/zip"))).withHeaders("Content-Disposition" -> s"attachment; filename=backup")
}
I don't know how the streaming in above case takes care of the difference in speed between disk reads(Which are faster than network). This never runs out of memory even for large files. But when I use the below code, which has zipOutput stream I am not sure of the reason to run out of memory. Somehow the same 3GB file when I try to use with zip stream, is not working.
def downloadLocalBackup2() = Action {
var pathOfFile = "/opt/mydir/backups/big/backup"
val file = new java.io.File(pathOfFile)
val path: java.nio.file.Path = file.toPath
val enumerator = Enumerator.outputStream { os =>
val zipStream = new ZipOutputStream(os)
zipStream.putNextEntry(new ZipEntry("backup2"))
val is = new BufferedInputStream(new FileInputStream(pathOfFile))
val buf = new Array[Byte](1024)
var len = is.read(buf)
var totalLength = 0L;
var logged = false;
while (len >= 0) {
zipStream.write(buf, 0, len)
len = is.read(buf)
if (!logged) {
logged = true;
logger.info("logging the while loop just one time")
}
}
is.close
zipStream.close()
}
logger.info("log right before sendEntity")
val kk = Ok.sendEntity(HttpEntity.Streamed(Source.fromPublisher(Streams.enumeratorToPublisher(enumerator)).map(x => {
val kk = Writeable.wByteArray.transform(x); kk
}),
None, Some("application/zip"))
).withHeaders("Content-Disposition" -> s"attachment; filename=backupfile.zip")
kk
}

In the first example, Akka Streams handles all details for you. It knows how to read the input stream without loading the complete file in memory. That is the advantage of using Akka Streams as explained in the docs:
The way we consume services from the Internet today includes many instances of streaming data, both downloading from a service as well as uploading to it or peer-to-peer data transfers. Regarding data as a stream of elements instead of in its entirety is very useful because it matches the way computers send and receive them (for example via TCP), but it is often also a necessity because data sets frequently become too large to be handled as a whole. We spread computations or analyses over large clusters and call it “big data”, where the whole principle of processing them is by feeding those data sequentially—as a stream—through some CPUs.
...
The purpose [of Akka Streams] is to offer an intuitive and safe way to formulate stream processing setups such that we can then execute them efficiently and with bounded resource usage—no more OutOfMemoryErrors. In order to achieve this our streams need to be able to limit the buffering that they employ, they need to be able to slow down producers if the consumers cannot keep up. This feature is called back-pressure and is at the core of the Reactive Streams initiative of which Akka is a founding member.
At the second example, you are handling the input/output streams by yourself, using the standard blocking API. I'm not 100% sure about how writing to a ZipOutputStream works here, but it is possible that it is not flushing the writes and accumulating everything before close.
Good thing is that you don't need to handle this manually since Akka Streams provides a way to gzip a Source of ByteStrings:
import javax.inject.Inject
import akka.util.ByteString
import akka.stream.scaladsl.{Compression, FileIO, Source}
import play.api.http.HttpEntity
import play.api.mvc.{BaseController, ControllerComponents}
class FooController #Inject()(val controllerComponents: ControllerComponents) extends BaseController {
def download = Action {
val pathOfFile = "/opt/mydir/backups/big/backup"
val file = new java.io.File(pathOfFile)
val path: java.nio.file.Path = file.toPath
val source: Source[ByteString, _] = FileIO.fromPath(path)
val gzipped = source.via(Compression.gzip)
Ok.sendEntity(HttpEntity.Streamed(gzipped, Some(file.length()), Some("application/zip"))).withHeaders("Content-Disposition" -> s"attachment; filename=backup")
}
}

Related

Scala making parallel network calls using Futures

i'm new to Scala, i have a method, that reads data from the given list of files and does api calls with
the data, and writes the response to a file.
listOfFiles.map { file =>
val bufferedSource = Source.fromFile(file)
val data = bufferedSource.mkString
bufferedSource.close()
val response = doApiCall(data) // time consuming task
if (response.nonEmpty) writeFile(response, outputLocation)
}
the above method, is taking too much time, during the network call, so tried to do using parallel
processing to reduce the time.
so i tried wrapping the block of code, which consumes more time, but the program ends quickly
and its not generating any output, as the above code.
import scala.concurrent.ExecutionContext.Implicits.global
listOfFiles.map { file =>
val bufferedSource = Source.fromFile(file)
val data = bufferedSource.mkString
bufferedSource.close()
Future {
val response = doApiCall(data) // time consuming task
if (response.nonEmpty) writeFile(response, outputLocation)
}
}
it would be helpful, if you have any suggestions.
(I also tried using "par", it works fine,
I'm exploring other options other than 'par' and using frameworks like 'akka', 'cats' etc)
Based on Jatin instead of using default execution context which contains deamon threads
import scala.concurrent.ExecutionContext.Implicits.global
define execution context with non-deamon threads
implicit val nonDeamonEc = ExecutionContext.fromExecutor(Executors.newCachedThreadPool)
Also you can use Future.traverse and Await like so
val resultF = Future.traverse(listOfFiles) { file =>
val bufferedSource = Source.fromFile(file)
val data = bufferedSource.mkString
bufferedSource.close()
Future {
val response = doApiCall(data) // time consuming task
if (response.nonEmpty) writeFile(response, outputLocation)
}
}
Await.result(resultF, Duration.Inf)
traverse converts List[Future[A]] to Future[List[A]].

What is the most efficient way to copy a S3 "folder"?

I want to find an efficient way to copy an S3 folder/prefix with lots of objects to another folder/prefix on the same bucket. This is what I have tried.
Test data: around 200 objects, around 100 MB each.
1) aws s3 cp --recursive. It took around 150 secs.
2) s3-dist-cp. It took around 59 secs.
3) spark & aws jdk, 2 threads. It took around 440 secs.
4) spark & aws jdk, 64 threads. It took around 50 secs.
The threads definitely worked, but when it goes to a single thread, the aws java sdk approach seems not as efficient the aws s3 cp approach. Is there a single-threaded programming API that can have performance comparable to that of aws s3 cp? Or if there is a better to copy data?
Ideally I would prefer to use programming API to have more flexibility.
Below are the codes I used.
import org.apache.hadoop.fs.{FileSystem, Path}
import java.net.URI
def listAllFiles(rootPath: String): Seq[String] = {
val fileSystem = FileSystem.get(URI.create(rootPath), new Configuration())
val it = fileSystem.listFiles(new Path(rootPath), true)
var files = List[String]()
while (it.hasNext) {
files = it.next().getPath.toString::files
}
files
}
def s3CopyFiles(spark: SparkSession, fromPath: String, toPath: String): Unit = {
val fromFiles = listAllFiles(fromPath)
val toFiles = fromFiles.map(_.replaceFirst(fromPath, toPath))
val fileMap = fromFiles.zip(toFiles)
s3CopyFiles(spark, fileMap)
}
def s3CopyFiles(spark: SparkSession, fileMap: Seq[(String, String)]): Unit = {
val sc = spark.sparkContext
val filePairRdd = sc.parallelize(fileMap.toList, sc.defaultParallelism)
filePairRdd.foreachPartition(it => {
val p = "s3://([^/]*)/(.*)".r
val s3 = AmazonS3ClientBuilder.defaultClient()
while (it.hasNext) {
val (p(fromBucket, fromKey), p(toBucket, toKey)) = it.next()
s3.copyObject(fromBucket, fromKey, toBucket, toKey)
}
})
}
The AWS SDK transfer manager is multithreaded; you tell it the block size you want to split the copy up by and it will do it across threads and coalesce the output at the end. Your code shouldn't have to care about how the thread/http pool is working.
Remember that the COPY call isn't doing IO; each thread issues the HTTP request and then blocks awaiting the answer...you can have many, many of them blocked at the same time
I would recommend async approach, for example reactive-aws-clients. You still will be limited to the S3 throttling bandwidth but you won't need brute force of huge number of threads on client side. For example, you could create a Monix app with structure like:
val future = listS3filesTask.flatMap(key => Task.now(getS3Object(key))).runAsync
Await.result(future, 100.seconds)
Another possible optimization could be using torrent protocol s3 feature if you have multiple consumers, so you can distribute data files across consumers with just one S3 GetObject operation per file.

How to combine multiple TIFF's into one large Geotiff in scala?

I am working on a project for finding water depth and extent using digital ground model (DGM). I have multiple tiff files covering the area of interest and i want to combine them into a single tiff file for quick processing. How can i combine them using my own code below or any other methodology?
I have tried to concatenate the tiles bye getting them as an input one by one and then combining them but it throws GC error probably because there is something wrong with code itself. The code is provided below
import geotrellis.proj4._
import geotrellis.raster._
import geotrellis.raster.io.geotiff._
object waterdepth {
val directories = List("data")
//constants to differentiate which bands to use
val R_BAND = 0
val G_BAND = 1
val NIR_BAND = 2
// Path to our landsat band geotiffs.
def bandPath(directory: String) = s"../biggis-landuse/radar_data/${directory}"
def main(args: Array[String]): Unit = {
directories.map(directory => generateMultibandGeoTiffFile(directory))
}
def generateMultibandGeoTiffFile(directory: String) = {
val tiffFiles = new java.io.File(bandPath(directory)).listFiles.map(_.toString)
val singleBandGeoTiffArray = tiffFiles.foldLeft(Array[SinglebandGeoTiff]())((acc, el:String) => {
acc :+ SinglebandGeoTiff(el)
})
val tileArray = ArrayMultibandTile(singleBandGeoTiffArray.map(_.tile))
println(s"Writing out $directory multispectral tif")
MultibandGeoTiff(tileArray, singleBandGeoTiffArray(0).extent, singleBandGeoTiffArray(0).crs).write(s"data/$directory.tif")
it should be able to create a single tif file from all the seperate files but it throws up a memory error.
The idea you follow is correct, probably OOM happens since you're loading lot's of TIFFs into memory so it is not surprising. The solution is to allocate more memory for the JVM. However you can try this small optimization (that probably will work):
import geotrellis.proj4._
import geotrellis.raster._
import geotrellis.raster.io.geotiff._
import geotrellis.raster.io.geotiff.reader._
import java.io.File
def generateMultibandGeoTiffFile(directory: String) = {
val tiffs =
new File(bandPath(directory))
.listFiles
.map(_.toString)
// streaming = true won't force all bytes to load into memory
// only tiff metadata is fetched here
.map(GeoTiffReader.readSingleband(_, streaming = true))
val (extent, crs) = {
val tiff = tiffs.head
tiff.extent -> tiff.crs
}
// TIFF segments bytes fetch will start only during the write
MultibandGeoTiff(
MultibandTile(tiffs.map(_.tile)),
extent, crs
).write(s"data/$directory.tif")
}
}

Akka Streams: How to group a list of files in a source by size?

So I currently have an akka stream to read a list of files, and a sink to concatenate them, and that works just fine:
val files = List("a.txt", "b.txt", "c.txt") // and so on;
val source = Source(files).flatMapConcat(f => FileIO.fromPath(Paths.get(f)))
val sink = Sink.fold[ByteString, ByteString](ByteString(""))(_ ++ ByteString("\n" ++ _) // Concatenate
source.toMat(sink)(Keep.right).run().flatMap(concatByteStr => writeByteStrToFile(concatByteStr, "an-output-file.txt"))
While this is fine for a simple case, the files are rather large (on the order of GBs, and can't fit in the memory of the machine I'm running this application on. So I'd like to chunk it after the byte string has reached a certain size. An option is doing it with Source.grouped(N), but files vary greatly in size (from 1 KB to 2 GB), so there's no guarantee on normalizing the size of the file.
My question is if there's a way to chunk writing files by the size of the bytestring. The documentation of akka streams are quite overwhelming and I'm having trouble figuring out the library. Any help would be greatly appreciated. Thanks!
The FileIO module from Akka Streams provides you with a streaming IO Sink to write to files, and utility methods to chunk a stream of ByteString. Your example would become something along the lines of
val files = List("a.txt", "b.txt", "c.txt") // and so on;
val source = Source(files).flatMapConcat(f => FileIO.fromPath(Paths.get(f)))
val chunking = Framing.delimiter(ByteString("\n"), maximumFrameLength = 256, allowTruncation = true)
val sink: Sink[ByteString, Future[IOResult]] = FileIO.toPath(Paths.get("an-output-file.txt"))
source.via(chunking).runWith(sink)
Using FileIO.toPath sink avoids storing the whole folded ByteString into memory (hence allowing proper streaming).
More details on this Akka module can be found in the docs.
I think #Stefano Bonetti already offered a great solution. Just wanted to add that, one could also consider building custom GraphStage to address specific chunking need. In essence, create a chunk emitting method like below for the In/Out handlers as described in this Akka Stream link:
private def emitChunk(): Unit = {
if (buffer.isEmpty) {
if (isClosed(in)) completeStage()
else pull(in)
} else {
val (chunk, nextBuffer) = buffer.splitAt(chunkSize)
buffer = nextBuffer
push(out, chunk)
}
}
After a week of tinkering in the Akka Streams libraries, the solution I ended with was a combination of Stefano's answer along with a solution provided here. I read the source of files line by line via the Framing.delimiter function, and then just simply use the LogRotatorSink provided by Alpakka. The meat of the determining log rotation is here:
val fileSizeRotationFunction = () => {
val max = 10 * 1024 * 1024 // 10 MB, but whatever you really want; I had it at our HDFS block size
var size: Long = max
(element: ByteString) =>
{
if (size + element.size > max) {
val path = Files.createTempFile("out-", ".log")
size = element.size
Some(path)
} else {
size += element.size
None
}
}
}
val sizeRotatorSink: Sink[ByteString, Future[Done]] =
LogRotatorSink(fileSizeRotationFunction)
val source = Source(files).flatMapConcat(f => FileIO.fromPath(Paths.get(f)))
val chunking = Framing.delimiter(ByteString("\n"), maximumFrameLength = 256, allowTruncation = true)
source.via(chunking).runWith(sizeRotatorSink)
And that's it. Hope this was helpful to others.

How to use Java libraries asynchronously in a Scala Play 2.0 application?

I see in the Play 2.0 Scala doc for calling web services that the idiomatic approach is to use Scala's asynchronous mechanisms to call web services. So if I'm using Java libraries for, say, downloading images from S3 and uploading to Facebook and Twitter (restfb and twitter4j), does this make for a highly inefficient use of resources (what resources?) or does it not make much difference (or no difference at all)?
If it makes a difference, how would I go about making something like the following asynchronous? Is there a quick way, or would I have to write libraries from scratch?
Note this will be running on heroku, if that matters in this discussion.
def tweetJpeg = Action(parse.urlFormEncoded) { request =>
val form = request.body
val folder = form("folder").head
val mediaType = form("type").head
val photo = form("photo").head
val path = folder + "/" + mediaType + "/" + photo
val config = Play.current.configuration;
val awsAccessKey = config.getString("awsAccessKey").get
val awsSecretKey = config.getString("awsSecretKey").get
val awsBucket = config.getString("awsBucket").get
val awsCred = new BasicAWSCredentials(awsAccessKey, awsSecretKey)
val amazonS3Client = new AmazonS3Client(awsCred)
val obj = amazonS3Client.getObject(awsBucket, path)
val stream = obj.getObjectContent()
val twitterKey = config.getString("twitterKey").get
val twitterSecret = config.getString("twitterSecret").get
val token = form("token").head
val secret = form("secret").head
val tweet = form("tweet").head
val cb = new ConfigurationBuilder();
cb.setDebugEnabled(true)
.setOAuthConsumerKey(twitterKey)
.setOAuthConsumerSecret(twitterSecret)
.setOAuthAccessToken(token)
.setOAuthAccessTokenSecret(secret)
val tf = new TwitterFactory(cb.build())
val twitter = tf.getInstance()
val status = new StatusUpdate(tweet)
status.media(photo, stream)
val twitResp = twitter.updateStatus(status)
Logger.info("Tweeted " + twitResp.getText())
Ok("Tweeted " + twitResp.getText())
}
def facebookJpeg = Action(parse.urlFormEncoded) { request =>
val form = request.body
val folder = form("folder").head
val mediaType = form("type").head
val photo = form("photo").head
val path = folder + "/" + mediaType + "/" + photo
val config = Play.current.configuration;
val awsAccessKey = config.getString("awsAccessKey").get
val awsSecretKey = config.getString("awsSecretKey").get
val awsBucket = config.getString("awsBucket").get
val awsCred = new BasicAWSCredentials(awsAccessKey, awsSecretKey)
val amazonS3Client = new AmazonS3Client(awsCred)
val obj = amazonS3Client.getObject(awsBucket, path)
val stream = obj.getObjectContent()
val token = form("token").head
val msg = form("msg").head
val facebookClient = new DefaultFacebookClient(token)
val fbClass = classOf[FacebookType]
val param = com.restfb.Parameter.`with`("message", msg)
val attachment = com.restfb.BinaryAttachment`with`(photo + ".png", stream)
val fbResp = facebookClient.publish("me/photos", fbClass, attachment, param)
Logger.info("Posted " + fbResp.toString())
Ok("Posted " + fbResp.toString())
}
My attempt at a guess:
Yes it's better to do things asynchronous; you're tying up threads if you do everything synchronously. Threads are memory hogs, so your server can only use so many; the more that are tied up waiting, the fewer requests your server can respond to.
No it's not a huge issue. With node.js (and Rails? Django?) it is a huge issue because there's only one thread and so it blocks your whole web server. A JVM server is multithreaded so you can still service new requests.
You can easily wrap the whole thing in a future, or do it more granularly, but that doesn't really buy you anything because you're calling the same methods, so you're just shifting the wait from one thread do another.
If those Java libraries offer asynchronous methods, you can wrap those in a future to get the real benefits of asynchrony <-how to do?. Otherwise yes you're looking at writing from the ground up.
Don't really know if running on heroku matters. Is one dyno == one simultaneous request?
I think it's best to do these requests asynchronously for two main reasons:
high latency (network calls)
failures
With Play, you should use the Akka actors to make your actions it provides great ways to deal with these two concerns.
The problem synchronous code is that it will block the web server. So it won't be available to other requests. Here we will make the wait in other threads unrelated to the web server.
You could do something like:
// you will have to write the TwitterActor
val twitterActor = Akka.system.actorOf(Props[TwitterActor], name = "twitter-actor")
def tweetJpeg = Action(parse.urlFormEncoded) { request =>
val futureMessage = (twitterActor ? request.body).map {
// Do something with the response from the actor
case ... => ...
}
async {
futureMessage.map( message =>
ok("Tweeted " + message)
)
}
}
Your actor would receive the body and send back the response of the service.
Moreover with Akka, you can tune your process to have several actors available, have a circuit breaker ...
To go further: http://doc.akka.io/docs/akka/2.1.2/scala/actors.html
Ps: I never tried play on Heroku so I don't know the impact of a single dynamo.