Transform AWS S3 Object into ByteString issue - scala

I'm working on a process of upload files from S3 to Facebook using Akka. According to Facebook API docs, files should be uploaded via small parts - chunks. Based on a file size, Facebook gives you an information about bytes offset which it expects to receive in a next request.
Firstly I make a GetObjectRequest to S3 via Java AWS SDK, in order to receive a chunk with a required bytes size:
val objChunkReq = new GetObjectRequest(get.s3ObjId.bucketName, get.s3ObjId.key)
objChunkReq.setRange(get.fbUploadSession.from, get.fbUploadSession.to)
Try(s3Client.getObject(objChunkReq)) match {
case Success(s3ObjChunk) => Right(S3ObjChunk(s3ObjChunk, get.fbUploadSession))
case Failure(ex) => Left(S3Exception(ex.getMessage))
}
Then in case if the S3 response is successful, I can work with the received chunk as with an InputStream in order to pass it then into Facebook HTTP request:
private def inputStreamToArrayByte(is: InputStream) = {
Try {
val reads: Int = is.read()
val byteStringBuilder = ByteString.newBuilder
while (is.read() != -1) {
byteStringBuilder.asOutputStream.write(reads)
is.read()
}
is.close()
byteStringBuilder.result()
}
}
The issue I faced is that size of s3ObjChunk from the first code snippet has twice bigger size in bytes than the resulting ByteString from the second one code snippet.
s3ObjChunk.getObjectMetadata.getContentLength == n
byteStringBuilder.result().length == n / 2
I have two assumptions:
a) I transform the InputStream into ByteString incorrectly
b) The ByteString compresses the InputStream
How to transform an S3 object InputStream into a ByteString correctly?

The issue with n vs n / 2 in the resulting output may be explained by a bug in the implementation.
is.read() is called twice in the loop, and none of its returns is indeed written into the output stream, but only the first one, stored in val reads.
The implementation should change to something like:
val byteStringBuilder = ByteString.newBuilder
val output = byteStringBuilder.asOutputStream
try {
var reads: Int = is.read() // note "var" instead of "val"
while (reads != -1) {
output.write(reads)
reads = is.read()
}
} finally {
is.close() // should it be here or closed by the caller?
// also close "output"
}
byteStringBuilder.result()
Or, another approach would be to use slightly more idiomatic stream reading with scala.io.Source, for example:
val byteStringBuilder = ByteString.newBuilder
val output = byteStringBuilder.asOutputStream
scala.io.Source.fromInputStream(is).foreach(output.write(_))
byteStringBuilder.result()

Related

Chunk file into multiple streams in scala

I have a use case in my program where I need to take a file, split them equally N times and upload them remotely.
I'd like a function that takes, say a File and output a list of BufferedReader. To which I can just distribute them and send them to another function which uses some API to store them.
I've seen examples where authors utilize the .lines() method of a BufferedReader:
def splitFile: List[Stream] = {
val temp = "Test mocked file contents\nTest"
val is = new ByteArrayInputStream(lolz.getBytes)
val br = new BufferedReader(new InputStreamReader(is))
// Chunk the file into two sort-of equal parts.
// Stream 1
val test = br.lines().skip(1).limit(1)
// Stream 2
val test2 = br.lines().skip(2).limit(1)
List(test, test2)
}
I suppose that the above example works, it's not beautiful but it works.
My questions:
Is there a way to split a BufferedReader into multiple list of streams?
I don't control the format of the File, so the file contents could potentially be a single line long. Wouldn't that just mean that .lines() just load all that into a Stream of one element?
It will be much easier to read the file once, and then split it up. You're not really losing anything either, since the file read is a serial operation anyway. From there, just devise a scheme to slice up the list. In this case, I group everything based on its index modulo the number of desired output lists, then pull out the lists. If there are fewer lines than you ask for, it will just place each line in a separate List.
val lines: List[String] = br.lines
val outputListQuantity: Int = 2
val data: List[List[String]] = lines.zipWithIndex.groupBy(_._2 % outputListQuantity}.
values.map(_.map(_._1)).toList
Well, if you don't mind reading the whole stream into memory, it's easy enough (assuming, that file contains text - since you are talking about Readers, but it would be the same idea with binary):
Source.fromFile("filename")
.mkString
.getBytes
.grouped(chunkSize)
.map { chunk => new BufferedReader(new InputStreamReader(chunk)) }
But that seems to sorta defeat the purpose: if the file is small enough to be loaded into memory entirely, why bother splitting it to begin with?
So, a more practical solution is a little bit more involved:
def splitFile(
input: InputStream,
chunkSize: Int
): Iterator[InputStream] = new AbstractIterator[InputStream] {
var hasNext = true
def next = {
val buffer = new Array[Byte](chunkSize)
val bytes = input.read(buffer)
hasNext = bytes == chunkSize
new ByteArrayInputStream(buffer, 0, bytes max 0)
}
}

Using Iteratees and Enumerators in Play Scala to Stream Data to S3

I am building a Play Framework application in Scala where I would like to stream an array of bytes to S3. I am using the Play-S3 library to do this. The "Multipart file upload" of the documentation section is what's relevant here:
// Retrieve an upload ticket
val result:Future[BucketFileUploadTicket] =
bucket initiateMultipartUpload BucketFile(fileName, mimeType)
// Upload the parts and save the tickets
val result:Future[BucketFilePartUploadTicket] =
bucket uploadPart (uploadTicket, BucketFilePart(partNumber, content))
// Complete the upload using both the upload ticket and the part upload tickets
val result:Future[Unit] =
bucket completeMultipartUpload (uploadTicket, partUploadTickets)
I am trying to do the same thing in my application but with Iteratees and Enumerators.
The streams and asynchronicity make things a little complicated, but here is what I have so far (Note uploadTicket is defined earlier in the code):
val partNumberStream = Stream.iterate(1)(_ + 1).iterator
val partUploadTicketsIteratee = Iteratee.fold[Array[Byte], Future[Vector[BucketFilePartUploadTicket]]](Future.successful(Vector.empty[BucketFilePartUploadTicket])) { (partUploadTickets, bytes) =>
bucket.uploadPart(uploadTicket, BucketFilePart(partNumberStream.next(), bytes)).flatMap(partUploadTicket => partUploadTickets.map( _ :+ partUploadTicket))
}
(body |>>> partUploadTicketsIteratee).andThen {
case result =>
result.map(_.map(partUploadTickets => bucket.completeMultipartUpload(uploadTicket, partUploadTickets))) match {
case Success(x) => x.map(d => println("Success"))
case Failure(t) => throw t
}
}
Everything compiles and runs without incident. In fact, "Success" gets printed, but no file ever shows up on S3.
There might be multiple problems with your code. It's a bit unreadable caused by the map method calls. You might have a problem with your future composition. Another problem might be caused by the fact that all chunks (except for the last) should be at least 5MB.
The code below has not been tested, but shows a different approach. The iteratee approach is one where you can create small building blocks and compose them into a pipe of operations.
To make the code compile I added a trait and a few methods
trait BucketFilePartUploadTicket
val uploadPart: (Int, Array[Byte]) => Future[BucketFilePartUploadTicket] = ???
val completeUpload: Seq[BucketFilePartUploadTicket] => Future[Unit] = ???
val body: Enumerator[Array[Byte]] = ???
Here we create a few parts
// Create 5MB chunks
val chunked = {
val take5MB = Traversable.takeUpTo[Array[Byte]](1024 * 1024 * 5)
Enumeratee.grouped(take5MB transform Iteratee.consume())
}
// Add a counter, used as part number later on
val zipWithIndex = Enumeratee.scanLeft[Array[Byte]](0 -> Array.empty[Byte]) {
case ((counter, _), bytes) => (counter + 1) -> bytes
}
// Map the (Int, Array[Byte]) tuple to a BucketFilePartUploadTicket
val uploadPartTickets = Enumeratee.mapM[(Int, Array[Byte])](uploadPart.tupled)
// Construct the pipe to connect to the enumerator
// the ><> operator is an alias for compose, it is more intuitive because of
// it's arrow like structure
val pipe = chunked ><> zipWithIndex ><> uploadPartTickets
// Create a consumer that ends by finishing the upload
val consumeAndComplete =
Iteratee.getChunks[BucketFilePartUploadTicket] mapM completeUpload
Running it is done by simply connecting the parts
// This is the result, a Future[Unit]
val result = body through pipe run consumeAndComplete
Note that I did not test any code and might have made some mistakes in my approach. This however shows a different way of dealing with the problem and should probably help you to find a good solution.
Note that this approach waits for one part to complete upload before it takes on the next part. If the connection from your server to amazon is slower than the connection from the browser to you server this mechanism will slow the input.
You could take another approach where you do not wait for the Future of the part upload to complete. This would result in another step where you use Future.sequence to convert the sequence of upload futures into a single future containing a sequence of the results. The result would be a mechanism sending a part to amazon as soon as you have enough data.

Bucketed Sink in scalaz-stream

I am trying to make a sink that would write a stream to bucketed files: when a particular condition is reached (time, size of file, etc.) is reached, the current output stream is closed and a new one is opened to a new bucket file.
I checked how the different sinks were created in the io object, but there aren't many examples. So I trIed to follow how resource and chunkW were written. I ended up with the following bit of code, where for simplicity, buckets are just represented by an Int for now, but would eventually be some type output streams.
val buckets: Channel[Task, String, Int] = {
//recursion to step through the stream
def go(step: Task[String => Task[Int]]): Process[Task, String => Task[Int]] = {
// Emit the value and repeat
def next(msg: String => Task[Int]) =
Process.emit(msg) ++
go(step)
Process.await[Task, String => Task[Int], String => Task[Int]](step)(
next
, Process.halt // TODO ???
, Process.halt) // TODO ???
}
//starting bucket
val acquire: Task[Int] = Task.delay {
val startBuck = nextBucket(0)
println(s"opening bucket $startBuck")
startBuck
}
//the write step
def step(os: Int): Task[String => Task[Int]] =
Task.now((msg: String) => Task.delay {
write(os, msg)
val newBuck = nextBucket(os)
if (newBuck != os) {
println(s"closing bucket $os")
println(s"opening bucket $newBuck")
}
newBuck
})
//start the Channel
Process.await(acquire)(
buck => go(step(buck))
, Process.halt, Process.halt)
}
def write(bucket: Int, msg: String) { println(s"$bucket\t$msg") }
def nextBucket(b: Int) = b+1
There are a number of issues in this:
step is passed the bucket once at the start and this never changes during the recursion. I am not sure how in the recursive go to create a new step task that will use the bucket (Int) from the previous task, as I have to provide a String to get to that task.
the fallback and cleanup of the await calls do not receive the result of rcv (if there is one). In the io.resource function, it works fine as the resource is fixed, however, in my case, the resource might change at any step. How would I go to pass the reference to the current open bucket to these callbacks?
Well one of the options (i.e. time) may be to use simple go on the sink. This one uses time based, essentially reopening file every single hour:
val metronome = Process.awakeEvery(1.hour).map(true)
def writeFileSink(file:String):Sink[Task,ByteVector] = ???
def timeBasedSink(prefix:String) = {
def go(index:Int) : Sink[Task,ByteVector] = {
metronome.wye(write(prefix + "_" + index))(wye.interrupt) ++ go(index + 1)
}
go(0)
}
for the other options (i.e. bytes written) you can use similar technique, just keep signal of bytes written and combine it with Sink.

Play 2.x : Reactive file upload with Iteratees

I will start with the question: How to use Scala API's Iteratee to upload a file to the cloud storage (Azure Blob Storage in my case, but I don't think it's most important now)
Background:
I need to chunk the input into blocks of about 1 MB for storing large media files (300 MB+) as an Azure's BlockBlobs. Unfortunately, my Scala knowledge is still poor (my project is Java based and the only use for Scala in it will be an Upload controller).
I tried with this code: Why makes calling error or done in a BodyParser's Iteratee the request hang in Play Framework 2.0? (as a Input Iteratee) - it works quite well but eachElement that I could use has size of 8192 bytes, so it's too small for sending some hundred megabyte files to the cloud.
I must say that's quite a new approach to me, and most probably I misunderstood something (don't want to tell that I misunderstood everything ;> )
I will appreciate any hint or link, which will help me with that topic. If is there any sample of similar usage it would be the best option for me to get the idea.
Basically what you need first is rechunk input as bigger chunks, 1024 * 1024 bytes.
First let's have an Iteratee that will consume up to 1m of bytes (ok to have the last chunk smaller)
val consumeAMB =
Traversable.takeUpTo[Array[Byte]](1024*1024) &>> Iteratee.consume()
Using that, we can construct an Enumeratee (adapter) that will regroup chunks, using an API called grouped:
val rechunkAdapter:Enumeratee[Array[Byte],Array[Byte]] =
Enumeratee.grouped(consumeAMB)
Here grouped uses an Iteratee to determine how much to put in each chunk. It uses the our consumeAMB for that. Which means the result is an Enumeratee that rechunks input into Array[Byte] of 1MB.
Now we need to write the BodyParser, which will use the Iteratee.foldM method to send each chunk of bytes:
val writeToStore: Iteratee[Array[Byte],_] =
Iteratee.foldM[Array[Byte],_](connectionHandle){ (c,bytes) =>
// write bytes and return next handle, probable in a Future
}
foldM passes a state along and uses it in its passed function (S,Input[Array[Byte]]) => Future[S] to return a new Future of state. foldM will not call the function again until the Future is completed and there is an available chunk of input.
And the body parser will be rechunking input and pushing it into the store:
BodyParser( rh => (rechunkAdapter &>> writeToStore).map(Right(_)))
Returning a Right indicates that you are returning a body by the end of the body parsing (which happens to be the handler here).
If your goal is to stream to S3, here is a helper that I have implemented and tested:
def uploadStream(bucket: String, key: String, enum: Enumerator[Array[Byte]])
(implicit ec: ExecutionContext): Future[CompleteMultipartUploadResult] = {
import scala.collection.JavaConversions._
val initRequest = new InitiateMultipartUploadRequest(bucket, key)
val initResponse = s3.initiateMultipartUpload(initRequest)
val uploadId = initResponse.getUploadId
val rechunker: Enumeratee[Array[Byte], Array[Byte]] = Enumeratee.grouped {
Traversable.takeUpTo[Array[Byte]](5 * 1024 * 1024) &>> Iteratee.consume()
}
val uploader = Iteratee.foldM[Array[Byte], Seq[PartETag]](Seq.empty) { case (etags, bytes) =>
val uploadRequest = new UploadPartRequest()
.withBucketName(bucket)
.withKey(key)
.withPartNumber(etags.length + 1)
.withUploadId(uploadId)
.withInputStream(new ByteArrayInputStream(bytes))
.withPartSize(bytes.length)
val etag = Future { s3.uploadPart(uploadRequest).getPartETag }
etag.map(etags :+ _)
}
val futETags = enum &> rechunker |>>> uploader
futETags.map { etags =>
val compRequest = new CompleteMultipartUploadRequest(bucket, key, uploadId, etags.toBuffer[PartETag])
s3.completeMultipartUpload(compRequest)
}.recoverWith { case e: Exception =>
s3.abortMultipartUpload(new AbortMultipartUploadRequest(bucket, key, uploadId))
Future.failed(e)
}
}
add the following to your config file
play.http.parser.maxMemoryBuffer=256K
For those who are also trying to figure out a solution of this streaming problem, instead of writing a whole new BodyParser, you can also use what has already been implemented in parse.multipartFormData.
You can implement something like below to overwrite the default handler handleFilePartAsTemporaryFile.
def handleFilePartAsS3FileUpload: PartHandler[FilePart[String]] = {
handleFilePart {
case FileInfo(partName, filename, contentType) =>
(rechunkAdapter &>> writeToS3).map {
_ =>
val compRequest = new CompleteMultipartUploadRequest(...)
amazonS3Client.completeMultipartUpload(compRequest)
...
}
}
}
def multipartFormDataS3: BodyParser[MultipartFormData[String]] = multipartFormData(handleFilePartAsS3FileUpload)
I am able to make this work but I am still not sure whether the whole upload process is streamed. I tried some large files, it seems the S3 upload only starts when the whole file has been sent from the client side.
I looked at the above parser implementation and I think everything is connected using Iteratee so the file should be streamed.
If someone has some insight on this, that will be very helpful.

Modifying a large file in Scala

I am trying to modify a large PostScript file in Scala (some are as large as 1GB in size). The file is a group of batches, with each batch containing a code that represents the batch number, number of pages, etc.
I need to:
Search the file for the batch codes (which always start with the same line in the file)
Count the number of pages until the next batch code
Modify the batch code to include how many pages are in each batch.
Save the new file in a different location.
My current solution uses two iterators (iterA and iterB), created from Source.fromFile("file.ps").getLines. The first iterator (iterA) traverses in a while loop to the beginning of a batch code (with iterB.next being called each time as well). iterB then continues searching until the next batch code (or the end of the file), counting the number of pages it passes as it goes. Then, it updates the batch code at iterA's position, an the process repeats.
This seems very non-Scala-like and I still haven't designed a good way to save these changes into a new file.
What is a good approach to this problem? Should I ditch iterators entirely? I'd preferably like to do it without having to have the entire input or output into memory at once.
Thanks!
You could probably implement this with Scala's Stream class. I am assuming that you don't mind
holding one "batch" in memory at a time.
import scala.annotation.tailrec
import scala.io._
def isBatchLine(line:String):Boolean = ...
def batchLine(size: Int):String = ...
val it = Source.fromFile("in.ps").getLines
// cannot use it.toStream here because of SI-4835
def inLines = Stream.continually(i).takeWhile(_.hasNext).map(_.next)
// Note: using `def` instead of `val` here means we don't hold
// the entire stream in memory
def batchedLinesFrom(stream: Stream[String]):Stream[String] = {
val (batch, remainder) = stream span { !isBatchLine(_) }
if (batch.isEmpty && remainder.isEmpty) {
Stream.empty
} else {
batchLine(batch.size) #:: batch #::: batchedLinesFrom(remainder.drop(1))
}
}
def newLines = batchedLinesFrom(inLines dropWhile isBatchLine)
val ps = new java.io.PrintStream(new java.io.File("out.ps"))
newLines foreach ps.println
ps.close()
If you not in pursuit of functional scala enlightenment, I'd recommend a more imperative style using java.util.Scanner#findWithinHorizon. My example is quite naive, iterating through the input twice.
val scanner = new Scanner(inFile)
val writer = new BufferedWriter(...)
def loop() = {
// you might want to limit the horizon to prevent OutOfMemoryError
Option(scanner.findWithinHorizon(".*YOUR-BATCH-MARKER", 0)) match {
case Some(batch) =>
val pageCount = countPages(batch)
writePageCount(writer, pageCount)
writer.write(batch)
loop()
case None =>
}
}
loop()
scanner.close()
writer.close()
May be you can use span and duplicate effectively. Assuming the iterator is positioned on the start of a batch, you take the span before the next batch, duplicate it so that you can count the pages, write the modified batch line, then write the pages using the duplicated iterator. Then process next batch recursively...
def batch(i: Iterator[String]) {
if (i.hasNext) {
assert(i.next() == "batch")
val (current, next) = i.span(_ != "batch")
val (forCounting, forWriting) = current.duplicate
val count = forCounting.filter(_ == "p").size
println("batch " + count)
forWriting.foreach(println)
batch(next)
}
}
Assuming the following input:
val src = Source.fromString("head\nbatch\np\np\nbatch\np\nbatch\np\np\np\n")
You position the iterator at the start of batch and then you process the batches:
val (head, next) = src.getLines.span(_ != "batch")
head.foreach(println)
batch(next)
This prints:
head
batch 2
p
p
batch 1
p
batch 3
p
p
p