Directive to complete with RandomAccessFile read - scala

I have a large data file and respond to GET requests with very small portions of that file as Array[Byte]
The directive is:
get {
dataRepo.load(param).map(data =>
complete(
HttpResponse(
entity = HttpEntity(myContentType, data),
headers = List(gzipContentEncoding)
)
)
).getOrElse(complete(HttpResponse(status = StatusCodes.NoContent)))
}
Where dataRepo.load is a function along the lines of:
val pointers: Option[Long, Int] = calculateFilePointers(param)
pointers.map { case (index, length) =>
val dataReader = new RandomAccessFile(dataFile, "r")
dataReader.seek(index)
val data = Array.ofDim[Byte](length)
dataReader.readFully(data)
data
}
Is there a more efficient way to pipe the RandomAccessFile read directly back in the response, rather than having to read it fully first?

Instead of reading the data into a Array[Byte] you could create an Iterator[Array[Byte]] which reads chunks of the file at a time:
val dataReader = new RandomAccessFile(dataFile, 'r')
val chunkSize = 1024
Iterator
.range(index, index + length, chunkSize)
.map { currentIndex =>
val currentBytes =
Array.ofDim[Byte](Math.min(chunkSize, length - currentIndex))
dataReader seek currentIndex
dataReader readFully currentBytes
currentBytes
}
This iterator can now feed an akka Source:
val source : Source[Array[Byte], _] =
Source fromIterator (() => dataRepo.load(param))
Which can then feed an HttpEntity:
val byteStrSource : Source[ByteString, _] = source.map(ByteString.apply)
val httpEntity = HttpEntity(myContentType, byteStrSource)
Now each client will only use 1024 Bytes of memory at-a-time instead of the full length of your file read. This will make your server much more efficient at handling multiple concurrent requests as well as allowing your dataRepo.load to return immediately with a lazy Source value instead of utilizing a Future.

Related

Scala: How to get the content of PortableDataStream instance from an RDD

As I want to extract data from binaryFiles I read the files using
val dataRDD = sc.binaryRecord("Path") I get the result as org.apache.spark.rdd.RDD[(String, org.apache.spark.input.PortableDataStream)]
I want to extract the content of my files which is under the form of PortableDataStream
For that I tried: val data = dataRDD.map(x => x._2.open()).collect()
but I get the following error:
java.io.NotSerializableException:org.apache.hadoop.hdfs.client.HdfsDataInputStream
If you have an idea how can I solve my issue, please HELP!
Many Thanks in advance.
Actually, the PortableDataStream is Serializable. That's what it is meant for. Yet, open() returns a simple DataInputStream (HdfsDataInputStream in your case because your file is on HDFS) which is not Serializable, hence the error you get.
In fact, when you open the PortableDataStream, you just need to read the data right away. In scala, you can use scala.io.Source.fromInputStream:
val data : RDD[Array[String]] = sc
.binaryFiles("path/.../")
.map{ case (fileName, pds) => {
scala.io.Source.fromInputStream(pds.open())
.getLines().toArray
}}
This code assumes that the data is textual. If it is not, you can adapt it to read any kind of binary data. Here is an example to create a sequence of bytes, that you could process the way you want.
val rdd : RDD[Seq[Byte]] = sc.binaryFiles("...")
.map{ case (file, pds) => {
val dis = pds.open()
val bytes = Array.ofDim[Byte](1024)
val all = scala.collection.mutable.ArrayBuffer[Byte]()
while( dis.read(bytes) != -1) {
all ++= bytes
}
all.toSeq
}}
See the javadoc of DataInputStream for more possibilities. For instance, it possesses readLong, readDouble (and so on) methods.
val bf = sc.binaryFiles("...")
val bytes = bf.map{ case(file, pds) => {
val dis = pds.open()
val len = dis.available();
val buf = Array.ofDim[Byte](len)
pds.open().readFully(buf)
buf
}}
bytes: org.apache.spark.rdd.RDD[Array[Byte]] = MapPartitionsRDD[21] at map at <console>:26
scala> bytes.take(1)(0).size
res15: Int = 5879609 // this happened to be the size of my first binary file

Akka Streams: How to group a list of files in a source by size?

So I currently have an akka stream to read a list of files, and a sink to concatenate them, and that works just fine:
val files = List("a.txt", "b.txt", "c.txt") // and so on;
val source = Source(files).flatMapConcat(f => FileIO.fromPath(Paths.get(f)))
val sink = Sink.fold[ByteString, ByteString](ByteString(""))(_ ++ ByteString("\n" ++ _) // Concatenate
source.toMat(sink)(Keep.right).run().flatMap(concatByteStr => writeByteStrToFile(concatByteStr, "an-output-file.txt"))
While this is fine for a simple case, the files are rather large (on the order of GBs, and can't fit in the memory of the machine I'm running this application on. So I'd like to chunk it after the byte string has reached a certain size. An option is doing it with Source.grouped(N), but files vary greatly in size (from 1 KB to 2 GB), so there's no guarantee on normalizing the size of the file.
My question is if there's a way to chunk writing files by the size of the bytestring. The documentation of akka streams are quite overwhelming and I'm having trouble figuring out the library. Any help would be greatly appreciated. Thanks!
The FileIO module from Akka Streams provides you with a streaming IO Sink to write to files, and utility methods to chunk a stream of ByteString. Your example would become something along the lines of
val files = List("a.txt", "b.txt", "c.txt") // and so on;
val source = Source(files).flatMapConcat(f => FileIO.fromPath(Paths.get(f)))
val chunking = Framing.delimiter(ByteString("\n"), maximumFrameLength = 256, allowTruncation = true)
val sink: Sink[ByteString, Future[IOResult]] = FileIO.toPath(Paths.get("an-output-file.txt"))
source.via(chunking).runWith(sink)
Using FileIO.toPath sink avoids storing the whole folded ByteString into memory (hence allowing proper streaming).
More details on this Akka module can be found in the docs.
I think #Stefano Bonetti already offered a great solution. Just wanted to add that, one could also consider building custom GraphStage to address specific chunking need. In essence, create a chunk emitting method like below for the In/Out handlers as described in this Akka Stream link:
private def emitChunk(): Unit = {
if (buffer.isEmpty) {
if (isClosed(in)) completeStage()
else pull(in)
} else {
val (chunk, nextBuffer) = buffer.splitAt(chunkSize)
buffer = nextBuffer
push(out, chunk)
}
}
After a week of tinkering in the Akka Streams libraries, the solution I ended with was a combination of Stefano's answer along with a solution provided here. I read the source of files line by line via the Framing.delimiter function, and then just simply use the LogRotatorSink provided by Alpakka. The meat of the determining log rotation is here:
val fileSizeRotationFunction = () => {
val max = 10 * 1024 * 1024 // 10 MB, but whatever you really want; I had it at our HDFS block size
var size: Long = max
(element: ByteString) =>
{
if (size + element.size > max) {
val path = Files.createTempFile("out-", ".log")
size = element.size
Some(path)
} else {
size += element.size
None
}
}
}
val sizeRotatorSink: Sink[ByteString, Future[Done]] =
LogRotatorSink(fileSizeRotationFunction)
val source = Source(files).flatMapConcat(f => FileIO.fromPath(Paths.get(f)))
val chunking = Framing.delimiter(ByteString("\n"), maximumFrameLength = 256, allowTruncation = true)
source.via(chunking).runWith(sizeRotatorSink)
And that's it. Hope this was helpful to others.

is it possible in scala/Akka to read the .xls file and .xlsx as a chunk?

Upload a file in chunk to a server including additional fields
def readFile(): Seq[ExcelFile] = {
logger.info(" readSales method initiated: ")
val source_test = source("E:/dd.xlsx")
println( " source_test "+source_test)
val source_test2 = Source.fromFile(source_test)
println( " source_test2 "+source_test)
//logger.info(" source: "+source)
for {
line <- source_test2.getLines().drop(1).toVector
values = line.split(",").map(_.trim)
// logger.info(" values are the: "+values)
} yield ExcelFile(Option(values(0)), Option(values(1)), Option(values(2)), Option(values(3)))
}
def source(filePath: String): String = {
implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
Source.fromFile(filePath).mkString
}
upload route,
path("upload"){
(post & extractRequestContext) { ctx => {
implicit val materializer = ctx.materializer
implicit val ec = ctx.executionContext
fileUpload("fileUploads") {
case (fileInfo, fileStream) =>
val path = "E:\\"
val sink = FileIO.toPath(Paths.get(path).resolve(fileInfo.fileName))
val wResult = fileStream.runWith(sink)
onSuccess(wResult) { rep => rep.status match {
case Success(_) =>
var ePath = path + File.separator + fileInfo.fileName
readFile(ePath)
_success message_
case Failure(e) => _faillure message_
} }
}
} }
}
am using above code, is it possible in scala or Akka can I read the excel file like chunk file
After looking at your code, it like you are having an issue with the post-processing (after upload) of the file.
If uploading a 3GB file is working even for 1 user then I assume that it is already chunked or multipart.
The first problem is here - source_test2.getLines().drop(1).toVector which create a Vector ( > 3GB ) with all line in file.
The other problem is that you are keeping the whole Seq[ExcelFile] in memory which should be bigger than 3 GB (because of Java object overhead).
So whenever you are calling this readFile function, you are using more than 6 GB memory.
You should try to avoid creating such large object in your application and use things like Iterator instead of Seq
def readFile(): Iterator[ExcelFile] = {
val lineIterator = Source.fromFile("your_file_path").getLines
lineIterator.drop(1).map(line => {
val values = line.split(",").map(_.trim)
ExcelFile(
Option(values(0)),
Option(values(1)),
Option(values(2)),
Option(values(3))
)
})
}
The advantage with Iterator is that it will not load all the things in memory at once. And you can keep using Iterators for further steps.

Extract range of bytes from file in Scala

I have a binary file that I need to extract some range of bytes from: start: Long - end: Long. I need Long because there are several gigagbytes. My app needs to give the result back as a ByteString. I tried
val content: Array[Byte] = Array()
val stream: FileInputStream = new FileInputStream(file: File)
stream.skip(start)
stream.read(content, 0, end-start)
but already I cannot use Long in read, only Int (is this a bug? skip is ok with Long...). Also I would need to convert the result to ByteString. I would love to do this, too:
val stream: FileInputStream = new FileInputStream(file: File)
stream.skip(start)
org.apache.commons.io.IOUtils.toByteArray(stream)
but how do I tell it where to end? stream has no method takeWhile or take. Then I tried
val source = scala.io.Source.fromFile(file: File)
source.drop(start).take(end-start)
Again, only Int in drop...
How can I do that ?
Use IOUtils.toByteArray(InputStream input, long size)
val stream = new FileInputStream(file)
stream.skip(start)
val bytesICareAbout = IOUtils.toByteArray(stream, end-start)
// form the ByteString from bytesICareAbout
Note this will throw if end - start is greater than Integer.MAX_VALUE, for a good reason! You wouldn't want a 2GB array to be allocated in-memory.
If for some reason your end - start > Integer.MAX_VALUE, you should definitely avoid allocating a single ByteString to represent the data. Instead, you should do something like:
import org.apache.commons.io.input.BoundedInputStream
val stream = new FileInputStream(file)
stream.skip(start)
val boundedStream = new BoundedInputStream(stream, start - end)

Play 2.1: Await result from enumerator

I'm working on testing my WebSocket code in Play Framework 2.1. My approach is to get the iterator/enumerator pair that are used for the actual web socket, and just test pushing data in and pulling data out.
Unfortunately, I just cannot figure out how to get data out of an Enumerator. Right now my code looks roughly like this:
val (in, out) = createClient(FakeRequest("GET", "/myendpoint"))
in.feed(Input.El("My input here"))
in.feed(Input.EOF)
//no idea how to get data from "out"
As far as I can tell, the only way to get data out of an enumerator is through an iteratee. But I can't figure out how to just wait until I get the full list of strings coming out of the enumerator. What I want is a List[String], not a Future[Iteratee[A,String]] or an Expectable[Iteratee[String]] or yet another Iteratee[String]. The documentation is confusing at best.
How do I do that?
You can consume an Enumerator like this:
val out = Enumerator("one", "two")
val consumer = Iteratee.getChunks[String]
val appliedEnumeratorFuture = out.apply(consumer)
val appliedEnumerator = Await.result(appliedEnumeratorFuture, 1.seconds)
val result = Await.result(appliedEnumerator.run, 1.seconds)
println(result) // List("one", "two")
Note that you need to await a Future twice because Enumerator and Iteratee control the speed of respectively producing and consuming values.
A more elaborate example for an Iteratee -> Enumerator chain where feeding the Iteratee results in the Enumerator producing values:
// create an enumerator to which messages can be pushed
// using a channel
val (out, channel) = Concurrent.broadcast[String]
// create the input stream. When it receives an string, it
// will push the info into the channel
val in =
Iteratee.foreach[String] { s =>
channel.push(s)
}.map(_ => channel.eofAndEnd())
// Instead of using the complex `feed` method, we just create
// an enumerator that we can use to feed the input stream
val producer = Enumerator("one", "two").andThen(Enumerator.eof)
// Apply the input stream to the producer (feed the input stream)
val producedElementsFuture = producer.apply(in)
// Create a consumer for the output stream
val consumer = Iteratee.getChunks[String]
// Apply the consumer to the output stream (consume the output stream)
val consumedOutputStreamFuture = out.apply(consumer)
// Await the construction of the input stream chain
Await.result(producedElementsFuture, 1.second)
// Await the construction of the output stream chain
val consumedOutputStream = Await.result(consumedOutputStreamFuture, 1.second)
// Await consuming the output stream
val result = Await.result(consumedOutputStream.run, 1.second)
println(result) // List("one", "two")