Extract range of bytes from file in Scala

Extract range of bytes from file in Scala - scala

I have a binary file that I need to extract some range of bytes from: start: Long - end: Long. I need Long because there are several gigagbytes. My app needs to give the result back as a ByteString. I tried
val content: Array[Byte] = Array()
val stream: FileInputStream = new FileInputStream(file: File)
stream.skip(start)
stream.read(content, 0, end-start)
but already I cannot use Long in read, only Int (is this a bug? skip is ok with Long...). Also I would need to convert the result to ByteString. I would love to do this, too:
val stream: FileInputStream = new FileInputStream(file: File)
stream.skip(start)
org.apache.commons.io.IOUtils.toByteArray(stream)
but how do I tell it where to end? stream has no method takeWhile or take. Then I tried
val source = scala.io.Source.fromFile(file: File)
source.drop(start).take(end-start)
Again, only Int in drop...
How can I do that ?

Use IOUtils.toByteArray(InputStream input, long size)
val stream = new FileInputStream(file)
stream.skip(start)
val bytesICareAbout = IOUtils.toByteArray(stream, end-start)
// form the ByteString from bytesICareAbout
Note this will throw if end - start is greater than Integer.MAX_VALUE, for a good reason! You wouldn't want a 2GB array to be allocated in-memory.
If for some reason your end - start > Integer.MAX_VALUE, you should definitely avoid allocating a single ByteString to represent the data. Instead, you should do something like:
import org.apache.commons.io.input.BoundedInputStream
val stream = new FileInputStream(file)
stream.skip(start)
val boundedStream = new BoundedInputStream(stream, start - end)

Related

Write data in single file n spark scala

I am trying to write data in single file using spark scala:
while (loop > 0) {
val getReq = new HttpGet(ww.url.com)
val httpResponse = client.execute(getReq)
val data = Source.fromInputStream(httpResponse.getEntity.getContent()).getLines.mkString
val parser = JSON.parseFull(data)
val globalMap = parser.get.asInstanceOf[Map[String, Any]]
val reviewMap = globalMap.get("payload").get.asInstanceOf[Map[String, Any]]
val df = context.sparkContext.parallelize(Seq(reviewMap.get("records").get.toString())).toDF()
if (startIndex == 0) {
df.coalesce(1).write.mode(SaveMode.Overwrite).json("C:\\Users\\kh\\Desktop\\spark\\raw\\data\\final")
} else {
df.coalesce(1).write.mode(SaveMode.Append).json("C:\\Users\\kh\\Desktop\\spark\\raw\\data\\final")
}
startIndex = startIndex + limit
loop = loop - 1
httpResponse.close()
}
The number of file created is the number of loops and I want to create one file only.
and it is also creating CRC file as well I want to remove those:
I tried below config but it only stops creation of Success files:
.config("dfs.client.read.shortcircuit.skip.checksum", "true")
.config("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
.config("fs.file.impl.disable.cache", true)
Any ideas to create one file only without crc and success files?

Re: "The number of file created is the number of loops"
Even though you are using df.coalesce(1) in your code, it is still being executed as many number of times as you run the while loop.
I want to create one file only
From your code it seems that you are trying to invoke HTTP GET requests to some URL and save the content after parsing.
If this understanding is right then I believe you should not be using a while loop to do this task. There is map transformation that you could use in the following manner.
Please find below the Psuedo-Code for your reference.
val urls = List("a.com","b.com","c.com")
val sourcedf = sparkContext.parallelize(urls).toDF
//this could be map or flatMap based on your requirement.
val yourprocessedDF = sourcedf.map(<< do your parsing here and emit data>>)
yourprocessedDF.repartition(1).write(<<whichever format you need>>)

Scala: How to get the content of PortableDataStream instance from an RDD

As I want to extract data from binaryFiles I read the files using
val dataRDD = sc.binaryRecord("Path") I get the result as org.apache.spark.rdd.RDD[(String, org.apache.spark.input.PortableDataStream)]
I want to extract the content of my files which is under the form of PortableDataStream
For that I tried: val data = dataRDD.map(x => x._2.open()).collect()
but I get the following error:
java.io.NotSerializableException:org.apache.hadoop.hdfs.client.HdfsDataInputStream
If you have an idea how can I solve my issue, please HELP!
Many Thanks in advance.

Actually, the PortableDataStream is Serializable. That's what it is meant for. Yet, open() returns a simple DataInputStream (HdfsDataInputStream in your case because your file is on HDFS) which is not Serializable, hence the error you get.
In fact, when you open the PortableDataStream, you just need to read the data right away. In scala, you can use scala.io.Source.fromInputStream:
val data : RDD[Array[String]] = sc
.binaryFiles("path/.../")
.map{ case (fileName, pds) => {
scala.io.Source.fromInputStream(pds.open())
.getLines().toArray
}}
This code assumes that the data is textual. If it is not, you can adapt it to read any kind of binary data. Here is an example to create a sequence of bytes, that you could process the way you want.
val rdd : RDD[Seq[Byte]] = sc.binaryFiles("...")
.map{ case (file, pds) => {
val dis = pds.open()
val bytes = Array.ofDim[Byte](1024)
val all = scala.collection.mutable.ArrayBuffer[Byte]()
while( dis.read(bytes) != -1) {
all ++= bytes
}
all.toSeq
}}
See the javadoc of DataInputStream for more possibilities. For instance, it possesses readLong, readDouble (and so on) methods.

val bf = sc.binaryFiles("...")
val bytes = bf.map{ case(file, pds) => {
val dis = pds.open()
val len = dis.available();
val buf = Array.ofDim[Byte](len)
pds.open().readFully(buf)
buf
}}
bytes: org.apache.spark.rdd.RDD[Array[Byte]] = MapPartitionsRDD[21] at map at <console>:26
scala> bytes.take(1)(0).size
res15: Int = 5879609 // this happened to be the size of my first binary file

This Akka stream sometimes doesn't finish

I have a graph that reads lines from multiple gzipped files and writes those lines to another set of gzipped files, mapped according to some value in each line.
It works correctly against small data sets, but fails to terminate on larger data. (It may not be the size of the data that's to blame, as I have not run it enough times to be sure - it takes a while).
def files: Source[File, NotUsed] =
Source.fromIterator(
() =>
Files
.fileTraverser()
.breadthFirst(inDir)
.asScala
.filter(_.getName.endsWith(".gz"))
.toIterator)
def extract =
Flow[File]
.mapConcat[String](unzip)
.mapConcat(s =>
(JsonMethods.parse(s) \ "tk").extract[Array[String]].map(_ -> s).to[collection.immutable.Iterable])
.groupBy(1 << 16, _._1)
.groupedWithin(1000, 1.second)
.map { lines =>
val w = writer(lines.head._1)
w.println(lines.map(_._2).mkString("\n"))
w.close()
Done
}
.mergeSubstreams
def unzip(f: File) = {
scala.io.Source
.fromInputStream(new GZIPInputStream(new FileInputStream(f)))
.getLines
.toIterable
.to[collection.immutable.Iterable]
}
def writer(tk: String): PrintWriter =
new PrintWriter(
new OutputStreamWriter(
new GZIPOutputStream(
new FileOutputStream(new File(outDir, s"$tk.json.gz"), true)
))
)
val process = files.via(extract).toMat(Sink.ignore)(Keep.right).run()
Await.result(process, Duration.Inf)
The thread dump shows that the process is WAITING at Await.result(process, Duration.Inf) and nothing else is happening.
OpenJDK v11 with Akka v2.5.15

Most likely it's stuck in groupBy because it ran out of available threads in dispatcher to gather items into 2^16 groups for all sources.
So if I were you I'd probably implement grouping in extract semi-manually using statefulMapConcat with mutable Map[KeyType, List[String]]. Or buffer lines with groupedWithin first and split them into groups that you would write to different files in Sink.foreach.

Directive to complete with RandomAccessFile read

I have a large data file and respond to GET requests with very small portions of that file as Array[Byte]
The directive is:
get {
dataRepo.load(param).map(data =>
complete(
HttpResponse(
entity = HttpEntity(myContentType, data),
headers = List(gzipContentEncoding)
)
)
).getOrElse(complete(HttpResponse(status = StatusCodes.NoContent)))
}
Where dataRepo.load is a function along the lines of:
val pointers: Option[Long, Int] = calculateFilePointers(param)
pointers.map { case (index, length) =>
val dataReader = new RandomAccessFile(dataFile, "r")
dataReader.seek(index)
val data = Array.ofDim[Byte](length)
dataReader.readFully(data)
data
}
Is there a more efficient way to pipe the RandomAccessFile read directly back in the response, rather than having to read it fully first?

Instead of reading the data into a Array[Byte] you could create an Iterator[Array[Byte]] which reads chunks of the file at a time:
val dataReader = new RandomAccessFile(dataFile, 'r')
val chunkSize = 1024
Iterator
.range(index, index + length, chunkSize)
.map { currentIndex =>
val currentBytes =
Array.ofDim[Byte](Math.min(chunkSize, length - currentIndex))
dataReader seek currentIndex
dataReader readFully currentBytes
currentBytes
}
This iterator can now feed an akka Source:
val source : Source[Array[Byte], _] =
Source fromIterator (() => dataRepo.load(param))
Which can then feed an HttpEntity:
val byteStrSource : Source[ByteString, _] = source.map(ByteString.apply)
val httpEntity = HttpEntity(myContentType, byteStrSource)
Now each client will only use 1024 Bytes of memory at-a-time instead of the full length of your file read. This will make your server much more efficient at handling multiple concurrent requests as well as allowing your dataRepo.load to return immediately with a lazy Source value instead of utilizing a Future.

Akka Streams: How to group a list of files in a source by size?

So I currently have an akka stream to read a list of files, and a sink to concatenate them, and that works just fine:
val files = List("a.txt", "b.txt", "c.txt") // and so on;
val source = Source(files).flatMapConcat(f => FileIO.fromPath(Paths.get(f)))
val sink = Sink.fold[ByteString, ByteString](ByteString(""))(_ ++ ByteString("\n" ++ _) // Concatenate
source.toMat(sink)(Keep.right).run().flatMap(concatByteStr => writeByteStrToFile(concatByteStr, "an-output-file.txt"))
While this is fine for a simple case, the files are rather large (on the order of GBs, and can't fit in the memory of the machine I'm running this application on. So I'd like to chunk it after the byte string has reached a certain size. An option is doing it with Source.grouped(N), but files vary greatly in size (from 1 KB to 2 GB), so there's no guarantee on normalizing the size of the file.
My question is if there's a way to chunk writing files by the size of the bytestring. The documentation of akka streams are quite overwhelming and I'm having trouble figuring out the library. Any help would be greatly appreciated. Thanks!

The FileIO module from Akka Streams provides you with a streaming IO Sink to write to files, and utility methods to chunk a stream of ByteString. Your example would become something along the lines of
val files = List("a.txt", "b.txt", "c.txt") // and so on;
val source = Source(files).flatMapConcat(f => FileIO.fromPath(Paths.get(f)))
val chunking = Framing.delimiter(ByteString("\n"), maximumFrameLength = 256, allowTruncation = true)
val sink: Sink[ByteString, Future[IOResult]] = FileIO.toPath(Paths.get("an-output-file.txt"))
source.via(chunking).runWith(sink)
Using FileIO.toPath sink avoids storing the whole folded ByteString into memory (hence allowing proper streaming).
More details on this Akka module can be found in the docs.

I think #Stefano Bonetti already offered a great solution. Just wanted to add that, one could also consider building custom GraphStage to address specific chunking need. In essence, create a chunk emitting method like below for the In/Out handlers as described in this Akka Stream link:
private def emitChunk(): Unit = {
if (buffer.isEmpty) {
if (isClosed(in)) completeStage()
else pull(in)
} else {
val (chunk, nextBuffer) = buffer.splitAt(chunkSize)
buffer = nextBuffer
push(out, chunk)
}
}

After a week of tinkering in the Akka Streams libraries, the solution I ended with was a combination of Stefano's answer along with a solution provided here. I read the source of files line by line via the Framing.delimiter function, and then just simply use the LogRotatorSink provided by Alpakka. The meat of the determining log rotation is here:
val fileSizeRotationFunction = () => {
val max = 10 * 1024 * 1024 // 10 MB, but whatever you really want; I had it at our HDFS block size
var size: Long = max
(element: ByteString) =>
{
if (size + element.size > max) {
val path = Files.createTempFile("out-", ".log")
size = element.size
Some(path)
} else {
size += element.size
None
}
}
}
val sizeRotatorSink: Sink[ByteString, Future[Done]] =
LogRotatorSink(fileSizeRotationFunction)
val source = Source(files).flatMapConcat(f => FileIO.fromPath(Paths.get(f)))
val chunking = Framing.delimiter(ByteString("\n"), maximumFrameLength = 256, allowTruncation = true)
source.via(chunking).runWith(sizeRotatorSink)
And that's it. Hope this was helpful to others.