Scala: Source.fromInputStream failed for bigger input - scala

I am reading data from AWS S3. The following code works fine if the input file is small. It failed when the input file is big. Is there any parameter I can modify to increase the buffer size or anything so it can handle bigger input file as well? Thanks!
val s3Object= s3Client.getObject(new GetObjectRequest("myBucket", "myPath/myFile.csv"));
val myData = Source.fromInputStream(s3Object.getObjectContent()).getLines()
for (line <- myData) {
val data = line.split(",")
myMap.put(data(0), data(1).toDouble)
}
println(" my map : " + myMap.toString())

If you look at the source code you can see that internally it calls Source.createBufferedSource. You can use that to create your own version with a bigger buffer size.
These are the lines of code from scala:
def createBufferedSource(
inputStream: InputStream,
bufferSize: Int = DefaultBufSize,
reset: () => Source = null,
close: () => Unit = null
)(implicit codec: Codec): BufferedSource = {
// workaround for default arguments being unable to refer to other parameters
val resetFn = if (reset == null) () => createBufferedSource(inputStream, bufferSize, reset, close)(codec) else reset
new BufferedSource(inputStream, bufferSize)(codec) withReset resetFn withClose close
}
def fromInputStream(is: InputStream, enc: String): BufferedSource =
fromInputStream(is)(Codec(enc))
def fromInputStream(is: InputStream)(implicit codec: Codec): BufferedSource =
createBufferedSource(is, reset = () => fromInputStream(is)(codec), close = () => is.close())(codec)
Edit: Now that I have thought about your issue a bit more, you can increase the buffer size in this way, but I'm not sure that this will actually fix your issue

Related

Why does "Inflator" algorithm fails for UTF-8 encoding?

I wrote the following code in order to decompress messages that were zipped using the Deflator algorithm:
def decompressMsg[V: StringDecoder](msg: String): Try[V] = {
if (msg.startsWith(CompressionHeader)) {
logger.debug(s"Message before decompression is: ${msg}")
val compressedByteArray =
msg.drop(CompressionHeader.length).getBytes(StandardCharsets.UTF_8)
val inflaterInputStream = new InflaterInputStream(
new ByteArrayInputStream(compressedByteArray)
)
val decompressedByteArray = readDataFromInflaterInputStream(inflaterInputStream)
StringDecoder.decode[V](new String(decompressedByteArray, StandardCharsets.UTF_8).tap {
decompressedMsg => logger.info(s"Message after decompression is: ${decompressedMsg}")
})
} else {
StringDecoder.decode[V](msg)
}
}
private def readDataFromInflaterInputStream(
inflaterInputStream: InflaterInputStream
): Array[Byte] = {
val outputStream = new ByteArrayOutputStream
var runLoop = true
while (runLoop) {
val buffer = new Array[Byte](BufferSize)
val len = inflaterInputStream.read(buffer) // ERROR IS THROWN FROM THIS LINE!!
outputStream.write(buffer, 0, len)
if (len < BufferSize) runLoop = false
}
outputStream.toByteArray
}
The input argument 'msg' was compressed using the Deflator. The above code fails with the error message:
invalid stored block lengths java.util.zip.ZipException: invalid
stored block lengths
After I saw this thread, I changed StandardCharsets.UTF_8 to StandardCharsets.ISO_8859_1 and surprisingly, the code passed and returned the desired behaviour.
I don't want to work with an encoding different than UTF_8. Do you have an idea how to make my code work with UTF_8 encoding?

Getting a 0KB File in S3 when i write the zipInputStream

I have the following piece of code which is producing 0KB Files in S3 whats wrong with the following piece of code
def extractFilesFromZipStream(zipInputStream: ZipArchiveInputStream,
tgtPath: String, storageType:String): scala.collection.mutable.Map[String, (String, Long)] = {
var filePathMap = scala.collection.mutable.Map[String, (String, Long)]()
Try {
storageWrapper.mkDirs(tgtPath)
Stream.continually(zipInputStream.getNextZipEntry).takeWhile(_ != null).foreach {file =>
val storagePathFilePath = s"$tgtPath/${file.getName}"
storageWrapper.write(zipInputStream, storagePathFilePath)
LOGGER.info(storagePathFilePath)
val lineCount = Source.fromInputStream(storageWrapper.read(storagePathFilePath)).getLines().count(s => s!= null)
There is nothing wrong with the Storage wrapper, it takes an input stream and path and has been working well so far. Can anyone suggest what's wrong with my implementation of using zipArchive Stream.

Scala: How to get the content of PortableDataStream instance from an RDD

As I want to extract data from binaryFiles I read the files using
val dataRDD = sc.binaryRecord("Path") I get the result as org.apache.spark.rdd.RDD[(String, org.apache.spark.input.PortableDataStream)]
I want to extract the content of my files which is under the form of PortableDataStream
For that I tried: val data = dataRDD.map(x => x._2.open()).collect()
but I get the following error:
java.io.NotSerializableException:org.apache.hadoop.hdfs.client.HdfsDataInputStream
If you have an idea how can I solve my issue, please HELP!
Many Thanks in advance.
Actually, the PortableDataStream is Serializable. That's what it is meant for. Yet, open() returns a simple DataInputStream (HdfsDataInputStream in your case because your file is on HDFS) which is not Serializable, hence the error you get.
In fact, when you open the PortableDataStream, you just need to read the data right away. In scala, you can use scala.io.Source.fromInputStream:
val data : RDD[Array[String]] = sc
.binaryFiles("path/.../")
.map{ case (fileName, pds) => {
scala.io.Source.fromInputStream(pds.open())
.getLines().toArray
}}
This code assumes that the data is textual. If it is not, you can adapt it to read any kind of binary data. Here is an example to create a sequence of bytes, that you could process the way you want.
val rdd : RDD[Seq[Byte]] = sc.binaryFiles("...")
.map{ case (file, pds) => {
val dis = pds.open()
val bytes = Array.ofDim[Byte](1024)
val all = scala.collection.mutable.ArrayBuffer[Byte]()
while( dis.read(bytes) != -1) {
all ++= bytes
}
all.toSeq
}}
See the javadoc of DataInputStream for more possibilities. For instance, it possesses readLong, readDouble (and so on) methods.
val bf = sc.binaryFiles("...")
val bytes = bf.map{ case(file, pds) => {
val dis = pds.open()
val len = dis.available();
val buf = Array.ofDim[Byte](len)
pds.open().readFully(buf)
buf
}}
bytes: org.apache.spark.rdd.RDD[Array[Byte]] = MapPartitionsRDD[21] at map at <console>:26
scala> bytes.take(1)(0).size
res15: Int = 5879609 // this happened to be the size of my first binary file

Directive to complete with RandomAccessFile read

I have a large data file and respond to GET requests with very small portions of that file as Array[Byte]
The directive is:
get {
dataRepo.load(param).map(data =>
complete(
HttpResponse(
entity = HttpEntity(myContentType, data),
headers = List(gzipContentEncoding)
)
)
).getOrElse(complete(HttpResponse(status = StatusCodes.NoContent)))
}
Where dataRepo.load is a function along the lines of:
val pointers: Option[Long, Int] = calculateFilePointers(param)
pointers.map { case (index, length) =>
val dataReader = new RandomAccessFile(dataFile, "r")
dataReader.seek(index)
val data = Array.ofDim[Byte](length)
dataReader.readFully(data)
data
}
Is there a more efficient way to pipe the RandomAccessFile read directly back in the response, rather than having to read it fully first?
Instead of reading the data into a Array[Byte] you could create an Iterator[Array[Byte]] which reads chunks of the file at a time:
val dataReader = new RandomAccessFile(dataFile, 'r')
val chunkSize = 1024
Iterator
.range(index, index + length, chunkSize)
.map { currentIndex =>
val currentBytes =
Array.ofDim[Byte](Math.min(chunkSize, length - currentIndex))
dataReader seek currentIndex
dataReader readFully currentBytes
currentBytes
}
This iterator can now feed an akka Source:
val source : Source[Array[Byte], _] =
Source fromIterator (() => dataRepo.load(param))
Which can then feed an HttpEntity:
val byteStrSource : Source[ByteString, _] = source.map(ByteString.apply)
val httpEntity = HttpEntity(myContentType, byteStrSource)
Now each client will only use 1024 Bytes of memory at-a-time instead of the full length of your file read. This will make your server much more efficient at handling multiple concurrent requests as well as allowing your dataRepo.load to return immediately with a lazy Source value instead of utilizing a Future.

is it possible in scala/Akka to read the .xls file and .xlsx as a chunk?

Upload a file in chunk to a server including additional fields
def readFile(): Seq[ExcelFile] = {
logger.info(" readSales method initiated: ")
val source_test = source("E:/dd.xlsx")
println( " source_test "+source_test)
val source_test2 = Source.fromFile(source_test)
println( " source_test2 "+source_test)
//logger.info(" source: "+source)
for {
line <- source_test2.getLines().drop(1).toVector
values = line.split(",").map(_.trim)
// logger.info(" values are the: "+values)
} yield ExcelFile(Option(values(0)), Option(values(1)), Option(values(2)), Option(values(3)))
}
def source(filePath: String): String = {
implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
Source.fromFile(filePath).mkString
}
upload route,
path("upload"){
(post & extractRequestContext) { ctx => {
implicit val materializer = ctx.materializer
implicit val ec = ctx.executionContext
fileUpload("fileUploads") {
case (fileInfo, fileStream) =>
val path = "E:\\"
val sink = FileIO.toPath(Paths.get(path).resolve(fileInfo.fileName))
val wResult = fileStream.runWith(sink)
onSuccess(wResult) { rep => rep.status match {
case Success(_) =>
var ePath = path + File.separator + fileInfo.fileName
readFile(ePath)
_success message_
case Failure(e) => _faillure message_
} }
}
} }
}
am using above code, is it possible in scala or Akka can I read the excel file like chunk file
After looking at your code, it like you are having an issue with the post-processing (after upload) of the file.
If uploading a 3GB file is working even for 1 user then I assume that it is already chunked or multipart.
The first problem is here - source_test2.getLines().drop(1).toVector which create a Vector ( > 3GB ) with all line in file.
The other problem is that you are keeping the whole Seq[ExcelFile] in memory which should be bigger than 3 GB (because of Java object overhead).
So whenever you are calling this readFile function, you are using more than 6 GB memory.
You should try to avoid creating such large object in your application and use things like Iterator instead of Seq
def readFile(): Iterator[ExcelFile] = {
val lineIterator = Source.fromFile("your_file_path").getLines
lineIterator.drop(1).map(line => {
val values = line.split(",").map(_.trim)
ExcelFile(
Option(values(0)),
Option(values(1)),
Option(values(2)),
Option(values(3))
)
})
}
The advantage with Iterator is that it will not load all the things in memory at once. And you can keep using Iterators for further steps.