Why does "Inflator" algorithm fails for UTF-8 encoding? - scala

I wrote the following code in order to decompress messages that were zipped using the Deflator algorithm:
def decompressMsg[V: StringDecoder](msg: String): Try[V] = {
if (msg.startsWith(CompressionHeader)) {
logger.debug(s"Message before decompression is: ${msg}")
val compressedByteArray =
msg.drop(CompressionHeader.length).getBytes(StandardCharsets.UTF_8)
val inflaterInputStream = new InflaterInputStream(
new ByteArrayInputStream(compressedByteArray)
)
val decompressedByteArray = readDataFromInflaterInputStream(inflaterInputStream)
StringDecoder.decode[V](new String(decompressedByteArray, StandardCharsets.UTF_8).tap {
decompressedMsg => logger.info(s"Message after decompression is: ${decompressedMsg}")
})
} else {
StringDecoder.decode[V](msg)
}
}
private def readDataFromInflaterInputStream(
inflaterInputStream: InflaterInputStream
): Array[Byte] = {
val outputStream = new ByteArrayOutputStream
var runLoop = true
while (runLoop) {
val buffer = new Array[Byte](BufferSize)
val len = inflaterInputStream.read(buffer) // ERROR IS THROWN FROM THIS LINE!!
outputStream.write(buffer, 0, len)
if (len < BufferSize) runLoop = false
}
outputStream.toByteArray
}
The input argument 'msg' was compressed using the Deflator. The above code fails with the error message:
invalid stored block lengths java.util.zip.ZipException: invalid
stored block lengths
After I saw this thread, I changed StandardCharsets.UTF_8 to StandardCharsets.ISO_8859_1 and surprisingly, the code passed and returned the desired behaviour.
I don't want to work with an encoding different than UTF_8. Do you have an idea how to make my code work with UTF_8 encoding?

Related

Load a .csv file from HDFS in Scala

So I basically have the following code to read a .csv file and store it in an Array[Array[String]]:
def load(filepath: String): Array[Array[String]] = {
var data = Array[Array[String]]()
val bufferedSource = io.Source.fromFile(filepath)
for (line <- bufferedSource.getLines) {
data :+ line.split(",").map(_.trim)
}
bufferedSource.close
return data.slice(1,data.length-1) //skip header
}
Which works for files that are not stored on HDFS. However, when I try the same thing on HDFS I get
No such file or directory found
When writing to a file on HDFS I also had to change my original code and added some FileSystem and Path arguments to PrintWriter, but this time I have no idea at all how to do it.
I am this far:
def load(filepath: String, sc: SparkContext): Array[Array[String]] = {
var data = Array[Array[String]]()
val fs = FileSystem.get(sc.hadoopConfiguration)
val stream = fs.open(new Path(filepath))
var line = ""
while ((line = stream.readLine()) != null) {
data :+ line.split(",").map(_.trim)
}
return data.slice(1,data.length-1) //skip header
}
This should work, but I get a NullPointerException when comparing line to null or if its length is over 0.
This code will read a .csv file from HDFS:
def read(filepath: String, sc: SparkContext): ArrayBuffer[Array[String]] = {
var data = ArrayBuffer[Array[String]]()
val fs = FileSystem.get(sc.hadoopConfiguration)
val stream = fs.open(new Path(filepath))
var line = stream.readLine()
while (line != null) {
val row = line.split(",").map(_.trim)
data += row
line = stream.readLine()
}
stream.close()
return data // or return data.slice(1,data.length-1) to skip header
}
Please read this post about reading CSV by Alvin Alexander, writer of the Scala Cookbook:
object CSVDemo extends App {
println("Month, Income, Expenses, Profit")
val bufferedSource = io.Source.fromFile("/tmp/finance.csv")
for (line <- bufferedSource.getLines) {
val cols = line.split(",").map(_.trim)
// do whatever you want with the columns here
println(s"${cols(0)}|${cols(1)}|${cols(2)}|${cols(3)}")
}
bufferedSource.close
}
You just have to get an InputStream from your HDFS and replace in this snippet

How to stream zipped file (on the fly) via Play Framework 2.5 in scala?

I want to stream some files and zip them on the fly, so users can download multiple files into a single zipped file without writing anything to the local disk. However, my current implementation holds everything in the memory, and will no work for large files. Is there any way to fix it?
I was looking at this implementation: https://gist.github.com/kirked/03c7f111de0e9a1f74377bf95d3f0f60, but couldn't figure out how to use it.
import java.io.{BufferedOutputStream, ByteArrayInputStream, ByteArrayOutputStream}
import java.util.zip.{ZipEntry, ZipOutputStream}
import akka.stream.scaladsl.{StreamConverters}
import org.apache.commons.io.FileUtils
import play.api.mvc.{Action, Controller}
class HomeController extends Controller {
def single() = Action {
Ok.sendFile(
content = new java.io.File("C:\\Users\\a.csv"),
fileName = _ => "a.csv"
)
}
def zip() = Action {
Ok.chunked(StreamConverters.fromInputStream(fileByteData)).withHeaders(
CONTENT_TYPE -> "application/zip",
CONTENT_DISPOSITION -> s"attachment; filename = test.zip"
)
}
def fileByteData(): ByteArrayInputStream = {
val fileList = List(
new java.io.File("C:\\Users\\a.csv"),
new java.io.File("C:\\Users\\b.csv")
)
val baos = new ByteArrayOutputStream()
val zos = new ZipOutputStream(new BufferedOutputStream(baos))
try {
fileList.map(file => {
zos.putNextEntry(new ZipEntry(file.toPath.getFileName.toString))
zos.write(FileUtils.readFileToByteArray(file))
zos.closeEntry()
})
} finally {
zos.close()
}
new ByteArrayInputStream(baos.toByteArray)
}
}
Instead of using a ByteArrayOutputStream to buffer the contents in an array then putting them into a ByteArrayInputStream you could use Java's piping mechanism.
Here's a sketch solution:
def zip() = Action {
// Create Source that listens to an OutputStream
// and pass it to `fileByteData` method.
val zipSource: Source[ByteString, Unit] =
StreamConverters
.asOutputStream()
.mapMaterializedValue(fileByteData)
Ok.chunked(zipSource).withHeaders(
CONTENT_TYPE -> "application/zip",
CONTENT_DISPOSITION -> s"attachment; filename = test.zip")
}
// Send the file data, given an OutputStream to write to.
def fileByteData(os: OutputStream): Unit = {
val fileList = List(
new java.io.File("C:\\Users\\a.csv"),
new java.io.File("C:\\Users\\b.csv")
)
val zos = new ZipOutputStream(os)
val buffer: Array[Byte] = new Array[Byte](2048)
try {
for (file <- fileList) {
zos.putNextEntry(new ZipEntry(file.toPath.getFileName.toString))
val fis = new Files.newInputStream(file.toPath)
try {
#tailrec
def zipFile(): Unit = {
val bytesRead = fis.read(buffer)
if (bytesRead == -1) () else {
zos.write(buffer, 0, bytesRead)
zipFile()
}
}
zipFile()
} finally fis.close()
zos.closeEntry()
}
} finally {
zos.close()
}
}
This is just an outline of an approach. You'll also want to make sure:
- the threading is OK - the fileByteData will hopefully run on a different thread to the sending thread
- the error handling is OK - e.g. all streams are closed properly if there's an error on either the server (e.g. file not found) or client side (early disconnect)

Scala bufferedinput stream reader

I have a scala program as below
object TestApp {
def main(args: Array[String]): Unit = {
val in = new BufferedInputStream(new FileInputStream("testBytes.txt"))
val buffer = Array.ofDim[Byte](15)
while (in.read(buffer)>0) {
println(new String(buffer))
}
}
}
The input file contains "AAAAAAAAAABBBBBBBBBB"
When I run this program I'm getting the below results
AAAAAAAAAABBBBB
BBBBBAAAAABBBBB
I'm confused why the buffer keeps old read data still, Or any way to avoid this?
I'm expecting something like this
AAAAAAAAAABBBBB
BBBBB
The read method returns the number of bytes read. Only use these bytes in the buffer. E.g.
var len = 0
while ({len = in.read(buffer); len} > 0) {
println(new String(buffer, 0, len))
}

Scala: Source.fromInputStream failed for bigger input

I am reading data from AWS S3. The following code works fine if the input file is small. It failed when the input file is big. Is there any parameter I can modify to increase the buffer size or anything so it can handle bigger input file as well? Thanks!
val s3Object= s3Client.getObject(new GetObjectRequest("myBucket", "myPath/myFile.csv"));
val myData = Source.fromInputStream(s3Object.getObjectContent()).getLines()
for (line <- myData) {
val data = line.split(",")
myMap.put(data(0), data(1).toDouble)
}
println(" my map : " + myMap.toString())
If you look at the source code you can see that internally it calls Source.createBufferedSource. You can use that to create your own version with a bigger buffer size.
These are the lines of code from scala:
def createBufferedSource(
inputStream: InputStream,
bufferSize: Int = DefaultBufSize,
reset: () => Source = null,
close: () => Unit = null
)(implicit codec: Codec): BufferedSource = {
// workaround for default arguments being unable to refer to other parameters
val resetFn = if (reset == null) () => createBufferedSource(inputStream, bufferSize, reset, close)(codec) else reset
new BufferedSource(inputStream, bufferSize)(codec) withReset resetFn withClose close
}
def fromInputStream(is: InputStream, enc: String): BufferedSource =
fromInputStream(is)(Codec(enc))
def fromInputStream(is: InputStream)(implicit codec: Codec): BufferedSource =
createBufferedSource(is, reset = () => fromInputStream(is)(codec), close = () => is.close())(codec)
Edit: Now that I have thought about your issue a bit more, you can increase the buffer size in this way, but I'm not sure that this will actually fix your issue

Reading lines and raw bytes from the same source in scala

I need to write code that does the following:
Connect to a tcp socket
Read a line ending in "\r\n" that contains a number N
Read N bytes
Use those N bytes
I am currently using the following code:
val socket = new Socket(InetAddress.getByName(host), port)
val in = socket.getInputStream;
val out = new PrintStream(socket.getOutputStream)
val reader = new DataInputStream(in)
val baos = new ByteArrayOutputStream
val buffer = new Array[Byte](1024)
out.print(cmd + "\r\n")
out.flush
val firstLine = reader.readLine.split("\\s")
if(firstLine(0) == "OK") {
def read(written: Int, max: Int, baos: ByteArrayOutputStream): Array[Byte] = {
if(written >= max) baos.toByteArray
else {
val count = reader.read(buffer, 0, buffer.length)
baos.write(buffer, 0, count)
read(written + count, max, baos)
}
}
read(0, firstLine(1).toInt, baos)
} else {
// RAISE something
}
baos.toByteArray()
The problem with this code is that the use of DataInputStream#readLine raises a deprecation warning, but I can't find a class that implements both read(...) and readLine(...). BufferedReader for example, implements read but it reads Chars and not Bytes. I could cast those chars to bytes but I don't think it's safe.
Any other ways to write something like this in scala?
Thank you
be aware that on the JVM a char has 2 bytes, so "\r\n" is 4 bytes. This is generally not true for Strings stored outside of the JVM.
I think the safest way would be to read your file in raw bytes until you reache your Binary representation of "\r\n", now you can create a Reader (makes bytes into JVM compatible chars) on the first bytes, where you can be shure that there is Text only, parse it, and contiue safely with the rest of the binary data.
You can achive the goal to use read(...) and readLine(...) in one class. The idea is use BufferedReader.read():Int. The BufferedReader class has buffered the content so you can read one byte a time without performance decrease.
The change can be: (without scala style optimization)
import java.io.BufferedInputStream
import java.io.BufferedReader
import java.io.ByteArrayOutputStream
import java.io.PrintStream
import java.net.InetAddress
import java.net.Socket
import java.io.InputStreamReader
object ReadLines extends App {
val host = "127.0.0.1"
val port = 9090
val socket = new Socket(InetAddress.getByName(host), port)
val in = socket.getInputStream;
val out = new PrintStream(socket.getOutputStream)
// val reader = new DataInputStream(in)
val bufIns = new BufferedInputStream(in)
val reader = new BufferedReader(new InputStreamReader(bufIns, "utf8"));
val baos = new ByteArrayOutputStream
val buffer = new Array[Byte](1024)
val cmd = "get:"
out.print(cmd + "\r\n")
out.flush
val firstLine = reader.readLine.split("\\s")
if (firstLine(0) == "OK") {
def read(written: Int, max: Int, baos: ByteArrayOutputStream): Array[Byte] = {
if (written >= max) {
println("get: " + new String(baos.toByteArray))
baos.toByteArray()
} else {
// val count = reader.read(buffer, 0, buffer.length)
var count = 0
var b = reader.read()
while(b != -1){
buffer(count) = b.toByte
count += 1
if (count < max){
b = reader.read()
}else{
b = -1
}
}
baos.write(buffer, 0, count)
read(written + count, max, baos)
}
}
read(0, firstLine(1).toInt, baos)
} else {
// RAISE something
}
baos.toByteArray()
}
for test, below is a server code:
object ReadLinesServer extends App {
val serverSocket = new ServerSocket(9090)
while(true){
println("accepted a connection.")
val socket = serverSocket.accept()
val ops = socket.getOutputStream()
val printStream = new PrintStream(ops, true, "utf8")
printStream.print("OK 2\r\n") // 1 byte for alpha-number char
printStream.print("ab")
}
}
Seems this is the best solution I can find:
val reader = new BufferedReader(new InputStreamReader(in))
val buffer = new Array[Char](1024)
out.print(cmd + "\r\n")
out.flush
val firstLine = reader.readLine.split("\\s")
if(firstLine(0) == "OK") {
def read(readCount: Int, acc: List[Byte]): Array[Byte] = {
if(readCount <= 0) acc.toArray
else {
val count = reader.read(buffer, 0, buffer.length)
val asBytes = buffer.slice(0, count).map(_.toByte)
read(readCount - count, acc ++ asBytes)
}
}
read(firstLine(1).toInt, List[Byte]())
} else {
// RAISE
}
That is, use buffer.map(_.toByte).toArray to transform a char Array into a Byte Array without caring about the encoding.