ZipInputStream.read in ZipEntry - scala

I am reading zip file using ZipInputStream. Zip file has 4 csv files. Some files are written completely, some are written partially. Please help me find the issue with below code. Is there any limit on reading buffer from ZipInputStream.read method?
val zis = new ZipInputStream(inputStream)
Stream.continually(zis.getNextEntry).takeWhile(_ != null).foreach { file =>
if (!file.isDirectory && file.getName.endsWith(".csv")) {
val buffer = new Array[Byte](file.getSize.toInt)
zis.read(buffer)
val fo = new FileOutputStream("c:\\temp\\input\\" + file.getName)
fo.write(buffer)
}

You have not closed/flushed the files you attempted to write. It should be something like this (assuming Scala syntax, or is this Kotlin/Ceylon?):
val fo = new FileOutputStream("c:\\temp\\input\\" + file.getName)
try {
fo.write(buffer)
} finally {
fo.close
}
Also you should check the read count and read more if necessary, something like this:
var readBytes = 0
while (readBytes < buffer.length) {
val r = zis.read(buffer, readBytes, buffer.length - readBytes)
r match {
case -1 => throw new IllegalStateException("Read terminated before reading everything")
case _ => readBytes += r
}
}
PS: In your example it seems to be less than required closing }s.

Related

decompress (unzip/extract) util using spark scala

I have customer_input_data.tar.gz in HDFS, which have 10 different tables data in csv file format. so i need to unzip this file to /my/output/path using spark scala
please suggest how to unzip customer_input_data.tar.gz file using spark scala
gzip is not a splittable format in Hadoop. Consequently, the file is not really going to be distributed across the cluster and you don't get any benefit of distributed compute/processing in hadoop or Spark.
Better approach may be to,
uncompress the file on the OS and then individually send the files back to hadoop.
If you still want to uncompress in scala, you can simply resort to java class GZIPInputStream via
new GZIPInputStream(new FileInputStream("your file path"))
I developed the below code for decompress the files using scala. You need to pass input path and output path and Hadoopfile system
/*below method used for processing zip files*/
#throws[IOException]
private def processTargz(fullpath: String, houtPath: String, fs: FileSystem): Unit = {
val path = new Path(fullpath)
val gzipIn = new GzipCompressorInputStream(fs.open(path))
try {
val tarIn = new TarArchiveInputStream(gzipIn)
try {
var entry:TarArchiveEntry = null
out.println("Tar entry")
out.println("Tar Name entry :" + FilenameUtils.getName(fullpath))
val fileName1 = FilenameUtils.getName(fullpath)
val tarNamesFolder = fileName1.substring(0, fileName1.indexOf('.'))
out.println("Folder Name : " + tarNamesFolder)
while ( {
(entry = tarIn.getNextEntry.asInstanceOf[TarArchiveEntry]) != null
}) { // entity Name as tsv file name which are part of inside compressed tar file
out.println("ENTITY NAME : " + entry.getName)
/** If the entry is a directory, create the directory. **/
out.println("While")
if (entry.isDirectory) {
val f = new File(entry.getName)
val created = f.mkdir
out.println("mkdir")
if (!created) {
out.printf("Unable to create directory '%s', during extraction of archive contents.%n", f.getAbsolutePath)
out.println("Absolute path")
}
}
else {
var count = 0
val slash = "/"
val targetPath = houtPath + slash + tarNamesFolder + slash + entry.getName
val hdfswritepath = new Path(targetPath)
val fos = fs.create(hdfswritepath, true)
try {
val dest = new BufferedOutputStream(fos, BUFFER_SIZE)
try {
val data = new Array[Byte](BUFFER_SIZE)
while ( {
(count = tarIn.read(data, 0, BUFFER_SIZE)) != -1
}) dest.write(data, 0, count)
} finally if (dest != null) dest.close()
}
}
}
out.println("Untar completed successfully!")
} catch {
case e: IOException =>
out.println("catch Block")
} finally {
out.println("FINAL Block")
if (tarIn != null) tarIn.close()
}
}
}

How to generate a random Scala Int in Chisel code?

I am trying to implement the way-prediction technique in the RocketChip core (in-order). For this, I need to access each way separately. So this is how SRAM for tags looks like after modification (separate SRAM for each way)
val tag_arrays = Seq.fill(nWays) { SeqMem(nSets, UInt(width = tECC.width(1 + tagBits)))}
val tag_rdata = Reg(Vec(nWays, UInt(width = tECC.width(1 + tagBits))))
for ((tag_array, i) <- tag_arrays zipWithIndex) {
tag_rdata(i) := tag_array.read(s0_vaddr(untagBits-1,blockOffBits), !refill_done && s0_valid)
}
And I want to access it like
when (refill_done) {
val enc_tag = tECC.encode(Cat(tl_out.d.bits.error, refill_tag))
tag_arrays(repl_way).write(refill_idx, enc_tag)
ccover(tl_out.d.bits.error, "D_ERROR", "I$ D-channel error")
}
Where repl_way is Chisel random UInt generated by LFSR. But Seq element can be accessed only by Scala Int index which causes a compilation error. Then I tried access it like this
when (refill_done) {
val enc_tag = tECC.encode(Cat(tl_out.d.bits.error, refill_tag))
for (i <- 0 until nWays) {
when (repl_way === i.U) {tag_arrays(i).write(refill_idx, enc_tag)}
}
ccover(tl_out.d.bits.error, "D_ERROR", "I$ D-channel error")
}
But assertion arises -
assert(PopCount(s1_tag_hit zip s1_tag_disparity map { case (h, d) => h && !d }) <= 1)
I am trying to modify ICache.scala file. Any ideas on how to do this properly? Thanks!
I think you can just use a Vec here instead of a Seq
val tag_arrays = Vec(nWays, SeqMem(nSets, UInt(width = tECC.width(1 + tagBits))))
The Vec allows indexing with a UInt

write immutable code for storing data in listBuffer in scala

I have the below code where I am using a mutable list buffer to store files recieved from kafka consumer , and then when the list size reached 15 I insert them into cassandra .
But Is their any way to do the same thing using immutable list.
val filesList = ListBuffer[SystemTextFile]()
storeservSparkService.configFilesTopicInBatch.subscribe.atLeastOnce(Flow[SystemTextFile].mapAsync(4) { file: SystemTextFile =>
filesList += file
if (filesList.size == 15) {
storeServSystemRepository.config.insertFileInBatch(filesList.toList)
filesList.clear()
}
Future(Done)
})
Something along these lines?
Flow[SystemTextFile].grouped(15).mapAsync(4){ files =>
storeServSystemRepository.config.insertFileInBatch(files)
}
Have you tried using Vector?
val filesList = Vector[SystemTextFile]()
storeservSparkService.configFilesTopicInBatch.subscribe.
atLeastOnce(Flow[SystemTextFile].mapAsync(4) { file: SystemTextFile =>
filesList = filesList :+ file
if (filesList.length == 15) {
storeServSystemRepository.config.insertFileInBatch(filesList.toList)
}
Future(Done)
})

Uploading multiple large files and other form data in playframework 2/scala

I am trying to upload multiple large files in the play framework using scala. I'm still a scala and play noob.
I got some great code from here which got me 90% of the way, but now I'm stuck again.
The main issue I have now is that I can only read the file data, not any other data that's been uploaded, and after poking around the play docs I'm unclear as to how to get at that from here. Any Suggestions appreciated!
def directUpload(projectId: String) = Secured(parse.multipartFormData(myFilePartHandler)) { implicit request =>
Ok("Done");
}
def myFilePartHandler: BodyParsers.parse.Multipart.PartHandler[MultipartFormData.FilePart[Result]] = {
parse.Multipart.handleFilePart {
case parse.Multipart.FileInfo(partName, filename, contentType) =>
println("Handling Streaming Upload:" + filename + "/" + partName + ", " + contentType);
//Set up the PipedOutputStream here, give the input stream to a worker thread
val pos: PipedOutputStream = new PipedOutputStream();
val pis: PipedInputStream = new PipedInputStream(pos);
val worker: UploadFileWorker = new UploadFileWorker(pis,contentType.get);
worker.start();
//Read content to the POS
play.api.libs.iteratee.Iteratee.fold[Array[Byte], PipedOutputStream](pos) { (os, data) =>
os.write(data)
os
}.mapDone { os =>
os.close()
worker.join()
if( worker.success )
Ok("uplaod done. Size: " + worker.size )
else
Status(503)("Upload Failed");
}
}
}
You have to handle the data part. As you can guess (or look up in the documentation) the function to handle the data part ist called: handleFilePart.
def myFilePartHandler: BodyParsers.parse.Multipart.PartHandler[MultipartFormData.FilePart[Result]] = {
parse.Multipart.handleFilePart {
// ...
}
parse.Multipart.handleFilePart {
// ...
}
}
Another way would be the handlePart method. Check the documentation for more details.

java.nio.BufferUnderflowException when processing files in Scala

I got a similar problem to this guy while processing 4MB log file. Actually I'm processing multiple files simultaneously but since I keep getting this exception, I decide to just test it for a single file:
val temp = Source.fromFile("./datasource/input.txt")
val dummy = new PrintWriter("test.txt")
var itr = 0
println("Default Buffer size: " + Source.DefaultBufSize)
try {
for( chr <- temp) {
dummy.print(chr.toChar)
itr += 1
if(itr == 75703) println("Passed line 85")
if(itr % 256 == 0){ print("..." + itr); temp.reset; System.gc; }
if(itr == 75703) println("Passed line 87")
if(itr % 2048 == 0) println("")
if(itr == 75703) println("Passed line 89")
}
} finally {
println("\nFalied at itr = " + itr)
}
What I always get is that it will fails at itr = 75703, while my output file will always be 64KB (65536 Bytes exact). No matter where I put temp.reset or System.gc, all experiments ends up the same.
It seems like the problem relies on some memory allocation but I cannot find any useful information on this problem. Any idea on how to solve this one?
All your helps are greatly appreciated
EDIT: Actually I want to process it as binary files, so this technique is not a good solution, many had recommend me to use BufferedInputStream instead.
Why are you calling reset on the Source before it has finished iterating thru the file?
val temp = Source.fromFile("./datasource/input.txt")
try {
for (line <- tem p.getLines) {
//whatever
}
finally temp.reset
Should work just fine with no underflows. See also this question