Decompress a single file from a zip folder without writing in disk - scala

Is it possible to decompress a single file from a zip folder and return the decompressed file without storing the data on the server?
I have a zip file with an unknown structure and I would like to develop a service that will serve the content of a given file on demand without decompressing the whole zip and also without writing on disk.
So, if I have a zip file like this
zip_folder.zip
| folder1
| file1.txt
| file2.png
| folder 2
| file3.jpg
| file4.pdf
| ...
So, I would like my service to receive the name and path of the file so I could send the file.
For example, fileName could be folder1/file1.txt
def getFileContent(fileName: String): IBinaryContent = {
val content: IBinaryContent = getBinaryContent(...)
val zipInputStream: ZipInputStream = new ZipInputStream(content.getInputStream)
val outputStream: FileOutputStream = new FileOutputStream(fileName)
var zipEntry: ZipEntry = null
var founded: Boolean = false
while ({
zipEntry = zipInputStream.getNextEntry
Option(zipEntry).isDefined && !founded
}) {
if (zipEntry.getName.equals(fileName)) {
val buffer: Array[Byte] = Array.ofDim(9000) // FIXME how to get the dimension of the array
var length = 0
while ({
length = zipInputStream.read(buffer)
length != -1
}) {
outputStream.write(buffer, 0, length)
}
outputStream.close()
founded = true
}
}
zipInputStream.close()
outputStream /* how can I return the value? */
}
How can I do it without writing the content in the disk?

You can use a ByteArrayOutputStream instead of the FileOutputStream to uncompress the zip entry into memory. Then call toByteArray() on it.
Also note that, technically, you would not even need to decompress the zip part if you can transmit it over a protocol (think: HTTP(S)) which supports the deflate encoding for its transport (which is usually the standard compression used in Zip files).

So, basically I did the same thing that #cbley recommended. I returned an array of bytes and defined the content-type so that the browser can do the magic!
def getFileContent(fileName: String): IBinaryContent = {
val content: IBinaryContent = getBinaryContent(...)
val zipInputStream: ZipInputStream = new ZipInputStream(content.getInputStream)
val outputStream: ByteArrayOutputStream = new ByteArrayOutputStream()
var zipEntry: ZipEntry = null
var founded: Boolean = false
while ({
zipEntry = zipInputStream.getNextEntry
Option(zipEntry).isDefined && !founded
}) {
if (zipEntry.getName.equals(fileName)) {
val buffer: Array[Byte] = Array.ofDim(zipEntry.getSize)
var length = 0
while ({
length = zipInputStream.read(buffer)
length != -1
}) {
outputStream.write(buffer, 0, length)
}
outputStream.close()
founded = true
}
}
zipInputStream.close()
outputStream.toByteArray
}
// in my rest service
#GET
#Path("/content/${fileName}")
def content(#PathVariable fileName): Response = {
val content = getFileContent(fileName)
Response.ok(content)
.header("Content-type", new Tika().detect(fileName)) // I'm using Tika but it's possible to use other libraries
.build()
}

Related

decompress (unzip/extract) util using spark scala

I have customer_input_data.tar.gz in HDFS, which have 10 different tables data in csv file format. so i need to unzip this file to /my/output/path using spark scala
please suggest how to unzip customer_input_data.tar.gz file using spark scala
gzip is not a splittable format in Hadoop. Consequently, the file is not really going to be distributed across the cluster and you don't get any benefit of distributed compute/processing in hadoop or Spark.
Better approach may be to,
uncompress the file on the OS and then individually send the files back to hadoop.
If you still want to uncompress in scala, you can simply resort to java class GZIPInputStream via
new GZIPInputStream(new FileInputStream("your file path"))
I developed the below code for decompress the files using scala. You need to pass input path and output path and Hadoopfile system
/*below method used for processing zip files*/
#throws[IOException]
private def processTargz(fullpath: String, houtPath: String, fs: FileSystem): Unit = {
val path = new Path(fullpath)
val gzipIn = new GzipCompressorInputStream(fs.open(path))
try {
val tarIn = new TarArchiveInputStream(gzipIn)
try {
var entry:TarArchiveEntry = null
out.println("Tar entry")
out.println("Tar Name entry :" + FilenameUtils.getName(fullpath))
val fileName1 = FilenameUtils.getName(fullpath)
val tarNamesFolder = fileName1.substring(0, fileName1.indexOf('.'))
out.println("Folder Name : " + tarNamesFolder)
while ( {
(entry = tarIn.getNextEntry.asInstanceOf[TarArchiveEntry]) != null
}) { // entity Name as tsv file name which are part of inside compressed tar file
out.println("ENTITY NAME : " + entry.getName)
/** If the entry is a directory, create the directory. **/
out.println("While")
if (entry.isDirectory) {
val f = new File(entry.getName)
val created = f.mkdir
out.println("mkdir")
if (!created) {
out.printf("Unable to create directory '%s', during extraction of archive contents.%n", f.getAbsolutePath)
out.println("Absolute path")
}
}
else {
var count = 0
val slash = "/"
val targetPath = houtPath + slash + tarNamesFolder + slash + entry.getName
val hdfswritepath = new Path(targetPath)
val fos = fs.create(hdfswritepath, true)
try {
val dest = new BufferedOutputStream(fos, BUFFER_SIZE)
try {
val data = new Array[Byte](BUFFER_SIZE)
while ( {
(count = tarIn.read(data, 0, BUFFER_SIZE)) != -1
}) dest.write(data, 0, count)
} finally if (dest != null) dest.close()
}
}
}
out.println("Untar completed successfully!")
} catch {
case e: IOException =>
out.println("catch Block")
} finally {
out.println("FINAL Block")
if (tarIn != null) tarIn.close()
}
}
}

How to read a .tar file containing parquets on S3 as dataframes in Spark?

I need to load a .tar file on S3 that contains multiple parquets with different schema using Scala/Spark. Ideally I'd like to read one of these parquets into Spark dataframe. I tried to get the s3 object and then convert to a tar input stream using org.apache.commons.compress.archivers.tar.TarArchiveInputStream and it was able to creat the tar input stream but failed to read the tar entries.
val s3client: AmazonS3 = AmazonS3ClientBuilder.
standard().
withCredentials(new InstanceProfileCredentialsProvider()).
withRegion(my_region).
build();
val tarFile = s3client.getObject(my_bucket, my_tar_file)
val tarInputStream = new TarArchiveInputStream(tarFile.getObjectContent)
tarInputStream.getNextTarEntry() <-- error thrown in this line
Error:
java.io.IOException: Error detected parsing the header
at org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:240)
... 52 elided
Caused by: java.lang.IllegalArgumentException: Invalid byte 48 at offset 7 in '00755{NUL}00' len=8
at org.apache.commons.compress.archivers.tar.TarUtils.parseOctal(TarUtils.java:127)
at org.apache.commons.compress.archivers.tar.TarUtils.parseOctalOrBinary(TarUtils.java:171)
at org.apache.commons.compress.archivers.tar.TarArchiveEntry.parseTarHeader(TarArchiveEntry.java:935)
at org.apache.commons.compress.archivers.tar.TarArchiveEntry.parseTarHeader(TarArchiveEntry.java:924)
at org.apache.commons.compress.archivers.tar.TarArchiveEntry.<init>(TarArchiveEntry.java:328)
at org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:238)
Does anyone have the knowledge of the proper way of extract a partial of tar file on s3 in Spark?
Follow this example. I hope you are using tar.gz
AWSCredentials credentials = new BasicAWSCredentials("accessKey", "secretKey");
AWSCredentialsProvider credentialsProvider = new AWSStaticCredentialsProvider(credentials);
AmazonS3 s3Client = AmazonS3ClientBuilder.standard().withRegion(Regions.US_EAST_1).withCredentials(credentialsProvider).build();
S3Object object = s3Client.getObject("bucketname", "file.tar.gz");
S3ObjectInputStream objectContent = object.getObjectContent();
TarArchiveInputStream tarInputStream = new TarArchiveInputStream(new GZIPInputStream(objectContent));
TarArchiveEntry currentEntry;
while((currentEntry = tarInputStream.getNextTarEntry()) != null) {
if(currentEntry.getName().equals("1/foo.bar") && currentEntry.isFile()) {
FileOutputStream entryOs = new FileOutputStream("foo.bar");
IOUtils.copy(tarInputStream, entryOs);
entryOs.close();
break;
}
}
objectContent.abort(); // Warning at this line
tarInputStream.close(); // warning at this line
scala equivalent is
val credentials: AWSCredentials =
new BasicAWSCredentials("accessKey", "secretKey")
val credentialsProvider: AWSCredentialsProvider =
new AWSStaticCredentialsProvider(credentials)
val s3Client: AmazonS3 = AmazonS3ClientBuilder
.standard()
.withRegion(Regions.US_EAST_1)
.withCredentials(credentialsProvider)
.build()
val s3object: S3Object = s3Client.getObject("bucketname", "file.tar.gz")
val objectContent: S3ObjectInputStream = s3object.getObjectContent
val tarInputStream: TarArchiveInputStream = new TarArchiveInputStream(
new GZIPInputStream(objectContent))
var currentEntry: TarArchiveEntry = null
while ((currentEntry = tarInputStream.getNextTarEntry) != null)
if (currentEntry.getName ==("1/foo.bar") && currentEntry.isFile) {
val entryOs: FileOutputStream = new FileOutputStream("foo.bar")
IOUtils.copy(tarInputStream, entryOs)
entryOs.close()
}
objectContent.abort()
tarInputStream.close()
}
Update :
since you are using only tar not gzip
so you have to read like this...
val tarInputStream = new TarArchiveInputStream(new FileInputStream(
tarFile.getObjectContent))
In your case you are passing object as a InputStream. My suggestion is to pass it as a GzipInputstream, then read entries:
val tarInputStream = new TarArchiveInputStream(tarFile.getObjectContent)
val tarInputStream = new TarArchiveInputStream(new GZIPInputStream(tarFile))
val entry: TarArchiveEntry = readEntries(tarInputStream)
def readEntries(tarInputStream: TarArchiveInputStream): TarArchiveEntry = {
var currentEntry = Option(tarInputStream.getNextTarEntry())
// you can use functional approach with foldLeft, reduce or something else or while loop
// implementation details here
}
You can find how to use TarArchiveInputStream usage here
You can use GetObjectRequest to create an S3Object
val s3FullObject: S3Object = s3client.getObject(new GetObjectRequest(s3Bucket, s3TarPath))
val tis = new TarArchiveInputStream(s3FullObject.getObjectContent)
var entry: TarArchiveEntry = tis.getNextTarEntry

How to check file size before sinking file into the destination in playframework?

For learning I was writing a code in scala-playframework to upload files by using FilePartHandler in multipartFormData parser.
My code is here
type FilePartHandler[A] = FileInfo => Accumulator[ByteString, FilePart[A]]
def filePartHandler: FilePartHandler[File] = {
case FileInfo(partname, fileName, contentType) =>
// val isAppropriateFileType: Boolean = List("image/png", "image/jpg", "image/jpeg").contains(contentType.getOrElse(""))
val file = new File("/tmp/picture/" + fileName)
val sink = FileIO.toPath(file.toPath)
val accumulator = Accumulator.apply(sink)
accumulator.mapFuture {
case IOResult(_, scala.util.Failure(exception)) => Future.failed(exception)
case IOResult(size, Success(Done)) => Future.successful(FilePart(partname, fileName, contentType, file))
}
}
I can check the file size in accumulator.mapFuture(. . .), but this step is very late, because in Accumulator step file already moved to the destination. Yes I can delete the file form the destination in
accumulator.mapFuture step from the uploaded location.
But my question- is there any play provided way of checking file size before moving file to the destination?

ZipInputStream.read in ZipEntry

I am reading zip file using ZipInputStream. Zip file has 4 csv files. Some files are written completely, some are written partially. Please help me find the issue with below code. Is there any limit on reading buffer from ZipInputStream.read method?
val zis = new ZipInputStream(inputStream)
Stream.continually(zis.getNextEntry).takeWhile(_ != null).foreach { file =>
if (!file.isDirectory && file.getName.endsWith(".csv")) {
val buffer = new Array[Byte](file.getSize.toInt)
zis.read(buffer)
val fo = new FileOutputStream("c:\\temp\\input\\" + file.getName)
fo.write(buffer)
}
You have not closed/flushed the files you attempted to write. It should be something like this (assuming Scala syntax, or is this Kotlin/Ceylon?):
val fo = new FileOutputStream("c:\\temp\\input\\" + file.getName)
try {
fo.write(buffer)
} finally {
fo.close
}
Also you should check the read count and read more if necessary, something like this:
var readBytes = 0
while (readBytes < buffer.length) {
val r = zis.read(buffer, readBytes, buffer.length - readBytes)
r match {
case -1 => throw new IllegalStateException("Read terminated before reading everything")
case _ => readBytes += r
}
}
PS: In your example it seems to be less than required closing }s.

file not downloading properly

I am downloading a file from a url and saving it to a directory on my phone.
the path is: /private/var/mobile/Applications/17E4F0B0-0781-4259-B39D-37057D44B778/Documents/samplefile.txt
However, when i debug the file is created and downloaded. But, when i ad-hoc it and run the file. samplefile.txt is created but it's blank.
Code:
String directory = Environment.GetFolderPath (Environment.SpecialFolder.MyDocuments);
var filename = Path.Combine (directory, "samplefile.txt");
if (!File.Exists (filename)) {
File.Create (filename);
var webClient = new WebClient ();
webClient.DownloadStringCompleted += (s, e) => {
var text = e.Result; // get the downloaded text
File.WriteAllText (filename, text);
};
var url = new Uri (/**myURL**/);
webClient.Encoding = Encoding.UTF8;
webClient.DownloadStringAsync (url);
I modified your sample slightly and the following works for me.
The StreamReader is only there just to re-read in the contents of the file to confirm that its the same contents in the file as that of the downloaded file:-
If you put a breakpoint there also you can manually inspect same contents as downloaded.
string directory = System.Environment.GetFolderPath(System.Environment.SpecialFolder.Personal);
var filename = Path.Combine(directory, "samplefile.txt");
if (!File.Exists(filename))
{
var webClient = new WebClient();
webClient.DownloadStringCompleted += (s, e) =>
{
// Write contents of downloaded file to device:-
var text = e.Result; // get the downloaded text
StreamWriter sw = new StreamWriter(filename);
sw.Write(text);
sw.Flush();
sw.Close();
sw = null;
// Read in contents from device and validate same as downloaded:-
StreamReader sr = new StreamReader(filename);
string strFileContentsOnDevice = sr.ReadToEnd();
System.Diagnostics.Debug.Assert(strFileContentsOnDevice == text);
};
var url = new Uri("**url here**, UriKind.Absolute);
webClient.Encoding = Encoding.UTF8;
webClient.DownloadStringAsync(url);
}