decompress (unzip/extract) util using spark scala - scala

I have customer_input_data.tar.gz in HDFS, which have 10 different tables data in csv file format. so i need to unzip this file to /my/output/path using spark scala
please suggest how to unzip customer_input_data.tar.gz file using spark scala

gzip is not a splittable format in Hadoop. Consequently, the file is not really going to be distributed across the cluster and you don't get any benefit of distributed compute/processing in hadoop or Spark.
Better approach may be to,
uncompress the file on the OS and then individually send the files back to hadoop.
If you still want to uncompress in scala, you can simply resort to java class GZIPInputStream via
new GZIPInputStream(new FileInputStream("your file path"))

I developed the below code for decompress the files using scala. You need to pass input path and output path and Hadoopfile system
/*below method used for processing zip files*/
#throws[IOException]
private def processTargz(fullpath: String, houtPath: String, fs: FileSystem): Unit = {
val path = new Path(fullpath)
val gzipIn = new GzipCompressorInputStream(fs.open(path))
try {
val tarIn = new TarArchiveInputStream(gzipIn)
try {
var entry:TarArchiveEntry = null
out.println("Tar entry")
out.println("Tar Name entry :" + FilenameUtils.getName(fullpath))
val fileName1 = FilenameUtils.getName(fullpath)
val tarNamesFolder = fileName1.substring(0, fileName1.indexOf('.'))
out.println("Folder Name : " + tarNamesFolder)
while ( {
(entry = tarIn.getNextEntry.asInstanceOf[TarArchiveEntry]) != null
}) { // entity Name as tsv file name which are part of inside compressed tar file
out.println("ENTITY NAME : " + entry.getName)
/** If the entry is a directory, create the directory. **/
out.println("While")
if (entry.isDirectory) {
val f = new File(entry.getName)
val created = f.mkdir
out.println("mkdir")
if (!created) {
out.printf("Unable to create directory '%s', during extraction of archive contents.%n", f.getAbsolutePath)
out.println("Absolute path")
}
}
else {
var count = 0
val slash = "/"
val targetPath = houtPath + slash + tarNamesFolder + slash + entry.getName
val hdfswritepath = new Path(targetPath)
val fos = fs.create(hdfswritepath, true)
try {
val dest = new BufferedOutputStream(fos, BUFFER_SIZE)
try {
val data = new Array[Byte](BUFFER_SIZE)
while ( {
(count = tarIn.read(data, 0, BUFFER_SIZE)) != -1
}) dest.write(data, 0, count)
} finally if (dest != null) dest.close()
}
}
}
out.println("Untar completed successfully!")
} catch {
case e: IOException =>
out.println("catch Block")
} finally {
out.println("FINAL Block")
if (tarIn != null) tarIn.close()
}
}
}

Related

How to read a .tar file containing parquets on S3 as dataframes in Spark?

I need to load a .tar file on S3 that contains multiple parquets with different schema using Scala/Spark. Ideally I'd like to read one of these parquets into Spark dataframe. I tried to get the s3 object and then convert to a tar input stream using org.apache.commons.compress.archivers.tar.TarArchiveInputStream and it was able to creat the tar input stream but failed to read the tar entries.
val s3client: AmazonS3 = AmazonS3ClientBuilder.
standard().
withCredentials(new InstanceProfileCredentialsProvider()).
withRegion(my_region).
build();
val tarFile = s3client.getObject(my_bucket, my_tar_file)
val tarInputStream = new TarArchiveInputStream(tarFile.getObjectContent)
tarInputStream.getNextTarEntry() <-- error thrown in this line
Error:
java.io.IOException: Error detected parsing the header
at org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:240)
... 52 elided
Caused by: java.lang.IllegalArgumentException: Invalid byte 48 at offset 7 in '00755{NUL}00' len=8
at org.apache.commons.compress.archivers.tar.TarUtils.parseOctal(TarUtils.java:127)
at org.apache.commons.compress.archivers.tar.TarUtils.parseOctalOrBinary(TarUtils.java:171)
at org.apache.commons.compress.archivers.tar.TarArchiveEntry.parseTarHeader(TarArchiveEntry.java:935)
at org.apache.commons.compress.archivers.tar.TarArchiveEntry.parseTarHeader(TarArchiveEntry.java:924)
at org.apache.commons.compress.archivers.tar.TarArchiveEntry.<init>(TarArchiveEntry.java:328)
at org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:238)
Does anyone have the knowledge of the proper way of extract a partial of tar file on s3 in Spark?
Follow this example. I hope you are using tar.gz
AWSCredentials credentials = new BasicAWSCredentials("accessKey", "secretKey");
AWSCredentialsProvider credentialsProvider = new AWSStaticCredentialsProvider(credentials);
AmazonS3 s3Client = AmazonS3ClientBuilder.standard().withRegion(Regions.US_EAST_1).withCredentials(credentialsProvider).build();
S3Object object = s3Client.getObject("bucketname", "file.tar.gz");
S3ObjectInputStream objectContent = object.getObjectContent();
TarArchiveInputStream tarInputStream = new TarArchiveInputStream(new GZIPInputStream(objectContent));
TarArchiveEntry currentEntry;
while((currentEntry = tarInputStream.getNextTarEntry()) != null) {
if(currentEntry.getName().equals("1/foo.bar") && currentEntry.isFile()) {
FileOutputStream entryOs = new FileOutputStream("foo.bar");
IOUtils.copy(tarInputStream, entryOs);
entryOs.close();
break;
}
}
objectContent.abort(); // Warning at this line
tarInputStream.close(); // warning at this line
scala equivalent is
val credentials: AWSCredentials =
new BasicAWSCredentials("accessKey", "secretKey")
val credentialsProvider: AWSCredentialsProvider =
new AWSStaticCredentialsProvider(credentials)
val s3Client: AmazonS3 = AmazonS3ClientBuilder
.standard()
.withRegion(Regions.US_EAST_1)
.withCredentials(credentialsProvider)
.build()
val s3object: S3Object = s3Client.getObject("bucketname", "file.tar.gz")
val objectContent: S3ObjectInputStream = s3object.getObjectContent
val tarInputStream: TarArchiveInputStream = new TarArchiveInputStream(
new GZIPInputStream(objectContent))
var currentEntry: TarArchiveEntry = null
while ((currentEntry = tarInputStream.getNextTarEntry) != null)
if (currentEntry.getName ==("1/foo.bar") && currentEntry.isFile) {
val entryOs: FileOutputStream = new FileOutputStream("foo.bar")
IOUtils.copy(tarInputStream, entryOs)
entryOs.close()
}
objectContent.abort()
tarInputStream.close()
}
Update :
since you are using only tar not gzip
so you have to read like this...
val tarInputStream = new TarArchiveInputStream(new FileInputStream(
tarFile.getObjectContent))
In your case you are passing object as a InputStream. My suggestion is to pass it as a GzipInputstream, then read entries:
val tarInputStream = new TarArchiveInputStream(tarFile.getObjectContent)
val tarInputStream = new TarArchiveInputStream(new GZIPInputStream(tarFile))
val entry: TarArchiveEntry = readEntries(tarInputStream)
def readEntries(tarInputStream: TarArchiveInputStream): TarArchiveEntry = {
var currentEntry = Option(tarInputStream.getNextTarEntry())
// you can use functional approach with foldLeft, reduce or something else or while loop
// implementation details here
}
You can find how to use TarArchiveInputStream usage here
You can use GetObjectRequest to create an S3Object
val s3FullObject: S3Object = s3client.getObject(new GetObjectRequest(s3Bucket, s3TarPath))
val tis = new TarArchiveInputStream(s3FullObject.getObjectContent)
var entry: TarArchiveEntry = tis.getNextTarEntry

Decompress a single file from a zip folder without writing in disk

Is it possible to decompress a single file from a zip folder and return the decompressed file without storing the data on the server?
I have a zip file with an unknown structure and I would like to develop a service that will serve the content of a given file on demand without decompressing the whole zip and also without writing on disk.
So, if I have a zip file like this
zip_folder.zip
| folder1
| file1.txt
| file2.png
| folder 2
| file3.jpg
| file4.pdf
| ...
So, I would like my service to receive the name and path of the file so I could send the file.
For example, fileName could be folder1/file1.txt
def getFileContent(fileName: String): IBinaryContent = {
val content: IBinaryContent = getBinaryContent(...)
val zipInputStream: ZipInputStream = new ZipInputStream(content.getInputStream)
val outputStream: FileOutputStream = new FileOutputStream(fileName)
var zipEntry: ZipEntry = null
var founded: Boolean = false
while ({
zipEntry = zipInputStream.getNextEntry
Option(zipEntry).isDefined && !founded
}) {
if (zipEntry.getName.equals(fileName)) {
val buffer: Array[Byte] = Array.ofDim(9000) // FIXME how to get the dimension of the array
var length = 0
while ({
length = zipInputStream.read(buffer)
length != -1
}) {
outputStream.write(buffer, 0, length)
}
outputStream.close()
founded = true
}
}
zipInputStream.close()
outputStream /* how can I return the value? */
}
How can I do it without writing the content in the disk?
You can use a ByteArrayOutputStream instead of the FileOutputStream to uncompress the zip entry into memory. Then call toByteArray() on it.
Also note that, technically, you would not even need to decompress the zip part if you can transmit it over a protocol (think: HTTP(S)) which supports the deflate encoding for its transport (which is usually the standard compression used in Zip files).
So, basically I did the same thing that #cbley recommended. I returned an array of bytes and defined the content-type so that the browser can do the magic!
def getFileContent(fileName: String): IBinaryContent = {
val content: IBinaryContent = getBinaryContent(...)
val zipInputStream: ZipInputStream = new ZipInputStream(content.getInputStream)
val outputStream: ByteArrayOutputStream = new ByteArrayOutputStream()
var zipEntry: ZipEntry = null
var founded: Boolean = false
while ({
zipEntry = zipInputStream.getNextEntry
Option(zipEntry).isDefined && !founded
}) {
if (zipEntry.getName.equals(fileName)) {
val buffer: Array[Byte] = Array.ofDim(zipEntry.getSize)
var length = 0
while ({
length = zipInputStream.read(buffer)
length != -1
}) {
outputStream.write(buffer, 0, length)
}
outputStream.close()
founded = true
}
}
zipInputStream.close()
outputStream.toByteArray
}
// in my rest service
#GET
#Path("/content/${fileName}")
def content(#PathVariable fileName): Response = {
val content = getFileContent(fileName)
Response.ok(content)
.header("Content-type", new Tika().detect(fileName)) // I'm using Tika but it's possible to use other libraries
.build()
}

scala- Read file from S3 bucket

I want to read a specific file from S3 bucket. In my S3 bucket I will be having so many objects(directories and Sub directories). I want traverse through all the objects and have to read only that file.
I am trying below code:
val s3Client: AmazonS3Client = getS3Client()
try {
log.info("Listing objects from S3")
var counter = 0
val listObjectsRequest = new ListObjectsRequest()
.withBucketName(bucketName)
.withMaxKeys(2)
.withPrefix("Test/"+"Client_cd" + "/"+"DM1"+"/")
.withMarker("Test/"+"Client_cd" + "/"+"DM1"+"/")
var objectListing: ObjectListing = null
do {
objectListing = s3Client.listObjects(listObjectsRequest)
import scala.collection.JavaConversions._
for (objectSummary <- objectListing.getObjectSummaries) {
println( objectSummary.getKey + "\t" + StringUtils.fromDate(objectSummary.getLastModified))
}
listObjectsRequest.setMarker(objectListing.getNextMarker())
}
while (objectListing.isTruncated())
}
catch {
case e: Exception => {
log.error("Failed listing files. ", e)
throw e
}
}
In this path I have to read only .gz files from latest month folders. File Path:
"Mybucket/Test/Client_cd/Dm1/20181010_xxxxx/*.gz"
Here, I have to pass Client_cd as parameter for particular client.
How to filter the objects and to get particular files?
If you are using EMR or your S3 configs are setup correctly, you can also use the sc.textFile("s3://bucket/Test/Client_cd/Dm1/20181010_xxxxx/*.gz")

Download all the files from a s3 bucket using scala

I tried the below code to download one file successfully but unable to download all the list of files
client.getObject(
new GetObjectRequest(bucketName, "TestFolder/TestSubfolder/Psalm/P.txt"),
new File("test.txt"))
Thanks in advance
Update
I tried the below code but getting list of directories ,I want list of files rather
val listObjectsRequest = new ListObjectsRequest().
withBucketName("tivo-hadoop-dev").
withPrefix("prefix").
withDelimiter("/")
client.listObjects(listObjectsRequest).getCommonPrefixes
It's a simple thing but I struggled like any thing before concluding below mentioned answer.
I found a java code and changed to scala accordingly and it worked
val client = new AmazonS3Client(credentials)
val listObjectsRequest = new ListObjectsRequest().
withBucketName("bucket-name").
withPrefix("path/of/dir").
withDelimiter("/")
var objects = client.listObjects(listObjectsRequest);
do {
for (objectSummary <- objects.getObjectSummaries()) {
var key = objectSummary.getKey()
println(key)
var arr=key.split("/")
var file_name = arr(arr.length-1)
client.getObject(
new GetObjectRequest("bucket" , key),
new File("some/path/"+file_name))
}
objects = client.listNextBatchOfObjects(objects);
} while (objects.isTruncated())
Below code is fast and useful especially when you want to download all objects at a specific local directory. It maintains the files under the exact same s3 prefix hierarchy
val xferMgrForAws:TransferManager = TransferManagerBuilder.standard().withS3Client(awsS3Client).build();
var objectListing:ObjectListing = null;
objectListing = awsS3Client.listObjects(awsBucketName, prefix);
val summaries:java.util.List[S3ObjectSummary] = objectListing.getObjectSummaries();
if(summaries.size() > 0) {
val xfer:MultipleFileDownload = xferMgrForAws.downloadDirectory(awsBucketName, prefix, new File(localDirPath));
xfer.waitForCompletion();
println("All files downloaded successfully!")
} else {
println("No object present in the bucket !");
}

ZipInputStream.read in ZipEntry

I am reading zip file using ZipInputStream. Zip file has 4 csv files. Some files are written completely, some are written partially. Please help me find the issue with below code. Is there any limit on reading buffer from ZipInputStream.read method?
val zis = new ZipInputStream(inputStream)
Stream.continually(zis.getNextEntry).takeWhile(_ != null).foreach { file =>
if (!file.isDirectory && file.getName.endsWith(".csv")) {
val buffer = new Array[Byte](file.getSize.toInt)
zis.read(buffer)
val fo = new FileOutputStream("c:\\temp\\input\\" + file.getName)
fo.write(buffer)
}
You have not closed/flushed the files you attempted to write. It should be something like this (assuming Scala syntax, or is this Kotlin/Ceylon?):
val fo = new FileOutputStream("c:\\temp\\input\\" + file.getName)
try {
fo.write(buffer)
} finally {
fo.close
}
Also you should check the read count and read more if necessary, something like this:
var readBytes = 0
while (readBytes < buffer.length) {
val r = zis.read(buffer, readBytes, buffer.length - readBytes)
r match {
case -1 => throw new IllegalStateException("Read terminated before reading everything")
case _ => readBytes += r
}
}
PS: In your example it seems to be less than required closing }s.