I have a .Z file stored in Azure ADLS Gen2. I want to decompress the file in the ADLS, I tried decompressing using ADF and C# but found that .Z is not supported. Also I tried using Apache Common Compress Lib for decompression, but unable to read the file in InputStream.
Can anyone have any idea, how we can decompress the file using Apache lib in Scala.
.Z files are .gzip files so you could try this approach
import java.io.{BufferedReader, File, FileInputStream, InputStreamReader}
import java.util.zip.GZIPInputStream
object UnzipFiles {
def decompressGzipOrZFiles(file: File, encode: String): BufferedReader = {
val fis = new FileInputStream(file)
val gzis = new GZIPInputStream(fis)
val isr = new InputStreamReader(gzis, encode)
new BufferedReader(isr)
}
def main(args: Array[String]): Unit = {
val path = new File("/home/cloudera/files/my_file.Z")
// print to the console
decompressGzipOrZFiles(path,"UTF-8").lines().toArray.foreach(println)
}
}
or you could follow this too
def uncompressGzip(myFileDotZorGzip: String): Unit = {
import java.io.FileInputStream
import java.util.zip.GZIPInputStream
try {
val gzipInputStream = new GZIPInputStream(new FileInputStream(myFileDotZorGzip))
try {
val tam = 128
val buffer = new Array[Byte](tam)
do {
gzipInputStream.read(buffer)
gzipInputStream.skip(tam)
//do something with data
print(buffer.foreach(b => print(b.toChar)))
} while(gzipInputStream.read() != -1)
} finally {
if (gzipInputStream != null) gzipInputStream.close()
}
}
}
I hope this helps.
Related
I want to stream some files and zip them on the fly, so users can download multiple files into a single zipped file without writing anything to the local disk. However, my current implementation holds everything in the memory, and will no work for large files. Is there any way to fix it?
I was looking at this implementation: https://gist.github.com/kirked/03c7f111de0e9a1f74377bf95d3f0f60, but couldn't figure out how to use it.
import java.io.{BufferedOutputStream, ByteArrayInputStream, ByteArrayOutputStream}
import java.util.zip.{ZipEntry, ZipOutputStream}
import akka.stream.scaladsl.{StreamConverters}
import org.apache.commons.io.FileUtils
import play.api.mvc.{Action, Controller}
class HomeController extends Controller {
def single() = Action {
Ok.sendFile(
content = new java.io.File("C:\\Users\\a.csv"),
fileName = _ => "a.csv"
)
}
def zip() = Action {
Ok.chunked(StreamConverters.fromInputStream(fileByteData)).withHeaders(
CONTENT_TYPE -> "application/zip",
CONTENT_DISPOSITION -> s"attachment; filename = test.zip"
)
}
def fileByteData(): ByteArrayInputStream = {
val fileList = List(
new java.io.File("C:\\Users\\a.csv"),
new java.io.File("C:\\Users\\b.csv")
)
val baos = new ByteArrayOutputStream()
val zos = new ZipOutputStream(new BufferedOutputStream(baos))
try {
fileList.map(file => {
zos.putNextEntry(new ZipEntry(file.toPath.getFileName.toString))
zos.write(FileUtils.readFileToByteArray(file))
zos.closeEntry()
})
} finally {
zos.close()
}
new ByteArrayInputStream(baos.toByteArray)
}
}
Instead of using a ByteArrayOutputStream to buffer the contents in an array then putting them into a ByteArrayInputStream you could use Java's piping mechanism.
Here's a sketch solution:
def zip() = Action {
// Create Source that listens to an OutputStream
// and pass it to `fileByteData` method.
val zipSource: Source[ByteString, Unit] =
StreamConverters
.asOutputStream()
.mapMaterializedValue(fileByteData)
Ok.chunked(zipSource).withHeaders(
CONTENT_TYPE -> "application/zip",
CONTENT_DISPOSITION -> s"attachment; filename = test.zip")
}
// Send the file data, given an OutputStream to write to.
def fileByteData(os: OutputStream): Unit = {
val fileList = List(
new java.io.File("C:\\Users\\a.csv"),
new java.io.File("C:\\Users\\b.csv")
)
val zos = new ZipOutputStream(os)
val buffer: Array[Byte] = new Array[Byte](2048)
try {
for (file <- fileList) {
zos.putNextEntry(new ZipEntry(file.toPath.getFileName.toString))
val fis = new Files.newInputStream(file.toPath)
try {
#tailrec
def zipFile(): Unit = {
val bytesRead = fis.read(buffer)
if (bytesRead == -1) () else {
zos.write(buffer, 0, bytesRead)
zipFile()
}
}
zipFile()
} finally fis.close()
zos.closeEntry()
}
} finally {
zos.close()
}
}
This is just an outline of an approach. You'll also want to make sure:
- the threading is OK - the fileByteData will hopefully run on a different thread to the sending thread
- the error handling is OK - e.g. all streams are closed properly if there's an error on either the server (e.g. file not found) or client side (early disconnect)
My sample code for reading text file is
val text = sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], sc.defaultMinPartitions)
var rddwithPath = text.asInstanceOf[HadoopRDD[LongWritable, Text]].mapPartitionsWithInputSplit { (inputSplit, iterator) ⇒
val file = inputSplit.asInstanceOf[FileSplit]
iterator.map { tpl ⇒ (file.getPath.toString, tpl._2.toString) }
}.reduceByKey((a,b) => a)
In this way how can I use PDF and Xml files
PDF & XML can be parsed using Tika:
look at Apache Tika - a content analysis toolkit
look at
- https://tika.apache.org/1.9/api/org/apache/tika/parser/xml/
- http://tika.apache.org/0.7/api/org/apache/tika/parser/pdf/PDFParser.html
- https://tika.apache.org/1.9/api/org/apache/tika/parser/AutoDetectParser.html
Below is example integration of Spark with Tika :
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.input.PortableDataStream
import org.apache.tika.metadata._
import org.apache.tika.parser._
import org.apache.tika.sax.WriteOutContentHandler
import java.io._
object TikaFileParser {
def tikaFunc (a: (String, PortableDataStream)) = {
val file : File = new File(a._1.drop(5))
val myparser : AutoDetectParser = new AutoDetectParser()
val stream : InputStream = new FileInputStream(file)
val handler : WriteOutContentHandler = new WriteOutContentHandler(-1)
val metadata : Metadata = new Metadata()
val context : ParseContext = new ParseContext()
myparser.parse(stream, handler, metadata, context)
stream.close
println(handler.toString())
println("------------------------------------------------")
}
def main(args: Array[String]) {
val filesPath = "/home/user/documents/*"
val conf = new SparkConf().setAppName("TikaFileParser")
val sc = new SparkContext(conf)
val fileData = sc.binaryFiles(filesPath)
fileData.foreach( x => tikaFunc(x))
}
}
PDF can be parse in pyspark as follow:
If PDF is store in HDFS then using sc.binaryFiles() as PDF is store in binary format.
Then the binary content can be send to pdfminer for parsing.
import pdfminer
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
def return_device_content(cont):
fp = io.BytesIO(cont)
parser = PDFParser(fp)
document = PDFDocument(parser)
filesPath="/user/root/*.pdf"
fileData = sc.binaryFiles(filesPath)
file_content = fileData.map(lambda content : content[1])
file_content1 = file_content.map(return_device_content)
Further parsing is can be done using functionality provided by pdfminer.
You can simply use spark-shell with tika and run the below code in a sequential manner or in a distributed manner depending upon your use case
spark-shell --jars tika-app-1.8.jar
val binRDD = sc.binaryFiles("/data/")
val textRDD = binRDD.map(file => {new org.apache.tika.Tika().parseToString(file._2.open( ))})
textRDD.saveAsTextFile("/output/")
System.exit(0)
I am using Play 2.3 and want to store the uploaded files to S3, so I use Play-S3 module.
However, I got stuck because I need to create a BucketFile to upload to S3 with this module, and a BucketFile is created using an Array[Byte] in memory of the file. The Play! body parser gives me a temporary on disc file. How can I put this file into BucketFile?
Here is my controller Action:
def upload = Action.async(parse.multipartFormData) { implicit request =>
request.body.file("file").map{ file =>
implicit val credential = AwsCredentials.fromConfiguration
val bucket = S3("bucketName")
val result = bucket + BucketFile(file.filename, file.contentType.get, file.ref.file.toString.getBytes)
result.map{ unit =>
Ok("File uploaded")
}
}.getOrElse {
Future.successful {
Redirect(routes.Application.index).flashing(
"error" -> "Missing file"
)
}
}
}
This code does not work because file.ref.file.toString() does not really return the string representation of a file.
Import the following:
import java.nio.file.{Paths, Files}
To create the Array[Byte] do:
val byteArray = Files.readAllBytes(Paths.get(file.ref.file.getPath))
Then upload with:
BucketFile(file.filename, file.contentType.get, byteArray, None, None)
In Scala, how does one uncompress the text contained in file.gz so that it can be processed? I would be happy with either having the contents of the file stored in a variable, or saving it as a local file so that it can be read in by the program after.
Specifically, I am using Scalding to process compressed log data, but Scalding does not define a way to read them in FileSource.scala.
Here's my version:
import java.io.BufferedReader
import java.io.InputStreamReader
import java.util.zip.GZIPInputStream
import java.io.FileInputStream
class BufferedReaderIterator(reader: BufferedReader) extends Iterator[String] {
override def hasNext() = reader.ready
override def next() = reader.readLine()
}
object GzFileIterator {
def apply(file: java.io.File, encoding: String) = {
new BufferedReaderIterator(
new BufferedReader(
new InputStreamReader(
new GZIPInputStream(
new FileInputStream(file)), encoding)))
}
}
Then do:
val iterator = GzFileIterator(new java.io.File("test.txt.gz"), "UTF-8")
iterator.foreach(println)
Could anyone post a simple snippet that does this?
Files are text files, so compression would be nice rather than just archive the files.
I have the filenames stored in an iterable.
There's not currently any way to do this kind of thing from the standard Scala library, but it's pretty easy to use java.util.zip:
def zip(out: String, files: Iterable[String]) = {
import java.io.{ BufferedInputStream, FileInputStream, FileOutputStream }
import java.util.zip.{ ZipEntry, ZipOutputStream }
val zip = new ZipOutputStream(new FileOutputStream(out))
files.foreach { name =>
zip.putNextEntry(new ZipEntry(name))
val in = new BufferedInputStream(new FileInputStream(name))
var b = in.read()
while (b > -1) {
zip.write(b)
b = in.read()
}
in.close()
zip.closeEntry()
}
zip.close()
}
I'm focusing on simplicity instead of efficiency here (no error checking and reading and writing one byte at a time isn't ideal), but it works, and can very easily be improved.
I recently had to work with zip files too and found this very nice utility: https://github.com/zeroturnaround/zt-zip
Here's an example of zipping all files inside a directory:
import org.zeroturnaround.zip.ZipUtil
ZipUtil.pack(new File("/tmp/demo"), new File("/tmp/demo.zip"))
Very convenient.
This is a little bit more scala style in case you like functional:
def compress(zipFilepath: String, files: List[File]) {
def readByte(bufferedReader: BufferedReader): Stream[Int] = {
bufferedReader.read() #:: readByte(bufferedReader)
}
val zip = new ZipOutputStream(new FileOutputStream(zipFilepath))
try {
for (file <- files) {
//add zip entry to output stream
zip.putNextEntry(new ZipEntry(file.getName))
val in = Source.fromFile(file.getCanonicalPath).bufferedReader()
try {
readByte(in).takeWhile(_ > -1).toList.foreach(zip.write(_))
}
finally {
in.close()
}
zip.closeEntry()
}
}
finally {
zip.close()
}
}
and don't forget the imports:
import java.io.{BufferedReader, FileOutputStream, File}
import java.util.zip.{ZipEntry, ZipOutputStream}
import io.Source
The Travis answer is correct but I have tweaked a little to get a faster version of his code:
val Buffer = 2 * 1024
def zip(out: String, files: Iterable[String], retainPathInfo: Boolean = true) = {
var data = new Array[Byte](Buffer)
val zip = new ZipOutputStream(new FileOutputStream(out))
files.foreach { name =>
if (!retainPathInfo)
zip.putNextEntry(new ZipEntry(name.splitAt(name.lastIndexOf(File.separatorChar) + 1)._2))
else
zip.putNextEntry(new ZipEntry(name))
val in = new BufferedInputStream(new FileInputStream(name), Buffer)
var b = in.read(data, 0, Buffer)
while (b != -1) {
zip.write(data, 0, b)
b = in.read(data, 0, Buffer)
}
in.close()
zip.closeEntry()
}
zip.close()
}
A bit modified (shorter) version using NIO2:
private def zip(out: Path, files: Iterable[Path]) = {
val zip = new ZipOutputStream(Files.newOutputStream(out))
files.foreach { file =>
zip.putNextEntry(new ZipEntry(file.toString))
Files.copy(file, zip)
zip.closeEntry()
}
zip.close()
}
As suggested by Gabriele Petronella, in addition, you need to add below Maven dependency in pom.xml, as well as below imports:
import org.zeroturnaround.zip.ZipUtil
import java.io.File
<dependency>
<groupId>org.zeroturnaround</groupId>
<artifactId>zt-zip</artifactId>
<version>1.13</version>
<type>jar</type>
</dependency>*