uncompress and read gzip file in scala - scala

In Scala, how does one uncompress the text contained in file.gz so that it can be processed? I would be happy with either having the contents of the file stored in a variable, or saving it as a local file so that it can be read in by the program after.
Specifically, I am using Scalding to process compressed log data, but Scalding does not define a way to read them in FileSource.scala.

Here's my version:
import java.io.BufferedReader
import java.io.InputStreamReader
import java.util.zip.GZIPInputStream
import java.io.FileInputStream
class BufferedReaderIterator(reader: BufferedReader) extends Iterator[String] {
override def hasNext() = reader.ready
override def next() = reader.readLine()
}
object GzFileIterator {
def apply(file: java.io.File, encoding: String) = {
new BufferedReaderIterator(
new BufferedReader(
new InputStreamReader(
new GZIPInputStream(
new FileInputStream(file)), encoding)))
}
}
Then do:
val iterator = GzFileIterator(new java.io.File("test.txt.gz"), "UTF-8")
iterator.foreach(println)

Related

How to unzip a file in a fs2 stream

Given a file, say t.gz, that is zipped I want to be able to read this file's content line by line.
I've been able to read the contents using:
Source.fromInputStream(new GZIPInputStream(new BufferedInputStream(new FileInputStream(s))))
However, I'm looking for a way to process these files in a functional paradigm instead which has brought me to fs2.
If I unzip the file, I can do something like this:
import cats.effect._
import fs2.io.file.{Files, Path}
import fs2.{Stream, text, compression, io}
object Main extends IOApp.Simple {
def doThing(inPath: Path): Stream[IO, Unit] = {
Files[IO]
.readAll(inPath)
.through(text.utf8.decode)
.through(text.lines)
.map(line => line)
.intersperse("\n")
.through(text.utf8.encode)
.through(io.stdout)
}
val run = doThing(Path("t")).compile.drain
}
where we just go to the console in the end for simplicity.
If instead I leave it in the zipped format, I can't quite seem to find anywhere that shows how these operations would fit together to provide this as a Stream.
fs2 seems to have a compression object (https://www.javadoc.io/doc/co.fs2/fs2-docs_2.13/latest/fs2/compression/Compression.html) that seems it should do what is desired, but if it does haven't figured out how to integrate.
As such, the question is this: How do I read a zipped file into a stream to work with fs2 in a functional paradigm?
You probably want this:
object Main extends IOApp.Simple {
def doThing(inPath: Path): Stream[IO, Unit] = {
Files[IO]
.readAll(inPath)
.through(Compression[IO].gunzip())
.flatMap(_.content)
.through(text.utf8.decode)
.through(text.lines)
.map(line => line)
.intersperse("\n")
.through(text.utf8.encode)
.through(io.stdout)
}
override final val run =
doThing(Path("t")).compile.drain
}

Decompressing .Z file stored in Azure ADLS Gen2

I have a .Z file stored in Azure ADLS Gen2. I want to decompress the file in the ADLS, I tried decompressing using ADF and C# but found that .Z is not supported. Also I tried using Apache Common Compress Lib for decompression, but unable to read the file in InputStream.
Can anyone have any idea, how we can decompress the file using Apache lib in Scala.
.Z files are .gzip files so you could try this approach
import java.io.{BufferedReader, File, FileInputStream, InputStreamReader}
import java.util.zip.GZIPInputStream
object UnzipFiles {
def decompressGzipOrZFiles(file: File, encode: String): BufferedReader = {
val fis = new FileInputStream(file)
val gzis = new GZIPInputStream(fis)
val isr = new InputStreamReader(gzis, encode)
new BufferedReader(isr)
}
def main(args: Array[String]): Unit = {
val path = new File("/home/cloudera/files/my_file.Z")
// print to the console
decompressGzipOrZFiles(path,"UTF-8").lines().toArray.foreach(println)
}
}
or you could follow this too
def uncompressGzip(myFileDotZorGzip: String): Unit = {
import java.io.FileInputStream
import java.util.zip.GZIPInputStream
try {
val gzipInputStream = new GZIPInputStream(new FileInputStream(myFileDotZorGzip))
try {
val tam = 128
val buffer = new Array[Byte](tam)
do {
gzipInputStream.read(buffer)
gzipInputStream.skip(tam)
//do something with data
print(buffer.foreach(b => print(b.toChar)))
} while(gzipInputStream.read() != -1)
} finally {
if (gzipInputStream != null) gzipInputStream.close()
}
}
}
I hope this helps.

Scalacache with redis support

I am trying to integrate redis to scalacache. Keys are usually string but values can be objects, Set[String], etc. Cache is initialized by this
val cache: RedisCache = RedisCache(config.host, config.port)
private implicit val scalaCache: ScalaCache[Array[Byte]] = ScalaCache(cacheService.cache)
But while calling put, i am getting this error "Could not find any Codecs for type Set[String] and Repr". Looks like i need to provide codec for my cache input as suggested here so i added,
class A extends Codec[Set[String], Array[Byte]] with GZippingBinaryCodec[Set[String]]
Even after, my class A, is throwing the same error. What am i missing.
As you mentioned in the link, you can either serialize values in a binary format:
import scalacache.serialization.binary._
or as JSON using circe:
import scalacache.serialization.circe._
import io.circe.generic.auto._
Looks like its solved in next release by binary and circe serialization. I am on version 10 and solved by the following,
implicit object SetBindaryCodec extends Codec[Any, Array[Byte]] {
override def serialize(value: Any): Array[Byte] = {
val stream: ByteArrayOutputStream = new ByteArrayOutputStream()
val oos = new ObjectOutputStream(stream)
oos.writeObject(value)
oos.close()
stream.toByteArray
}
override def deserialize(data: Array[Byte]): Any = {
val ois = new ObjectInputStream(new ByteArrayInputStream(data))
val value = ois.readObject
ois.close()
value
}
}
Perks of being up to date. Will upgrade the version, posted it just in case somebody needs it.

How to read PDF files and xml files in Apache Spark scala?

My sample code for reading text file is
val text = sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], sc.defaultMinPartitions)
var rddwithPath = text.asInstanceOf[HadoopRDD[LongWritable, Text]].mapPartitionsWithInputSplit { (inputSplit, iterator) ⇒
val file = inputSplit.asInstanceOf[FileSplit]
iterator.map { tpl ⇒ (file.getPath.toString, tpl._2.toString) }
}.reduceByKey((a,b) => a)
In this way how can I use PDF and Xml files
PDF & XML can be parsed using Tika:
look at Apache Tika - a content analysis toolkit
look at
- https://tika.apache.org/1.9/api/org/apache/tika/parser/xml/
- http://tika.apache.org/0.7/api/org/apache/tika/parser/pdf/PDFParser.html
- https://tika.apache.org/1.9/api/org/apache/tika/parser/AutoDetectParser.html
Below is example integration of Spark with Tika :
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.input.PortableDataStream
import org.apache.tika.metadata._
import org.apache.tika.parser._
import org.apache.tika.sax.WriteOutContentHandler
import java.io._
object TikaFileParser {
def tikaFunc (a: (String, PortableDataStream)) = {
val file : File = new File(a._1.drop(5))
val myparser : AutoDetectParser = new AutoDetectParser()
val stream : InputStream = new FileInputStream(file)
val handler : WriteOutContentHandler = new WriteOutContentHandler(-1)
val metadata : Metadata = new Metadata()
val context : ParseContext = new ParseContext()
myparser.parse(stream, handler, metadata, context)
stream.close
println(handler.toString())
println("------------------------------------------------")
}
def main(args: Array[String]) {
val filesPath = "/home/user/documents/*"
val conf = new SparkConf().setAppName("TikaFileParser")
val sc = new SparkContext(conf)
val fileData = sc.binaryFiles(filesPath)
fileData.foreach( x => tikaFunc(x))
}
}
PDF can be parse in pyspark as follow:
If PDF is store in HDFS then using sc.binaryFiles() as PDF is store in binary format.
Then the binary content can be send to pdfminer for parsing.
import pdfminer
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
def return_device_content(cont):
fp = io.BytesIO(cont)
parser = PDFParser(fp)
document = PDFDocument(parser)
filesPath="/user/root/*.pdf"
fileData = sc.binaryFiles(filesPath)
file_content = fileData.map(lambda content : content[1])
file_content1 = file_content.map(return_device_content)
Further parsing is can be done using functionality provided by pdfminer.
You can simply use spark-shell with tika and run the below code in a sequential manner or in a distributed manner depending upon your use case
spark-shell --jars tika-app-1.8.jar
val binRDD = sc.binaryFiles("/data/")
val textRDD = binRDD.map(file => {new org.apache.tika.Tika().parseToString(file._2.open( ))})
textRDD.saveAsTextFile("/output/")
System.exit(0)

How to convert Source[ByteString, Any] to InputStream

akka-http represents a file uploaded using multipart/form-data encoding as Source[ByteString, Any]. I need to unmarshal it using Java library that expects an InputStream.
How Source[ByteString, Any] can be turned into an InputStream?
As of version 2.x you achieve this with the following code:
import akka.stream.scaladsl.StreamConverters
...
val inputStream: InputStream = entity.dataBytes
.runWith(
StreamConverters.asInputStream(FiniteDuration(3, TimeUnit.SECONDS))
)
See: http://doc.akka.io/docs/akka-stream-and-http-experimental/2.0.1/scala/migration-guide-1.0-2.x-scala.html
Note: was broken in version 2.0.2 and fixed in 2.4.2
You could try using an OutputStreamSink that writes to a PipedOutputStream and feed that into a PipedInputStream that your other code uses as its input stream. It's a little rough of an idea but it could work. The code would look like this:
import akka.util.ByteString
import akka.stream.scaladsl.Source
import java.io.PipedInputStream
import java.io.PipedOutputStream
import akka.stream.io.OutputStreamSink
import java.io.BufferedReader
import java.io.InputStreamReader
import akka.actor.ActorSystem
import akka.stream.ActorFlowMaterializer
object PipedStream extends App{
implicit val system = ActorSystem("flowtest")
implicit val mater = ActorFlowMaterializer()
val lines = for(i <- 1 to 100) yield ByteString(s"This is line $i\n")
val source = Source(lines)
val pipedIn = new PipedInputStream()
val pipedOut = new PipedOutputStream(pipedIn)
val flow = source.to(OutputStreamSink(() => pipedOut))
flow.run()
val reader = new BufferedReader(new InputStreamReader(pipedIn))
var line:String = reader.readLine
while(line != null){
println(s"Reader received line: $line")
line = reader.readLine
}
}
You could extract an interator from ByteString and then get the InputStream. Something like this (pseudocode):
source.map { data: ByteString =>
data.iterator.asInputStream
}
Update
A more elaborated sample starting with a Multipart.FormData
def isSourceFromFormData(formData: Multipart.FormData): Source[InputStream, Any] =
formData.parts.map { part =>
part.entity.dataBytes
.map(_.iterator.asInputStream)
}.flatten(FlattenStrategy.concat)