decompress data in scala using gzip - scala

when i try to decompress gzip file i get error:
my code:
val file_inp = new FileInputStream("Textfile.txt.gzip")
val file_out = new FileOutputStream("Textfromgzip.txt")
val gzInp = new GZIPInputStream(new BufferedInputStream(file_inp))
while (gzInp.available != -1) {
file_out.write(gzInp.read)
}
file_out.close()
output :
scala:25: error: ambiguous reference to overloaded definition,
both method read in class GZIPInputStream of type (x$1: Array[Byte], x$2: Int, x$3:
Int)Int
and method read in class InflaterInputStream of type ()Int
match expected type ?
file_out.write(gzInp.read)
^
one error found
if anyone knows about this error please help me.

working model for gzip compress and decompress in scala :
//create a text file and write something ..
val file_txt = new FileOutputStream("Textfile1.txt")
file_txt.write("hello this is a sample text ".getBytes)
file_txt.write(10) // 10 -> \n (or) next Line
file_txt.write("hello it is second line".getBytes)
file_txt.write(10)
file_txt.write("the good day".getBytes)
file_txt.close()
// gzip file
val file_ip = new FileInputStream("Textfile1.txt")
val file_gzOut = new GZIPOutputStream(new FileOutputStream("Textfile1.txt.gz"))
file_gzOut.write(file_ip.readAllBytes)
file_gzOut.close
//extract gzip
val gzFile = new GZIPInputStream(new FileInputStream("Textfile1.txt.gz"))
val txtFile = new FileOutputStream("Textfrom_gz.txt")
txtFile.write(gzFile.readAllBytes)
gzFile.close
don't forgot to close the files it leads to "truncated gzip input"

Related

Scala: How to get the content of PortableDataStream instance from an RDD

As I want to extract data from binaryFiles I read the files using
val dataRDD = sc.binaryRecord("Path") I get the result as org.apache.spark.rdd.RDD[(String, org.apache.spark.input.PortableDataStream)]
I want to extract the content of my files which is under the form of PortableDataStream
For that I tried: val data = dataRDD.map(x => x._2.open()).collect()
but I get the following error:
java.io.NotSerializableException:org.apache.hadoop.hdfs.client.HdfsDataInputStream
If you have an idea how can I solve my issue, please HELP!
Many Thanks in advance.
Actually, the PortableDataStream is Serializable. That's what it is meant for. Yet, open() returns a simple DataInputStream (HdfsDataInputStream in your case because your file is on HDFS) which is not Serializable, hence the error you get.
In fact, when you open the PortableDataStream, you just need to read the data right away. In scala, you can use scala.io.Source.fromInputStream:
val data : RDD[Array[String]] = sc
.binaryFiles("path/.../")
.map{ case (fileName, pds) => {
scala.io.Source.fromInputStream(pds.open())
.getLines().toArray
}}
This code assumes that the data is textual. If it is not, you can adapt it to read any kind of binary data. Here is an example to create a sequence of bytes, that you could process the way you want.
val rdd : RDD[Seq[Byte]] = sc.binaryFiles("...")
.map{ case (file, pds) => {
val dis = pds.open()
val bytes = Array.ofDim[Byte](1024)
val all = scala.collection.mutable.ArrayBuffer[Byte]()
while( dis.read(bytes) != -1) {
all ++= bytes
}
all.toSeq
}}
See the javadoc of DataInputStream for more possibilities. For instance, it possesses readLong, readDouble (and so on) methods.
val bf = sc.binaryFiles("...")
val bytes = bf.map{ case(file, pds) => {
val dis = pds.open()
val len = dis.available();
val buf = Array.ofDim[Byte](len)
pds.open().readFully(buf)
buf
}}
bytes: org.apache.spark.rdd.RDD[Array[Byte]] = MapPartitionsRDD[21] at map at <console>:26
scala> bytes.take(1)(0).size
res15: Int = 5879609 // this happened to be the size of my first binary file

Scala Either[type1, type2]

Here is a working example for the usage of Either:
val a: Either[Int, String] = {
if (true)
Left(42) // return an Int
else
Right("Hello, world") // return a String
}
But the below doesn't work:
The condition "text" is just to determine if the input file is text file or parquet file
val a: Either[org.apache.spark.rdd.RDD[String], org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]] = {
if (text)
spark.sparkContext.textFile(input_path + "/lineitem.tbl") // read in text file as rdd
else
sparkSession.read.parquet(input_path + "/lineitem").rdd //read in parquet file as df, convert to rdd
}
It gives me type mismatch errors:
<console>:33: error: type mismatch;
found : org.apache.spark.rdd.RDD[String]
required: scala.util.Either[org.apache.spark.rdd.RDD[String],org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]]
spark.sparkContext.textFile(input_path + "/lineitem.tbl") // read in text file as rdd
^
<console>:35: error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
required: scala.util.Either[org.apache.spark.rdd.RDD[String],org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]]
sparkSession.read.parquet(input_path + "/lineitem").rdd //read in parquet file as df, convert to rdd
Your working example tells you exactly, what to do. Just wrap those two expressions returned by Spark into Left and Right:
val a: Either[org.apache.spark.rdd.RDD[String], org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]] = {
if (text)
Left(spark.sparkContext.textFile(input_path + "/lineitem.tbl")) // read in text file as rdd
else
Right(sparkSession.read.parquet(input_path + "/lineitem").rdd) //read in parquet file as df, convert to rdd
}
Left and Right are two classes, both extend from Either. You can create instances by using new Left(expression) and new Right(expression). Since both of them are case classes, than the new keyword can be omitted and you simply use Left(expression) and Right(expression).

Saving data to sequence file

I'm trying to do some sort of filtering on Sequence file, and save it back to another sequence file, example:
val subset = ???
val hc = sc.hadoopConfiguration
val serializers = List(
classOf[WritableSerialization].getName,
classOf[ResultSerialization].getName
).mkString(",")
hc.set("io.serializations", serializers)
subset.saveAsNewAPIHadoopFile(
"output/sequence",
classOf[ImmutableBytesWritable],
classOf[Result],
classOf[SequenceFileOutputFormat[ImmutableBytesWritable, Result]],
hc
)
After compilation I receive following error:
Class[org.apache.hadoop.mapred.SequenceFileOutputFormat[org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.hbase.client.Result]](classOf[org.apache.hadoop.mapred.SequenceFileOutputFormat])
required: Class[_ <: org.apache.hadoop.mapreduce.OutputFormat[_, _]] classOf[SequenceFileOutputFormat[ImmutableBytesWritable, Result]],
To my knowledge SequenceFileOuputFormat extends FileOutputFormat which extends OutputFormat, but I am missing something.
Can you please help?
I raised issue with Spark team at https://issues.apache.org/jira/browse/SPARK-25405

Typesafe config - parse from map/file and resolve

I can resolve substitutions when I parse config from string, but not when parsing from map or file.
import java.io.File
import com.typesafe.config.{Config, ConfigFactory}
import scala.collection.JavaConversions.mapAsJavaMap
val s: String = "a = test, b = another ${a}"
val m: Map[String, String] = Map("a" -> "test", "b" -> "another ${a}")
val f: File = new File("test.properties") // contains "a = test\nb = another ${a}"
val cs: Config = ConfigFactory.parseString(s).resolve
val cm: Config = ConfigFactory.parseMap(mapAsJavaMap(m)).resolve
val cf: Config = ConfigFactory.parseFile(f).resolve
println("b from string = " + cs.getString("b"))
println("b from map = " + cm.getString("b"))
println("b from file = " + cf.getString("b"))
> b from string = another test
> b from map = another ${a}
> b from file = another ${a}
When I do not resolve immediately it's visible that variable placeholders are not really treated the same way:
val cs: Config = ConfigFactory.parseString(s)
val cm: Config = ConfigFactory.parseMap(mapAsJavaMap(m))
val cf: Config = ConfigFactory.parseFile(f)
> cs: com.typesafe.config.Config = Config(SimpleConfigObject({"a":"test","b":"another "${a}}))
> cm: com.typesafe.config.Config = Config(SimpleConfigObject({"a":"test","b":"another ${a}"}))
> cf: com.typesafe.config.Config = Config(SimpleConfigObject({"a":"test","b":"another ${a}"}))
I could maybe just convert map/file to string, but is there a way to make the library handle it?
The ConfigFactory.parseMap method lead to fromAnyRef, the relevant part for us is:
if (object instanceof String)
return new ConfigString.Quoted(origin, (String) object);
It never parses the value as ConfigReference, so there is no way the resolve could work.
Their rationale could be that if you "control" the data structure then you can, somehow, leverage scala string interpolation.
For the parseFile the situation is easier. Java properties files does not support ${} substitutions, the type of the file is guessed by the (.properties) file extension.
You can just use, for instance, the HOCON format: you just have to rename the file (e.g. test.conf), ${} substitutions should then work out-of-the-box.
More information here: HOCON

Scala: Source.fromInputStream failed for bigger input

I am reading data from AWS S3. The following code works fine if the input file is small. It failed when the input file is big. Is there any parameter I can modify to increase the buffer size or anything so it can handle bigger input file as well? Thanks!
val s3Object= s3Client.getObject(new GetObjectRequest("myBucket", "myPath/myFile.csv"));
val myData = Source.fromInputStream(s3Object.getObjectContent()).getLines()
for (line <- myData) {
val data = line.split(",")
myMap.put(data(0), data(1).toDouble)
}
println(" my map : " + myMap.toString())
If you look at the source code you can see that internally it calls Source.createBufferedSource. You can use that to create your own version with a bigger buffer size.
These are the lines of code from scala:
def createBufferedSource(
inputStream: InputStream,
bufferSize: Int = DefaultBufSize,
reset: () => Source = null,
close: () => Unit = null
)(implicit codec: Codec): BufferedSource = {
// workaround for default arguments being unable to refer to other parameters
val resetFn = if (reset == null) () => createBufferedSource(inputStream, bufferSize, reset, close)(codec) else reset
new BufferedSource(inputStream, bufferSize)(codec) withReset resetFn withClose close
}
def fromInputStream(is: InputStream, enc: String): BufferedSource =
fromInputStream(is)(Codec(enc))
def fromInputStream(is: InputStream)(implicit codec: Codec): BufferedSource =
createBufferedSource(is, reset = () => fromInputStream(is)(codec), close = () => is.close())(codec)
Edit: Now that I have thought about your issue a bit more, you can increase the buffer size in this way, but I'm not sure that this will actually fix your issue