Read from GZIPInputStream to String without using Source - scala

I am using Scala. I need to read a large gzip file and turn it into string. And I need to remove the first line.
This is how I read the file:
val fis = new FileInputStream(filename)
val gz = new GZIPInputStream(fis)
And then I tried with this Source.fromInputStream(gz).getLines.drop(1).mkString("")
. But it causes out of memory error.
Therefore, I think of reading line by line and maybe put it into byte array. Then I can just convert it into a single String in the end.
But I have no idea how to do this. Any suggestion? Or any better method is also welcome.

If your gzipped file is huge, you can go with BufferedReader. Here is an example. It copies all chars from gzipped file to uncompressed, but it skips the first line.
import java.util.zip.GZIPInputStream
import java.io._
import java.nio.charset.StandardCharsets
import scala.annotation.tailrec
import scala.util.Try
val bufferSize = 4096
val pathToGzFile = "/tmp/text.txt.gz"
val pathToOutputFile = "/tmp/text_without_first_line.txt"
val charset = StandardCharsets.UTF_8
val inStream = new FileInputStream(pathToGzFile)
val outStream = new FileOutputStream(pathToOutputFile)
try {
val inGzipStream = new GZIPInputStream(inStream)
val inReader = new InputStreamReader(inGzipStream, charset)
val outWriter = new OutputStreamWriter(outStream, charset)
val bufferedReader = new BufferedReader(inReader)
val closeables = Array[Closeable](inGzipStream, inReader,
outWriter, bufferedReader)
// Read first line, so copy method will not get this - it will be skipped
val firstLine = bufferedReader.readLine()
println(s"First line: $firstLine")
#tailrec
def copy(in: Reader, out: Writer, buffer: Array[Char]): Unit = {
// Copy while it's not end of file
val readChars = in.read(buffer, 0, buffer.length)
if (readChars > 0) {
out.write(buffer, 0, readChars)
copy(in, out, buffer)
}
}
// Copy chars from bufferReader to outWriter using buffer
copy(bufferedReader, outWriter, Array.ofDim[Char](bufferSize))
// Close all closeabes
closeables.foreach(c => Try(c.close()))
}
finally {
Try(inStream.close())
Try(outStream.close())
}

Related

download gz file on clicking a url and convert to csv using scala

I am really messing with syntax here need help...
I have a URL, on clicking of which a sample.csv.gz file gets downloaded
Please can someone help me fill the syntactic gaps below:
val outputFile = "C:\\sampleNew" + ".csv"
val inputFile = "C:\\sample.csv.gz"
val fileUrl = "someSamplehttpUrl"
// On hitting this Url, sample.csv.gz file should download at destination 'outputFile'
val in = new URL()(fileUrl).openStream()
Files.copy(in, Paths.get(outputFile), StandardCopyOption.REPLACE_EXISTING)
val filePath = new File(outputFile)
if(filePath.exists()) filePath.delete()
val fw = new FileWriter(outputFile, true)
var bf = new BufferedReader(new InputStreamReader(new GZIPInputStream(new FileInputStream(inputFile)), "UTF-8"))
while (bf.ready()) fw.append(bf.readLine() + "\n")
I have been getting several errors with syntax... Any corrections here? I basically have an http get request that returns a URL, which I must open to download this gz file
Thanks!
Here are two possible solutions:
import java.io.{File, PrintWriter}
import scala.io.Source
val outputFile = "out.csv"
val inputFile = "/tmp/marks.csv"
val fileUrl = s"file:///$inputFile"
// Method 1, a traditional copy from the input to the output.
val in = Source.fromURL(fileUrl)
val out = new PrintWriter(outputFile)
for (line <- in.getLines)
out.println(line)
out.close
in.close
Here is a one liner which basically pipes the data from the input to the output.
import sys.process._
import java.net.URL
val outputFile = "out.csv"
val inputFile = "/tmp/marks.csv"
val fileUrl = s"file:///$inputFile"
// Method 2, pipe the content of the URL to the output file.
new URL(fileUrl) #> new File(outputFile) !!
Here is a version using Files.copy
val outputFile = "out.csv"
val inputFile = "/tmp/marks.csv"
val fileUrl = s"file:///$inputFile"
import java.nio.file.{Files, Paths, StandardCopyOption}
import java.net.URL
val in = new URL(fileUrl).openStream
val out = Paths.get(outputFile)
Files.copy(in, out, StandardCopyOption.REPLACE_EXISTING)
Hopefully one (or more) of the above will address your needs.

why always exists <feff> in my output file while use printWriter or BufferedWriter in scala?

I read data from file and write to destination after some transfomations.
the case :
val schema1file = Source.fromFile(s"$path", "UTF-8")
val writer = new PrintWriter(new File(s"$targetPath"))
val lines = schema1file.getLines().toArray
for (i <- 0 until lines.length) {
val line = lines(i).toString()
// todo
writer.println(line)
}
writer.close
It works and save my output right.But I always find the 'FEFF' at the very beginning of the outputfile and the place I use s"$var$var1".
Could you help to tell me why and how to fix it? Thanks#

What happens when you do java data manipulations in Spark outside of an RDD

I am reading a csv file from hdfs using Spark. It's going into an FSDataInputStream object. I cant use the textfile() method because it splits up the csv file by line feed, and I am reading a csv file that has line feeds inside the text fields. Opencsv from sourcefourge handles line feeds inside the cells, its a nice project, but it accepts a Reader as an input. I need to convert it to a string so that I can pass it to opencsv as a StringReader. So, HDFS File -> FSdataINputStream -> String -> StringReader -> an opencsv list of strings. Below is the code...
import java.io._
import org.apache.spark.sql.SQLContext
import org.apache.hadoop.fs._
import org.apache.hadoop.conf._
import com.opencsv._
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import java.lang.StringBuilder
val conf = new Configuration()
val hdfsCoreSitePath = new Path("core-site.xml")
val hdfsHDFSSitePath = new Path("hdfs-site.xml")
conf.addResource(hdfsCoreSitePath)
conf.addResource(hdfsHDFSSitePath)
val fileSystem = FileSystem.get(conf)
val csvPath = new Path("/raw_data/project_name/csv/file_name.csv")
val csvFile = fileSystem.open(csvPath)
val fileLen = fileSystem.getFileStatus(csvPath).getLen().toInt
var b = Array.fill[Byte](2048)(0)
var j = 1
val stringBuilder = new StringBuilder()
var bufferString = ""
csvFile.seek(0)
csvFile.read(b)
var bufferString = new String(b,"UTF-8")
stringBuilder.append(bufferString)
while(j != -1) {b = Array.fill[Byte](2048)(0);j=csvFile.read(b);bufferString = new String(b,"UTF-8");stringBuilder.append(bufferString)}
val stringBuilderClean = new StringBuilder()
stringBuilderClean = stringBuilder.substring(0,fileLen)
val reader: Reader = new StringReader(stringBuilderClean.toString()).asInstanceOf[Reader]
val csv = new CSVReader(reader)
val javaContext = new JavaSparkContext(sc)
val sqlContext = new SQLContext(sc)
val javaRDD = javaContext.parallelize(csv.readAll())
//do a bunch of transformations on the RDD
It works but I doubt it is scalable. It makes me wonder how big of a limitation it is to have a driver program which pipes in all the data trough one jvm. My questions to anyone very familiar with spark are:
What happens when you do data manipulations across your whole data set like this, before it even gets dropped into the input RDD? It is just treated as any other program and would be swapping out like crazy I guess?
How would you then make any spark program scalable? Do you always NEED to extract the data directly into an input RDD?
Your code loads the data into the memory, and then Spark driver will split and send each part of data to executor, of cause, it is not scalable.
There are two ways to resolve your question.
write custom InputFormat to support CSV file format
import java.io.{InputStreamReader, IOException}
import com.google.common.base.Charsets
import com.opencsv.{CSVParser, CSVReader}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{Seekable, Path, FileSystem}
import org.apache.hadoop.io.compress._
import org.apache.hadoop.io.{ArrayWritable, Text, LongWritable}
import org.apache.hadoop.mapred._
class CSVInputFormat extends FileInputFormat[LongWritable, ArrayWritable] with JobConfigurable {
private var compressionCodecs: CompressionCodecFactory = _
def configure(conf: JobConf) {
compressionCodecs = new CompressionCodecFactory(conf)
}
protected override def isSplitable(fs: FileSystem, file: Path): Boolean = {
val codec: CompressionCodec = compressionCodecs.getCodec(file)
if (null == codec) {
return true
}
codec.isInstanceOf[SplittableCompressionCodec]
}
#throws(classOf[IOException])
def getRecordReader(genericSplit: InputSplit, job: JobConf, reporter: Reporter): RecordReader[LongWritable, ArrayWritable] = {
reporter.setStatus(genericSplit.toString)
val delimiter: String = job.get("textinputformat.record.delimiter")
var recordDelimiterBytes: Array[Byte] = null
if (null != delimiter) {
recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8)
}
new CsvLineRecordReader(job, genericSplit.asInstanceOf[FileSplit], recordDelimiterBytes)
}
}
class CsvLineRecordReader(job: Configuration, split: FileSplit, recordDelimiter: Array[Byte])
extends RecordReader[LongWritable, ArrayWritable] {
private val compressionCodecs = new CompressionCodecFactory(job)
private val maxLineLength = job.getInt(org.apache.hadoop.mapreduce.lib.input.
LineRecordReader.MAX_LINE_LENGTH, Integer.MAX_VALUE)
private var filePosition: Seekable = _
private val file = split.getPath
private val codec = compressionCodecs.getCodec(file)
private val isCompressedInput = codec != null
private val fs = file.getFileSystem(job)
private val fileIn = fs.open(file)
private var start = split.getStart
private var pos: Long = 0L
private var end = start + split.getLength
private var reader: CSVReader = _
private var decompressor: Decompressor = _
private lazy val CSVSeparator =
if (recordDelimiter == null)
CSVParser.DEFAULT_SEPARATOR
else
recordDelimiter(0).asInstanceOf[Char]
if (isCompressedInput) {
decompressor = CodecPool.getDecompressor(codec)
if (codec.isInstanceOf[SplittableCompressionCodec]) {
val cIn = (codec.asInstanceOf[SplittableCompressionCodec])
.createInputStream(fileIn, decompressor, start, end, SplittableCompressionCodec.READ_MODE.BYBLOCK)
reader = new CSVReader(new InputStreamReader(cIn), CSVSeparator)
start = cIn.getAdjustedStart
end = cIn.getAdjustedEnd
filePosition = cIn
}else {
reader = new CSVReader(new InputStreamReader(codec.createInputStream(fileIn, decompressor)), CSVSeparator)
filePosition = fileIn
}
} else {
fileIn.seek(start)
reader = new CSVReader(new InputStreamReader(fileIn), CSVSeparator)
filePosition = fileIn
}
#throws(classOf[IOException])
private def getFilePosition: Long = {
if (isCompressedInput && null != filePosition) {
filePosition.getPos
}else
pos
}
private def nextLine: Option[Array[String]] = {
if (getFilePosition < end){
//readNext automatical split the line to elements
reader.readNext() match {
case null => None
case elems => Some(elems)
}
} else
None
}
override def next(key: LongWritable, value: ArrayWritable): Boolean =
nextLine
.exists { elems =>
key.set(pos)
val lineLength = elems.foldRight(0)((a, b) => a.length + 1 + b)
pos += lineLength
value.set(elems.map(new Text(_)))
if (lineLength < maxLineLength) true else false
}
#throws(classOf[IOException])
def getProgress: Float =
if (start == end)
0.0f
else
Math.min(1.0f, (getFilePosition - start) / (end - start).toFloat)
override def getPos: Long = pos
override def createKey(): LongWritable = new LongWritable
override def close(): Unit = {
try {
if (reader != null) {
reader.close
}
} finally {
if (decompressor != null) {
CodecPool.returnDecompressor(decompressor)
}
}
}
override def createValue(): ArrayWritable = new ArrayWritable(classOf[Text])
}
Simple test example:
val arrayRdd = sc.hadoopFile("source path", classOf[CSVInputFormat], classOf[LongWritable], classOf[ArrayWritable],
sc.defaultMinPartitions).map(_._2.get().map(_.toString))
arrayRdd.collect().foreach(e => println(e.mkString(",")))
The other way which I prefer uses spark-csv written by databricks, which is well supported for CSV file format, you can take some practices in the github page.
Updated for spark-csv, using univocity as parserLib, which can handle multi-line cells
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("parserLib", "univocity")
.option("inferSchema", "true") // Automatically infer data types
.load("source path")
What happens when you do data manipulations across your whole data set like this, before it even gets dropped into the input RDD? It is just treated as any other program and would be swapping out like crazy I guess?
You load the whole dataset into local memory. So if you have the memory, it works.
How would you then make any spark program scalable?
You have select the a data format that spark can load, or you change your application so that it can load the data format into spark directly or a bit of both.
In this case you could look at creating a custom InputFormat that splits on something other than newlines. I think you would want to also look at how you write your data so it is partitioned in HDFS at record boundaries not new lines.
However I suspect the simplest answer is to encode the data differently. JSON Lines or encode the newlines in the CSV file during the write or Avro or... Anything that fits better with Spark & HDFS.

Writing data generated in scala to a text file

I was hoping somebody could help, I'm new to scala and I'm having some issues writing my output to a text file.
I have a data table and I've written some code to read it in one line at a time, do what I want it to do, and now I need it to write that line to a text file.
So for example, I have the following table of data type
Name, Date, goX, goY, stopX, stopY
1, 12/01/01, 1166, 2299, 3300, 4477
My code, takes the first characters of goX and goY and creates a new number, in this instance 1.2 and does the same for stopX and stopY so in this case you get 3.4
What I want to get in the text file is essentially the following:
go, stop
1.2, 3.4
and I want it to go through hundreds of lines doing this until I have a long list of on and off in the text file.
My current code is as follows, this is almost certainly not the most elegant solution but it is my first ever scala/java code:
import scala.io.Source
object FT2 extends App {
for(line<-Source.fromFile("C://Users//Data.csv").getLines){
var array = line.split(",")
val gox = (array(2));
val xStringGo = gox.toString
val goX =xStringGo.dropRight(1|2)
val goy = (array(3));
val yStringGo = goy.toString
val goY = yStringGo.dropRight(1|2)
val goXY = goX+"."+goY
val stopx = (array(4));
val xStringStop = stopx.toString
val stopX =xStringStop.dropRight(1|2)
val stopy = (array(3));
val yStringStop = stopy.toString
val stopY = yStringStop.dropRight(1|2)
val stopXY = stopX+"."+stopY
val GoStop = List(goXY,stopXY)
//This is where I want to print GoStop to a text file
}
Any help is much appreciated!
This should do it:
import java.io._
val data = List("everything", "you", "want", "to", "write", "to", "the", "file")
val file = "whatever.txt"
val writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file)))
for (x <- data) {
writer.write(x + "\n") // however you want to format it
}
writer.close()
But you can make it a little nicer by creating a method that will automatically close stuff for you:
def using[T <: Closeable, R](resource: T)(block: T => R): R = {
try { block(resource) }
finally { resource.close() }
}
using(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file)))) {
writer =>
for (x <- data) {
writer.write(x + "\n") // however you want to format it
}
}
So:
using(new BufferedWriter(new OutputStreamWriter(new FileOutputStream("output.txt")))) {
writer =>
for(line <- io.Source.fromFile("input.txt").getLines) {
writer.write(line + "\n") // however you want to format it
}
}

How to delete the last line of the file in scala?

I am trying to append to a file such that I first want to delete the last line and then start appending. But, I can't figure how to delete the last line of the file.
I am appending the file as follows:
val fw = new FileWriter("src/file.txt", true) ;
fw.write("new item");
Can anybody please help me?
EDIT:
val lines_list = Source.fromFile("src/file.txt").getLines().toList
val new_lines = lines_list.dropRight(1)
val pw = new PrintWriter(new File("src/file.txt" ))
(t).foreach(pw.write) pw.write("\n")
pw.close()
After following your method, I am trying to write back to the file, but when I do this, all the contents, with the last line deleted come in a single line, however I want them to come in separate lines.
For very large files a simple solution relies in OS related tools, for instance sed (stream editor), and so consider a call like this,
import sys.process._
Seq("sed","-i","$ d","src/file1.txt")!
which will remove the last line of the text file. This approach is not so Scalish yet it solves the problem without leaving Scala.
Return random access file in position without last line.
import java.io.{RandomAccessFile, File}
def randomAccess(file: File) = {
val random = new RandomAccessFile(file, "rw")
val result = findLastLine(random, 0, 0)
random.seek(result)
random
}
def findLastLine(random: RandomAccessFile, position: Long, previous: Long): Long = {
val pointer = random.getFilePointer
if (random.readLine == null) {
previous
} else {
findLastLine(random, previous, pointer)
}
}
val file = new File("build.sbt")
val random = randomAccess(file)
And test:
val line = random.readLine()
logger.debug(s"$line")
My scala is way off, so people can probably give you a nicer solution:
import scala.io.Source
import java.io._
object Test00 {
def main(args: Array[String]) = {
val lines = Source.fromFile("src/file.txt").getLines().toList.dropRight(1)
val pw = new PrintWriter(new File("src/out.txt" ))
(lines :+ "another line").foreach(pw.println)
pw.close()
}
}
Sorry for the hardcoded appending, i used it just to test that everything worked fine.