Writing data generated in scala to a text file - scala

I was hoping somebody could help, I'm new to scala and I'm having some issues writing my output to a text file.
I have a data table and I've written some code to read it in one line at a time, do what I want it to do, and now I need it to write that line to a text file.
So for example, I have the following table of data type
Name, Date, goX, goY, stopX, stopY
1, 12/01/01, 1166, 2299, 3300, 4477
My code, takes the first characters of goX and goY and creates a new number, in this instance 1.2 and does the same for stopX and stopY so in this case you get 3.4
What I want to get in the text file is essentially the following:
go, stop
1.2, 3.4
and I want it to go through hundreds of lines doing this until I have a long list of on and off in the text file.
My current code is as follows, this is almost certainly not the most elegant solution but it is my first ever scala/java code:
import scala.io.Source
object FT2 extends App {
for(line<-Source.fromFile("C://Users//Data.csv").getLines){
var array = line.split(",")
val gox = (array(2));
val xStringGo = gox.toString
val goX =xStringGo.dropRight(1|2)
val goy = (array(3));
val yStringGo = goy.toString
val goY = yStringGo.dropRight(1|2)
val goXY = goX+"."+goY
val stopx = (array(4));
val xStringStop = stopx.toString
val stopX =xStringStop.dropRight(1|2)
val stopy = (array(3));
val yStringStop = stopy.toString
val stopY = yStringStop.dropRight(1|2)
val stopXY = stopX+"."+stopY
val GoStop = List(goXY,stopXY)
//This is where I want to print GoStop to a text file
}
Any help is much appreciated!

This should do it:
import java.io._
val data = List("everything", "you", "want", "to", "write", "to", "the", "file")
val file = "whatever.txt"
val writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file)))
for (x <- data) {
writer.write(x + "\n") // however you want to format it
}
writer.close()
But you can make it a little nicer by creating a method that will automatically close stuff for you:
def using[T <: Closeable, R](resource: T)(block: T => R): R = {
try { block(resource) }
finally { resource.close() }
}
using(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file)))) {
writer =>
for (x <- data) {
writer.write(x + "\n") // however you want to format it
}
}
So:
using(new BufferedWriter(new OutputStreamWriter(new FileOutputStream("output.txt")))) {
writer =>
for(line <- io.Source.fromFile("input.txt").getLines) {
writer.write(line + "\n") // however you want to format it
}
}

Related

Process Interaction through stdin/stdout

I am trying to build a class that starts a system process which waits for stdin. The class should have another method which takes a string, inputs that into the system process, and return the process' output.
The reason is that starting the process involves loading a lot of data and hence takes a while.
I am trying to dummy-test this with bc, so that bc is started and waits for input. I would envision an interface like this:
case class BcWrapper(executable: File) {
var bc: Option[???] = None
def startBc(): Unit = bc = Some(???)
def calc(input: String): String = bc.get.???
def stopBc(): Unit = bc.get.???
}
I would like to be able to use it like this:
val wrapper = BcWrapper(new File("/usr/bin/bc"))
wrapper.startBc()
val result1 = wrapper.calc("1 + 1") // should be "2"
val result2 = wrapper.calc(???)
[...]
wrapper.stopBc()
This topic has been touched in multiple questions, but never fully answered for a use case like this one. This question or this one seems to come close. However, I am not sure how to implement the ProcessLogger, nor whether to use one in the first place.
Unfortunately, the Scala documentation is not very elaborate either.
Note that I do not want to read from stdin, but want to call a function.
The background is that I want to read a large file, read it line by line, preprocess the lines, pass them to the external process, and post-process the output.
You can get something similar, but simpler, like so.
import sys.process._
import util.Try
class StdInReader(val reader :String) {
def send(input :String) :Try[String] =
Try(s"/bin/echo $input".#|(reader).!!.trim)
}
usage:
val bc = new StdInReader("/usr/bin/bc")
bc.send("2 * 8") //res0: scala.util.Try[String] = Success(16)
bc.send("12 + 8") //res1: scala.util.Try[String] = Success(20)
bc.send("22 - 8") //res2: scala.util.Try[String] = Success(14)
Programs that send a non-zero exit-code (bc doesn't) will result with a Failure().
If you need more fine-grained control you might start with something like this and expand on it.
import sys.process._
class ProcHandler(val cmnd :String) {
private val resbuf = collection.mutable.Buffer.empty[String]
def run(data :Seq[String]) :Unit = {
cmnd.run(new ProcessIO(
in => {
val writer = new java.io.PrintWriter(in)
data.foreach(writer.println)
writer.close()
},
out => {
val src = io.Source.fromInputStream(out)
src.getLines().foreach(resbuf += _)
src.close()
},
_.close() //maybe create separate buffer for stderr?
)).exitValue()
}
def results() :Seq[String] = {
val rs = collection.mutable.Buffer.empty[String]
resbuf.copyToBuffer(rs)
resbuf.clear()
rs
}
}
usage:
val bc = new ProcHandler("/usr/bin/bc")
bc.run(List("4+5","6-2","2*5"))
bc.run(List("99/3","11*77"))
bc.results() //res0: Seq[String] = ArrayBuffer(9, 4, 10, 33, 847)
OK, I did some more research and found this. It appears to get at what you want but there are limitations. In particular, the process stays open for input until you want to get output. At that point IO streams are closed to insure all buffers are flushed.
import sys.process._
import util.Try
class ProcHandler(val cmnd :String) {
private val procInput = new java.io.PipedOutputStream()
private val procOutput = new java.io.PipedInputStream()
private val proc = cmnd.run( new ProcessIO(
{ in => // attach to the process's internal input stream
val istream = new java.io.PipedInputStream(procInput)
val buf = Array.fill(100)(0.toByte)
Iterator.iterate(istream.read(buf)){ br =>
in.write(buf, 0, br)
istream.read(buf)
}.takeWhile(_>=0).toList
in.close()
},
{ out => // attach to the process's internal output stream
val ostream = new java.io.PipedOutputStream(procOutput)
val buf = Array.fill(100)(0.toByte)
Iterator.iterate(out.read(buf)){ br =>
ostream.write(buf, 0, br)
out.read(buf)
}.takeWhile(_>=0).toList
out.close()
},
_ => () // ignore stderr
))
private val procO = new java.io.BufferedReader(new java.io.InputStreamReader(procOutput))
private val procI = new java.io.PrintWriter(procInput, true)
def feed(str :String) :Unit = procI.println(str)
def feed(ss :Seq[String]) :Unit = ss.foreach(procI.println)
def read() :List[String] = {
procI.close() //close input before reading output
val lines = Stream.iterate(Try(procO.readLine)){_ =>
Try(procO.readLine)
}.takeWhile(_.isSuccess).map(_.get).toList
procO.close()
lines
}
}
usage:
val bc = new ProcHandler("/usr/bin/bc")
bc.feed(List("9*3","4+11")) //res0: Unit = ()
bc.feed("4*13") //res1: Unit = ()
bc.read() //res2: List[String] = List(27, 15, 52)
bc.read() //res3: List[String] = List()
OK, this is my final word on the subject. I think this ticks every item on your wish list: start the process only once, it stays alive until actively closed, allows alternating the writing and reading.
import sys.process._
class ProcHandler(val cmnd :Seq[String]) {
private var os: java.io.OutputStream = null
private var is: java.io.InputStream = null
private val pio = new ProcessIO(os = _, is = _, _.close())
private val proc = cmnd.run(pio)
def feed(ss :String*) :Unit = {
ss.foreach(_.foreach(os.write(_)))
os.flush()
}
def ready :Boolean = is.available() > 0
def read() :String = {
Seq.fill[Char](is.available())(is.read().toChar).mkString
}
def close() :Unit = {
proc.exitValue()
os.close()
is.close()
}
}
There are still issues and much room for improvement. IO is handled at a basic level (streams) and I'm not sure what I'm doing here is completely safe and correct. The input, feed(), is required to supply the necessary NewLine terminations, and the output, read(), is just a raw String and not separated into a nice collection of string results.
Note that this will bleed system resources if the client code fails to close() all processes.
Note also that reading doesn't wait for content (i.e. no blocking). After writing the response might not be immediately available.
usage:
val bc = new ProcHandler(Seq("/usr/bin/bc","-q"))
bc.feed("44-21\n", "21*4\n")
bc.feed("67+11\n")
if (bc.ready) bc.read() else "not ready" // "23\n84\n78\n"
bc.feed("67-11\n")
if (bc.ready) bc.read() else "not ready" // "56\n"
bc.feed("67*11\n", "1+2\n")
if (bc.ready) bc.read() else "not ready" // "737\n3\n"
if (bc.ready) bc.read() else "not ready" // "not ready"
bc.close()

not able to store result in hdfs when code runs for second iteration

Well I am new to spark and scala and have been trying to implement cleaning of data in spark. below code checks for the missing value for one column and stores it in outputrdd and runs loops for calculating missing value. code works well when there is only one missing value in file. Since hdfs does not allow writing again on the same location it fails if there are more than one missing value. can you please assist in writing finalrdd to particular location once calculating missing values for all occurrences is done.
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("app").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val files = sc.wholeTextFiles("/input/raw_files/")
val file = files.map { case (filename, content) => filename }
file.collect.foreach(filename => {
cleaningData(filename)
})
def cleaningData(file: String) = {
//headers has column headers of the files
var hdr = headers.toString()
var vl = hdr.split("\t")
sqlContext.clearCache()
if (hdr.contains("COLUMN_HEADER")) {
//Checks for missing values in dataframe and stores missing values' in outputrdd
if (!outputrdd.isEmpty()) {
logger.info("value is zero then performing further operation")
val outputdatetimedf = sqlContext.sql("select date,'/t',time from cpc where kwh = 0")
val outputdatetimerdd = outputdatetimedf.rdd
val strings = outputdatetimerdd.map(row => row.mkString).collect()
for (i <- strings) {
if (Coddition check) {
//Calculates missing value and stores in finalrdd
finalrdd.map { x => x.mkString("\t") }.saveAsTextFile("/output")
logger.info("file is written in file")
}
}
}
}
}
}``
It is not clear how (Coddition check) works in your example.
In any case function .saveAsTextFile("/output") should be called only once.
So I would rewrite your example into this:
val strings = outputdatetimerdd
.map(row => row.mkString)
.collect() // perhaps '.collect()' is redundant
val finalrdd = strings
.filter(str => Coddition check str) //don't know how this Coddition works
.map (x => x.mkString("\t"))
// this part is called only once but not in a loop
finalrdd.saveAsTextFile("/output")
logger.info("file is written in file")

Looping through Map Spark Scala

Within this code we have two files: athletes.csv that contains names, and twitter.test that contains the tweet message. We want to find name for every single line in the twitter.test that match the name in athletes.csv We applied map function to store the name from athletes.csv and want to iterate all of the name to all of the line in the test file.
object twitterAthlete {
def loadAthleteNames() : Map[String, String] = {
// Handle character encoding issues:
implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
// Create a Map of Ints to Strings, and populate it from u.item.
var athleteInfo:Map[String, String] = Map()
//var movieNames:Map[Int, String] = Map()
val lines = Source.fromFile("../athletes.csv").getLines()
for (line <- lines) {
var fields = line.split(',')
if (fields.length > 1) {
athleteInfo += (fields(1) -> fields(7))
}
}
return athleteInfo
}
def parseLine(line:String): (String)= {
var athleteInfo = loadAthleteNames()
var hello = new String
for((k,v) <- athleteInfo){
if(line.toString().contains(k)){
hello = k
}
}
return (hello)
}
def main(args: Array[String]){
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("local[*]", "twitterAthlete")
val lines = sc.textFile("../twitter.test")
var athleteInfo = loadAthleteNames()
val splitting = lines.map(x => x.split(";")).map(x => if(x.length == 4 && x(2).length <= 140)x(2))
var hello = new String()
val container = splitting.map(x => for((key,value) <- athleteInfo)if(x.toString().contains(key)){key}).cache
container.collect().foreach(println)
// val mapping = container.map(x => (x,1)).reduceByKey(_+_)
//mapping.collect().foreach(println)
}
}
the first file look like:
id,name,nationality,sex,height........
001,Michael,USA,male,1.96 ...
002,Json,GBR,male,1.76 ....
003,Martin,female,1.73 . ...
the second file look likes:
time, id , tweet .....
12:00, 03043, some message that contain some athletes names , .....
02:00, 03023, some message that contain some athletes names , .....
some thinks like this ...
but i got empty result after running this code, any suggestions is much appreciated
result i got is empty :
()....
()...
()...
but the result that i expected something like:
(name,1)
(other name,1)
You need to use yield to return value to your map
val container = splitting.map(x => for((key,value) <- athleteInfo ; if(x.toString().contains(key)) ) yield (key, 1)).cache
I think you should just start with the simplest option first...
I would use DataFrames so you can use the built-in CSV parsing and leverage Catalyst, Tungsten, etc.
Then you can use the built-in Tokenizer to split the tweets into words, explode, and do a simple join. Depending how big/small the data with athlete names is you'll end up with a more optimized broadcast join and avoid a shuffle.
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.Tokenizer
val tweets = spark.read.format("csv").load(...)
val athletes = spark.read.format("csv").load(...)
val tokenizer = new Tokenizer()
tokenizer.setInputCol("tweet")
tokenizer.setOutputCol("words")
val tokenized = tokenizer.transform(tweets)
val exploded = tokenized.withColumn("word", explode('words))
val withAthlete = exploded.join(athletes, 'word === 'name)
withAthlete.select(exploded("id"), 'name).show()

Understanding the operation of map function

I came across the following example from the book "Fast Processing with Spark" by Holden Karau. I did not understand what the following line of code does in the program:
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.map(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
The program is :
package pandaspark.examples
import spark.SparkContext
import spark.SparkContext._
import spark.SparkFiles;
import au.com.bytecode.opencsv.CSVReader
import java.io.StringReader
object LoadCsvExample {
def main(args: Array[String]) {
if (args.length != 2) {
System.err.println("Usage: LoadCsvExample <master>
<inputfile>")
System.exit(1)
}
val master = args(0)
val inputFile = args(1)
val sc = new SparkContext(master, "Load CSV Example",
System.getenv("SPARK_HOME"),
Seq(System.getenv("JARS")))
sc.addFile(inputFile)
val inFile = sc.textFile(inputFile)
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.map(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
println(summedData.collect().mkString(","))
}
}
I briefly know the functionality of the above program. It parses the input
CSV and sums all the rows. But how exactly those 3 lines of code work to achieve is what I am unable to understand.
Also could anyone explain how the output would change if those lines are replaced with flatMap? Like:
val splitLines = inFile.flatMap(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.flatMap(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.map(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
so in this code is basically reading a CSV file data and adding it's value.
suppose your CSV file is something like -
10,12,13
1,2,3,4
1,2
so here inFile we are fetching a data from CSV file like -
val inFile = sc.textFile("your CSV file path")
so Here inFile is an RDD Which has text formatted data.
and when you apply collect on it then it will look like this -
Array[String] = Array(10,12,13 , 1,2,3,4 , 1,2)
and when you apply map over it then you will find -
line = 10,12,13
line = 1,2,3,4
line = 1,2
and for reading this data in CSV format it is using -
val reader = new CSVReader(new StringReader(line))
reader.readNext()
so after reading data in CSV format, splitLines look like -
Array(
Array(10,12,13),
Array(1,2,3,4),
Array(1,2)
)
on splitLines, it's applying
splitLines.map(line => line.map(_.toDouble))
here in line you will get Array(10,12,13) and after it, it's using
line.map(_.toDouble)
so it's changing all elements type from string to Double.
so in numericData you will get same
Array(Array(10.0, 12.0, 13.0), Array(1.0, 2.0, 3.0, 4.0), Array(1.0, 2.0))
but all elements now in form of Double
and it's applying the sum of the individual row or array so answer something like -
Array(35.0, 10.0, 3.0)
you will get it when you will apply susummedData.collect()
First of all there is no any flatMap operation in your code sample, so title is misleading. But in general map called on collection returns new collection with function applied to each element of collection.
Going line by line through your code snippet:
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
Type of inFile is RDD[String]. You take every such string, create csv reader out of it and call readNext (which returns array of strings). So at the end you will get RDD[String[]].
val numericData = splitLines.map(line => line.map(_.toDouble))
A bit more tricky line with 2 maps operations nested. Again, you take each element of RDD collection (which is now String[]) and apply _.toDouble function to every element of String[]. At the end you will get RDD[Double[]].
val summedData = numericData.map(row => row.sum)
You take elements of RDD and apply sum function to them. Since every element is Double[], sum will produce single Double value. At the end you will get RDD[Double].

What happens when you do java data manipulations in Spark outside of an RDD

I am reading a csv file from hdfs using Spark. It's going into an FSDataInputStream object. I cant use the textfile() method because it splits up the csv file by line feed, and I am reading a csv file that has line feeds inside the text fields. Opencsv from sourcefourge handles line feeds inside the cells, its a nice project, but it accepts a Reader as an input. I need to convert it to a string so that I can pass it to opencsv as a StringReader. So, HDFS File -> FSdataINputStream -> String -> StringReader -> an opencsv list of strings. Below is the code...
import java.io._
import org.apache.spark.sql.SQLContext
import org.apache.hadoop.fs._
import org.apache.hadoop.conf._
import com.opencsv._
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import java.lang.StringBuilder
val conf = new Configuration()
val hdfsCoreSitePath = new Path("core-site.xml")
val hdfsHDFSSitePath = new Path("hdfs-site.xml")
conf.addResource(hdfsCoreSitePath)
conf.addResource(hdfsHDFSSitePath)
val fileSystem = FileSystem.get(conf)
val csvPath = new Path("/raw_data/project_name/csv/file_name.csv")
val csvFile = fileSystem.open(csvPath)
val fileLen = fileSystem.getFileStatus(csvPath).getLen().toInt
var b = Array.fill[Byte](2048)(0)
var j = 1
val stringBuilder = new StringBuilder()
var bufferString = ""
csvFile.seek(0)
csvFile.read(b)
var bufferString = new String(b,"UTF-8")
stringBuilder.append(bufferString)
while(j != -1) {b = Array.fill[Byte](2048)(0);j=csvFile.read(b);bufferString = new String(b,"UTF-8");stringBuilder.append(bufferString)}
val stringBuilderClean = new StringBuilder()
stringBuilderClean = stringBuilder.substring(0,fileLen)
val reader: Reader = new StringReader(stringBuilderClean.toString()).asInstanceOf[Reader]
val csv = new CSVReader(reader)
val javaContext = new JavaSparkContext(sc)
val sqlContext = new SQLContext(sc)
val javaRDD = javaContext.parallelize(csv.readAll())
//do a bunch of transformations on the RDD
It works but I doubt it is scalable. It makes me wonder how big of a limitation it is to have a driver program which pipes in all the data trough one jvm. My questions to anyone very familiar with spark are:
What happens when you do data manipulations across your whole data set like this, before it even gets dropped into the input RDD? It is just treated as any other program and would be swapping out like crazy I guess?
How would you then make any spark program scalable? Do you always NEED to extract the data directly into an input RDD?
Your code loads the data into the memory, and then Spark driver will split and send each part of data to executor, of cause, it is not scalable.
There are two ways to resolve your question.
write custom InputFormat to support CSV file format
import java.io.{InputStreamReader, IOException}
import com.google.common.base.Charsets
import com.opencsv.{CSVParser, CSVReader}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{Seekable, Path, FileSystem}
import org.apache.hadoop.io.compress._
import org.apache.hadoop.io.{ArrayWritable, Text, LongWritable}
import org.apache.hadoop.mapred._
class CSVInputFormat extends FileInputFormat[LongWritable, ArrayWritable] with JobConfigurable {
private var compressionCodecs: CompressionCodecFactory = _
def configure(conf: JobConf) {
compressionCodecs = new CompressionCodecFactory(conf)
}
protected override def isSplitable(fs: FileSystem, file: Path): Boolean = {
val codec: CompressionCodec = compressionCodecs.getCodec(file)
if (null == codec) {
return true
}
codec.isInstanceOf[SplittableCompressionCodec]
}
#throws(classOf[IOException])
def getRecordReader(genericSplit: InputSplit, job: JobConf, reporter: Reporter): RecordReader[LongWritable, ArrayWritable] = {
reporter.setStatus(genericSplit.toString)
val delimiter: String = job.get("textinputformat.record.delimiter")
var recordDelimiterBytes: Array[Byte] = null
if (null != delimiter) {
recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8)
}
new CsvLineRecordReader(job, genericSplit.asInstanceOf[FileSplit], recordDelimiterBytes)
}
}
class CsvLineRecordReader(job: Configuration, split: FileSplit, recordDelimiter: Array[Byte])
extends RecordReader[LongWritable, ArrayWritable] {
private val compressionCodecs = new CompressionCodecFactory(job)
private val maxLineLength = job.getInt(org.apache.hadoop.mapreduce.lib.input.
LineRecordReader.MAX_LINE_LENGTH, Integer.MAX_VALUE)
private var filePosition: Seekable = _
private val file = split.getPath
private val codec = compressionCodecs.getCodec(file)
private val isCompressedInput = codec != null
private val fs = file.getFileSystem(job)
private val fileIn = fs.open(file)
private var start = split.getStart
private var pos: Long = 0L
private var end = start + split.getLength
private var reader: CSVReader = _
private var decompressor: Decompressor = _
private lazy val CSVSeparator =
if (recordDelimiter == null)
CSVParser.DEFAULT_SEPARATOR
else
recordDelimiter(0).asInstanceOf[Char]
if (isCompressedInput) {
decompressor = CodecPool.getDecompressor(codec)
if (codec.isInstanceOf[SplittableCompressionCodec]) {
val cIn = (codec.asInstanceOf[SplittableCompressionCodec])
.createInputStream(fileIn, decompressor, start, end, SplittableCompressionCodec.READ_MODE.BYBLOCK)
reader = new CSVReader(new InputStreamReader(cIn), CSVSeparator)
start = cIn.getAdjustedStart
end = cIn.getAdjustedEnd
filePosition = cIn
}else {
reader = new CSVReader(new InputStreamReader(codec.createInputStream(fileIn, decompressor)), CSVSeparator)
filePosition = fileIn
}
} else {
fileIn.seek(start)
reader = new CSVReader(new InputStreamReader(fileIn), CSVSeparator)
filePosition = fileIn
}
#throws(classOf[IOException])
private def getFilePosition: Long = {
if (isCompressedInput && null != filePosition) {
filePosition.getPos
}else
pos
}
private def nextLine: Option[Array[String]] = {
if (getFilePosition < end){
//readNext automatical split the line to elements
reader.readNext() match {
case null => None
case elems => Some(elems)
}
} else
None
}
override def next(key: LongWritable, value: ArrayWritable): Boolean =
nextLine
.exists { elems =>
key.set(pos)
val lineLength = elems.foldRight(0)((a, b) => a.length + 1 + b)
pos += lineLength
value.set(elems.map(new Text(_)))
if (lineLength < maxLineLength) true else false
}
#throws(classOf[IOException])
def getProgress: Float =
if (start == end)
0.0f
else
Math.min(1.0f, (getFilePosition - start) / (end - start).toFloat)
override def getPos: Long = pos
override def createKey(): LongWritable = new LongWritable
override def close(): Unit = {
try {
if (reader != null) {
reader.close
}
} finally {
if (decompressor != null) {
CodecPool.returnDecompressor(decompressor)
}
}
}
override def createValue(): ArrayWritable = new ArrayWritable(classOf[Text])
}
Simple test example:
val arrayRdd = sc.hadoopFile("source path", classOf[CSVInputFormat], classOf[LongWritable], classOf[ArrayWritable],
sc.defaultMinPartitions).map(_._2.get().map(_.toString))
arrayRdd.collect().foreach(e => println(e.mkString(",")))
The other way which I prefer uses spark-csv written by databricks, which is well supported for CSV file format, you can take some practices in the github page.
Updated for spark-csv, using univocity as parserLib, which can handle multi-line cells
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("parserLib", "univocity")
.option("inferSchema", "true") // Automatically infer data types
.load("source path")
What happens when you do data manipulations across your whole data set like this, before it even gets dropped into the input RDD? It is just treated as any other program and would be swapping out like crazy I guess?
You load the whole dataset into local memory. So if you have the memory, it works.
How would you then make any spark program scalable?
You have select the a data format that spark can load, or you change your application so that it can load the data format into spark directly or a bit of both.
In this case you could look at creating a custom InputFormat that splits on something other than newlines. I think you would want to also look at how you write your data so it is partitioned in HDFS at record boundaries not new lines.
However I suspect the simplest answer is to encode the data differently. JSON Lines or encode the newlines in the CSV file during the write or Avro or... Anything that fits better with Spark & HDFS.