Load a .csv file from HDFS in Scala - scala

So I basically have the following code to read a .csv file and store it in an Array[Array[String]]:
def load(filepath: String): Array[Array[String]] = {
var data = Array[Array[String]]()
val bufferedSource = io.Source.fromFile(filepath)
for (line <- bufferedSource.getLines) {
data :+ line.split(",").map(_.trim)
}
bufferedSource.close
return data.slice(1,data.length-1) //skip header
}
Which works for files that are not stored on HDFS. However, when I try the same thing on HDFS I get
No such file or directory found
When writing to a file on HDFS I also had to change my original code and added some FileSystem and Path arguments to PrintWriter, but this time I have no idea at all how to do it.
I am this far:
def load(filepath: String, sc: SparkContext): Array[Array[String]] = {
var data = Array[Array[String]]()
val fs = FileSystem.get(sc.hadoopConfiguration)
val stream = fs.open(new Path(filepath))
var line = ""
while ((line = stream.readLine()) != null) {
data :+ line.split(",").map(_.trim)
}
return data.slice(1,data.length-1) //skip header
}
This should work, but I get a NullPointerException when comparing line to null or if its length is over 0.

This code will read a .csv file from HDFS:
def read(filepath: String, sc: SparkContext): ArrayBuffer[Array[String]] = {
var data = ArrayBuffer[Array[String]]()
val fs = FileSystem.get(sc.hadoopConfiguration)
val stream = fs.open(new Path(filepath))
var line = stream.readLine()
while (line != null) {
val row = line.split(",").map(_.trim)
data += row
line = stream.readLine()
}
stream.close()
return data // or return data.slice(1,data.length-1) to skip header
}

Please read this post about reading CSV by Alvin Alexander, writer of the Scala Cookbook:
object CSVDemo extends App {
println("Month, Income, Expenses, Profit")
val bufferedSource = io.Source.fromFile("/tmp/finance.csv")
for (line <- bufferedSource.getLines) {
val cols = line.split(",").map(_.trim)
// do whatever you want with the columns here
println(s"${cols(0)}|${cols(1)}|${cols(2)}|${cols(3)}")
}
bufferedSource.close
}
You just have to get an InputStream from your HDFS and replace in this snippet

Related

Nested for loop in Scala not iterating through out the file?

I have first file with data as
A,B,C
B,E,F
C,N,P
And second file with data as below
A,B,C,YES
B,C,D,NO
C,D,E,TRUE
D,E,F,FALSE
E,F,G,NO
I need every record in the first file to iterate with all records in the second file. But it's happening only for the first record.
Below is the code:
import scala.io.Source.fromFile
object TestComparision {
def main(args: Array[String]): Unit = {
val lines = fromFile("C:\\Users\\nreddy26\\Desktop\\Spark\\PRI.txt").getLines
val lines2 = fromFile("C:\\Users\\nreddy26\\Desktop\\Spark\\LKP.txt").getLines
var l = 0
var cnt = 0
for (line <- lines) {
for (line2 <- lines2) {
val cols = line.split(",").map(_.trim)
println(s"${cols(0)}|${cols(1)}|${cols(2)}")
val cols2 = line2.split(",").map(_.trim)
println(s"${cols2(0)}|${cols2(1)}|${cols2(2)}|${cols2(3)}")
}
}
}
}
As rightly suggested by #Luis, get the lines in List form by using toList:
val lines = fromFile("C:\\Users\\nreddy26\\Desktop\\Spark\\PRI.txt").getLines.toList
val lines2 = fromFile("C:\\Users\\nreddy26\\Desktop\\Spark\\LKP.txt").getLines.toList

remove header from csv while reading from from txt or csv file in spark scala

I am trying to remove header from given input file. But I couldn't make it.
Th is what I have written. Can someone help me how to remove headers from the txt or csv file.
import org.apache.spark.{SparkConf, SparkContext}
object SalesAmount {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName(getClass.getName).setMaster("local")
val sc = new SparkContext(conf)
val salesRDD = sc.textFile(args(0),2)
val salesPairRDD = salesRDD.map(rec => {
val fieldArr = rec.split(",")
(fieldArr(1), fieldArr(3).toDouble)
})
val totalAmountRDD = salesPairRDD.reduceByKey(_+_).sortBy(_._2,false)
val discountAmountRDD = totalAmountRDD.map(t => {
if (t._2 > 1000) (t._1,t._2 * 0.9)
else t
})
discountAmountRDD.foreach(println)
}
}
Skipping the first row when manually parsing text files using the RDD API is a bit tricky:
val salesPairRDD =
salesRDD
.mapPartitionsWithIndex((i, it) => if (i == 0) it.drop(1) else it)
.map(rec => {
val fieldArr = rec.split(",")
(fieldArr(1), fieldArr(3).toDouble)
})
The header line will be the first item in the first partition, so mapPartitionsWithIndex is used to iterate over the partitions and to skip the first item if the partition index is 0.

not able to store result in hdfs when code runs for second iteration

Well I am new to spark and scala and have been trying to implement cleaning of data in spark. below code checks for the missing value for one column and stores it in outputrdd and runs loops for calculating missing value. code works well when there is only one missing value in file. Since hdfs does not allow writing again on the same location it fails if there are more than one missing value. can you please assist in writing finalrdd to particular location once calculating missing values for all occurrences is done.
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("app").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val files = sc.wholeTextFiles("/input/raw_files/")
val file = files.map { case (filename, content) => filename }
file.collect.foreach(filename => {
cleaningData(filename)
})
def cleaningData(file: String) = {
//headers has column headers of the files
var hdr = headers.toString()
var vl = hdr.split("\t")
sqlContext.clearCache()
if (hdr.contains("COLUMN_HEADER")) {
//Checks for missing values in dataframe and stores missing values' in outputrdd
if (!outputrdd.isEmpty()) {
logger.info("value is zero then performing further operation")
val outputdatetimedf = sqlContext.sql("select date,'/t',time from cpc where kwh = 0")
val outputdatetimerdd = outputdatetimedf.rdd
val strings = outputdatetimerdd.map(row => row.mkString).collect()
for (i <- strings) {
if (Coddition check) {
//Calculates missing value and stores in finalrdd
finalrdd.map { x => x.mkString("\t") }.saveAsTextFile("/output")
logger.info("file is written in file")
}
}
}
}
}
}``
It is not clear how (Coddition check) works in your example.
In any case function .saveAsTextFile("/output") should be called only once.
So I would rewrite your example into this:
val strings = outputdatetimerdd
.map(row => row.mkString)
.collect() // perhaps '.collect()' is redundant
val finalrdd = strings
.filter(str => Coddition check str) //don't know how this Coddition works
.map (x => x.mkString("\t"))
// this part is called only once but not in a loop
finalrdd.saveAsTextFile("/output")
logger.info("file is written in file")

Read a file from HDFS and assign the contents to string

In Scala, How to read a file in HDFS and assign the contents to a variable. I know how to read a file and I am able to print it. But If I try assign the content to a string, It giving output as Unit(). Below is the codes I tried.
val dfs = org.apache.hadoop.fs.FileSystem.get(config);
val snapshot_file = "/path/to/file/test.txt"
val stream = dfs.open(new Path(snapshot_file))
def readLines = Stream.cons(stream.readLine, Stream.continually( stream.readLine))
readLines.takeWhile(_ != null).foreach(line => println(line))
The above code printing the output properly. But If I tried assign the output to a string, I am getting correct output.
val snapshot_id = readLines.takeWhile(_ != null).foreach(line => println(line))
snapshot_id: Unit = ()
what is the correct way to assign the contents to a variable ?
You need to use mkString. Since println returns Unit() which gets stored to your variable if you call println on you stream
val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://namenode:port/"), new org.apache.hadoop.conf.Configuration())
val path = new org.apache.hadoop.fs.Path("/user/cloudera/file.txt")
val stream = hdfs.open(path)
def readLines = scala.io.Source.fromInputStream(stream)
val snapshot_id : String = readLines.takeWhile(_ != null).mkString("\n")
I used org.apache.commons.io.IOUtils.toString to convert stream in to string
def getfileAsString( file: String): String = {
import org.apache.hadoop.fs.FileSystem
val config: Configuration = new Configuration();
config.set("fs.hdfs.impl", classOf[DistributedFileSystem].getName)
config.set("fs.file.impl", classOf[LocalFileSystem].getName)
val dfs = FileSystem.get(config)
val filePath: FSDataInputStream = dfs.open(new Path(file))
logInfo("file.available " + filePath.available)
val outputxmlAsString: String = org.apache.commons.io.IOUtils.toString(filePath, "UTF-8")
outputxmlAsString
}

What happens when you do java data manipulations in Spark outside of an RDD

I am reading a csv file from hdfs using Spark. It's going into an FSDataInputStream object. I cant use the textfile() method because it splits up the csv file by line feed, and I am reading a csv file that has line feeds inside the text fields. Opencsv from sourcefourge handles line feeds inside the cells, its a nice project, but it accepts a Reader as an input. I need to convert it to a string so that I can pass it to opencsv as a StringReader. So, HDFS File -> FSdataINputStream -> String -> StringReader -> an opencsv list of strings. Below is the code...
import java.io._
import org.apache.spark.sql.SQLContext
import org.apache.hadoop.fs._
import org.apache.hadoop.conf._
import com.opencsv._
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import java.lang.StringBuilder
val conf = new Configuration()
val hdfsCoreSitePath = new Path("core-site.xml")
val hdfsHDFSSitePath = new Path("hdfs-site.xml")
conf.addResource(hdfsCoreSitePath)
conf.addResource(hdfsHDFSSitePath)
val fileSystem = FileSystem.get(conf)
val csvPath = new Path("/raw_data/project_name/csv/file_name.csv")
val csvFile = fileSystem.open(csvPath)
val fileLen = fileSystem.getFileStatus(csvPath).getLen().toInt
var b = Array.fill[Byte](2048)(0)
var j = 1
val stringBuilder = new StringBuilder()
var bufferString = ""
csvFile.seek(0)
csvFile.read(b)
var bufferString = new String(b,"UTF-8")
stringBuilder.append(bufferString)
while(j != -1) {b = Array.fill[Byte](2048)(0);j=csvFile.read(b);bufferString = new String(b,"UTF-8");stringBuilder.append(bufferString)}
val stringBuilderClean = new StringBuilder()
stringBuilderClean = stringBuilder.substring(0,fileLen)
val reader: Reader = new StringReader(stringBuilderClean.toString()).asInstanceOf[Reader]
val csv = new CSVReader(reader)
val javaContext = new JavaSparkContext(sc)
val sqlContext = new SQLContext(sc)
val javaRDD = javaContext.parallelize(csv.readAll())
//do a bunch of transformations on the RDD
It works but I doubt it is scalable. It makes me wonder how big of a limitation it is to have a driver program which pipes in all the data trough one jvm. My questions to anyone very familiar with spark are:
What happens when you do data manipulations across your whole data set like this, before it even gets dropped into the input RDD? It is just treated as any other program and would be swapping out like crazy I guess?
How would you then make any spark program scalable? Do you always NEED to extract the data directly into an input RDD?
Your code loads the data into the memory, and then Spark driver will split and send each part of data to executor, of cause, it is not scalable.
There are two ways to resolve your question.
write custom InputFormat to support CSV file format
import java.io.{InputStreamReader, IOException}
import com.google.common.base.Charsets
import com.opencsv.{CSVParser, CSVReader}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{Seekable, Path, FileSystem}
import org.apache.hadoop.io.compress._
import org.apache.hadoop.io.{ArrayWritable, Text, LongWritable}
import org.apache.hadoop.mapred._
class CSVInputFormat extends FileInputFormat[LongWritable, ArrayWritable] with JobConfigurable {
private var compressionCodecs: CompressionCodecFactory = _
def configure(conf: JobConf) {
compressionCodecs = new CompressionCodecFactory(conf)
}
protected override def isSplitable(fs: FileSystem, file: Path): Boolean = {
val codec: CompressionCodec = compressionCodecs.getCodec(file)
if (null == codec) {
return true
}
codec.isInstanceOf[SplittableCompressionCodec]
}
#throws(classOf[IOException])
def getRecordReader(genericSplit: InputSplit, job: JobConf, reporter: Reporter): RecordReader[LongWritable, ArrayWritable] = {
reporter.setStatus(genericSplit.toString)
val delimiter: String = job.get("textinputformat.record.delimiter")
var recordDelimiterBytes: Array[Byte] = null
if (null != delimiter) {
recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8)
}
new CsvLineRecordReader(job, genericSplit.asInstanceOf[FileSplit], recordDelimiterBytes)
}
}
class CsvLineRecordReader(job: Configuration, split: FileSplit, recordDelimiter: Array[Byte])
extends RecordReader[LongWritable, ArrayWritable] {
private val compressionCodecs = new CompressionCodecFactory(job)
private val maxLineLength = job.getInt(org.apache.hadoop.mapreduce.lib.input.
LineRecordReader.MAX_LINE_LENGTH, Integer.MAX_VALUE)
private var filePosition: Seekable = _
private val file = split.getPath
private val codec = compressionCodecs.getCodec(file)
private val isCompressedInput = codec != null
private val fs = file.getFileSystem(job)
private val fileIn = fs.open(file)
private var start = split.getStart
private var pos: Long = 0L
private var end = start + split.getLength
private var reader: CSVReader = _
private var decompressor: Decompressor = _
private lazy val CSVSeparator =
if (recordDelimiter == null)
CSVParser.DEFAULT_SEPARATOR
else
recordDelimiter(0).asInstanceOf[Char]
if (isCompressedInput) {
decompressor = CodecPool.getDecompressor(codec)
if (codec.isInstanceOf[SplittableCompressionCodec]) {
val cIn = (codec.asInstanceOf[SplittableCompressionCodec])
.createInputStream(fileIn, decompressor, start, end, SplittableCompressionCodec.READ_MODE.BYBLOCK)
reader = new CSVReader(new InputStreamReader(cIn), CSVSeparator)
start = cIn.getAdjustedStart
end = cIn.getAdjustedEnd
filePosition = cIn
}else {
reader = new CSVReader(new InputStreamReader(codec.createInputStream(fileIn, decompressor)), CSVSeparator)
filePosition = fileIn
}
} else {
fileIn.seek(start)
reader = new CSVReader(new InputStreamReader(fileIn), CSVSeparator)
filePosition = fileIn
}
#throws(classOf[IOException])
private def getFilePosition: Long = {
if (isCompressedInput && null != filePosition) {
filePosition.getPos
}else
pos
}
private def nextLine: Option[Array[String]] = {
if (getFilePosition < end){
//readNext automatical split the line to elements
reader.readNext() match {
case null => None
case elems => Some(elems)
}
} else
None
}
override def next(key: LongWritable, value: ArrayWritable): Boolean =
nextLine
.exists { elems =>
key.set(pos)
val lineLength = elems.foldRight(0)((a, b) => a.length + 1 + b)
pos += lineLength
value.set(elems.map(new Text(_)))
if (lineLength < maxLineLength) true else false
}
#throws(classOf[IOException])
def getProgress: Float =
if (start == end)
0.0f
else
Math.min(1.0f, (getFilePosition - start) / (end - start).toFloat)
override def getPos: Long = pos
override def createKey(): LongWritable = new LongWritable
override def close(): Unit = {
try {
if (reader != null) {
reader.close
}
} finally {
if (decompressor != null) {
CodecPool.returnDecompressor(decompressor)
}
}
}
override def createValue(): ArrayWritable = new ArrayWritable(classOf[Text])
}
Simple test example:
val arrayRdd = sc.hadoopFile("source path", classOf[CSVInputFormat], classOf[LongWritable], classOf[ArrayWritable],
sc.defaultMinPartitions).map(_._2.get().map(_.toString))
arrayRdd.collect().foreach(e => println(e.mkString(",")))
The other way which I prefer uses spark-csv written by databricks, which is well supported for CSV file format, you can take some practices in the github page.
Updated for spark-csv, using univocity as parserLib, which can handle multi-line cells
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("parserLib", "univocity")
.option("inferSchema", "true") // Automatically infer data types
.load("source path")
What happens when you do data manipulations across your whole data set like this, before it even gets dropped into the input RDD? It is just treated as any other program and would be swapping out like crazy I guess?
You load the whole dataset into local memory. So if you have the memory, it works.
How would you then make any spark program scalable?
You have select the a data format that spark can load, or you change your application so that it can load the data format into spark directly or a bit of both.
In this case you could look at creating a custom InputFormat that splits on something other than newlines. I think you would want to also look at how you write your data so it is partitioned in HDFS at record boundaries not new lines.
However I suspect the simplest answer is to encode the data differently. JSON Lines or encode the newlines in the CSV file during the write or Avro or... Anything that fits better with Spark & HDFS.