Link of my screenshot I am a beginner in Scala, trying to read the file but getting the java.io.FileNotFoundException,can someone help.
package standardscala
case class TempData(day :Int,doy :Int, month:Int, year :Int, precip :Double, snow :Double, tave :Double, tmax :Double, tmin :Double )
object TempData {
def main(args: Array[String]): Unit = {
val source = scala.io.Source.fromFile("DATA/MN212.csv")
val lines = source.getLines().drop(1) // to get the lines of files,drop(1) to drop the header
val data= lines.map { line => val p = line.split(",")
TempData(p(0).toInt,p(1).toInt,p(2).toInt,p(4).toInt,p(5).toDouble,p(6).toDouble,p(7).toDouble,p(8).toDouble,p(9).toDouble)
}.toArray
source.close() //Closing the connection
data.take(5) foreach println
}
}
Try to use absolute path, and the problem will disappear.
One option would be to move your csv file into a resources folder and load it as a resource like:
val f = new File(getClass.getClassLoader.getResource("your/csv/file.csv").getPath)
Or you could try loading it from an absolute path!
Please read this post about reading CSV by Alvin Alexander, writer of the Scala Cookbook:
object CSVDemo extends App {
println("Month, Income, Expenses, Profit")
val bufferedSource = io.Source.fromFile("/tmp/finance.csv")
for (line <- bufferedSource.getLines) {
val cols = line.split(",").map(_.trim)
// do whatever you want with the columns here
println(s"${cols(0)}|${cols(1)}|${cols(2)}|${cols(3)}")
}
bufferedSource.close
}
As Silvio Manolo pointed out, you should not use fromFile with absolute path, as you code will require the same file hierarchy to run. In a first draft, this is acceptable so you can move on and test the real job!
Related
Well I am new to spark and scala and have been trying to implement cleaning of data in spark. below code checks for the missing value for one column and stores it in outputrdd and runs loops for calculating missing value. code works well when there is only one missing value in file. Since hdfs does not allow writing again on the same location it fails if there are more than one missing value. can you please assist in writing finalrdd to particular location once calculating missing values for all occurrences is done.
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("app").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val files = sc.wholeTextFiles("/input/raw_files/")
val file = files.map { case (filename, content) => filename }
file.collect.foreach(filename => {
cleaningData(filename)
})
def cleaningData(file: String) = {
//headers has column headers of the files
var hdr = headers.toString()
var vl = hdr.split("\t")
sqlContext.clearCache()
if (hdr.contains("COLUMN_HEADER")) {
//Checks for missing values in dataframe and stores missing values' in outputrdd
if (!outputrdd.isEmpty()) {
logger.info("value is zero then performing further operation")
val outputdatetimedf = sqlContext.sql("select date,'/t',time from cpc where kwh = 0")
val outputdatetimerdd = outputdatetimedf.rdd
val strings = outputdatetimerdd.map(row => row.mkString).collect()
for (i <- strings) {
if (Coddition check) {
//Calculates missing value and stores in finalrdd
finalrdd.map { x => x.mkString("\t") }.saveAsTextFile("/output")
logger.info("file is written in file")
}
}
}
}
}
}``
It is not clear how (Coddition check) works in your example.
In any case function .saveAsTextFile("/output") should be called only once.
So I would rewrite your example into this:
val strings = outputdatetimerdd
.map(row => row.mkString)
.collect() // perhaps '.collect()' is redundant
val finalrdd = strings
.filter(str => Coddition check str) //don't know how this Coddition works
.map (x => x.mkString("\t"))
// this part is called only once but not in a loop
finalrdd.saveAsTextFile("/output")
logger.info("file is written in file")
I am monitoring a directory in hdfs and if a file is put into it i want to retrieve the name of the file entering the directory and apply pattern matching so as to sort them based on names so far i have been able to retrieve the file path.which gives me the file name but i don't know how to proceed further with pattern matching
import org.apache.hadoop.mapreduce.lib.input.{FileSplit, TextInputFormat}
import org.apache.spark.rdd.{NewHadoopRDD}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
object path {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val fc = classOf[TextInputFormat]
val kc = classOf[LongWritable]
val vc = classOf[Text]
val path :String = "/home/hduser/Desktop/foldername"
val text = sc.newAPIHadoopFile(path, fc ,kc, vc, sc.hadoopConfiguration)
println("++++++++++++++ "+text)
val linesWithFileNames = text.asInstanceOf[NewHadoopRDD[LongWritable, Text]].mapPartitionsWithInputSplit((inputSplit, iterator) => {
val file = inputSplit.asInstanceOf[FileSplit]
println(">>>>>>>>>>>>>>>> "+file.getPath)
iterator.map(tup => (file.getPath, tup._2))
}
)
linesWithFileNames.collect()
}
}
this gives me the path like
/home/hduser/Desktop/folder_name/XYZ_123_pqr_abc_FILENAME
and i want to apply pattern matching based on the file name start....here it will based on XYZ.
Any help will be appreciated.
You are not specific enough, i think. From context of your question i assume XYZ are digits and you intend to sort files by that numbers, if that's the case consider the following example:
val matchStart = """^(\d{3})""".r.unanchored
val number = file match {
case matchStart(number, _*) => number.toInt
case _ => ???//error handling, not necessarily unimplemented error
}
Now having extracted the number from the begining of the file name you can perform sorting etc.
Edit
Following information in comment answer would be similar:
val matchStart = """^([a-zA-Z1-9]{3})""".r.unanchored
val prefix = file match {
case matchStart(a, _*) => a
case _ => ???//error handling, not necessarily unimplemented error
}
Having the prefix extracted you can do anything you want with it.
I think there may be a simple solution to this, I was wondering if anybody knew how to iterate over a set of files and output a value based on the files name.
My problem is, I want to read in a set of graph edges for each month, and then create a seperate monthly graphs.
Currently I've done this the long way, which is fine for doing one years worth, but I'd like a way to automate it.
You can see my code below which hopefully clearly shows what I am doing.
//Load vertex data
val vertices= (sc.textFile("D:~vertices.csv")
.map(line => line.split(",")).map(parts => (parts.head.toLong, parts.tail)))
//Define function for creating edges from csv file
def EdgeMaker(file: RDD[String]): RDD[Edge[String]] = {
file.flatMap { line =>
if (!line.isEmpty && line(0) != '#') {
val lineArray = line.split(",")
if (lineArray.length < 0) {
None
} else {
val srcId = lineArray(0).toInt
val dstId = lineArray(1).toInt
val ID = lineArray(2).toString
(Array(Edge(srcId.toInt, dstId.toInt, ID)))
}
} else {
None
}
}
}
//make graphs -This is where I want automation, so I can iterate through a
//folder of edge files and output corresponding monthly graphs.
val edgesJan = EdgeMaker(sc.textFile("D:~edges2011Jan.txt"))
val graphJan = Graph(vertices, edgesJan)
val edgesFeb = EdgeMaker(sc.textFile("D:~edges2011Feb.txt"))
val graphFeb = Graph(vertices, edgesFeb)
val edgesMar = EdgeMaker(sc.textFile("D:~edges2011Mar.txt"))
val graphMar = Graph(vertices, edgesMar)
val edgesApr = EdgeMaker(sc.textFile("D:~edges2011Apr.txt"))
val graphApr = Graph(vertices, edgesApr)
val edgesMay = EdgeMaker(sc.textFile("D:~edges2011May.txt"))
val graphMay = Graph(vertices, edgesMay)
val edgesJun = EdgeMaker(sc.textFile("D:~edges2011Jun.txt"))
val graphJun = Graph(vertices, edgesJun)
val edgesJul = EdgeMaker(sc.textFile("D:~edges2011Jul.txt"))
val graphJul = Graph(vertices, edgesJul)
val edgesAug = EdgeMaker(sc.textFile("D:~edges2011Aug.txt"))
val graphAug = Graph(vertices, edgesAug)
val edgesSep = EdgeMaker(sc.textFile("D:~edges2011Sep.txt"))
val graphSep = Graph(vertices, edgesSep)
val edgesOct = EdgeMaker(sc.textFile("D:~edges2011Oct.txt"))
val graphOct = Graph(vertices, edgesOct)
val edgesNov = EdgeMaker(sc.textFile("D:~edges2011Nov.txt"))
val graphNov = Graph(vertices, edgesNov)
val edgesDec = EdgeMaker(sc.textFile("D:~edges2011Dec.txt"))
val graphDec = Graph(vertices, edgesDec)
Any help or pointers on this would be much appreciated.
you can use Spark Context wholeTextFiles to map the filename, and use the String for naming/calling/filtering/etc your values/output/etc
val fileLoad = sc.wholeTextFiles("hdfs:///..Path").map { case (filename, content) => ... }
The Spark Context textFile only reads the data, but does not keep the file name.
----EDIT----
Sorry I seem to have mis-understood the question; you can load multiple files using
sc.wholeTextFiles("~/path/file[0-5]*,/anotherPath/*.txt").map { case (filename, content) => ... }
the asterisk * should load in all files in the path assuming they are all supported input file types.
This read will concatenate all your files into 1 single large RDD to avoid multiple calling (because each call, you have to specify the path and filename which is what you want to avoid I think).
Reading with the filename allows you to GroupBy the file name and apply your graph function to each group.
I was hoping somebody could help, I'm new to scala and I'm having some issues writing my output to a text file.
I have a data table and I've written some code to read it in one line at a time, do what I want it to do, and now I need it to write that line to a text file.
So for example, I have the following table of data type
Name, Date, goX, goY, stopX, stopY
1, 12/01/01, 1166, 2299, 3300, 4477
My code, takes the first characters of goX and goY and creates a new number, in this instance 1.2 and does the same for stopX and stopY so in this case you get 3.4
What I want to get in the text file is essentially the following:
go, stop
1.2, 3.4
and I want it to go through hundreds of lines doing this until I have a long list of on and off in the text file.
My current code is as follows, this is almost certainly not the most elegant solution but it is my first ever scala/java code:
import scala.io.Source
object FT2 extends App {
for(line<-Source.fromFile("C://Users//Data.csv").getLines){
var array = line.split(",")
val gox = (array(2));
val xStringGo = gox.toString
val goX =xStringGo.dropRight(1|2)
val goy = (array(3));
val yStringGo = goy.toString
val goY = yStringGo.dropRight(1|2)
val goXY = goX+"."+goY
val stopx = (array(4));
val xStringStop = stopx.toString
val stopX =xStringStop.dropRight(1|2)
val stopy = (array(3));
val yStringStop = stopy.toString
val stopY = yStringStop.dropRight(1|2)
val stopXY = stopX+"."+stopY
val GoStop = List(goXY,stopXY)
//This is where I want to print GoStop to a text file
}
Any help is much appreciated!
This should do it:
import java.io._
val data = List("everything", "you", "want", "to", "write", "to", "the", "file")
val file = "whatever.txt"
val writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file)))
for (x <- data) {
writer.write(x + "\n") // however you want to format it
}
writer.close()
But you can make it a little nicer by creating a method that will automatically close stuff for you:
def using[T <: Closeable, R](resource: T)(block: T => R): R = {
try { block(resource) }
finally { resource.close() }
}
using(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file)))) {
writer =>
for (x <- data) {
writer.write(x + "\n") // however you want to format it
}
}
So:
using(new BufferedWriter(new OutputStreamWriter(new FileOutputStream("output.txt")))) {
writer =>
for(line <- io.Source.fromFile("input.txt").getLines) {
writer.write(line + "\n") // however you want to format it
}
}
I am trying to append to a file such that I first want to delete the last line and then start appending. But, I can't figure how to delete the last line of the file.
I am appending the file as follows:
val fw = new FileWriter("src/file.txt", true) ;
fw.write("new item");
Can anybody please help me?
EDIT:
val lines_list = Source.fromFile("src/file.txt").getLines().toList
val new_lines = lines_list.dropRight(1)
val pw = new PrintWriter(new File("src/file.txt" ))
(t).foreach(pw.write) pw.write("\n")
pw.close()
After following your method, I am trying to write back to the file, but when I do this, all the contents, with the last line deleted come in a single line, however I want them to come in separate lines.
For very large files a simple solution relies in OS related tools, for instance sed (stream editor), and so consider a call like this,
import sys.process._
Seq("sed","-i","$ d","src/file1.txt")!
which will remove the last line of the text file. This approach is not so Scalish yet it solves the problem without leaving Scala.
Return random access file in position without last line.
import java.io.{RandomAccessFile, File}
def randomAccess(file: File) = {
val random = new RandomAccessFile(file, "rw")
val result = findLastLine(random, 0, 0)
random.seek(result)
random
}
def findLastLine(random: RandomAccessFile, position: Long, previous: Long): Long = {
val pointer = random.getFilePointer
if (random.readLine == null) {
previous
} else {
findLastLine(random, previous, pointer)
}
}
val file = new File("build.sbt")
val random = randomAccess(file)
And test:
val line = random.readLine()
logger.debug(s"$line")
My scala is way off, so people can probably give you a nicer solution:
import scala.io.Source
import java.io._
object Test00 {
def main(args: Array[String]) = {
val lines = Source.fromFile("src/file.txt").getLines().toList.dropRight(1)
val pw = new PrintWriter(new File("src/out.txt" ))
(lines :+ "another line").foreach(pw.println)
pw.close()
}
}
Sorry for the hardcoded appending, i used it just to test that everything worked fine.