The file name is searcher.scala and I want to be able to type:
scala searcher.scala "term I want to find" "file I want to search through" "new file with new lines"
I tried this code but keeps saying I have an empty iterator
import java.io.PrintWriter
val searchTerm = args(0)
val input = args(1)
val output = args(2)
val out = new PrintWriter(output);
val listLines = scala.io.Source.fromFile(input).getLines
for (line <- listLines)
{
{ out.println("Line: " + line) }
def term (x: String): Boolean = {x == searchTerm}
val newList = listLines.filter(term)
println(listLines.filter(term))
}
out.close;
You have iterator listLines and you read it few times but iterator is one-time object:
for (line <- listLines)
val newList = listLines.filter(term)
println(listLines.filter(term))
You need revise your code to avoid repeat using of iterator.
Related
I have a path that contains subpath ,each subpath contains files
path="/data"
I implemented tow function to get csv files from each sub path
def getListOfSubDirectories(directoryName: String): Array[String] = {
(new File(directoryName))
.listFiles
.filter(_.isDirectory)
.map(_.getName)
}
def getListOfFiles(dir: String, extensions: List[String]): List[File] = {
val d = new File(dir)
d.listFiles.filter(_.isFile).toList.filter { file =>
extensions.exists(file.getName.endsWith(_))
}
}
each sub path contain 5 csv files : contextfile.csv,datafile.csv,datesfiles.csv,errors.csv,testfiles so my problem that i'll work with each file in a separate dataframe how I can get name of file for the right dataframe for example I want to get the name of files that concern context (i.e contextfile.csv). I worked like this but for each iteration the logic and the ranking in th List change
val dir=getListOfSubDirectories(path)
for (sup_path <- dir)
{ val Files = getListOfFiles(path + "//" + sup_path, List(".csv"))
val filename_context = Files(1).toString
val filename_datavalue = Files(0).toString
val filename_error = Files(3).toString
val filename_testresult = Files(4).toString
}
any help and thanks
I solve it just a simple filter
val filename_context = Files.filter(f =>f.getName.contains("context")).last.toString
val filename_datavalue = Files.filter(f =>f.getName.contains("data")).last.toString
val filename_error = Files.filter(f =>f.getName.contains("error")).last.toString
val filename_testresult = Files.filter(f =>f.getName.contains("test")).last.toString
Within this code we have two files: athletes.csv that contains names, and twitter.test that contains the tweet message. We want to find name for every single line in the twitter.test that match the name in athletes.csv We applied map function to store the name from athletes.csv and want to iterate all of the name to all of the line in the test file.
object twitterAthlete {
def loadAthleteNames() : Map[String, String] = {
// Handle character encoding issues:
implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
// Create a Map of Ints to Strings, and populate it from u.item.
var athleteInfo:Map[String, String] = Map()
//var movieNames:Map[Int, String] = Map()
val lines = Source.fromFile("../athletes.csv").getLines()
for (line <- lines) {
var fields = line.split(',')
if (fields.length > 1) {
athleteInfo += (fields(1) -> fields(7))
}
}
return athleteInfo
}
def parseLine(line:String): (String)= {
var athleteInfo = loadAthleteNames()
var hello = new String
for((k,v) <- athleteInfo){
if(line.toString().contains(k)){
hello = k
}
}
return (hello)
}
def main(args: Array[String]){
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("local[*]", "twitterAthlete")
val lines = sc.textFile("../twitter.test")
var athleteInfo = loadAthleteNames()
val splitting = lines.map(x => x.split(";")).map(x => if(x.length == 4 && x(2).length <= 140)x(2))
var hello = new String()
val container = splitting.map(x => for((key,value) <- athleteInfo)if(x.toString().contains(key)){key}).cache
container.collect().foreach(println)
// val mapping = container.map(x => (x,1)).reduceByKey(_+_)
//mapping.collect().foreach(println)
}
}
the first file look like:
id,name,nationality,sex,height........
001,Michael,USA,male,1.96 ...
002,Json,GBR,male,1.76 ....
003,Martin,female,1.73 . ...
the second file look likes:
time, id , tweet .....
12:00, 03043, some message that contain some athletes names , .....
02:00, 03023, some message that contain some athletes names , .....
some thinks like this ...
but i got empty result after running this code, any suggestions is much appreciated
result i got is empty :
()....
()...
()...
but the result that i expected something like:
(name,1)
(other name,1)
You need to use yield to return value to your map
val container = splitting.map(x => for((key,value) <- athleteInfo ; if(x.toString().contains(key)) ) yield (key, 1)).cache
I think you should just start with the simplest option first...
I would use DataFrames so you can use the built-in CSV parsing and leverage Catalyst, Tungsten, etc.
Then you can use the built-in Tokenizer to split the tweets into words, explode, and do a simple join. Depending how big/small the data with athlete names is you'll end up with a more optimized broadcast join and avoid a shuffle.
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.Tokenizer
val tweets = spark.read.format("csv").load(...)
val athletes = spark.read.format("csv").load(...)
val tokenizer = new Tokenizer()
tokenizer.setInputCol("tweet")
tokenizer.setOutputCol("words")
val tokenized = tokenizer.transform(tweets)
val exploded = tokenized.withColumn("word", explode('words))
val withAthlete = exploded.join(athletes, 'word === 'name)
withAthlete.select(exploded("id"), 'name).show()
I came across the following example from the book "Fast Processing with Spark" by Holden Karau. I did not understand what the following line of code does in the program:
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.map(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
The program is :
package pandaspark.examples
import spark.SparkContext
import spark.SparkContext._
import spark.SparkFiles;
import au.com.bytecode.opencsv.CSVReader
import java.io.StringReader
object LoadCsvExample {
def main(args: Array[String]) {
if (args.length != 2) {
System.err.println("Usage: LoadCsvExample <master>
<inputfile>")
System.exit(1)
}
val master = args(0)
val inputFile = args(1)
val sc = new SparkContext(master, "Load CSV Example",
System.getenv("SPARK_HOME"),
Seq(System.getenv("JARS")))
sc.addFile(inputFile)
val inFile = sc.textFile(inputFile)
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.map(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
println(summedData.collect().mkString(","))
}
}
I briefly know the functionality of the above program. It parses the input
CSV and sums all the rows. But how exactly those 3 lines of code work to achieve is what I am unable to understand.
Also could anyone explain how the output would change if those lines are replaced with flatMap? Like:
val splitLines = inFile.flatMap(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.flatMap(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.map(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
so in this code is basically reading a CSV file data and adding it's value.
suppose your CSV file is something like -
10,12,13
1,2,3,4
1,2
so here inFile we are fetching a data from CSV file like -
val inFile = sc.textFile("your CSV file path")
so Here inFile is an RDD Which has text formatted data.
and when you apply collect on it then it will look like this -
Array[String] = Array(10,12,13 , 1,2,3,4 , 1,2)
and when you apply map over it then you will find -
line = 10,12,13
line = 1,2,3,4
line = 1,2
and for reading this data in CSV format it is using -
val reader = new CSVReader(new StringReader(line))
reader.readNext()
so after reading data in CSV format, splitLines look like -
Array(
Array(10,12,13),
Array(1,2,3,4),
Array(1,2)
)
on splitLines, it's applying
splitLines.map(line => line.map(_.toDouble))
here in line you will get Array(10,12,13) and after it, it's using
line.map(_.toDouble)
so it's changing all elements type from string to Double.
so in numericData you will get same
Array(Array(10.0, 12.0, 13.0), Array(1.0, 2.0, 3.0, 4.0), Array(1.0, 2.0))
but all elements now in form of Double
and it's applying the sum of the individual row or array so answer something like -
Array(35.0, 10.0, 3.0)
you will get it when you will apply susummedData.collect()
First of all there is no any flatMap operation in your code sample, so title is misleading. But in general map called on collection returns new collection with function applied to each element of collection.
Going line by line through your code snippet:
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
Type of inFile is RDD[String]. You take every such string, create csv reader out of it and call readNext (which returns array of strings). So at the end you will get RDD[String[]].
val numericData = splitLines.map(line => line.map(_.toDouble))
A bit more tricky line with 2 maps operations nested. Again, you take each element of RDD collection (which is now String[]) and apply _.toDouble function to every element of String[]. At the end you will get RDD[Double[]].
val summedData = numericData.map(row => row.sum)
You take elements of RDD and apply sum function to them. Since every element is Double[], sum will produce single Double value. At the end you will get RDD[Double].
In Scala, How to read a file in HDFS and assign the contents to a variable. I know how to read a file and I am able to print it. But If I try assign the content to a string, It giving output as Unit(). Below is the codes I tried.
val dfs = org.apache.hadoop.fs.FileSystem.get(config);
val snapshot_file = "/path/to/file/test.txt"
val stream = dfs.open(new Path(snapshot_file))
def readLines = Stream.cons(stream.readLine, Stream.continually( stream.readLine))
readLines.takeWhile(_ != null).foreach(line => println(line))
The above code printing the output properly. But If I tried assign the output to a string, I am getting correct output.
val snapshot_id = readLines.takeWhile(_ != null).foreach(line => println(line))
snapshot_id: Unit = ()
what is the correct way to assign the contents to a variable ?
You need to use mkString. Since println returns Unit() which gets stored to your variable if you call println on you stream
val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://namenode:port/"), new org.apache.hadoop.conf.Configuration())
val path = new org.apache.hadoop.fs.Path("/user/cloudera/file.txt")
val stream = hdfs.open(path)
def readLines = scala.io.Source.fromInputStream(stream)
val snapshot_id : String = readLines.takeWhile(_ != null).mkString("\n")
I used org.apache.commons.io.IOUtils.toString to convert stream in to string
def getfileAsString( file: String): String = {
import org.apache.hadoop.fs.FileSystem
val config: Configuration = new Configuration();
config.set("fs.hdfs.impl", classOf[DistributedFileSystem].getName)
config.set("fs.file.impl", classOf[LocalFileSystem].getName)
val dfs = FileSystem.get(config)
val filePath: FSDataInputStream = dfs.open(new Path(file))
logInfo("file.available " + filePath.available)
val outputxmlAsString: String = org.apache.commons.io.IOUtils.toString(filePath, "UTF-8")
outputxmlAsString
}
I was hoping somebody could help, I'm new to scala and I'm having some issues writing my output to a text file.
I have a data table and I've written some code to read it in one line at a time, do what I want it to do, and now I need it to write that line to a text file.
So for example, I have the following table of data type
Name, Date, goX, goY, stopX, stopY
1, 12/01/01, 1166, 2299, 3300, 4477
My code, takes the first characters of goX and goY and creates a new number, in this instance 1.2 and does the same for stopX and stopY so in this case you get 3.4
What I want to get in the text file is essentially the following:
go, stop
1.2, 3.4
and I want it to go through hundreds of lines doing this until I have a long list of on and off in the text file.
My current code is as follows, this is almost certainly not the most elegant solution but it is my first ever scala/java code:
import scala.io.Source
object FT2 extends App {
for(line<-Source.fromFile("C://Users//Data.csv").getLines){
var array = line.split(",")
val gox = (array(2));
val xStringGo = gox.toString
val goX =xStringGo.dropRight(1|2)
val goy = (array(3));
val yStringGo = goy.toString
val goY = yStringGo.dropRight(1|2)
val goXY = goX+"."+goY
val stopx = (array(4));
val xStringStop = stopx.toString
val stopX =xStringStop.dropRight(1|2)
val stopy = (array(3));
val yStringStop = stopy.toString
val stopY = yStringStop.dropRight(1|2)
val stopXY = stopX+"."+stopY
val GoStop = List(goXY,stopXY)
//This is where I want to print GoStop to a text file
}
Any help is much appreciated!
This should do it:
import java.io._
val data = List("everything", "you", "want", "to", "write", "to", "the", "file")
val file = "whatever.txt"
val writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file)))
for (x <- data) {
writer.write(x + "\n") // however you want to format it
}
writer.close()
But you can make it a little nicer by creating a method that will automatically close stuff for you:
def using[T <: Closeable, R](resource: T)(block: T => R): R = {
try { block(resource) }
finally { resource.close() }
}
using(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file)))) {
writer =>
for (x <- data) {
writer.write(x + "\n") // however you want to format it
}
}
So:
using(new BufferedWriter(new OutputStreamWriter(new FileOutputStream("output.txt")))) {
writer =>
for(line <- io.Source.fromFile("input.txt").getLines) {
writer.write(line + "\n") // however you want to format it
}
}