I've seen this question but I'm not completely sure I can achieve what I want with the answer that was provided.
Note that this is just an experience to study Scala. The example that I'll provide you may not make sense.
I want to open my ~/.subversion/servers file and if I spot a line that has the word "proxy" I want comment it (basically I just want to prepend the character "#"). Every other line must be left as is.
So, this file:
Line 1
Line 2
http-proxy-host = defaultproxy.whatever.com
Line 3
would become:
Line 1
Line 2
# http-proxy-host = defaultproxy.whatever.com
Line 3
I was able to read the file, spot the lines I want to change and print them. Here's what I've done so far:
val fileToFilter = new File(filePath)
io.Source.fromFile(fileToFilter)
.getLines
.filter( line => !line.startsWith("#"))
.filter( line => line.toLowerCase().contains("proxy") )
.map( line => "#" + line )
.foreach( line => println( line ) )
I missing two things:
How to save the changes I've done to the file (can I do it directly, or do I need to copy the changes to a temp file and then replace the "servers" file with that temp file?)
How can I apply the "map" conditionally (if I spot the word "proxy", I prepend the "#", otherwise I leave the line as is).
Is this possible? Am I even following the right approach to solve this problem?
Thank you very much.
Save to a different file and rename it back to original one.
Use if-else
This should work:
import java.io.File
import java.io.PrintWriter
import scala.io.Source
val f1 = "svn.txt" // Original File
val f2 = new File("/tmp/abc.txt") // Temporary File
val w = new PrintWriter(f2)
Source.fromFile(f1).getLines
.map { x => if(x.contains("proxy")) s"# $x" else x }
.foreach(x => w.println(x))
w.close()
f2.renameTo(f1)
There is no "replace" file method in stock scala libraries: so you would open the file (as you are showing), make the changes, and then save it back (also various ways to do this) to the same path.
AFA Updating certain lines to # if they have proxy:
.map line { case l if l.contains("proxy") => s"# $l"
case l => l
}
Related
I have a text file with the below data having no particular format
abc*123 *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~
hig*0109*10052200*Rq~
abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~
I want the output as two files as below :
Based on string abc, I want to split the file.
file 1:
abc*123 *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~
hig*0109*10052200*Rq~
file 2:
abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~
And the file names should be IT name(the line starts with k7) so file1 name should be IT_1234 second file name should be IT_8876.
There is this little dirty trick that I used for a project :
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "abc")
You can set the delimiter of your spark context for reading files. So you could do something like this :
val delimit = "abc"
sc.hadoopConfiguration.set("textinputformat.record.delimiter", delimit)
val df = sc.textFile("your_original_file.txt")
.map(x => (delimit ++ x))
.toDF("delimit_column")
.filter(col("delimit_column") !== delimit)
Then you can map each element of your DataFrame (or RDD) to be written to a file.
It's a dirty method but it might help you !
Have a good day
PS : The filter at the end is to drop the first line which is empty with the concatenated delimiter
You can benefit from sparkContext's wholeTextFiles function to read the file. Then parse it to separate the strings ( here I have used #### as distinct combination of characters that won't repeat in the text)
val rdd = sc.wholeTextFiles("path to the file")
.flatMap(tuple => tuple._2.replace("\r\nabc", "####abc").split("####")).collect()
And then loop the array to save the texts to output
for(str <- rdd){
//saving codes here
}
I was able to remove the first few lines of a single file using the code below:
scala> val file = sc.textFile("file:///root/path/file.csv")
Removing first 5 lines:
scala> val Data = file.mapPartitionsWithIndex{ (idx, iter) => if (idx == 0) iter.drop(5) else iter }
The problem is: Suppose that I have multiple files with the same columns, and I want to load all of them into rdd, removing the first few lines of each file.
Is this actually possible?
I'd appreciate any help. Thanks in advance!
Lets assume there are 2 files.
ravis-MacBook-Pro:files raviramadoss$ cat file.csv
first_file_first_record
first_file_second_record
first_file_third_record
first_file_fourth_record
first_file_fifth_record
first_file_sixth_record
ravis-MacBook-Pro:files raviramadoss$ cat file_2.csv
second_file_first_record
second_file_second_record
second_file_third_record
second_file_fourth_record
second_file_fifth_record
second_file_sixth_record
second_file_seventh_record
second_file_eight_record
Scala Code
sc.wholeTextFiles("/Users/raviramadoss/files").flatMap( _._2.lines.drop(5) ).collect()
Output:
res41: Array[String] = Array(first_file_sixth_record, second_file_sixth_record, second_file_seventh_record, second_file_eight_record)
In Spark/Hadoop if you give the input path as the directory containing all the files then the code which you have written will work on all the individual files separately.
So to achieve your objective, just give the input path as the directory containing all the files. So the first few lines will be removed from all the files.
Hi guys I am parsing an unstructured file for some key words but i can't seem to easily find the line number of what the results I am getiing
val filePath:String = "myfile"
val myfile = sc.textFile(filePath);
var ora_temp = myfile.filter(line => line.contains("MyPattern")).collect
ora_temp.length
However, I not only want to find the lines that contains MyPatterns but I want more like a tupple (Mypattern line, line number)
Thanks in advance,
You can use ZipWithIndex as eliasah pointed out in a comment (with probably the most succinct way to do this using the direct tuple accessor syntax), or like so using pattern matching in the filter:
val matchingLineAndLineNumberTuples = sc.textFile("myfile").zipWithIndex().filter({
case (line, lineNumber) => line.contains("MyPattern")
}).collect
I'm looking for the best way to process a file in which, based on the contents, i combine certain lines into XML and return the XML.
e.g. Given
line 1
line 2
line 3
line 4
line 5
I may want the first call to return
<msg>line 1, line 2</msg>
and a subsequent call to return
<msg>line 5, line 4</msg>
skipping line 3 for uninteresting content and exhausting the input stream. (Note: the <msg> tags will always contain contiguous lines but the number and organization of those lines in the XML will vary.) If you'd like some criteria for choosing lines to include in a message, assume odd line #s combine with the following four lines, even line #s combine with the following two lines, mod(10) line #s combine with the following five lines, skip lines that start with '#'.
I was thinking I should implement this as an iterator so i can just do
<root>{ for (m <- messages(inputstream)) yield m }</root>
Is that reasonable? If so, how best to implement it? If not, how best to implement it? :)
Thanks
This answer provided my solution: How do you return an Iterator in Scala?
I tried the following but there appears to be some sort of buffer issue and lines are skipped between calls to Log.next.
class Log(filename:String) {
val src = io.Source.fromFile(filename)
var node:Node = null
def iterator = new Iterator[Node] {
def hasNext:Boolean = {
for (line <- src.getLines()) {
// ... do stuff ...
if (null != node) return true
}
src.close()
false
}
def next = node
}
There might be a more Scala-way to do it and i'd like to see it but this is my solution to move forward for now.
Reading lines in a foreach loop, a function looks for a value by a key in a CSV-like structured text file. After a specific line is found, it is senseless to continue reading lines looking for something there. How to stop as there is no break statement in Scala?
Scala's Source class is lazy. You can read chars or lines using takeWhile or dropWhile and the iteration over the input need not proceed farther than required.
To expand on Randall's answer. For instance if the key is in the first column:
val src = Source.fromFile("/etc/passwd")
val iter = src.getLines().map(_.split(":"))
// print the uid for Guest
iter.find(_(0) == "Guest") foreach (a => println(a(2)))
// the rest of iter is not processed
src.close()
Previous answers assumed that you want to read lines from a file, I assume that you want a way to break for-loop by demand.
Here is solution
You can do like this:
breakable {
for (...) {
if (...) break
}
}