Need the best way to iterate a file returning batches of lines as XML - scala

I'm looking for the best way to process a file in which, based on the contents, i combine certain lines into XML and return the XML.
e.g. Given
line 1
line 2
line 3
line 4
line 5
I may want the first call to return
<msg>line 1, line 2</msg>
and a subsequent call to return
<msg>line 5, line 4</msg>
skipping line 3 for uninteresting content and exhausting the input stream. (Note: the <msg> tags will always contain contiguous lines but the number and organization of those lines in the XML will vary.) If you'd like some criteria for choosing lines to include in a message, assume odd line #s combine with the following four lines, even line #s combine with the following two lines, mod(10) line #s combine with the following five lines, skip lines that start with '#'.
I was thinking I should implement this as an iterator so i can just do
<root>{ for (m <- messages(inputstream)) yield m }</root>
Is that reasonable? If so, how best to implement it? If not, how best to implement it? :)
Thanks

This answer provided my solution: How do you return an Iterator in Scala?
I tried the following but there appears to be some sort of buffer issue and lines are skipped between calls to Log.next.
class Log(filename:String) {
val src = io.Source.fromFile(filename)
var node:Node = null
def iterator = new Iterator[Node] {
def hasNext:Boolean = {
for (line <- src.getLines()) {
// ... do stuff ...
if (null != node) return true
}
src.close()
false
}
def next = node
}
There might be a more Scala-way to do it and i'd like to see it but this is my solution to move forward for now.

Related

Reading multiple integers from line in text file

I am using Scala and reading input from the console. I am able to regurgitate the strings that make up each line, but if my input has the following format, how can I access each integer within each line?
2 2
1 2 2
2 1 1
Currently I just regurgitate the input back to the console using
object Main {
def main(args: Array[String]): Unit = {
for (ln <- io.Source.stdin.getLines) println(ln)
//how can I access each individual number within each line?
}
}
And I need to compile this project like so:
$ scalac main.scala
$ scala Main <input01.txt
2 2
1 2 2
2 1 1
A reasonable algorithm would be:
for each line, split it into words
parse each word into an Int
An implementation of that algorithm:
io.Source.stdin.getLines // for each line...
.flatMap(
_.split("""\s+""") // split it into words
.map(_.toInt) // parse each word into an Int
)
The result of this expression will be an Iterator[Int]; if you want a Seq, you can call toSeq on that Iterator (if there's a reasonable chance there will be more than 7 or so integers, it's probably worth calling toVector instead). It will blow up with a NumberFormatException if there's a word which isn't an integer. You can handle this a few different ways... if you want to ignore words that aren't integers, you can:
import scala.util.Try
io.Source.stdin.getLines
.flatMap(
_.split("""\s+""")
.flatMap(Try(_.toInt).toOption)
)
The following will give you a flat list of numbers.
val integers = (
for {
line <- io.Source.stdin.getLines
number <- line.split("""\s+""").map(_.toInt)
} yield number
)
As you can read here, some care must be taken when parsing the numbers.

Remove white spaces in scala-spark

I have sample file record like this
2018-01-1509.05.540000000000001000000751111EMAIL#AAA.BB.CL
and the above record is from a fixed length file and I wanted to split based on the lengths
and when I split I am getting a list as shown below.
ListBuffer(2018-01-15, 09.05.54, 00000000000010000007, 5, 1111, EMAIL#AAA.BB.CL)
Everything looks fine until now . But I am not sure why is there extra-space adding in each field in the list(not for the first field).
Example : My data is "09.05.54",But I am getting as" 09.05.54" in the list.
My Logic for splitting is shown below
// Logic to Split the Line based on the lengths
def splitLineBasedOnLengths(line: String, lengths: List[String]): ListBuffer[Any] = {
var splittedLine = line
var split = new ListBuffer[Any]()
for (i <- lengths) yield {
var c = i.toInt
var fi = splittedLine.take(c)
split += fi
splittedLine = splittedLine.drop(c)
}
split
}
The above code take's the line and list[String] which are nothing but lengths as input and gives the listbuffer[Any] which has the lines split according to the length.
Can any one help me why am I getting extra space before each field after splitting ?
There are no extra spaces in the data. It's just adding some separation between the elements when printing them (using toString) to make them easier to read.
To prove this try the following code:
split.foreach(s => println(s"\"$s\""))
You will see the following printed:
"2018-01-15"
"09.05.54"
"00000000000010000007"
"5"
"1111"
"EMAIL#AAA.BB.CL"

Count filtered records in scala

As I am new to scala ,This problem might look very basic to all..
I have a file called data.txt which contains like below:
xxx.lss.yyy23.com-->mailuogwprd23.lss.com,Hub,12689,14.98904563,1549
xxx.lss.yyy33.com-->mailusrhubprd33.lss.com,Outbound,72996,1.673717588,1949
xxx.lss.yyy33.com-->mailuogwprd33.lss.com,Hub,12133,14.9381027,664
xxx.lss.yyy53.com-->mailusrhubprd53.lss.com,Outbound,72996,1.673717588,3071
I want to split the line and find the records depending upon the numbers in xxx.lss.yyy23.com
val data = io.Source.fromFile("data.txt").getLines().map { x => (x.split("-->"))}.map { r => r(0) }.mkString("\n")
which gives me
xxx.lss.yyy23.com
xxx.lss.yyy33.com
xxx.lss.yyy33.com
xxx.lss.yyy53.com
This is what I am trying to count the exact value...
data.count { x => x.contains("33")}
How do I get the count of records who does not contain 33...
The following will give you the number of lines that contain "33":
data.split("\n").count(a => a.contains("33"))
The reason what you have above isn't working is that you need to split data into an array of strings again. Your previous statement actually concatenates the result into a single string using newline as a separator using mkstring, so you can't really run collection operations like count on it.
The following will work for getting the lines that do not contain "33":
data.split("\n").count(a => !a.contains("33"))
You simply need to negate the contains operation in this case.

In Scala, when reading from a file how would I skip the first line?

The file is very large so I cannot store in memory. I iterate line by line as follows
for (line <- Source.fromFile(file).getLines) {
}
How can I specify that the first line should be skipped?
How about:
for (line <- Source.fromFile(file).getLines.drop(1)) {
// ...
}
drop will simply advance the iterator (returned by getLines) past the specified number of elements.

In Scala, how to stop reading lines from a file as soon as a criterion is accomplished?

Reading lines in a foreach loop, a function looks for a value by a key in a CSV-like structured text file. After a specific line is found, it is senseless to continue reading lines looking for something there. How to stop as there is no break statement in Scala?
Scala's Source class is lazy. You can read chars or lines using takeWhile or dropWhile and the iteration over the input need not proceed farther than required.
To expand on Randall's answer. For instance if the key is in the first column:
val src = Source.fromFile("/etc/passwd")
val iter = src.getLines().map(_.split(":"))
// print the uid for Guest
iter.find(_(0) == "Guest") foreach (a => println(a(2)))
// the rest of iter is not processed
src.close()
Previous answers assumed that you want to read lines from a file, I assume that you want a way to break for-loop by demand.
Here is solution
You can do like this:
breakable {
for (...) {
if (...) break
}
}