Scala: scanLeft one item behind when reading from stdin - scala

If I process the input from stdin with scanLeft, the resulting output is always one line behind my last input:
io.Source.stdin
.getLines
.scanLeft("START:")((accu, line) => accu + " " + line)
.foreach(println(_))
Results in (my manual inputs are preceded by >):
> first
START:
> second
START: first
> third
START: first second
The sensible output I want is:
> first
START: first
> second
START: first second
> third
START: first second third
As you can see, the output following the first input line should already contain the string of the first input line.
I already tried it using .scanLeft(...).drop(1).foreach(...), but this leads to the following result:
> first
> second
START: first
> third
START: first second
How do I correctly omit the pure seed to get the desired result?
[UPDATE]
For the time being I am content with Andrey Tyukin's nifty workaround. Many thanks for suggesting it.
But of course, if there is any alternative to scanLeft that does not send the seed as first item into the following iteration chain, I will prefer that solution.
[UPDATE]
User jwvh understood my objective and provided an excellent solution to it. To round off their suggestion I seek a way of preprocessing the lines before sending them into the accumulation callback. Thus the readLine command should not be called in the accumulation callback but in a different chain link I can prepend.

Edit Summary: Added a map to demonstrate that the preprocessing of lines returned by getLines is just as trivial.
You could move println into the body of scanLeft itself, to force immediate execution without the lag:
io.Source.stdin
.getLines
.scanLeft("START:") {
(accu, line) => accu + " " + line
val res = accu + " " + line
println(res)
res
}.foreach{_ => }
However, this seems to behave exactly the same as a shorter and more intuitive foldLeft:
io.Source.stdin
.getLines
.foldLeft("START:") {
(accu, line) => accu + " " + line
val res = accu + " " + line
println(res)
res
}
Example interaction:
first
START: first
second
START: first second
third
START: first second third
fourth
START: first second third fourth
fifth
START: first second third fourth fifth
sixth
START: first second third fourth fifth sixth
seventh
START: first second third fourth fifth sixth seventh
end
START: first second third fourth fifth sixth seventh end
EDIT
You can of course add a map-step to preprocess the lines:
io.Source.stdin
.getLines
.map(_.toUpperCase)
.foldLeft("START:") {
(accu, line) => accu + " " + line
val res = accu + " " + line
println(res)
res
}
Example interaction (typed lowercase, printed uppercase):
> foo
START: FOO
> bar
START: FOO BAR
> baz
START: FOO BAR BAZ

You can get something pretty similar with Stream.iterate() in place of scanLeft() and StdIn.readLine in place of stdin.getLines.
def input = Stream.iterate("START:"){prev =>
val next = s"$prev ${io.StdIn.readLine}"
println(next)
next
}
Since a Stream is evaluated lazily you'll need some means to materialize it.
val inStr = input.takeWhile(! _.contains("quit")).last
START: one //after input "one"<return>
START: one two //after input "two"<return>
START: one two brit //after input "brit"<return>
START: one two brit quit //after input "quit"<return>
//inStr: String = START: one two brit
You actually don't have to give up on the getLines iterator if that's a requirement.
def inItr = io.Source.stdin.getLines
def input = Stream.iterate("START:"){prev =>
val next = s"$prev ${inItr.next}"
println(next)
next
}
Not sure if this addresses your comments or not. Lots depends on where possible errors might come from and how they are determined.
Stream.iterate(document()){ doc =>
val line = io.StdIn.readLine //blocks here
.trim
.filterNot(_.isControl)
//other String or Char manipulations
doc.update(line)
/* at this point you have both input line and updated document to play with */
... //handle error and logging requirements
doc //for the next iteration
}
I've assumed that .update() modifies the source document and returns nothing (returns Unit). That's the usual signature for an update() method.
Much of this can be done in a call chain (_.method1.method2. etc.) but sometimes that just makes things more complicated.
Methods that don't return a value of interest can still be added to a call chain by using something called the kestrel pattern.

Related

Trouble With Rhyming Algorithm Scala

I want to make a method that takes two List of strings representing the sounds(phonemes & vowels) of two words as parameters. The function of this method is to determine whether or not the words rhyme based on the two sounds.
Definition of a rhyme: The words rhymes if the last vowel(inclusive) and after are the same. Words will rhyme even if the last vowel sounds have different stress. Only vowels will have stress levels(numbers)
My approach so far is to reverse the list so that the sounds are in reverse order and then add everything from the start of the line to the first vowel(inclusive). Then compare the two list to see if they equal. Please apply basic code, Im only at elementary level of scala. Just finished learning program execution.
Ex1: two words GEE and NEE will rhyme because GEE sound (“JH”,”IY1”) becomes (”IY1”,”JH”) and NEE sound (“N”,”IY1) becomes (”IY1”, “N”) since they have the same vowel everything else after should not be considered any more.
Ex2: two words GEE and JEEP will not rhyme because GEE sound (“JH”,”IY1”) becomes (”IY1”,”JH”) and JEEP sound (“JH”,”IY1”,”P”) becomes (”P”,”IY1”,”JH”) since the first sound in GEE is a vowel it’s being compared to “P” and “IY1” in JEEP.
Ex3: two words HALF and GRAPH will rhyme because HALF sound(“HH”,”AE1”,”F”) becomes (“F”,”AE1”,”HH”) and GRAPH sound (“G”,”R”,”AE2”,”F”) become (“F”,”AE2”,”R”,”G”) in this case although the first vowel have different stress(numbers) we ignore the stress since the vowels are the same.
def isRhymeSounds(soundList1: List[String], soundList2: List[String]): Boolean={
val revSound1 = soundList1.reverse
val revSound2 = soundList2.reverse
var revSoundList1:List[String] = List()
var revSoundList2:List[String] = List()
for(sound1 <- revSound1) {
if(sound1.length >= 3) {
val editVowel1 = sound1.substring(0,2)
revSoundList1 = revSoundList1 :+ editVowel1
}
else {
revSoundList1 = revSoundList1 :+ sound1
}
}
for(sound2 <- revSound2) {
if(sound2.length >= 3) {
val editVowel2 = sound2.substring(0, 2)
revSoundList2 = revSoundList2 :+ editVowel2
}
else {
revSoundList2 = revSoundList2 :+ sound2
}
}
if(revSoundList1 == revSoundList2){
true
}
else{
false
}
}
I don't think reversing is necessary.
def isRhyme(sndA :List[String], sndB :List[String]) :Boolean = {
val vowel = "[AEIOUY]+".r
sndA.foldLeft(""){case (res, s) => vowel.findPrefixOf(s).getOrElse(res+s)} ==
sndB.foldLeft(""){case (res, s) => vowel.findPrefixOf(s).getOrElse(res+s)}
}
explanation
"[AEIOUY]+".r - This is a Regular Expression (that's the .r part) that means "a String of one or more of these characters." In other words, any combination of capital letter vowels.
findPrefixOf() - This returns the first part of the test string that matches the regular expression. So vowel.findPrefixOf("AY2") returns Some("AY") because the first two letters match the regular expression. And vowel.findPrefixOf("OFF") returns Some("O") because only the first letter matches the regular expression. But vowel.findPrefixOf("BAY") returns None because the string does not start with any of the specified characters.
getOrElse() - This unwraps an Option. So Some("AY").getOrElse("X") returns "AY", and Some("O").getOrElse("X") returns "O", but None.getOrElse("X") returns "X" because there's nothing inside a None value so we go with the OrElse default return value.
foldLeft()() - This takes a collection of elements and, starting from the "left", it "folds" them in on each other until a final result is obtained.
So, consider how List("HH", "AE1", "F", "JH", "IY1", "P") would be processed.
res s result
=== === ======
"" HH ""+HH //s doesn't match the RE so default res+s
HH AE1 AE //s does match the RE, use only the matching part
AE F AE+F //res+s
AEF JH AEF+JH //res+s
AEFJH IY1 IY //only the matching part
IY P IY+P //res+s
final result: "IYP"

Remove white spaces in scala-spark

I have sample file record like this
2018-01-1509.05.540000000000001000000751111EMAIL#AAA.BB.CL
and the above record is from a fixed length file and I wanted to split based on the lengths
and when I split I am getting a list as shown below.
ListBuffer(2018-01-15, 09.05.54, 00000000000010000007, 5, 1111, EMAIL#AAA.BB.CL)
Everything looks fine until now . But I am not sure why is there extra-space adding in each field in the list(not for the first field).
Example : My data is "09.05.54",But I am getting as" 09.05.54" in the list.
My Logic for splitting is shown below
// Logic to Split the Line based on the lengths
def splitLineBasedOnLengths(line: String, lengths: List[String]): ListBuffer[Any] = {
var splittedLine = line
var split = new ListBuffer[Any]()
for (i <- lengths) yield {
var c = i.toInt
var fi = splittedLine.take(c)
split += fi
splittedLine = splittedLine.drop(c)
}
split
}
The above code take's the line and list[String] which are nothing but lengths as input and gives the listbuffer[Any] which has the lines split according to the length.
Can any one help me why am I getting extra space before each field after splitting ?
There are no extra spaces in the data. It's just adding some separation between the elements when printing them (using toString) to make them easier to read.
To prove this try the following code:
split.foreach(s => println(s"\"$s\""))
You will see the following printed:
"2018-01-15"
"09.05.54"
"00000000000010000007"
"5"
"1111"
"EMAIL#AAA.BB.CL"

Change the contents of a file in scala

I've seen this question but I'm not completely sure I can achieve what I want with the answer that was provided.
Note that this is just an experience to study Scala. The example that I'll provide you may not make sense.
I want to open my ~/.subversion/servers file and if I spot a line that has the word "proxy" I want comment it (basically I just want to prepend the character "#"). Every other line must be left as is.
So, this file:
Line 1
Line 2
http-proxy-host = defaultproxy.whatever.com
Line 3
would become:
Line 1
Line 2
# http-proxy-host = defaultproxy.whatever.com
Line 3
I was able to read the file, spot the lines I want to change and print them. Here's what I've done so far:
val fileToFilter = new File(filePath)
io.Source.fromFile(fileToFilter)
.getLines
.filter( line => !line.startsWith("#"))
.filter( line => line.toLowerCase().contains("proxy") )
.map( line => "#" + line )
.foreach( line => println( line ) )
I missing two things:
How to save the changes I've done to the file (can I do it directly, or do I need to copy the changes to a temp file and then replace the "servers" file with that temp file?)
How can I apply the "map" conditionally (if I spot the word "proxy", I prepend the "#", otherwise I leave the line as is).
Is this possible? Am I even following the right approach to solve this problem?
Thank you very much.
Save to a different file and rename it back to original one.
Use if-else
This should work:
import java.io.File
import java.io.PrintWriter
import scala.io.Source
val f1 = "svn.txt" // Original File
val f2 = new File("/tmp/abc.txt") // Temporary File
val w = new PrintWriter(f2)
Source.fromFile(f1).getLines
.map { x => if(x.contains("proxy")) s"# $x" else x }
.foreach(x => w.println(x))
w.close()
f2.renameTo(f1)
There is no "replace" file method in stock scala libraries: so you would open the file (as you are showing), make the changes, and then save it back (also various ways to do this) to the same path.
AFA Updating certain lines to # if they have proxy:
.map line { case l if l.contains("proxy") => s"# $l"
case l => l
}

Need the best way to iterate a file returning batches of lines as XML

I'm looking for the best way to process a file in which, based on the contents, i combine certain lines into XML and return the XML.
e.g. Given
line 1
line 2
line 3
line 4
line 5
I may want the first call to return
<msg>line 1, line 2</msg>
and a subsequent call to return
<msg>line 5, line 4</msg>
skipping line 3 for uninteresting content and exhausting the input stream. (Note: the <msg> tags will always contain contiguous lines but the number and organization of those lines in the XML will vary.) If you'd like some criteria for choosing lines to include in a message, assume odd line #s combine with the following four lines, even line #s combine with the following two lines, mod(10) line #s combine with the following five lines, skip lines that start with '#'.
I was thinking I should implement this as an iterator so i can just do
<root>{ for (m <- messages(inputstream)) yield m }</root>
Is that reasonable? If so, how best to implement it? If not, how best to implement it? :)
Thanks
This answer provided my solution: How do you return an Iterator in Scala?
I tried the following but there appears to be some sort of buffer issue and lines are skipped between calls to Log.next.
class Log(filename:String) {
val src = io.Source.fromFile(filename)
var node:Node = null
def iterator = new Iterator[Node] {
def hasNext:Boolean = {
for (line <- src.getLines()) {
// ... do stuff ...
if (null != node) return true
}
src.close()
false
}
def next = node
}
There might be a more Scala-way to do it and i'd like to see it but this is my solution to move forward for now.

How to extract character n-grams based on a large text

Given a large text file I want to extract the character n-grams using Apache Spark (do the task in parallel).
Example input (2 line text):
line 1: (Hello World, it)
line 2: (is a nice day)
Output n-grams:
Hel - ell -llo -lo_ - o_W - _Wo - Wor - orl - rld - ld, - d,_ - ,_i - _it - it_ - t_i - _is - ... and so on. So I want the return value to be a RDD[String], each string containing the n-gram.
Notice that the new line is considered a white space in the output n-grams. I put each line in parenthesis to be clear. Also, just to be clear the string or text is not a single entry in a RDD. I read the file using sc.textFile() method.
The main idea is to take all the lines within each partition and combine them into a long String. Next, we replace " " with "_" and call sliding on this string to create the trigrams for each partition in parallel.
Note: The resulting trigrams might not be 100% accurate since we will miss few trigrams from the beginning and the end of a each partition. Given that each partition can be several million characters long, the loss in assurance should be negligible. The main benefit here is that each partition can be executed in parallel.
Here are some toy data. Everything bellow can be executed on any Spark REPL:
scala> val data = sc.parallelize(Seq("Hello World, it","is a nice day"))
data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[12]
val trigrams = data.mapPartitions(_.toList.mkString(" ").replace(" ","_").sliding(3))
trigrams: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[14]
Here I will collect the trigrams to show how they look like (you might not want to do this if your dataset is massive)
scala> val asCollected = trigrams.collect
asCollected: Array[String] = Array(Hel, ell, llo, lo_, o_W, _Wo, Wor, orl, rld, ld,, d,_, ,_i, _it, is_, s_a, _a_, a_n, _ni, nic, ice, ce_, e_d, _da, day)
You could use a function like the following:
def n_gram(str:String, n:Int) = (str + " ").sliding(n)
I am assuming the newline has been stripped off when reading the line, so I've a added a space to compensate for that. If, on the other hand, the newline is preserved, you could define it as:
def n_gram(str:String, n:Int) = str.replace('\n', ' ').sliding(n)
Using your example:
println(n_gram("Hello World, it", 3).map(_.replace(' ', '_')).mkString(" - "))
would return:
Hel - ell - llo - lo_ - o_W - _Wo - Wor - orl - rld - ld, - d,_ - ,_i - _it - it_
There may be shorter ways to do this,
Assuming that the entire string (including the new line) is a single entry in an RDD, returning the following from flatMap should give you the result you want.
val strings = text.foldLeft(("", List[String]())) {
case ((s, l), c) =>
if (s.length < 2) {
val ns = s + c
(ns, l)
} else if (s.length == 2) {
val ns = s + c
(ns, ns :: l)
} else {
val ns = s.tail + c
(ns, ns :: l)
}
}._2