Iterating over the lines of a file - scala

I'd like to write a simple function that iterates over the lines of a text file. I believe in 2.8 one could do:
def lines(filename: String) : Iterator[String] = {
scala.io.Source.fromFile(filename).getLines
}
and that was that, but in 2.9 the above doesn't work and instead I must do:
def lines(filename: String) : Iterator[String] = {
scala.io.Source.fromFile(new File(filename)).getLines()
}
Now, the trouble is, I want to compose the above iterators in a for comprehension:
for ( l1 <- lines("file1.txt"); l2 <- lines("file2.txt") ){
do_stuff(l1, l2)
}
This again, used to work fine with 2.8 but causes a "too many open files"
exception to get thrown in 2.9. This is understandable -- the second lines
in the comprehension ends up opening (and not closing) a file for each line
in the first.
In my case, I know that the "file1.txt" is big and I don't want to suck it into
memory, but the second file is small, so I can write a different linesEager
like so:
def linesEager(filename: String): Iterator[String] =
val buf = scala.io.Source.fromFile(new File(filename))
val zs = buf.getLines().toList.toIterator
buf.close()
zs
and then turn my for-comprehension into:
for (l1 <- lines("file1.txt"); l2 <- linesEager("file2.txt")){
do_stuff(l1, l2)
}
This works, but is clearly ugly. Can someone suggest a uniform & clean
way of achieving the above. Seems like you need a way for the iterator
returned by lines to close the file when it reaches the end, and
this must have been happening in 2.8 which is why it worked there?
Thanks!
BTW -- here is a minimal version of the full program that shows the issue:
import java.io.PrintWriter
import java.io.File
object Fail {
def lines(filename: String) : Iterator[String] = {
val f = new File(filename)
scala.io.Source.fromFile(f).getLines()
}
def main(args: Array[String]) = {
val smallFile = args(0)
val bigFile = args(1)
println("helloworld")
for ( w1 <- lines(bigFile)
; w2 <- lines(smallFile)
)
{
if (w2 == w1){
val msg = "%s=%s\n".format(w1, w2)
println("found" + msg)
}
}
println("goodbye")
}
}
On 2.9.0 I compile with scalac WordsFail.scala and then I get this:
rjhala#goto:$ scalac WordsFail.scala
rjhala#goto:$ scala Fail passwd words
helloworld
java.io.FileNotFoundException: passwd (Too many open files)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:120)
at scala.io.Source$.fromFile(Source.scala:91)
at scala.io.Source$.fromFile(Source.scala:76)
at Fail$.lines(WordsFail.scala:8)
at Fail$$anonfun$main$1.apply(WordsFail.scala:18)
at Fail$$anonfun$main$1.apply(WordsFail.scala:17)
at scala.collection.Iterator$class.foreach(Iterator.scala:652)
at scala.io.BufferedSource$BufferedLineIterator.foreach(BufferedSource.scala:30)
at Fail$.main(WordsFail.scala:17)
at Fail.main(WordsFail.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at scala.tools.nsc.util.ScalaClassLoader$$anonfun$run$1.apply(ScalaClassLoader.scala:78)
at scala.tools.nsc.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:24)
at scala.tools.nsc.util.ScalaClassLoader$URLClassLoader.asContext(ScalaClassLoader.scala:88)
at scala.tools.nsc.util.ScalaClassLoader$class.run(ScalaClassLoader.scala:78)
at scala.tools.nsc.util.ScalaClassLoader$URLClassLoader.run(ScalaClassLoader.scala:101)
at scala.tools.nsc.ObjectRunner$.run(ObjectRunner.scala:33)
at scala.tools.nsc.ObjectRunner$.runAndCatch(ObjectRunner.scala:40)
at scala.tools.nsc.MainGenericRunner.runTarget$1(MainGenericRunner.scala:56)
at scala.tools.nsc.MainGenericRunner.process(MainGenericRunner.scala:80)
at scala.tools.nsc.MainGenericRunner$.main(MainGenericRunner.scala:89)
at scala.tools.nsc.MainGenericRunner.main(MainGenericRunner.scala)

scala-arm provides a great mechanism for automagically closing resources when you're done with them.
import resource._
import scala.io.Source
for (file1 <- managed(Source.fromFile("file1.txt"));
l1 <- file1.getLines();
file2 <- managed(Source.fromFile("file2.txt"));
l2 <- file2.getLines()) {
do_stuff(l1, l2)
}
But unless you're counting on the contents of file2.txt to change while you're looping through file1.txt, it would be best to read that into a List before you loop. There's no need to convert it into an Iterator.

Maybe you should take a look at scala-arm (https://github.com/jsuereth/scala-arm) and let the closing of the files (file input streams) happen automatically in the background.

Related

ZIO Fiber orElse generate exception messages

I want to use the combinator orElse on ZIO Fibers.
From docs:
If the first fiber succeeds, the composed fiber will succeed with its result; otherwise, the composed fiber will complete with the exit value of the second fiber (whether success or failure).
import zio._
import zio.console._
object MyApp extends App {
def f1 :Task[Int] = IO.fail(new Exception("f1 fail"))
def f2 :Task[Int] = IO.succeed(2)
val myAppLogic =
for {
f1f <- f1.fork
f2f <- f2.fork
ff = f1f.orElse(f2f)
r <- ff.join
_ <- putStrLn(s"Result is [$r]")
} yield ()
def run(args: List[String]) =
myAppLogic.fold(_ => 1, _ => 0)
}
I run it with sbt in console. And output:
[info] Running MyApp
Fiber failed.
A checked error was not handled.
java.lang.Exception: f1 fail
at MyApp$.f1(MyApp.scala:6)
at MyApp$.<init>(MyApp.scala:11)
at MyApp$.<clinit>(MyApp.scala)
at MyApp.main(MyApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
Result is [2]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sbt.Run.invokeMain(Run.scala:93)
at sbt.Run.run0(Run.scala:87)
at sbt.Run.execute$1(Run.scala:65)
at sbt.Run.$anonfun$run$4(Run.scala:77)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
at sbt.util.InterfaceUtil$$anon$1.get(InterfaceUtil.scala:10)
at sbt.TrapExit$App.run(TrapExit.scala:252)
at java.lang.Thread.run(Thread.java:748)
Fiber:Id(1574829590403,2) was supposed to continue to: <empty trace>
Fiber:Id(1574829590403,2) ZIO Execution trace: <empty trace>
Fiber:Id(1574829590403,2) was spawned by:
Fiber:Id(1574829590397,1) was supposed to continue to:
a future continuation at MyApp$.myAppLogic(MyApp.scala:12)
a future continuation at MyApp$.run(MyApp.scala:19)
Fiber:Id(1574829590397,1) ZIO Execution trace: <empty trace>
Fiber:Id(1574829590397,1) was spawned by:
Fiber:Id(1574829590379,0) was supposed to continue to:
a future continuation at zio.App.main(App.scala:57)
a future continuation at zio.App.main(App.scala:56)
[Fiber:Id(1574829590379,0) ZIO Execution trace: <empty trace>
I see the result of seconds Fiber, is Result is [2]
But why it output these unnecessary exception/warning messages?
By default a fiber failure warning is generated when a fiber that is not joined back fails so that errors do not get lost. But as you correctly note in some cases this is not necessary as the error is handled internally by the program logic, in this case by the orElse combinator. We have been working through a couple of other cases of spurious warnings being generated and I just opened a ticket for this one here. I expect we will have this resolved in the next release.
This happens because the default instance of Platform being created by zio.App has a default which reports uninterrupted, failed fibers to the console:
def reportFailure(cause: Cause[_]): Unit =
if (!cause.interrupted)
System.err.println(cause.prettyPrint)
To avoid this, you can provide your own Platform instance which doesn't do so:
import zio._
import zio.console._
import zio.internal.{Platform, PlatformLive}
object MyApp extends App {
override val Platform: Platform = PlatformLive.Default.withReportFailure(_ => ())
def f1: Task[Int] = IO.fail(new Exception("f1 fail"))
def f2: Task[Int] = IO.succeed(2)
val myAppLogic =
for {
f1f <- f1.fork
f2f <- f2.fork
ff = f1f.orElse(f2f)
r <- ff.join
_ <- putStrLn(s"Result is [$r]")
} yield ()
def run(args: List[String]) =
myAppLogic.fold(_ => 1, _ => 0)
}
Which yields:
Result is [2]
As #Adam Fraser noted, this will probably get fixed in a nearby release.
Edit:
Should be fixed after https://github.com/zio/zio/pull/2339 was merged

Write to process stdin

How can I get some kind of writeable stream connected to stdin (and also readable streams connected to stdout and stderr) when launching a process via scala.sys.process library? Here's the code that doesn't work (doesn't even print debug messages)
val p = Process("wc -l")
val io = BasicIO.standard(true)
val lines = Seq("a", "b", "c") mkString "\n"
val buf = lines.getBytes(StandardCharsets.UTF_8)
io withInput { w =>
println("Writing")
w.write(buf)
}
io withOutput { i =>
val s = new BufferedReader(new InputStreamReader(i)).readLine()
println(s"Output is $s")
}
You have a couple of problems.
First in your snippet you never connect your process with the io and never run it.
That can be done like this: p run io.
Second, the withInput & withOutput methods return a NEW ProcessIO they DON'T mutate the actual, and since you don't assign the return of those calls to a variable, you are doing nothing.
The following snippet fixes both problems, hope it works for you.
import scala.io.Source
import scala.sys.process._
import java.nio.charset.StandardCharsets
val p = Process("wc -l")
val io =
BasicIO.standard(true)
.withInput { w =>
val lines = Seq("a", "b", "c").mkString("", "\n", "\n")
val buf = lines.getBytes(StandardCharsets.UTF_8)
println("Writing")
w.write(buf)
w.close()
}
.withOutput { i =>
val s = Source.fromInputStream(i)
println(s"Output is ${s.getLines.mkString(",")}")
i.close()
}
p run io
Don't doubt to ask for clarification.
PS: it prints "Output is 3" - (Thanks to Dima for pointing the mistake).

Scala Iterator ++ blows the stack

I recently notice this bug causing StackOverFlowError in Scala Iterator++ caused by lazy init. Here's the code to make the bug appear.
var lines = Source.fromFile("file").getLines()
var line = lines.next()
lines = Array(line).toIterator ++ lines
lines.foreach { println(_) }
System.exit(0)
What I get is
Exception in thread "main" java.lang.StackOverflowError
at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:219)
at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:219)
at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:219)
at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:219)
...
It should be caused by this line in scala source (scala.collection.Iterator.scala:208)
lazy val rhs: Iterator[A] = that.toIterator
As rhs is a lazy init val, when the iterator is used what the name "lines" refers to already changed, and caused a loop reference, which leads to the error.
I noticed this post talks about the problem in 2013. However it seems it has not been fully repaired. I am running Scala 2.11.8 from Maven Repo.
My Question: I can rename the iterator, e.g. "lines2" to avoid this bug, but is this the only way to solve the problem? I feel like using the name "lines" is more natural and don't want to forsake it if possible.
If you want to reload an Iterator using the same var, this appears to work. [Tested on 2.11.7 and 2.12.1]
scala> var lines = io.Source.fromFile("file.txt").getLines()
lines: Iterator[String] = non-empty iterator
scala> var line = lines.next()
line: String = this,that,other,than
scala> lines = Iterator(line +: lines.toSeq:_*)
lines: Iterator[String] = non-empty iterator
scala> lines.foreach(println)
this,that,other,than
here,there,every,where
But it might make more sense to use a BufferedIterator where you can call head on it to peek at the next element without consuming it.
explanation
lines.toSeq <-- turn the Iterator[String] into a Seq[String] (The REPL will show this as a Stream but that's because the REPL has to compile and represent each line of input separately.)
line +: lines.toSeq <-- create a new Seq[String] with line as the first element (i.e. prepended)
(line +: lines.toSeq:_*) <-- turns a single Seq[T] into a parameter list that can be passed to the Iterator.apply() method. #som-snytt has cleverly pointed out that this can be simplified to (line +: lines.toSeq).iterator
BufferedIterator example
scala> var lines = io.Source.fromFile("file.txt").getLines.buffered
lines: scala.collection.BufferedIterator[String] = non-empty iterator
^^^^^^^^^^^^^^^^^^^^^^^^ <-- note the type
scala> lines.head
res5: String = this,that,other,than
scala> lines foreach println
this,that,other,than
here,there,every,where
Simple capture:
scala> var lines = Iterator.continually("x")
lines: Iterator[String] = non-empty iterator
scala> lines = { val z = lines ; Iterator.single("y") ++ z }
lines: Iterator[String] = non-empty iterator
scala> lines.next
res0: String = y
scala> lines.next
res1: String = x

Why do I get a MalformedInputException from this code?

I'm a newbie in Scala, and I wanted to write some sourcecodes from myself for me to get better.
I've written a simple object (with a main entry) in order to simulate a "grep" call on all files of the current directory. (I launch the program from Eclipse Indigo, and in Debian Squeeze) :
package com.gmail.bernabe.laurent.scala.tests
import java.io.File
import scala.io.Source
object DealWithFiles {
def main(args:Array[String]){
for (result <- grepFilesHere(".*aur.*"))
println(result)
}
private def grepFilesHere(pattern:String):Array[String] = {
val filesHere = new File(".").listFiles
def linesOfFile(file:File) =
Source.fromFile(file).getLines.toList
for (file <- filesHere;
if file.isFile
)
yield linesOfFile(file)(0)
}
}
But I get a java.nio.charset.MalformedInputException, which I am not able to solve :
Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:260)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:319)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at java.io.BufferedReader.fill(BufferedReader.java:136)
at java.io.BufferedReader.readLine(BufferedReader.java:299)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67)
at scala.collection.Iterator$class.foreach(Iterator.scala:772)
at scala.io.BufferedSource$BufferedLineIterator.foreach(BufferedSource.scala:43)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:130)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:242)
at scala.io.BufferedSource$BufferedLineIterator.toList(BufferedSource.scala:43)
at com.gmail.bernabe.laurent.scala.tests.DealWithFiles$.linesOfFile$1(DealWithFiles.scala:18)
at com.gmail.bernabe.laurent.scala.tests.DealWithFiles$$anonfun$grepFilesHere$2.apply(DealWithFiles.scala:23)
at com.gmail.bernabe.laurent.scala.tests.DealWithFiles$$anonfun$grepFilesHere$2.apply(DealWithFiles.scala:20)
at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:697)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:34)
at scala.collection.mutable.ArrayOps.foreach(ArrayOps.scala:38)
at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:696)
at com.gmail.bernabe.laurent.scala.tests.DealWithFiles$.grepFilesHere(DealWithFiles.scala:20)
at com.gmail.bernabe.laurent.scala.tests.DealWithFiles$.main(DealWithFiles.scala:10)
at com.gmail.bernabe.laurent.scala.tests.DealWithFiles.main(DealWithFiles.scala)
Thanks in advance for helps :)
From the JavaDoc:
MalformedInputException
thrown when an input byte sequence is not legal for given charset, or
an input character sequence is not a legal sixteen-bit Unicode
sequence.
Pass the currect encoding as parameter to Source.fromFile method.
You can handle this character encoding exception by adding below snippet in your code
import scala.io.Codec
import java.nio.charset.CodingErrorAction
implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)

Why do I get a java.nio.BufferUnderflowException in this Scala

I was trying to do some scripting in Scala, to process some log files:
scala> import io.Source
import io.Source
scala> import java.io.File
import java.io.File
scala> val f = new File(".")
f: java.io.File = .
scala> for (l <- f.listFiles) {
| val src = Source.fromFile(l).getLines
| println( (0 /: src) { (i, line) => i + 1 } )
| }
3658
java.nio.BufferUnderflowException
at java.nio.Buffer.nextGetIndex(Unknown Source)
at java.nio.HeapCharBuffer.get(Unknown Source)
at scala.io.BufferedSource$$anon$2.next(BufferedSource.scala:86)
at scala.io.BufferedSource$$anon$2.next(BufferedSource.scala:74)
at scala.io.Source$$anon$6.next(Source.scala:307)
at scala.io.Source$$anon$6.next(Source.scala:301)
at scala.Iterator$cla...
Why do I get this java.nio.BufferUnderflowException?
NOTE - I'm processing 10 log files, each about 1MB in size
I got BufferUnderflowException exception when I opened a file with the wrong enconding. It contained illegal characters (according to the wrong encoding) and this misleading exception was thrown.
I'd also be interested as to exactly why this is happening but I'd guess it's to do with the fact that Source is an object (i.e. a singleton) and how it is gets transparently reset. You can fix the problem as follows:
for (l <- g.listFiles if !l.isDirectory) {
| val src = Source.fromFile(l)
| println( (0 /: src.getLines) { (i, line) => i + 1 } )
| src.reset
| }
The important bit is the reset - which should probably be in a try-finally block (although the isDirectory test is probably useful too)
This is essentially a restatement of Elazar's answer, but you will also get this exception if you try to read a binary file using scala.io.Source.fromFile.
I just ran into this (accidentally trying to read a .jpg with fromFile) due to a very stupid bug in something I wrote...