The following Scala code does just what I expect it to - it prints each line of some_file.txt.
import scala.io.Source
val lines = Source.fromPath("some_file.txt").mkString
for (line <- lines) print(line)
If I use println instead of print, I expect to see some_file.txt printed out with double-spacing. Instead, the program prints a newline after every character of some_file.txt. Could someone explain this to me? I'm using Scala 2.8.0 Beta 1.
lines is a single string, not some iterable container of strings. This is because you called the .mkString method on it.
When you iterate over a string, you do so one character at a time. So the line in your for is not actually a line, it's a single character.
What you probably intended to do was call .getLines instead of .mkString
I suspect that for (line <- lines) print(line) doesn't put a line in line but instead a character. Making the output as expected since the \n is there too. When you the replace the print with println every character gets its own line.
Related
How can I ignore or remove blank lines, when reading from a text file using Scala?
An example is shown below: As you can see, the second line is the extra line.
I. The Period
It was the best of times,
try this.
val file = Source.fromFile(args(0)).getLines().filter(!_.isEmpty()).mkString(" ")
It will remove empty lines from list of lines and then concatenate them into one string with space between lines.
You may want to remove lines with whitespace only too. In that case this will work:
val file = Source.fromFile(args(0)).getLines().map(_.strip).filter(!_.isEmpty()).mkString(" ")
I have some lines with very long single-line comments:
# this is a comment describing the function, let's pretend it's long.
function whatever()
{
# this is an explanation of something that happens in here.
do_something();
}
For this example (adapting it to other numbers should be trivial) I want
each line to contain at most 33 characters (each indentation level is 4 spaces) and
to be broken at the last possible space
each additional line do be indented exactly like the original line.
So it would end up looking like this:
# this is a comment describing
# the function, let's pretend
# it's long.
function whatever()
{
# this is an explanation of
# something that happens in
# here.
do_something();
}
I'm trying to write a sed script for that, my attempt looking like this (leaving out the attempts to make it break at a particular character count for clarity and because it didn't work):
s/\(^[^#]*# \)\(.*\) \(.*\)/\1\2\n\1\3/g;
This breaks the line only once and not repeatedly like I falsely assumed g to do (and which it actually would do if it were only s/ /\n/g or something).
Perl to the rescue!
Its Text::Wrap module does what you need:
perl -MText::Wrap='wrap,$columns' -pe '
s/^(\s*#)(.*)/$columns = 33 - length $1; wrap("$1", "$1 ", "$2")/e
' < input > output
-M uses the given module with the given parameters. Here, we'll use the wrap function and the $columns variable.
-p reads the input line by line and prints the possibly modified line (like sed)
s///e is a substitution that uses code in the replacement part, the matching part is replaced by the value returned from the code
to calculate the width, we subtract the initial whitespace from 33. If you use tabs in your sources, you'll have to handle them specially.
wrap takes three parameters: prefix for the first line, prefix for the rest of the lines (in this case, they're almost the same: the comment prefix, we just need to add the space to the second one), and the text to wrap.
Comparing the output to yours, it seems you want 33 characters regardless of the leading whitespace. If that's true, just remove the - length $1 part.
Running this from the terminal prompt:
$ wc data.csv
195727 15924341 201584826 data.csv
So, 195727 lines. What about Scala?
val raw_rows: Iterator[String] = scala.io.Source.fromFile("data.csv").getLines()
println(raw_rows.length)
Result: 200945
What am I facing here? I wish for it to be the same. In fact, if I use mighty csv (opencsv wrapper lib) it also reads 195727 lines.
It might be a newline issue. From the doc of getLines
Returns an iterator who returns lines (NOT including newline character(s)). It will treat any of \r\n, \r, or \n as a line separator (longest match) - if you need more refined behavior you can subclass Source#LineIterator directly
Is there a shorthand for a new line character in Scala? In Java (on Windows) I usually just use "\n", but that doesn't seem to work in Scala - specifically
val s = """abcd
efg"""
val s2 = s.replace("\n", "")
println(s2)
outputs
abcd
efg
in Eclipse,
efgd
(sic) from the command line, and
abcdefg
from the REPL (GREAT SUCCESS!)
String.format("%n") works, but is there anything shorter?
A platform-specific line separator is returned by
sys.props("line.separator")
This will give you either "\n" or "\r\n", depending on your platform. You can wrap that in a val as terse as you please, but of course you can't embed it in a string literal.
If you're reading text that's not following the rules for your platform, this obviously won't help.
References:
scala.sys package scaladoc (for sys.props)
java.lang.System.getProperties javadoc (for "line.separator")
Your Eclipse making the newline marker the standard Windows \r\n, so you've got "abcd\r\nefg". The regex is turning it into "abcd\refg" and Eclipse console is treaing the \r slightly differently from how the windows shell does. The REPL is just using \n as the new line marker so it works as expected.
Solution 1: change Eclipse to just use \n newlines.
Solution 2: don't use triple quoted strings when you need to control newlines, use single quotes and explicit \n characters.
Solution 3: use a more sophisticated regex to replace \r\n, \n, or \r
Try this interesting construction :)
import scala.compat.Platform.EOL
println("aaa"+EOL+"bbb")
If you're sure the file's line separator in the one, used in this OS, you should do the following:
s.replaceAll(System.lineSeparator, "")
Elsewhere your regex should detect the following newline sequences: "\n" (Linux), "\r" (Mac), "\r\n" (Windows):
s.replaceAll("(\r\n)|\r|\n", "")
The second one is shorter and, I think, is more correct.
var s = """abcd
efg""".stripMargin.replaceAll("[\n\r]","")
Use \r\n instead
Before:
After:
The following Groovy commands illustrate my problem.
First of all, this works (as seen on lotrepls.appspot.com) as expected (note that \u0061 is 'a').
>>> print "a".matches(/\u0061/)
true
Now let's say that we want to match \n, using the Unicode escape \u000A. The following, using "pattern" as a string, behaves as expected:
>>> print "\n".matches("\u000A");
Interpreter exception: com.google.lotrepls.shared.InterpreterException:
org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed,
Script1.groovy: 1: expecting anything but ''\n''; got it anyway
# line 1, column 21. 1 error
This is expected because in Java at least, Unicode escapes are processed early (JLS 3.3), so:
print "\n".matches("\u000A")
really is the same as:
print "\n".matches("
")
The fix is to escape the Unicode escape, and let the regex engine process it, as follows:
>>> print "\n".matches("\\u000A")
true
Now here's the question part: how can we get this to work with the Groovy /pattern/ syntax instead of using string literal?
Here are some failed attempts:
>>> print "\n".matches(/\u000A/)
Interpreter exception: com.google.lotrepls.shared.InterpreterException:
org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed,
Script1.groovy: 1: expecting EOF, found '(' # line 1, column 19.
1 error
>>> print "\n".matches(/\\u000A/)
false
>>> print "\\u000A".matches(/\\u000A/);
true
~"[\u0000-\u0008\u000B\u000C\u000E-\u001F\u007F-\u009F]"
Appears to be working as it should. According to the docs I've seen, the double backslashes shouldn't be required with a slashy string, so I don't know why the compiler's not happy with them.
Firstly, it seems Groovy changed in this regard in the meantime, at least on https://groovyconsole.appspot.com/ and a local Groovy shell, "\n".matches(/\u000A/) works perfectly fine, evaluating to true.
In case you have a similar situation again, just encode the backslash with a unicode escape like in "\n".matches(/\u005Cu000A/) as then the unicode escape to character conversion makes it a backslash again and then the sequence for the regex parser is kept.
Another option would be to separate the backslash from the u for example by using "\n".matches(/${'\\'}u000A/) or "\n".matches('\\' + /u000A/)