Scala: How to remove blank lines when reading text from file - scala

How can I ignore or remove blank lines, when reading from a text file using Scala?
An example is shown below: As you can see, the second line is the extra line.
I. The Period
It was the best of times,

try this.
val file = Source.fromFile(args(0)).getLines().filter(!_.isEmpty()).mkString(" ")
It will remove empty lines from list of lines and then concatenate them into one string with space between lines.

You may want to remove lines with whitespace only too. In that case this will work:
val file = Source.fromFile(args(0)).getLines().map(_.strip).filter(!_.isEmpty()).mkString(" ")

Related

Replace block of text inside a file using the contents of another file using sed

I am looking to replace a block of text that is between markers with the contents of another file.
I came across this solution but it only works with one line
$ sed -n '/foo/{p;:a;N;/bar/!ba;s/.*\n/REPLACEMENT\n/};p' file
line 1
line 2
foo
REPLACEMENT
bar
line 6
line 7
I am trying to get the following working but it's not.
content=`cat file_content`
sed -n '/foo/{p;:a;N;/bar/!ba;s/.*\n/${content}\n/};p' file
output
line 1
line 2
foo
${content}
bar
line 6
line 7
How can I get ${content} to list the output of the file?
So I guess this should be a reasonably short way of doing it to replace text between foo and bar lines with content of file file_content:
sed -e '/^foo$/,/^bar$/{/^bar$/{x;r content_file
D};d}' file
For range of lines matching ^foo$ and ^bar$. If line matches ^bar$ swap (empty) hold space into pattern space, read and append content of content_file, then delete pattern space up to first newline and start next cycle with the reminder of the pattern space. For all other lines in that range... just drop the line (delete patter space and move to the next line of input).
Otherwise to the result of your question... any string enclosed in single quotes is taken literally by shell and without any expansion (also of variables) taking place. '${content}' means literally ${content} and that is also part of the argument passed to sed, whereas double quote text ("${content}") would still see shell expand variable to what its value before becoming part of the sed arguments. Since that could still see content tripping up sed, I would opt for the r method for being more generic / robust.
EDIT: Edit keeping the start and end lines in (since I've misread the question):
sed -e '/^foo$/,/^bar$/{/^foo$/{r content_file
p};/^bar$/!d}' file
This time for range between matched of ^foo$ and ^bar$... for opening line matching ^foo$ we it reads content from content_file appending it to pattern space and then prints it (because of delete that follow). Then for all line in the range not matching the closing line pattern ^bar$ it just drops it and moves on.
This might work for you (GNU sed):
sed '/foo/!b;:a;$b;N;/bar/!ba;P;s/.*\n//;e cat contentFile' file
Print all lines until one containing foo.
If this is the last line, then there will never be a line containing bar so break out and do not insert the contentFile.
Otherwise, append the next line and check for it containing bar, if not repeat.
The pattern space should now contain both foo and bar so, print the first line (containing foo), remove all other lines other than the one containing bar, print the file contentFile and then print the last line of the collection containing bar.
N.B. This does not insert the contentFile unless both foo and bar exist in file. Also the e command will evaluate the cat contentFile immediately and insert the result into the output stream before printing the line containing bar, whereas the r command always prints to the output stream after the implicit print of the sed cycle.
An alternative:
sed -ne '/foo/{p;:a;n;/bar/!ba;e cat contentFile' -e '};p' file
However this solution will only print lines before foo if file does not have a line containing bar.
sed '/foo/,/bar/{//!d;/foo/s//&\n'${content}'/}' file
From foo to bar, delete lines not matching previous match //!d.
On foo line, replace match & with match followed by \n${content}

Replacing lines between two specific strings - sed equivalent in cmd

I want to replace the lines between two strings [REPORT] and [TAGS]. File looks like this
Many lines
many lines
they remain the same
[REPORT]
some text
some more text412
[TAGS]
text that I Want
to stay the same!!!
I used sed within cygwin:
sed -e '/[REPORT]/,/[TAGS]/c\[REPORT]\nmy text goes here\nAnd a new line down here\n[TAGS]' minput.txt > moutput.txt
which gave me this:
Many lines
many lines
they remain the same
[REPORT]
my text goes here
And a new line down here
[TAGS]
text that I Want
to stay the same!!!
When I do this and open the output file in Notepad, it doesn't show the new lines. I assume that this is because of formatting issue a simple Dos2Unix should resolve the issue.
But because of this and also mainly due to the fact that not all of my colleagues have access to cygwin I was wondering if there's a way to do this in cmd (or Powershell if there is no way to do a batch).
Eventually, I want to run this on number of files and change this section of them (between those two aforementioned words) to the text that I am providing.
Use PowerShell, present from Windows 7 on.
## Q:\Test\2018\10\30\SO_53073481.ps1
## defining variable with a here string
$Text = #"
Many lines
many lines
they remain the same
[REPORT]
some text
some more text412
[TAGS]
text that I Want
to stay the same!!!
"#
$Text -Replace "(?sm)(?<=^\[REPORT\]`r?`n).*?(?=`r?`n\[TAGS\])",
"`nmy text goes here`nAnd a new line down here`n"
The -replace regular expression uses nonconsuming lookarounds
Sample output:
Many lines
many lines
they remain the same
[REPORT]
my text goes here
And a new line down here
[TAGS]
text that I Want
to stay the same!!!
To read text from file, replace and write back (even without storing in a var) you can use:
(Get-Content ".\file.txt" -Raw) -Replace "(?sm)(?<=^\[REPORT\]`r?`n).*?(?=`r?`n\[TAGS\])",
"`nmy text goes here`nAnd a new line down here`n"|
Set-Content ".\file.txt"
The parentheses are neccessary to reuse the same file name in one pipe.
Set Inp = WScript.Stdin
Set Outp = Wscript.Stdout
Set regEx = New RegExp
regEx.Pattern = "\n"
regEx.IgnoreCase = True
regEx.Global = True
Outp.Write regEx.Replace(Inp.ReadAll, vbcrlf)
To use
cscript //nologo "C:\Folder\Replace.vbs" < "C:\Windows\Win.ini" > "%userprofile%\Desktop\Test.txt"
So you can use your RegEx.

Why does Scala see more lines in a file?

Running this from the terminal prompt:
$ wc data.csv
195727 15924341 201584826 data.csv
So, 195727 lines. What about Scala?
val raw_rows: Iterator[String] = scala.io.Source.fromFile("data.csv").getLines()
println(raw_rows.length)
Result: 200945
What am I facing here? I wish for it to be the same. In fact, if I use mighty csv (opencsv wrapper lib) it also reads 195727 lines.
It might be a newline issue. From the doc of getLines
Returns an iterator who returns lines (NOT including newline character(s)). It will treat any of \r\n, \r, or \n as a line separator (longest match) - if you need more refined behavior you can subclass Source#LineIterator directly

Very basic replace using sed

Really would appreciate help on this.
I am using sed to create a CSV file. Essentially multiple html files are all merged to a single html file and sed is then used to remove all the junk pictures etc to get to the raw columnar data.
I have all this working but am stuck on the last bit.
What I want to do is very basic - I want to replace the following lines:
"a variable string"
"end td"
"begin td"
with a single line:
"a variable string"
(with a tab character at the end of this line)
I'M USING DOS.
As you see I'm new to all this. If I could get this working would save me a lot of time in the future so would appreciate the help.
At the moment I have to inject some html headers back into the text file, open it in a html editor, select the table and then paste this into a spreadsheet which is a bit of pain.
P.S. is there an easy way to get sed to remove the parenthesis '(' and ')' from a given line?
I doubt that this is what you really want, but it's what you asked for.
sed "s/\"a variable string\"/&\t/; s/\"end td\"//; s/\"begin td\"//" inputfile
What you probably want to do is replace them when they appear consecutively. Here's how you might do that:
sed "1{N;N}; /\"a variable string\"\n\"end td\"\n\"begin td\"/ s/\n.*$/\t/;ta;bb;:a;N;N;:b;$!P;N;D" inputfile
This will remove all parentheses in a file:
sed "s/[()]//g" inputfile
To select particular lines, you could do something like this:
sed "/foo/ s/[()]//g" inputfile
which will only make the replacement if the word "foo" is somewhere on a line.
Edit: Changed single quotes to double quotes to accommodate GNUWin32 and CMD.EXE.
A previous comment I left doesn't appear to have been saved - so will try again
The code to remove the ( and ) worked perfectly thanks
You are right - I was looking to merge the 3 lines into one line so the second example you gave where it looks like its reading the next two lines into the pattern space looks more promising. The output wasn't what I was expecting however.
I now realize the code is going to have to be more complicated and I don't want to trouble you any more as my manual method of injecting some html code back into the text file and opening it up in Openoffice and pasting into a spreadsheet only takes a few seconds and I have a feeling to manually produce the sed coding to this would be a nightmare.
Essentially the rules for converting the html would need to be:
[each tag has been formatted so it appears on its own line]
I have given example of an input file and desired output file below for reference
1) if < tr > is followed by < td > on the next line completely remove the < tr > and < td > lines [i.e. do not output a carriage return] and on the NEXT line stick a " at the start of that line [it doesn't matter about a carriage return at the end of this line as it is going to be edited later]
2) if < /td > is followed by < td > completely remove both these two lines [again do not output a carriage return after these lines] and on the PREVIOUS line output a ", [do not output a carriage return] and on the NEXT line stick "at the start of the line [don't worry about the the ending carriage return is will be edited later]
3) if < /td > is followed by < /tr > delete both of these lines and on the previous line add a " at to the end of the line and a final carriage return.
I have given an example of what the input and desired output would be:
input: http://medinfo.redirectme.net/input.txt
[the wanted file will be posted in the next message - this board will not allow new users to post a message with more than one hyperlink!]
there is an added issue that the address column is on multiple lines on the input file - this could be reduced to one line by looking to see if the first character of the NEXT line is a " If it isn't then do not output the carriage return at the end of the current line
Phew that was a nightmare just to type out never mind actually code. But thanks again for all your help in getting this far!
:-)

Use sed to delete a matched regexp and the line (or two) underneath it

OK I found this question:
How do I delete a matching line, the line above and the one below it, using sed?
and just spent the last hour trying to write something that will match a string and delete the line containing the string and the line beneath it (or a variant - delete 2 lines beneath it).
I feel I'm now typing random strings. Please somebody help me.
If I've understood that correctly, to delete match line and one line after
/matchstr/{N;d;}
Match line and two lines after
/matchstr/{N;N;d;}
N brings in the next line
d - deletes the resulting single line
you can use awk. eg search for the word "two" and skip 2 lines after it
$ cat file
one
two
three
four
five
six
seven
eight
$ awk -vnum=2 '/two/{for(i=0;i<=num;i++)getline}1' file
one
five
six
seven
eight