Replacing lines between two specific strings - sed equivalent in cmd - powershell

I want to replace the lines between two strings [REPORT] and [TAGS]. File looks like this
Many lines
many lines
they remain the same
[REPORT]
some text
some more text412
[TAGS]
text that I Want
to stay the same!!!
I used sed within cygwin:
sed -e '/[REPORT]/,/[TAGS]/c\[REPORT]\nmy text goes here\nAnd a new line down here\n[TAGS]' minput.txt > moutput.txt
which gave me this:
Many lines
many lines
they remain the same
[REPORT]
my text goes here
And a new line down here
[TAGS]
text that I Want
to stay the same!!!
When I do this and open the output file in Notepad, it doesn't show the new lines. I assume that this is because of formatting issue a simple Dos2Unix should resolve the issue.
But because of this and also mainly due to the fact that not all of my colleagues have access to cygwin I was wondering if there's a way to do this in cmd (or Powershell if there is no way to do a batch).
Eventually, I want to run this on number of files and change this section of them (between those two aforementioned words) to the text that I am providing.

Use PowerShell, present from Windows 7 on.
## Q:\Test\2018\10\30\SO_53073481.ps1
## defining variable with a here string
$Text = #"
Many lines
many lines
they remain the same
[REPORT]
some text
some more text412
[TAGS]
text that I Want
to stay the same!!!
"#
$Text -Replace "(?sm)(?<=^\[REPORT\]`r?`n).*?(?=`r?`n\[TAGS\])",
"`nmy text goes here`nAnd a new line down here`n"
The -replace regular expression uses nonconsuming lookarounds
Sample output:
Many lines
many lines
they remain the same
[REPORT]
my text goes here
And a new line down here
[TAGS]
text that I Want
to stay the same!!!
To read text from file, replace and write back (even without storing in a var) you can use:
(Get-Content ".\file.txt" -Raw) -Replace "(?sm)(?<=^\[REPORT\]`r?`n).*?(?=`r?`n\[TAGS\])",
"`nmy text goes here`nAnd a new line down here`n"|
Set-Content ".\file.txt"
The parentheses are neccessary to reuse the same file name in one pipe.

Set Inp = WScript.Stdin
Set Outp = Wscript.Stdout
Set regEx = New RegExp
regEx.Pattern = "\n"
regEx.IgnoreCase = True
regEx.Global = True
Outp.Write regEx.Replace(Inp.ReadAll, vbcrlf)
To use
cscript //nologo "C:\Folder\Replace.vbs" < "C:\Windows\Win.ini" > "%userprofile%\Desktop\Test.txt"
So you can use your RegEx.

Related

Bash one-liner to insert text marker into the fourth and all consecutive tabs of fields populated with text

This is a Bash/.bat terminal script for Mac.
I'm trying to add text ("!!XX!!") into a group of tab-delimited .txt files in a folder, but I only want to add it into the 4th and all following incidents of the tab in each .txt file, and then only if those cels have text in them. So, the end result would be something like (assuming the 7th cel/field/bit of info is blank). So turn this:
text01
text02
text03
text04
text05
text06
... into this:
text01 [TAB] text02 [TAB] text03 [TAB] text04!!XX!! [TAB] text05!!XX!! [TAB] text06!!XX!! [TAB]
The text marker "!!XX!!" is so that another script in a different system can run on the file and perform special system-compatible/custom line formatting at each incident of "!!XX!!", but I don't want to populate the first three fields/tab-delimited text (because it's not needed there) or in the empty fields (because it's not wanted there).
I'm already replacing each line return with a tab, so it is possible to do it there, though my preference is to do it later to the tab-delimited text b/c of weird issues with the line returns/formatting coming in from .rtf files. Below is what I am to replace each line return and replace it with a TAB (and, yes, that is an actual line return and tab in there, which seems to work best because... Macs?):
perl -pi -w -e 's/
/ /g' *.txt;
Thanks in advance :)
This post assumes an input file that has lines with tab-separated fields, where each field starting from (and including) the fourth need be edited if it has something.
One way
perl -F"\t" -wlane'
for (3..$#F) { $F[$_] .= "!XX!" if defined $F[$_] }; print join("\t", #F)
' file
(In tcsh shell need to escape those ! with a backslash.) Once you've tested enough add -i switch to change input file in place (-i.bak keeps a backup).
This uses Perl's -a switch to break input lines by what is given under -F switch (or by whitespace by default), and the resulting array is in #F. See switches in perlrun.
Then it iterates from the fourth field to the last. I use syntax $#ary for the index of the last element of array #ary.
I don't know what counts for cells that "have text in them" so above I test a field for defined-ness; thus this will append even for an empty string. Adjust as suitable.
Or use a regex, which allows more flexibility here. For example,
for (3..$#F) { $F[$_] =~ s/.+\K/!XX!/ }
This matches all characters and then adds !XX! (keeping what it matched, by \K assertion). Using regex allows and demands to specify more precisely what is accepted there; the shown pattern will match even for whitespace alone, but not for empty string. To not touch fields with whitespace only, and to strip trailing spaces if any
for (3..$#F) { $F[$_] =~ s/.+\S\K\s*/XX/ };
Again, adjust to your details.
I don't quite understand the discussion of newlines and what is wanted of them; the above one-liner goes line by line. If that's not what you need please clarify. I don't have Macs to test, so I can't comment on all that.
A self-contained example for ready testing and tweaking
echo "t1\tt2\tt3\tt4\t\tt6 \t " |
perl -F"\t" -wlanE'for (3..$#F) { $F[$_] =~ /.+\S\K\s*/XX/ } say for #F'
where I print each field on a separate line for easier inspection. The last tab in input is followed by trailing spaces only -- this results in an empty field, but with no text marker added (as asked for in a comment).
with GNU sed
$ echo text{01..07}$'\t' | sed -E 's/([^\t]+)(\t|$)/\1!!xx!!/4g'
text01 text02 text03 text04!!xx!! text05!!xx!! text06!!xx!! text07!!xx!!
or
$ echo text{01..07}$'\t' | sed -E 's/\t([^\t]+)/\1!!xx!!/3g'
Assuming each text file contains 7 lines, you can do
paste -s *.txt | awk '
BEGIN {FS=OFS="\t"}
{for (i=4; i<=NF; i++) if ($i != "") $i = $i "!!XX!!"; print}
'
Here is an awk:
echo text{01..10}$'\t' |
awk -v OFS=$'\t' '{for(i=1;i<=NF;i++) printf "%s%s", $i, i>=4 ? "XXX\t" : i<NF ? OFS : ORS }'
With perl, I would do this:
echo text{01..10}$'\t' |
perl -lpE '$cnt=0; s/\h+/++$cnt>=4 ? "XXX\t" : "\t"/ge;'
Both print:
text01 text02 text03 text04XXX text05XXX text06XXX text07XXX text08XXX text09XXX text10XXX

Using powershell how to replace pageBreak in a text file?

What I am trying to do is to create a word document from the text file. But the text file has pageBreaks in it. I want to remove or replace those pageBreaks in the text file. This is to allow me to add pageBreaks in the word document that I'll subsequently create in places where I actually need it.
Below is the PowerShell code that I tried myself to replace the pageBreak in the text file. This doesn't work. As using "`f" in place of pageBreak doesn't work.
$oldWord = "`fPage"
$newWord = "Page"
#-- replace the page breaks in the file
(Get-Content $inputFilePath) -replace '$oldWord', '$newWord' | Set-Content $inputFilePath
The symbol shown for pageBreak in the text editor UltraEdit is ♀
Replacing the character in UltraEdit is easy. I want to replace or remove this using Powershell.
Below is a related question. But still unanswered with regards to PowerShell code.
How to remove unknown line break (special character) in text file?
for page breaks , you can use :
[io.file]::ReadAllText( 'H:\oldFile.txt') | %{$_.replace("`f","")} >h:\newFile.txt
below snippet will work from powershell v3:
cat H:\oldFile.txt -raw | %{$_.replace("`f","")} >h:\newFile.txt
Thanks for the question! This one was interesting.
So the Form-Feed special character is a bugger in powershell. If you echo it out, you just get an odd character, or a square if you cannot display it. But if you copy and paste it back into the powershell terminal, if just moved your command entry point to the top of the screen. Odd.
What I did was try to find ways of replacing general special characters. You can use regexes in powershell using $oldWord -replace 'REGEX_GOES_HERE', 'THING_TO_REPLACE_WITH_HERE, so what I came up with is this:
$oldWord -replace '[\f]', '' #You can also use \r for carriage return, \n for new line, \t for tab, \s for ALL whitespace
This will simply remove all instances of the Form-Feed character.
Hope this helps! Cheers!

Convert a CSV file with embedded commas into a bash array by line efficiently

Normally, I do something like
IFS=','
columns=( $LINE )
where $LINE is a line from a csv file I'm reading.
However, how do I handle a csv file with embedded commas? I have to handle several hundred gigs of file so everything needs to be done quickly, i.e., no multiple readings of a line, definitely no loops (last time I tried that slowed it down several factors).
The general structure of the code is as follows
FILENAME=$1
cat $FILENAME | while read LINE
do
IFS=","
columns=( $LINE )
# affect columns changes here
newline="${columns[*]}"
echo "$newline"
done
Preferably, I need something that goes
FILENAME=$1
cat $FILENAME | while read LINE
do
IFS=","
# code to tell bash to ignore if IFS is within an open quote
columns=( $LINE )
# affect columns changes here
newline="${columns[*]}"
echo "$newline"
done
Any tips would be appreciated. Otherwise, I'll probably switch to using another language to handle this stuff.
Probably embedded commas is just the first obvious problem that you encountered while parsing those CSV files.
Future problems that might popped are:
embedded newline separator characters
embedded utf8 chars
special treatment for whitespaces, empty fields, spaces around commas, undef values
I generally tend to follow the philosophy that If there is a (reputable) module that parses some
format you have to parse, use it instead of making a homebrew
I don't think there is such a thing for bash, but there are some for Perl. I'd go for Text::CSV_XS. Being written in C I expect it to be very fast.
You can use sed or something similar to convert the commas within quotes to some other sequence or punctuation. If you don't care about the stuff in quotes then you do not even need to change them back. You can do this on the whole file:
sed 's/\("[^,"]*\),\([^"]*"\)/\1;\2/g' input.csv > intermediate.csv
or on each line:
line=$(echo $line | sed 's/\("[^,"]*\),\([^"]*"\)/\1;\2/g')
This isn't a complete answer, but it's a possible approach.
Find a character that never occurs in the input file. Use a C program that parses the CSV file and prints lines to standard output with a different delimiter. Writing that program is left as an exercise, but I'm sure there's CSV-parsing C source code out there. Pipe the output of the C program into your script.
For example:
FILENAME=$1
new_c_program $FILENAME | while read LINE
do
IFS="|"
# code to tell bash to ignore if IFS is within an open quote
columns=( $LINE )
# affect columns changes here
newline="${columns[*]}"
echo "$newline"
done
A minor point: I'd pick a name other than $newline; newline suggests an end-of-line marker rather than an entire line.
Another minor point: you have a "Useless Use Of cat" in the code in your question. You could replace this:
cat $FILENAME | while read LINE
do
...
done
by this:
while read LINE
do
...
done < $FILENAME
But if you replace cat by the hypothetical C program I suggested, you still need the pipe.

sed recipe: how to do stuff between two patterns that can be either on one line or on two lines?

Let's say we want to do some substitutions only between some patterns, let them be <a> and </a> for clarity... (all right, all right, they're start and end!.. Jeez!)
So I know what to do if start and end always occur on the same line: just design a proper regex.
I also know what to do if they're guaranteed to be on different lines and I don't care about anything in the line containing end and I'm also OK with applying all the commands in the line containing start before start: just specify the address range as /start/,/end/.
This, however, doesn't sound very useful. What if I need to do a smarter job, for instance, introduce changes inside a {...} block?
One thing I can think of is breaking the input on { and } before processing and putting it back together afterwards:
sed 's/{\|}/\n/g' input | sed 'main stuff' | sed ':a $!{N;ba}; s/\n\(}\|{\)\n/\1/g'
Another option is the opposite:
cat input | tr '\n' '#' | sed 'whatever; s/#/\n/g'
Both of these are ugly, mainly because the operations are not confined within a single command. The second one is even worse because one has to use some character or substring as a "newline holder" assuming it isn't present in the original text.
So the question is: are there better ways or can the above-mentioned ones be optimized? This is quite a regular task from what I read in recent SO questions, so I'd like to choose the best practice once and for all.
P.S. I'm mostly interested in pure sed solutions: can the job be do with one invocation of sed and nothing else? Please no awk, Perl, etc.: this is more of a theoretical question, not a "need the job done asap" one.
This might work for you:
# create multiline test data
cat <<\! >/tmp/a
> this
> this { this needs
> changing to
> that } that
> that
> !
sed '/{/!b;:a;/}/!{$q;N;ba};h;s/[^{]*{//;s/}.*//;s/this\|that/\U&/g;x;G;s/{[^}]*}\([^\n]*\)\n\(.*\)/{\2}\1/' /tmp/a
this
this { THIS needs
changing to
THAT } that
that
# convert multiline test data to a single line
tr '\n' ' ' </tmp/a >/tmp/b
sed '/{/!b;:a;/}/!{$q;N;ba};h;s/[^{]*{//;s/}.*//;s/this\|that/\U&/g;x;G;s/{[^}]*}\([^\n]*\)\n\(.*\)/{\2}\1/' /tmp/b
this this { THIS needs changing to THAT } that that
Explanation:
Read the data into the pattern space (PS). /{/!b;:a;/}/!{$q;N;ba}
Copy the data into the hold space (HS). h
Strip non-data from front and back of string. s/[^{]*{//;s/}.*//
Convert data e.g. s/this\|that/\U&/g
Swap to HS and append converted data. x;G
Replace old data with converted data.s/{[^}]*}\([^\n]*\)\n\(.*\)/{\2}\1/
EDIT:
A more complicated answer which I think caters for more than one block per line.
# slurp file into pattern space (PS)
:a
$! {
N
ba
}
# check for presence of \v if so quit with exit value 1
/\v/q1
# replace original newlines with \v's
y/\n/\v/
# append a newline to PS as a delimiter
G
# copy PS to hold space (HS)
h
# starting from right to left delete everything but blocks
:b
s/\(.*\)\({.*}\).*\n/\1\n\2/
tb
# delete any non-block details form the start of the file
s/.*\n//
# PS contains only block details
# do any block processing here e.g. uppercase this and that
s/th\(is\|at\)/\U&/g
# append ps to hs
H
# swap to HS
x
# replace each original block with its processed one from right to left
:c
s/\(.*\){.*}\(.*\)\n\n\(.*\)\({.*}\)/\1\n\n\4\2\3/
tc
# delete newlines
s/\n//g
# restore original newlines
y/\v/\n/
# done!
N.B. This uses GNU specific options but could be tweaked to work with generic sed's.

Very basic replace using sed

Really would appreciate help on this.
I am using sed to create a CSV file. Essentially multiple html files are all merged to a single html file and sed is then used to remove all the junk pictures etc to get to the raw columnar data.
I have all this working but am stuck on the last bit.
What I want to do is very basic - I want to replace the following lines:
"a variable string"
"end td"
"begin td"
with a single line:
"a variable string"
(with a tab character at the end of this line)
I'M USING DOS.
As you see I'm new to all this. If I could get this working would save me a lot of time in the future so would appreciate the help.
At the moment I have to inject some html headers back into the text file, open it in a html editor, select the table and then paste this into a spreadsheet which is a bit of pain.
P.S. is there an easy way to get sed to remove the parenthesis '(' and ')' from a given line?
I doubt that this is what you really want, but it's what you asked for.
sed "s/\"a variable string\"/&\t/; s/\"end td\"//; s/\"begin td\"//" inputfile
What you probably want to do is replace them when they appear consecutively. Here's how you might do that:
sed "1{N;N}; /\"a variable string\"\n\"end td\"\n\"begin td\"/ s/\n.*$/\t/;ta;bb;:a;N;N;:b;$!P;N;D" inputfile
This will remove all parentheses in a file:
sed "s/[()]//g" inputfile
To select particular lines, you could do something like this:
sed "/foo/ s/[()]//g" inputfile
which will only make the replacement if the word "foo" is somewhere on a line.
Edit: Changed single quotes to double quotes to accommodate GNUWin32 and CMD.EXE.
A previous comment I left doesn't appear to have been saved - so will try again
The code to remove the ( and ) worked perfectly thanks
You are right - I was looking to merge the 3 lines into one line so the second example you gave where it looks like its reading the next two lines into the pattern space looks more promising. The output wasn't what I was expecting however.
I now realize the code is going to have to be more complicated and I don't want to trouble you any more as my manual method of injecting some html code back into the text file and opening it up in Openoffice and pasting into a spreadsheet only takes a few seconds and I have a feeling to manually produce the sed coding to this would be a nightmare.
Essentially the rules for converting the html would need to be:
[each tag has been formatted so it appears on its own line]
I have given example of an input file and desired output file below for reference
1) if < tr > is followed by < td > on the next line completely remove the < tr > and < td > lines [i.e. do not output a carriage return] and on the NEXT line stick a " at the start of that line [it doesn't matter about a carriage return at the end of this line as it is going to be edited later]
2) if < /td > is followed by < td > completely remove both these two lines [again do not output a carriage return after these lines] and on the PREVIOUS line output a ", [do not output a carriage return] and on the NEXT line stick "at the start of the line [don't worry about the the ending carriage return is will be edited later]
3) if < /td > is followed by < /tr > delete both of these lines and on the previous line add a " at to the end of the line and a final carriage return.
I have given an example of what the input and desired output would be:
input: http://medinfo.redirectme.net/input.txt
[the wanted file will be posted in the next message - this board will not allow new users to post a message with more than one hyperlink!]
there is an added issue that the address column is on multiple lines on the input file - this could be reduced to one line by looking to see if the first character of the NEXT line is a " If it isn't then do not output the carriage return at the end of the current line
Phew that was a nightmare just to type out never mind actually code. But thanks again for all your help in getting this far!
:-)