Sed Range of Patterns only if it contains Pattern - sed

What i would like to know is how to print a Range of Patterns but only if it contains a specific Pattern.
For example:
I have a file that contains:
HEADER 1
AAA
BBBBBBB
MSG:testing
CCCCCC
DDD
PAGE 1
HEADER 2
EEE
FFFFFF
GGG
HHH
PAGE 2
I want to print from any HEADER to any PAGE but only if it contains the pattern MSG
The result i want is to print only these section:
HEADER 1
AAA
BBBBBBB
MSG:testing
CCCCCC
DDD
PAGE 1
What i have so far is: sed -n -e '/HEADER /,/PAGE /p' inputfile.txt > outputfile.txt
I'm open to any suggestions including the usage of Awk or Grep.
Thanks in advance.

This
sed '/HEADER/ { :a N; /PAGE/!ba; /MSG/!d }' inputfile.txt
works as follows:
/HEADER/ { # in a line that contains HEADER
:a # jump label for looping
N # fetch next line, append to pattern space
/PAGE/!ba # if the pattern space doesn't contain PAGE (this
# is the case if the new line doesn't), go back to :a
/MSG/!d # if the block that's now in the pattern space doesn't
# contain MSG, discard it
}
This removes offending ranges from the file and leaves everything else intact. To print only matching ranges and discard garbage data between ranges,
sed -n '/^HEADER/ { :a N; /PAGE/!ba; /MSG/p }' inputfile.txt
This removes the default print action with -n and uses /MSG/p to explicitly print matching ranges instead of deleting non-matching ranges.

If your date is separated bye space, you can use this gnu awk
awk -v RS= '/MSG/' file
HEADER 1
AAA
BBBBBBB
MSG:testing
CCCCCC
DDD
PAGE 1
By setting RS to nothing, awk works in block mode, and then just select the correct block.
This use HEADER as separator.
awk -v RS="HEADER" '/MSG/ {print RS$0}' file
HEADER 1
AAA
BBBBBBB
MSG:testing
CCCCCC
DDD
PAGE 1

sed -n '/^HEADER/,/^PAGE /!d;H;/^HEADER/h;/^PAGE / {x; /\nMSG/ p;}' YourFile
Assuming there is only and always section starting with HEADER and ending with PAGE (on different lines)
Explaination:
Dont print output unless print is asked
If line is not between (including) HEADER and PAGE, remove it
Append line to holding buffer
if line is HEADER write it to holding buffer (overwrite)
if line is PAGE
load the holding buffer to working buffer
print if MSG is inside
cycle

This might work for you (GNU sed):
sed '/HEADER/!{H;$!d};x;/MSG/!d' file
If the line does not contain HEADER append it to the hold space and if it is not the last line delete it. This means any other line (lines that contain HEADER or last line) will swap with the hold space and if the pattern space (a multiline previously the hold space) does not contain MSG delete it. Lines containing MSG will be printed.

Related

Delete string after '#' using sed

I have a text file that looks like:
#filelists.txt
a
# aaa
b
#bbb
c #ccc
I want to delete parts of lines starting with '#' and afterwards, if line starts with #, then to delete whole line.
So I use 'sed' command in my shell:
sed -e "s/#*//g" -e "/^$/d" filelists.txt
I wish its result is:
a
b
c
but actually result is:
filelists.txt
a
aaa
b
bbb
c ccc
What's wrong in my "sed" command?
I know '*' which means "any", so I think that '#*' means string after "#".
Isn't it?
You may use
sed 's/#.*//;/^$/d' file > outfile
The s/#.*// removes # and all the rest of the line and /^$/d drops empty lines.
See an online test:
s="#filelists.txt
a
# aaa
b
#bbb
c #ccc"
sed 's/#.*//;/^$/d' <<< "$s"
Output:
a
b
c
Another idea: match lines having #, then remove # and the rest of the line there and drop if the line is empty:
sed '/#/{s/#.*//;/^$/d}' file > outfile
See another online demo.
This way, you keep the original empty lines.
* does not mean "any" (at least not in regular expression context). * means "zero or more of the preceding pattern element". Which means you are deleting "zero or more #". Since you only have one #, you delete it, and the rest of the line is intact.
You need s/#.*//: "delete # followed by zero or more of any character".
EDIT: was suggesting grep -v, but didn't notice the third example (# in the middle of the line).

Join current and next line, then the next line and its successor using sed

Given the input:
1234
5678
9abc
defg
hijk
I'd like the output:
12345678
56789abc
9abcdefg
defghijk
There are lots of examples using sed(1) to joining a pair of lines, then the next pair after that pair and so on. But I haven't found an example that joins lines 1 with 2, 2 with 3, 3 with 4, ...
sed(1) solution preferred. Other options are less interesting - e.g., awk(1), python(1) and perl(1) implementations are fairly easy. I'm specifically stumped on a successful sed(1) incantation.
sed '1h;1d;x;G;s/\n//'
I guess it can be done some other way, but this works for me:
$ cat in
1234
5678
9abc
defg
hijk
$ sed '1h;1d;x;G;s/\n//' in
12345678
56789abc
9abcdefg
defghijk
How it works: we put first line to hold space and that's it for first line. Every line after the first - swap it with hold space, append the new hold space to the old hold space, remove newline.
This does it (now improved, thanks to potong's hint):
$ sed -n 'N;s/\n\(.*\)/\1&/;P;D' infile
12345678
56789abc
9abcdefg
defghijk
In detail:
N # Append next line to pattern space
s/\n\(.*\)/\1&/ # Make 111\n222 into 111222\n222
P # Print up to first newline
D # Delete up to first newline
The substitution makes these two lines
1111
2222
which in the pattern space look like 1111\n2222 into
11112222
2222
and the P and D print/delete the first line from the pattern space.
Notice that we never hit the bottom of the script (D starts a new loop) until the very last line, where N can't fetch a new line and would just print the last line on its own, if we didn't suppress that with -n.
Tweaking another answer (full credit to #aragaer) to handle single line input (and be more portable to bsd sed as well as gnu sed than the original version - update: that answer has been edited another way for portability):
% cat >> inputfile << eof
12
34
56
eof
% sed -e '1{$p;h;d' -e '}' -e 'x;G;s/\n//' inputfile # bsd + gnu sed [1]
1234
3456
or
% cat joinsuccessive.sed
1{
$p;h;d
}
x;G;s/\n//
% sed -f joinsuccessive.sed inputfile
1234
3456
Here's an annotated version.
1{ # special case for first line only:
$p # even MORE special case: print current line for input with
# only a single line
h # add line 1 to hold space (for joining with successive lines)
d # delete pattern space and move to next line (without printing)
}
x # for lines 2+, swap pattern space (current line) and hold space
G # add newline + hold space (now has current line) to pattern space
# (previous line) giving prev line, newline, curr line in pattern
# space (and curr line is in hold space)
s/\n// # remove newline added by G (between lines) before printing the
# pattern space
[1] bsd sed(1) wants a closing brace to be on a line by itself. Use -e to "build" the script or put the commands in a sed script file (and use -f joinsuccessive.sed).

How to remove empty lines to one empty line between sentences in text files?

I have a text file with many empty lines between sentences. I used sed, gawk, grep but they dont work. :(. How can I do now? Thanks.
Myfile: Desired file:
a a
b b
c c
. .
d d
e e
f f
g g
. .
h
i
h j
i k
j .
k
.
You can use awk for this:
awk 'BEGIN{prev="x"}
/^$/ {if (prev==""){next}}
{prev=$0;print}' inputFile
or the compressed one liner:
awk 'BEGIN{p="x"}/^$/{if(p==""){next}}{p=$0;print}' inFl
This is a simple state machine that collapses multi-blank-lines into a single one.
The basic idea is this. First, set the previous line to be non-empty.
Then, for every line in the file, if it and the previous one are blank, just throw it away.
Otherwise, set the previous line to that value, print the line, and carry on.
Sample transcript, the following command:
$ echo '1
2
3
4
5
6
7
8
9
10' | awk 'BEGIN{p="x"}/^$/{if(p==""){next}}{p=$0;print}'
outputs:
1
2
3
4
5
6
7
8
9
10
Keep in mind that this is for truly blank lines (no content). If you're trying to collapse lines that have an arbitrary number of spaces or tabs, that will be a little trickier.
In that case, you could pipe the file through something like:
sed 's/^\s*$//'
to ensure lines with just whitespace become truly empty.
In other words, something like:
sed 's/^\s*$//' infile | awk 'my previous awk command'
To suppress repeated empty output lines with GNU cat:
cat -s file1 > file2
Here's one way using sed:
sed ':a; N; $!ba; s/\n\n\+/\n\n/g' file
Otherwise, if you don't mind a trailing blank line, all you need is:
awk '1' RS= ORS="\n\n" file
The Perl solution is even shorter:
perl -00 -pe '' file
You could do like this also,
awk -v RS="\0" '{gsub(/\n\n+/,"\n\n");}1' file
Explanation:
RS="\0" Once we set the null character as Record Seperator value, awk will read the whole file as single record.
gsub(/\n\n+/,"\n\n"); this replaces one or more blank lines with a single blank line. Note that \n\n regex matches a blank line along with the previous line's new line character.
Here is an other awk
awk -v p=1 'p=="" {p=1;next} 1; {p=$0}' file

sed or awk deleting lines between pattern matches, excluding the second token's line

I have a sed command which will successfully print lines matching two patterns:
sed -n '/PAGE 2/,/\x0c/p' filename.txt
What I haven't figured out, is that I want it to print all the lines from the first token, up until the second token. The \x0c token is a record separator on a big flat file, and I need to keep THAT line intact.
In between the two tokens, the data is completely variable, and I do not have a reliable anchor to work with.
[CLARIFICATION]
Right now it prints all the lines between /PAGE 2/ and /\x0c/ inclusive. I want it to print /PAGE 2/ up until the next /\x0c/ in the record.
[test data] The /x0c will be at the start of the first line, and the beginning of the last line of this record.
I need to delete the first line of the record, through the line just before the beginning of the next record.
^L20-SEP-2006 01:54:08 PM Foobars College PAGE 2
TERM: 200610 Student Billing Statement SUMDATA
99999
Foo bar R0000000
999 Geese Rural Drive DUE: 15-OCT-2012
Columbus, NE 90210
--------------------------------------------------------------------------------
Balance equal to or greater than $5000.00 $200.00
Billing inquiries may be directed to 444/555-1212 or by
email to bursar#foobar.edu. Financial Aid inquiries should
be directed to 444/555-1212 or finaid#foobar.edu.
^L20-SEP-2006 01:54:08 PM Foobars College PAGE 1
[expected result]
^L20-SEP-2006 01:54:08 PM Foobars College PAGE 1
There will be multiple such records in the file. I can rely only on the /PAGE 2/ token, and the /x0c/ token.
[solution]:
Following Choruba's lead, I edited his command to:
sed '/PAGE [2-9]/,/\x0c/{/\x0c$/!d}'
The rule in the curly brackets was applying itself to any line containing a ^L and was selectively ignoring them.
EDIT: New answer for the new question the OP asked (how to delete records:
Given a file with control-Ls delimiting records and a desire to print specific lines from specific records, just set your record separator to control-L and your field separator to "\n" and print whatever you like. For example, to get the output the OP says he wants from the input he posted would just be:
awk -v RS='^L' -F'\n' 'NR==3{print $1}' file
^L shown here represents a literal control-L, and it's the 3rd record because there's an empty record before te first control-L in the input file.
#
This is the answer to the original question the OP asked:
You want this:
awk '/PAGE 2/ {f=1} /\x0c/{f=0} f' file
but also try these to see the difference (for the future):
awk '/PAGE 2/ {f=1} f; /\x0c/{f=0}' file
awk 'f; /PAGE 2/ {f=1} /\x0c/{f=0}' file
And finally, FYI, The following idioms describe how to select a range of records given a specific pattern to match:
a) Print all records from some pattern:
awk '/pattern/{f=1}f' file
b) Print all records after some pattern:
awk 'f;/pattern/{f=1}' file
c) Print the Nth record after some pattern:
awk 'c&&!--c;/pattern/{c=N}' file
d) Print every record except the Nth record after some pattern:
awk 'c&&!--c{next}/pattern/{c=N}1' file
e) Print the N records after some pattern:
awk 'c&&c--;/pattern/{c=N}' file
f) Print every record except the N records after some pattern:
awk 'c&&c--{next}/pattern/{c=N}1' file
g) Print the N records from some pattern:
awk '/pattern/{c=N}c&&c--' file
I changed the variable name from "f" for "found" to "c" for "count" where appropriate as that's more expressive of what the variable actually IS.
Tell sed not to print the line containing the character:
sed -n '/PAGE 2/,/\x0c/{/\x0c/!p}' filename.txt
I think this would do it:
awk '/PAGE 2/{a=1}/\x0c/{a=0}{if(a)print}'
In this line, the second sed deletes (d) the last line ($).
sed -n '/^START$/,/^STOP$/p' in.txt | sed '$d'
Following Choruba's lead, I edited his command to:
sed '/PAGE [2-9]/,/\x0c/{/\x0c$/!d}'

sed: joining lines depending on the second one

I have a file that, occasionally, has split lines. The split is signaled by the fact that the line starts with '+' (possibly preceeded by spaces).
line 1
line 2
+ continue 2
line 3
...
I'd like join the split line back:
line 1
line 2 continue 2
line 3
...
using sed. I'm not clear how to join a line with the preceeding one.
Any suggestion?
This might work for you:
sed 'N;s/\n\s*+//;P;D' file
These are actually four commands:
N
Append line from the input file to the pattern space
s/\n\s*+//
Remove newline, following whitespace and the plus
P
print line from the pattern space until the first newline
D
delete line from the pattern space until the first newline, e.g. the part which was just printed
The relevant manual page parts are
Selecting lines by numbers
Addresses overview
Multiline techniques - using D,G,H,N,P to process multiple lines
Doing this in sed is certainly a good exercise, but it's pretty trivial in perl:
perl -0777 -pe 's/\n\s*\+//g' input
I'm not partial to sed so this was a nice challenge for me.
sed -n '1{h;n};/^ *+ */{s// /;H;n};{x;s/\n//g;p};${x;p}'
In awk this is approximately:
awk '
NR == 1 {hold = $0; next}
/^ *\+/ {$1 = ""; hold=hold $0; next}
{print hold; hold = $0}
END {if (hold) print hold}
'
If the last line is a "+" line, the sed version will print a trailing blank line. Couldn't figure out how to suppress it.
You can use Vim in Ex mode:
ex -sc g/+/-j -cx file
g global search
- select previous line
j join with next line
x save and close
Different use of hold space with POSIX sed... to load the entire file into the hold space before merging lines.
sed -n '1x;1!H;${g;s/\n\s*+//g;p}'
1x on the first line, swap the line into the empty hold space
1!H on non-first lines, append to the hold space
$ on the last line:
g get the hold space (the entire file)
s/\n\s*+//g replace newlines preceeding +
p print everything
Input:
line 1
line 2
+ continue 2
+ continue 2 even more
line 3
+ continued
becomes
line 1
line 2 continue 2 continue 2 even more
line 3 continued
This (or potong's answer) might be more interesting than a sed -z implementation if other commands were desired for other manipulations of the data you can simply stick them in before 1!H, while sed -z is immediately loading the entire file into the pattern space. That means you aren't manipulating single lines at any point. Same for perl -0777.
In other words, if you want to also eliminate comment lines starting with *, add in /^\s*\*/d to delete the line
sed -n '1x;/^\s*\*/d;1!H;${g;s/\n\s*+//g;p}'
versus:
sed -z 's/\n\s*+//g;s/\n\s*\*[^\n]*\n/\n/g'
The former's accumulation in the hold space line by line keeps you in classic sed line processing land, while the latter's sed -z dumps you into what could be some painful substring regexes.
But that's sort of an edge case, and you could always just pipe sed -z back into sed. So +1 for that.
Footnote for internet searches: This is SPICE netlist syntax.
A solution for versions of sed that can read NUL separated data, like here GNU Sed's -z:
sed -z 's/\n\s*+//g'
Compared to potong's solution this has the advantage of being able to join multiple lines that start with +. For example:
line 1
line 2
+ continue 2
+ continue 2 even more
line 3
becomes
line 1
line 2 continue 2 continue 2 even more
line 3