Delete multiline pattern containing another pattern - sed

I would to like to delete all instances of a pattern which contain another specified pattern (conveniently on a new line):
MID:
Sample Input:
header
BEGIN:
1abc
7wurw
END:
BEGIN:
22xyz
MID:
34utov
END:
Desired Output:
header
BEGIN:
1abc
7wurw
END:
I'm looking for possible one liners. Any help would be appreciated.

Using GNU sed:
sed -e :a -e '/^BEGIN:/,/^END:/ { /END:/!{$!{N;ba};};/MID:/d;}' inputfile
For your input, it'd return:
header
BEGIN:
1abc
7wurw
END:

I would use the RS, ORS variable. here is one-liner:
awk -v RS="BEGIN:\n" -v ORS="" '/MID/{next}NR>1{printf RS}7' file
test with your file:
kent$ cat f
header
BEGIN:
1abc
7wurw
END:
BEGIN:
22xyz
MID:
34utov
END:
kent$ awk -v RS="BEGIN:\n" -v ORS="" '/MID/{next}NR>1{printf RS}7' f
header
BEGIN:
1abc
7wurw
END:
note, printf RS is not very nice, I used it because I know it is BEGIN: good practice would be printf "%s", RS

This worked for me on sample:
sed '/BEGIN:/,/END:/{/BEGIN/{h;d};H;/END:/!d;x;/MID:/d}' input.txt
I am pretty sure it can be simplified a lot.

BEGIN { in_block = 0; }
/BEGIN:/ { in_block = 1; lineno = 0; arr[lineno] = $0; must_write = 1; next; }
/END:/ { in_block = 0;
if (must_write == 1) {
for (i = 0; i <= lineno; ++i) print arr[i]; print;
}
next;
}
/MID:/ { must_write = 0; next; }
in_block == 1 && must_write == 1 { lineno++; arr[lineno] = $0; next; }
in_block == 0 { print }
This should work (worked with the supplied test case). Some awk-wizards will find a denser solution, probably. But you can use this kind of processing also for other tasks.

This might work for you (GNU sed):
sed '/^BEGIN:/{:a;$!{N;/END:/!ba};/MID:/d}' file

This gnu awk divides section by the END: so its not true to BEGIN:-END:, but works:
awk '!/MID:/{printf "%s%s\n",$0,RT}' RS="END:" t
header
BEGIN:
1abc
7wurw
END:

sed -n '/BEGIN:/,/END:/ {
H
/END:/ {
s/.*//
x
/\nMID:/ !p
}
}' YourFile
For OneLiner
sed -n '/BEGIN:/,/END:/{H;/END/{s/.*//;x;/\nMID:/ !p;};}' YourFile
would not work with the last ; (i have to keep a "\n" on my AIX) (should work on GNU sed)
#for AIX
sed -n '/BEGIN:/,/END:/{H;/END/{s/.*//;x;/\nMID:/ !p;}
}' YourFile

Related

How to use sed to find the line and its following line

I would like to get the lines which begin with "Xboy" and its following lines which begins with "+". How can I do this by using sed?
The input looks like below:
Xapple
+apple1
+apple2
.ends
Xboy
+boy1
+boy2
V2
Xcat
+cat1
+cat2
Xcat
The output should look like below:
Xboy
+boy1
+boy2
This will do the job in sed, but really this problem is more complicated than sed is intended for. You'd be better off using perl or python.
$ cat foo.txt
Xapple
+apple1
+apple2
.ends
Xboy
+boy1
+boy2
V2
Xcat
+cat1
+cat2
Xcat
$ sed ':section;/Xboy/!d;:plusline;n;/^+/b plusline;b section' foo.txt
Xboy
+boy1
+boy2
In a proper programming language, the nested loop structure of the data becomes clearer, and we can be more confident there are no edge cases we've forgotten about.
In Perl:
my $line = <>;
while (defined($line)) {
chomp($line);
if ($line eq "Xboy") {
print $line, "\n";
$line = <>;
while (defined($line) && $line =~ /^\+/) {
print $line;
$line = <>;
}
}
else {
$line = <>;
}
}
In Python:
import fileinput
lines = fileinput.input()
line = lines.readline()
while line != '':
line = line.rstrip('\n')
if line == 'Xboy':
print(line)
line = lines.readline()
while line != '' and line.startswith('+'):
print(line, end='')
line = lines.readline()
else:
line = lines.readline()
An awk version
awk '/Xboy/ {f=1;print;next} {/^+/?a=1:f=0} a&&f' file
Xboy
+boy1
+boy2
This might work for you (GNU sed):
sed -n ':a;/Xboy/{:b;p;n;/^+/bb;ba}' file
If a line contains Xboy, print it and any following lines that begin + otherwise be silent.
I guessed this is what you intended, however you may have meant that other lines beginning with non-word-like characters should also be ignored, use:
sed -n ':a;/Xboy/{:b;p;:c;n;/^+/bb;/^\W/bc;ba}' file
or perhaps you meant this:
sed -n ':a;/Xboy/{:b;p;:c;n;/^+/bb;/^[^[:upper:]]/bc;ba}' file
It may be that you only want to print Xboy if there is a line following that begins +, then use:
sed -n ':a;/Xboy/{$d;h;:b;n;/^+/{H;$!bb};x;/\n/p;x;ba}' file

SED code for removing newline

I am looking for sed command which will transform following line:
>AT1G01020.6 | ARV1 family protein | Chr1:6788-8737 REVERSE LENGTH=944 | 201606
AGACCCGGACTCTAATTGCTCCGTATTCTTCTTCTCTTGAGAGAGAGAGAGAGAGAGAGA
GAGAGAGAGCAATGGCGGCGAGTGAACACAGATGCGTGGGATGTGGTTTTAGGGTAAAGT
CATTGTTCATTCAATACTCTCCGGGGAAATTGCAAGGAAGTAGCAGATGAGTACATCGAG
TGTGAACGCATGATTATTTTCATCGATTTAATCCTTCACAGACCAAAGGTATATAGACAC
into
>AT1G01020.6 | ARV1 family protein | Chr1:6788-8737 REVERSE LENGTH=944 | 201606
AGACCCGGACTCTAATTGCTCCGTATTCTTCTTCTCTTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGCAATGGCGGCGAGTGAACACAGATGCGTGGGATGTGGTTTTAGGGTAAAGTCATTGTTCATTCAATACTCTCCGGGGAAATTGCAAGGAAGTAGCAGATGAGTACATCGAGTGTGAACGCATGATTATTTTCATCGATTTAATCCTTCACAGACCAAAGGTATATAGACAC
which means newline after > this character will remain unchanged, while on other cases newlines will be joined.
I have tried with the following line, but it is not working:
sed s/^!>\n$// <in.fasta>out.fasta
I have a 28MB fasta file which I need to transform.
sed is not a particularly good tool for this.
awk '/^>/ { if(prev) printf "\n"; print; next }
{ printf "%s", $0; prev = 1; }
END { if(prev) printf "\n" }' in.fasta >out.fasta
Using awk:
awk '/^>/{print (l?l ORS:"") $0;l="";next}{l=l $0}END{print l}' file
The line is printed if a > or the end of the file is reached, otherwise the line is buffered in the variable l.
Following awk may also help you here. Without using any array or variable's values solution.
awk 'BEGIN{ORS=""} /^>/{if(FNR==1){print $0 RS} else {print RS $0 RS};next}1' Input_file
OR
awk 'BEGIN{ORS=""} /^>/{printf("%s",FNR==1?$0 RS:RS $0 RS);next}1' Input_file

Sed: Print lines between string and another string in one line

I have 100 html files in a directory
I need to print a line from each file that matches a regex and at the same time print the lines between 2 regex.
The commands below provide the results, correctly
sed -n '/string1/p' *.html >result.txt
sed -n '/string2/,/string3/p' *.html > result2.txt
but I need them in one result.txt file, in the format
string1
string2
string3
I have been trying with grep, awk and sed and have searched but I have not found the answer.
Any help would be appreciated.
This might work for you:
sed -n '/string1/p;/string2/;/string3/p' INPUTFILE > OUTPUTFILE
Or here's an awk solution:
awk '/string1/ { print }
/srting2/ { print ; p = 1 }
p == 1 { print }
/string3/ { print ; p = 0 }' INPUTFILE > OUTPUTFILE
Simply put both SED epressions in one invocation:
echo $'a\nstring1\nb\nstring2\nc\nstring3\nd\n' | \
sed -n -e '/string1/p' -e '/string2/,/string3/p'
Input is:
a
string1
b
string2
c
string3
d
Output is:
string1
string2
c
string3

Joining lines with awk and sed

I like to join lines following {st,corridor,tunnel} into one line using AWK or SED
Input
abcd
efgjk
st
wer
dfgh
corridor
weerr
tunnel
twdf
Desired output
abcd
efgjk st
wer
dfgh corridor
weerr tunnel
twdf
One way using awk:
awk '!/st|corridor|tunnel/ { if (line) print line; line = $0; next } { line = line " " $0 } END { print line }' file.txt
Results:
abcd
efgjk st
wer
dfgh corridor
weerr tunnel
twdf
This might work for you (GNU sed):
sed '$!N;s/\n\(st\|corridor\|tunnel\)\s*$/ \1/;P;D' file
Or, an awk version that reads the whole file into memory first (not recommended for large files):
$ awk 'BEGIN {i=1} {line[i++] = $0} END {j=1; while (j<i) {if (match(line[j+1], /^(st|corridor|tunnel)$/)) {print line[j] " " line[j+1]; j+=2} else print line[j++];}}' streets
abcd
efgjk st
wer
dfgh corridor
weerr tunnel
twdf
I'll leave you with the exercise of doing this one-or-two-lines-at-a-time. :)
With awk
BEGIN {
s["st"]=s["corridor"]=s["tunnel"]
}
$1 in s {
print prev, $1
}
!($1 in s) {
if (prev) print prev
prev = $1
}

Print Valid words with _ in between them

I have done my research, but not able to find the solution to my problem.
I am trying to extract all valid words(Starting with a letter) in a string and concatenate them with underscore("_"). I am looking for solution with awk, sed or grep, etc.
Something like:
echo "The string under consideration" | (awk/grep/sed) (pattern match)
Example 1
Input:
1.2.3::L2 Traffic-house seen during ABCD from 2.2.4/5.2.3a to 1.2.3.X11
Desired output:
L2_Traffic_house_seen_during_ABCD_from
Example 2
Input:
XYZ-2-VRECYY_FAIL: Verify failed - Client 0x880016, Reason: Object exi
Desired Output:
XYZ_VRECYY_FAIL_Verify_failed_Client_Reason_Object_exi
Example 3
Input:
ABCMGR-2-SERVICE_CRASHED: Service "abcmgr" (PID 7582) during UPGRADE
Desired Output:
ABCMGR_SERVICE_CRASHED_Service_abcmgr_PID_during_UPGRADE
This might work for you (GNU sed):
sed 's/[[:punct:]]/ /g;s/\<[[:alpha:]]/\n&/g;s/[^\n]*\n//;s/ [^\n]*//g;y/\n/_/' file
A perl one-liner. It searches any alphabetic character followed by any number of word characters enclosed in word boundaries. Use the /g flag to try several matches for each line.
Content of infile:
1.2.3::L2 Traffic-house seen during ABCD from 2.2.4/5.2.3a to 1.2.3.X11
XYZ-2-VRECYY_FAIL: Verify failed - Client 0x880016, Reason: Object exi
ABCMGR-2-SERVICE_CRASHED: Service "abcmgr" (PID 7582) during UPGRADE
Perl command:
perl -ne 'printf qq|%s\n|, join qq|_|, (m/\b([[:alpha:]]\w*)\b/g)' infile
Output:
L2_Traffic_house_seen_during_ABCD_from_to_X11
XYZ_VRECYY_FAIL_Verify_failed_Client_Reason_Object_exi
ABCMGR_SERVICE_CRASHED_Service_abcmgr_PID_during_UPGRADE
One way using awk, with the contents of script.awk:
BEGIN {
FS="[^[:alnum:]_]"
}
{
for (i=1; i<=NF; i++) {
if ($i !~ /^[0-9]/ && $i != "") {
if (i < NF) {
printf "%s_", $i
}
else {
print $i
}
}
}
}
Run like:
awk -f script.awk file.txt
Alternatively, here is the one liner:
awk -F "[^[:alnum:]_]" '{ for (i=1; i<=NF; i++) { if ($i !~ /^[0-9]/ && $i != "") { if (i < NF) printf "%s_", $i; else print $i; } } }' file.txt
Results:
L2_Traffic_house_seen_during_ABCD_from_to_X11
XYZ_VRECYY_FAIL_Verify_failed_Client_Reason_Object_exi
ABCMGR_SERVICE_CRASHED_Service_abcmgr_PID_during_UPGRADE
This solution requires some tuning and I think one needs gawk to have regexp as "record separator"
http://www.gnu.org/software/gawk/manual/html_node/Records.html#Records
gawk -v ORS='_' -v RS='[-: \"()]' '/^[a-zA-Z]/' file.dat