Using sed to extract data from a file. I know the string I'm looking but I need to get the whole block of data that this string is in - sed

I'm using sed to extract data from a file. Lots of same style data in there. I want every occurrence of a specific string occurs but the string is part of a block of information and I want to extract the whole block based of that string.
Example data in file:
123
AAA
ABC
ZZZ
123
KJG
HJY
ZZZ
123
LPC
ABC
TRY
ZZZ
In this example 123 is the start of the block of data I want and ZZZ the end. ABC is the string I search for. So from this example my output should be:
123
AAA
ABC
ZZZ
123
LPC
ABC
TRY
ZZZ
sed -n '/ABC/{:a;p;n;/123/b;ba};' testfile.txt > testfile2.txt
the output with this is
ABC
ZZZ
ABC
TRY
ZZZ
so I'm not getting the data before ABC in the block

This might work for you (GNU sed):
sed -n '/123/{:a;N;/ZZZ/!ba;/ABC/p}' file
Gather up lines between 123 and ZZZ and then print them if they contain ABC.
N.B. n prints the current line and replaces it with the next. Whereas N appends the next line to the pattern space, inserting a newline. Thus keeping those lines current and searchable.

Related

Replace pattern with consecutive strings from list

I would like to find specific string in one file and then replace it with consecutive strings from another file. The order of replacement should be maintained.
The first file looks like this:
>A1
NNNNNNNNNN
NNNNNNNNNN
>B2
ACGTNNNNNN
NNNGTGTNNN
NNNNNNNNNN
>B3
GGGGGGGGGG
NNNTTTTTTT
NNNNCTGNNN
And the file with strings looks like this:
Name1
Name1
Name2
Name2
Name3
Name4
So finally I would like to find lines containing '>' and replace '>' with '>string' from second file to get this output:
>Name1 A1
NNNNNNNNNN
NNNNNNNNNN
>Name1 B2
ACGTNNNNNN
NNNGTGTNNN
NNNNNNNNNN
>Name2 B3
GGGGGGGGGG
NNNTTTTTTT
NNNNCTGNNN
If you have GNU sed:
sed '/^>/R file_with_strings' first_file | sed '/^>/{N;s/>\(.*\)\n\(.*\)/>\2 \1/;}'
This might work for you (GNU sed):
sed -E '1{x;s/^/cat file2/e;x}
/^>/{G;s/^>(\S+)\n(\S+)/>\2 \1/;P;s/^[^\n]*\n//;h;d}' file1
On the first line slurp the second file into hold space.
If a line begins >, append the hold space and using pattern matching and back references build the required header line from the first line of the hold space.
Print the result (the first line of the pattern space), remove the first line ,replace the hold space and delete the current line.
Repeat.

How match the last part of a line conditionally?

I am very new to perl, currently I am using a very simple perl regex to print the last part of a line after the string "Lecture" reading from a file 1.txt.
cat 1.txt | perl -ne 'print "$1 \n" while /Lecture\s+(\d+\w)/g;'
It works well but I need to add a simple condition to it:
First Preference is always print the characters after the string "Lecture".
If string "Lecture" is not found in a line, simply print the characters at the very end of line.
PS: It might occur that string "Lecture" doesn't have a space around it and throughout I used word character because it not necessarily would be a plain number, it can be alphanumeric .
Example
cat 1.txt
Some Topic 1 Lecture 001
Some Topic 2 Lecture 002
Topic 3 ( classroom Session ) Lecture2B
Practicals 07A
Submissions 10
Topic5Lecture4
Expected output:
001
002
2B
07A
10
4
I preferably want a solution which I can directly run in the cli/console. ( Just Like my original code - cat 1.txt | perl code ).
I don't want to execute a separate .pl file.
This
(?:\w*Lecture)?([^\s]+)$
Will capture ((...)) all (+) non-whitespace ([^\s]) at the end of line ($),
optionally (?) preceeded by non-captured((?:...)) "Lecture", even if there are other letters before (\w*).
It gets the desired output:
001
002
2B
07A
10
4
4
For the sample input:
Some Topic 1 Lecture 001
Some Topic 2 Lecture 002
Topic 3 ( classroom Session ) Lecture2B
Practicals 07A
Submissions 10
Topic5 Lecture4
Topic5Lecture4

How to find and replace every even-numbered appearance of a match in BASH?

I am using sed -i 's/AAA/ZZZ/g' filename to replace every occurance of "AAA" with "ZZZ" in a file. I need to instead replace every even-numbered appearance of "AAA" with "ZZZ", e.g.:
This is a AAA sentence. AAA
This is another AAA sentence.
This is yet AAA another AAA sentence.
This is AAA stillAAA AAA yet AAA another AAA sentence.
This would become:
This is a AAA sentence. ZZZ
This is another AAA sentence.
This is yet ZZZ another AAA sentence.
This is ZZZ stillAAA ZZZ yet AAA another ZZZ sentence.
How to replace every even-numbered appearance of a match?
Here is a short gnu awk version
awk '{ORS=NR%2==0?"ZZZ":RS}1' RS="AAA" file
This is a AAA sentence. ZZZ
This is another AAA sentence.
This is yet ZZZ another AAA sentence.
This is ZZZ stillAAA ZZZ yet AAA another ZZZ sentence.
awk is better tool for this than sed. Consider this awk command:
awk -F 'AAA' '{for (i=1; i<NF; i++) {OFS=c%2?"ZZZ":FS; printf "%s%s", $i, OFS; c++}
print $NF}' file
This is a AAA sentence. ZZZ
This is another AAA sentence.
This is yet ZZZ another AAA sentence.
This is ZZZ stillAAA ZZZ yet AAA another ZZZ sentence.
This awk sets the input field separator as AAA and and toggles output field separator between AAA and ZZZ depending upon a counter is odd or even. Every time counter is even OFS is set to AAA and when it is odd OFS is set to ZZZ
Here is a perl solution:
$ cat inp
This is a AAA sentence. AAA
This is another AAA sentence.
This is yet AAA another AAA sentence.
This is AAA stillAAA AAA yet AAA another AAA sentence.
$ perl -pe 'my $line = "" ; while(<>){ $line=$line.$_} $line =~ s/(.*?AAA.*?)AAA/\1ZZZ/mgs; print $line;' < inp
This is another AAA sentence.
This is yet ZZZ another AAA sentence.
This is ZZZ stillAAA ZZZ yet AAA another ZZZ sentence.
Here, first I accumulate entire file in a variable $line. & Then, I replace every alternate occurrence of AAA with ZZZ; using non-greedy matching.
Perl:
perl -wpe 'BEGIN{$/="AAA"} $.%2 or s/AAA/ZZZ/' foo.txt
You can do it with sed too:
sed -n -e '1,$ {
:oddline s/AAA/\n/g; :odd s/\n/AAA/m; t even ;p;N;s/.*\n//;b oddline ;
:evenline s/AAA/\n/g; :even s/\n/ZZZ/m; t odd ; p;N;s/.*\n//;b evenline ;
}' << _END_
This is a AAA sentence. AAA
This is another AAA sentence.
This is yet AAA another AAA sentence.
This is AAA stillAAA AAA yet AAA another AAA sentence.
_END_
The sed script loops through all lines and remembers odd/even replacements (across lines). In the pattern space, all AAAs are first replaced by newlines and then replaced one at a time by either AAA or ZZZ. In order to switch to the next line it is first appended (N) and then the previous one is deleted (s/.*\n//).
sed "1 h;1 !H;$ {x;l;s/=/=e/g;s/²/=c/g;s/AAA/²/g;s/²\([^²]\{1,\}\)²/²\1ZZZ/g;s/²/AAA/g;s/=c/²/g;s/=e/=/g;}" YourFile
Using substitution (due to AAA that could be inside a .*) insurring that even with substitute char is inside it work with the double translation before and after
This might work for you (GNU sed):
sed -r ':a;$!{N;ba};/\x00/q1;s/AAA/\x00/g;s/(\x00)([^\x00]*)\1/AAA\2ZZZ/g' file
This slurps the file into memory and then replaces all occurences of AAA with a unique character. Then every odd and even occurence of the unique character is replaced by AAA and ZZZ respectively.
N.B. If the unique character is not unique, no change is made to the file and an error code of 1 is set.
This second method is more long-winded but can be used to change the N'th value and does not rely on an unique value:
sed -r 's/AAA/\n&/g;/\n/!b;G;:a;s/$/#/;s/#{2}$//;/\n$/s/\nAAA/\nZZZ/;s/\n//;/\n.*\n/ba;P;s/^.*\n//;h;d' file
It stores the number of occurences of the required pattern in the hold space and retrieves it when encounters a line with such a pattern.

tricky multiline erase in SED

here is the input:
aaa
bbb
ccc
ddd
eee
fff
what I want? do sth like" sed "/ccc/,/(eee)/d" BUT ALSO DELETE "bbb" line (before "ccc")
so that output is:
aaa
fff
any ideas?
This might work for you (GNU sed):
sed ':a;$!{N;/\nccc/!{P;D};/\neee/!ba;d}' file
If you are fine with awk, this should do:
$ awk '/ccc/,/eee/{if(i!=1){i=1;x="";}next}{if (x)print x;x=$0;}END{print x}' file
aaa
fff
Every previous line is printed in the above case. Normal range filtering is done using awk. However, within the range filter, the variable x is reset so that the previous record just before the range is not printed.
Update:
sed solution:
$ sed '${x;p;};/ccc/,/eee/{/ccc/{s/.*//;x;};d;};1{h;d;};x;/^$/d;' file
You could do this in a simple 2-pass approach, first pass to identify the lines to delete and the second pass to print only the lines that are not marked for deletion:
awk '/ccc/,/eee/{d[NR]=d[NR-1]=1} NR!=FNR && !d[FNR]' file file

Printing text between regexps

I tried the '/pat1/,/pat2/p', but I want to print only the text between the patterns, not the whole line. How do I do that?
A pattern range is for multiline patterns. This is how you'd do that:
sed -n '/pat1/,/pat2/{/pat1\|pat2/!p}' inputfile
-n - don't print by default
/pat1/,/pat2/ - within the two patterns inclusive
/pat1\|pat2/!p - print everything that's not one of the patterns
What you may be asking for is what's between two patterns on the same line. One of the other answers will do that.
Edit:
A couple of examples:
$ cat file1
aaaa bbbb cccc
123 start 456
this is what
I want
789 end 000
xxxx yyyy zzzz
$ sed -n '/start/,/end/{/start\|end/!p}' file1
this is what
I want
You can shorten it by telling sed to use the most recent pattern again (//):
$ sed -n '/.*start.*/,/^[0-9]\{3\} end 0*$/{//!p}' file1
this is what
I want
As you can see, I didn't have to duplicate the long, complicated regex in the second part of the command.
sed -r 's/pat1(.*)pat2/\1/g' somefile.txt
I don't know the kind of pattern you used, but i think it is also possible with regular expressions.
cat myfile | sed -r 's/^(.*)pat1(.*)pat2(.*)$/\2/g'
you can use awk.
$ cat file
other TEXT
pat1 text i want pat2
pat1 TEXT I
WANT
pat2
other text
$ awk -vRS="pat2" 'RT{gsub(/.*pat1/,"");print}' file
text i want
TEXT I
WANT
The solution works for patterns that span multiple lines