sed: remove between delimeters IF a string is present between them - sed

Right now I have a multiline string matching this format:
---
some text
more text
MATCH: FIRST
more text
---
some text
more text
---
some text
MATCH: SECOND
more text
---
some
more
MATCH: THIRD
text
here
---
I'm looking for a way in bash (preferably using sed) to remove everything between --- and --- if MATCH: FIRST or MATCH: SECOND are present between them. i.e. for the above example I would want my output to look like:
---
some text
more text
---
some
more
MATCH: THIRD
text
here
---
For my purposes I don't really care either way if the delimiters are removed (the ---). Any help is appreciated.
The closest I've gotten is doing something along these lines:
sed -e "/---*[MATCH: ]FIRST\|SECOND[^---]/,/---/d"
but I seem to be missing something.

The --- are easily repeatable and easily to "catch". Accumulate blocks separated by --- into hold space, then match the whole hold space with the searched pattern. If it does not match, print it.
The following shell script:
cat <<EOF |
---
some text
more text
MATCH: FIRST
more text
---
some text
more text
---
some text
MATCH: SECOND
more text
---
some
more
MATCH: THIRD
text
here
---
moretest
---
andnaotherone
---
MATCH: SECOND
---
MATCH: SECOND
---
EOF
sed -n '
/^---$/!{H;b} # Accumulate one block
H;x;
# If there is the searched pattern
/\nMATCH: \(FIRST\|SECOND\)\n/!{
s/^\n// # the leading newline from H
p
} ; : OKEY
# Clear hold space so its empty
s/.*//;h
b
'
outputs:
---
some text
more text
---
some
more
MATCH: THIRD
text
here
---
moretest
---
andnaotherone
---

Related

Query related to working flow of sed: Print command act as range when part of multicommand but act as pattern when used individually

I was having query regarding a command posted in this question
As I understand the sed flow
sed -n '1!G;h;$p'
Sed flow occurs left to right in loop on every single line. Means every single line in input it will try to execute every single commands specified by semicolon. In above example, --sed reads first line into pattern space ---Now it has three sets of commands it will try to execute each on line ---> It reads first command that is 1!G it will try to execute it on line but since the line currently read is first and negation is supplied it will skip to second command, then it will try to execute third command which is $p but as third command is to print last line it will be skipped for all consecutive lines until the last line.
If I am right about my above understanding, then for below command
sed -n '1!G;h;7p;8p'
When 8th line is read it should print 7th and 8th line, printing command should not be applied to any other line.
But it was printing 15 lines in reverse order.
It was printing 1-8 lines and then again 1-7 lines.
Can anyone help clarify.
As per my undetstanding and sed documentation sed operates line by line in left to right order *but above command seems to process entite file again
What was occuring was instead of printing 1to 8 line it was storing 7 lines in patten space printing them then reading 8th linr and printing pattern space again.
Based on above observation sed acts on pattern space for each line read.
Then For below command
sed -n '1!G;h;7p'
Since commands are executed in order, Print command should be executed only when 7th line is read and it should printed but it was printing 1-7 line of patterm space
Hence 7p should literally mean print 7th line
But Here it act as range, Means if pattern space has 7th line of input then print the 1-7.
sed '7p' -n ---> Prints 7th line
sed -n '1!G;h;7p'---> Prints 1-7line.
sed -n '1!G;h;1,7p' ---> Prints 1, 1-2, 1-3, 1-4, 1-5, 1-6, 1-7 lines
Can someone clarify why it was occuring?
Means every single line in input it will try to execute every single commands specified by semicolon.
Commands in sed are separated by newlines and semicolons.
is supplied it will skip to second command, then it will try to execute third
Yes, it will skip G, but not h. It will skip the first command. Pattern space will be added to hold space.
1!G # when not on first line do `G`
h # do `h`, always
$p # on last line print
it was printing 1-7 line of patterm space
Sure, holds each line in hold space, and Grabs it each time. So on 7th line there are 6 lines in hold space, they are Grabbed, added to pattern space, then hold and then 7p printed.
Because of how G + h works, they are in reverse order - h puts first line in hold space, then on second line G appends the line in pattern space, so there is first line followed by second line, then h puts it in hold space - so they will be in reverse order.
Events:
> seq 10 | sed -n '1!G;h;3p'
- read first line to pattern space
- pattern space is `1`
- `1!G` - ignored for first line
- `h` - `1` is put into hold space
- hold space is `1`
- `3p` - ignored, not third line
- read second line to pattern space
- pattern space is `2`
- `1!G` - `G` grabs hold space to pattern space with a newline
- pattern space is `2\n1`
- `h`
- hold space is `2\n1`
- `3p` - ignored, not third line
- read third line to pattern space
- pattern space is `3`
- `1!G` - `G` grabs hold space to pattern space with a newline
- pattern space is `3\n2\n1`
- `h`
- hold space is `3\n2\n1`
- `3p` - third line, so we print pattern space with a newline
- print `3\n2\n1\n` - so you see the output in reverse
why it was occuring?
Because G and h commands are executing.
--debug looks nice:
$ printf "%s\n" one two three four | sed --debug -n '1!G;h;3p'
SED PROGRAM:
1! G
h
3 p
INPUT: 'STDIN' line 1
PATTERN: one
COMMAND: 1! G
COMMAND: h
HOLD: one
COMMAND: 3 p
END-OF-CYCLE:
INPUT: 'STDIN' line 2
PATTERN: two
COMMAND: 1! G
PATTERN: two\none
COMMAND: h
HOLD: two\none
COMMAND: 3 p
END-OF-CYCLE:
INPUT: 'STDIN' line 3
PATTERN: three
COMMAND: 1! G
PATTERN: three\ntwo\none
COMMAND: h
HOLD: three\ntwo\none
COMMAND: 3 p
three
two
one
END-OF-CYCLE:
INPUT: 'STDIN' line 4
PATTERN: four
COMMAND: 1! G
PATTERN: four\nthree\ntwo\none
COMMAND: h
HOLD: four\nthree\ntwo\none
COMMAND: 3 p
END-OF-CYCLE:

using sed to replace transcript id with gene id from a fasta head

I Have a fasta file as such:
ENSGACT00000000002.1 cdna scaffold:BROADS1:scaffold_154:1880:13338:1 gene:ENSGACG00000000002.1 gene_biotype:protein_coding transcript_biotype:protein_coding
I am needing to swap the gene field with the transcript at the beginning of the line. I have tried and tried, but I have been very unsuccessful.
Anything helps!
If you want to swap the text before the first space with the text between "gene:" and the following space then there are four parts of the line that you need to capture:
^\([^ ][^ ]*\) is the first bit of text that doesn't contain a space - this becomes \1
\(.*gene:\) is everything from that first space up to the text gene: - this becomes \2
\([^ ][^ ]*\) is the text between gene: and the next space - this becomes \3
\(.*\)$ is everything on the rest of the line - this becomes \4
Then you replace those 4 pieces with the same 4 pieces - just rearranged: \3\2\1\4
So the sed command would be:
sed 's/^\([^ ][^ ]*\)\(.*gene:\)\([^ ][^ ]*\)\(.*\)$/\3\2\1\4/' file

Copy selected content form one sheet of notepad++ to another

I have a data which is pipe separated ex.
1|2|3|4|5|6|7|8|9|10|
I have to copy and paste (to new sheet) only that which is between pipe 6 - 9
I have 10,000 rows like this
how can we do this? How can we write a macro for the same? Is there any other solution?
Copy the entire text into a new buffer then edit the text to remove the unwanted parts. Can do that with a regular expression replace-all of ^(?:[^|\r\n]*\|){5}([^|\r\n]*)\|.*$ with \1.
Explanation
^ - start of line
(?: - start of a non-capturing group
[^|\r\n]* - zero or more characters that are not a | or newlines or carriage returns
\| - a |
){5} - exactly 5 occurences of the previous group
-- the efect of the above is to match the unwanted leading characters
([^|\r\n]*) - a group containing the characters to keep
-- the wanted part of the line is saved in capture group 1
\|.*$ - a | then everything else to the end of the line
-- matches the unwanted right-hand part of the line
The final $ is not strictly needed. But, when considered with the opening ^, it serves to document that the regular expression looks at the whole line.

sed: how to locate between two ranges (Numeric)

Is it possible to use Sed to locate between two IP Addresses?
Currently am doing:
sed '/85.159.56/s/$/ --- API SYSTEMS/'
But i have a specific range i want to have print out next to it
(This is part of a bigger script)
For example:
sed '/192.200.160.0 - 192.200.191.255/s/$/ --- APIv2 SYSTEMS/'
I know this is not the correct format.
But ideally i want Sed to locate the string between these two ranges
Then print out next to it --- APIv2 SYSTEMS/
I have tried many ways and unable to accomplish, Making me think Sed not the tool for the job.

The pattern space and hold space of the Sed utility has an initialized value of null or empty string?

From the documentation of sed:
sed maintains two data buffers: the active pattern space, and the
auxiliary hold space. Both are initially empty.
I initially think the value of pattern space and hold space is null (nothing). But from the following example, it seems that the initially value of them is a single newline character (\n).
[root#localhost ~]# cat e.txt
aa
bb
cc
dd
[root#localhost ~]# cat e.txt | sed -r '/c/{x;p;x}'
aa
bb
cc
dd
[root#localhost ~]#
Is my understanding right?
Thanks.
I think the answer is that the p command, like the default print action, is actually adding a newline to the end of the empty pattern space. This is based on this little snippet from the GNU sed documentation (just below that bit you quote, by the way):
sed operates by performing the following cycle on each line of input: first, sed reads one line from the input stream, removes any trailing newline, and places it in the pattern space.
... blah, blah blah ...
When the end of the script is reached, unless the -n option is in use, the contents of pattern space are printed out to the output stream, adding back the trailing newline if it was removed.
In other words, the line being held in the pattern (and hold) space does not have the trailing newline - the aa line is held as aa rather than aa<newline>.
Of course, the hold space may still contain multiple lines but that just means that executing the H command on the first two lines of your file will give you a hold space containing aa<newline>bb, not aa<newline>bb<newline>.