How to parse sed regex syntax?

How to parse sed regex syntax? - sed

sed -i "0,/test/s//#test/g" file.txt
I do not know how to parse this regex. It is commenting out test by putting #, but my questions are
what is "0," at the beginning?
what is it not like "s/test/#test/g" ? aka why is /s is in the middle?
Any help is appreciated.

Lets break it down into smaller pieces:
https://www.gnu.org/software/sed/manual/sed.html#sed-script-overview
sed commands follow this syntax:
[addr]X[options]
X is a single-letter sed command. [addr] is an optional line address. If [addr] is specified, the command X will be executed only on the matched lines.
And
https://www.gnu.org/software/sed/manual/sed.html#Range-Addresses
An address range can be specified by specifying two addresses separated by a comma (,). An address range matches lines starting from where the first address matches, and continues until the second address matches (inclusively)
In the case of 0,/test/s//#test/g the address part is 0,/test/ because s is the command. An address part of 0,/test/ means the s command is only executed on lines inside that range. If the sed command was s/test/#test/g there wouldn't be an address part and the s command would be attempted on every line in the file.
https://www.gnu.org/software/sed/manual/sed.html#index-addr1_002c_002bN
A line number of 0 can be used in an address specification like 0,/regexp/ so that sed will try to match regexp in the first input line too. In other words, 0,/regexp/ is similar to 1,/regexp/, except that if addr2 matches the very first line of input the 0,/regexp/ form will consider it to end the range, whereas the 1,/regexp/ form will match the beginning of its range and hence make the range span up to the second occurrence of the regular expression.
Note that this is the only place where the 0 address makes sense; there is no 0-th line and commands which are given the 0 address in any other way will give an error.
So in 0,/test/s//#test/g, the address part 0,/test/ runs the s command only on the first line that matches /test/ - even if it is the first line.
https://www.gnu.org/software/sed/manual/sed.html#index-empty-regular-expression
The empty regular expression ‘//’ repeats the last regular expression match (the same holds if the empty regular expression is passed to the s command).
So 0,/test/s//#test/g is the same as 0,/test/s/test/#test/g because the empty regular expression matches the one that was used in the address part - but it can be left out because writing the same regex twice just makes the whole command less readable.
In conclusion:
s/test/#test/g does the replacement on every line in the file that contains test
0,/test/s//#test/g does the replacement only on the first line in the file that contains test

Related

Add words at beginning and end of a FASTA header line with sed

I have the following line:
>XXX-220_5004_COVID-A6
TTTATTTGACATGAGTAAATTTCCCCTTAAATTAAGGGGTACTGCTGTTATGTCTTTAAA
AGAAGGTCAAATCAATGATATGATTTTATCTCTTCTTAGTAAAGGTAGACTTATAATTAG
AGAAAACAAC
I would like to convert the first line as follows:
>INITWORD/XXX-220_5004_COVID-A6/FINALWORD
TTTATTTGACATGAGTAAATTTCCCCTTAAATTAAGGGGTACTGCTGTTATGTCTTTAAA
AGAAGGT...
So far I have managed to add the first word as follows:
sed 's/>/>INITTWORD\//I'
That returns:
>INITWORD/XXX-220_5004_COVID-A6
TTTATTTGACATGAGTAAATTTCCCCTTAAATTAAGGGGTACTGCTGTTATGTCTTTAAA
AGAAGGT
How can i add the FINALWORD at the end of the first line?

Just substitute more. sed conveniently allows you to recall the text you matched with a back reference, so just embed that between the things you want to add.
sed 's%^>\(.*\)%>INITWORD/\1/FINALWORD%I' file.fasta
I also added a ^ beginning-of-line anchor, and switched to % delimiters so the slashes don't need to be escaped.
In some more detail, the s command's syntax is s/regex/replacement/flags where regex is a regular expression to match the text you want to replace, and replacement is the text to replace it with. In the regex, you can use grouping parentheses \(...\) to extract some of the matched text into the replacement; so \1 refers to whatever matched the first set of grouping parentheses, \2 to the second, etc. The /flags are optional single-character specifiers which modify the behavior of the command; so for example, a /g flag says to replace every match on a line, instead of just the first one (but we only expect one match per line so it's not necessary or useful here).
The I flag is non-standard but since you are using that, I assume it does something useful for you.

Extracting substring from inside bracketed string, where the substring may have spaces

I've got an application that has no useful api implemented, and the only way to get certain information is to parse string output. This is proving to be very painful...
I'm trying to achieve this in bash on SLES12.
Given I have the following strings:
QMNAME(QMTKGW01) STATUS(Running)
QMNAME(QMTKGW01) STATUS(Ended normally)
I want to extract the STATUS value, ie "Ended normally" or "Running".
Note that the line structure can move around, so I can't count on the "STATUS" being the second field.
The closest I have managed to get so far is to extract a single word from inside STATUS like so
echo "QMNAME(QMTKGW01) STATUS(Running)" | sed "s/^.*STATUS(\(\S*\)).*/\1/"
This works for "Running" but not for "Ended normally"
I've tried switching the \S* for [\S\s]* in both "grep -o" and "sed" but it seems to corrupt the entire regex.

This is purely a regex issue, by doing \S you requested to match non-white space characters within (..) but the failing case has a space between which does not comply with the grammar defined. Make it simple by explicitly calling out the characters to match inside (..) as [a-zA-Z ]* i.e. zero or more upper & lower case characters and spaces.
sed 's/^.*STATUS(\([a-zA-Z ]*\)).*/\1/'
Or use character classes [:alnum:] if you want numbers too
sed 's/^.*STATUS(\([[:alnum:] ]*\)).*/\1/'

sed 's/.*STATUS(\([^)]*\)).*/\1/' file
Output:
Running
Ended normally

Extracting a substring matching a given pattern is a job for grep, not sed. We should use sed when we must edit the input string. (A lot of people use sed and even awk just to extract substrings, but that's wasteful in my opinion.)
So, here is a grep solution. We need to make some assumptions (in any solution) about your input - some are easy to relax, others are not. In your example the word STATUS is always capitalized, and it is immediately followed by the opening parenthesis (no space, no colon etc.). These assumptions can be relaxed easily. More importantly, and not easy to work around: there are no nested parentheses. You will want the longest substring of non-closing-parenthesis characters following the opening parenthesis, no mater what they are.
With these assumptions:
$ grep -oP '\bSTATUS\(\K[^)]*(?=\))' << EOF
> QMNAME(QMTKGW01) STATUS(Running)
> QMNAME(QMTKGW01) STATUS(Ended normally)
> EOF
Running
Ended normally
Explanation:
Command options: o to return only the matched substring; P to use Perl extensions (the \K marker and the lookahead). The regexp: we look for a word boundary (\b) - so the word STATUS is a complete word, not part of a longer word like SUBSTATUS; then the word STATUS and opening parenthesis. This is required for a match, but \K instructs that this part of the matched string will not be returned in the output. Then we seek zero or more non-closing-parenthesis characters ([^)]*) and we require that this be followed by a closing parenthesis - but the closing parenthesis is also not included in the returned string. That's a "lookahead" (the (?= ... ) construct).

sed match first word replace full line

I know this should be straight forward but I'm stuck, sorry.
I have two files both contain the same parameters but with different values. I'm trying to read one file line at a time, get the parameter name, use this to match in the second file and replace the whole line with that from file 1.
e.g. rw_2.core.fvbCore.Param.isEnable 1 (FVB_Params)
becomes
rw_2.core.fvbCore.Param.isEnable true (FVB_Boolean)
The lines are not always the same length but I always want to replace the whole line.
The code I have is as follows but it doesn't make the substitutions and I can't work out why not.
while read line; do
ParamName=`awk '{print $1}'`
sed -i 's/$ParamName.*/$line/g' FVB_Params.txt
done < FVB_Boolean.txt

You need your sed command within double quotes if you want those variables to be replaced with their values. You have single quotes, so sed is actually looking for strings with dollar signs to replace with the string '$line', not whatever your shell has in the $line variable.
In short, sed's not seeing the values you want. Switch to double quotes.

At what stage is sed's pattern space printed?

I have heard that for the pattern space, the maximum number of addresses is two.
And that sed goes through each line of the text file, and for each of them, runs through all the commands in the script expression or script file.
When does sed print the pattern space? Is it at the end of the text file, after it has done the last line? Or is it as the ending part of processing each line of the text file, just after it has run through all commands, it dumps the pattern space?
Can anybody demonstrate
a)the max limit of the pattern space being two?
b)the fact of when the pattern space is printed. And, if you can, please provide a textual source that says so too.
And why is it that here in my attempt to see the size of the pattern space, it looks like it can fit a lot..
When this tutorial, says
http://www.thegeekstuff.com/2009/12/unix-sed-tutorial-7-examples-for-sed-hold-and-pattern-buffer-operations/
Sed G function
The G function appends the contents of the holding area to the contents of the pattern space. The former and new contents are separated by a newline. The maximum number of addresses is two.
An example of what I found about the size of the pattern space, trying unsuccessfully to see its limit of two..
abc.txt is a text file with just the character z
sed h;G;G;G;G;G;G;G;G abc.txt
prints many zs so I guess it can hold more than 2.
So i've misunderstood some thing(s).

An address is a way of selecting lines. Lines can be selected using zero, one or two addresses. This has nothing to do with the capacity of pattern space.
Consider the following input file:
aaa
bbb
ccc
ddd
eee
This sed command has zero addresses, so it processes every line:
s/./X/
Result:
Xaa
Xbb
Xcc
Xdd
Xee
This command has one address, it selects only the third line:
3s/./X/
Result:
aaa
bbb
Xcc
ddd
eee
An address of $ as in $s/./X/ would function the same way, but for the last line (regardless of the number of lines).
Here is a two-address command. In this case, it selects the lines based on their content. A single address command can do this, too.
/b/,/d/s/./X/
Result:
aaa
Xbb
Xcc
Xdd
eee
Pattern space is printed when given an explicit p or P command or when the script is complete for the current line of the input file (which includes ending the processing of the file with the q command) if the -n (suppress automatic printing) option is not in place.
Here's a demonstration of sed printing each line immediately upon receiving and processing it:
for i in {1..3}; do echo aaa$i; sleep 2; done | sed 's/./X/'
The capacity of pattern space (and hold space) has to do with the number of characters it can hold (and is implementation dependent) rather than the number of input lines. The newlines separating those lines are simply another character in that total. The G command simply appends a copy of hold space onto the end of what's in pattern space. Multiple applications of the G command appends that many copies.
In the tutorial that you linked to, the statement "The maximum number of addresses is two." is somewhat ambiguous. What that indicates is that you can use zero, one or two addresses to select lines to apply that command to. As in the above examples, you could apply G to all lines, one line or a range of lines. Each command can accept zero, zero or one, or zero, one, or two addresses. See man sed under the Synopsis section for sub headings that group the commands by the number of addresses they accept.
From info sed:
3.1 How `sed' Works
'sed' maintains two data buffers: the active pattern space, and the
auxiliary hold space. Both are initially empty.
'sed' operates by performing the following cycle on each lines of
input: first, 'sed' reads one line from the input stream, removes any
trailing newline, and places it in the pattern space. Then commands
are executed; each command can have an address associated to it:
addresses are a kind of condition code, and a command is only executed
if the condition is verified before the command is to be executed.
When the end of the script is reached, unless the '-n' option is in
use, the contents of pattern space are printed out to the output
stream, adding back the trailing newline if it was removed.(1) Then the
next cycle starts for the next input line.
Unless special commands (like 'D') are used, the pattern space is
deleted between two cycles. The hold space, on the other hand, keeps
its data between cycles (see commands 'h', 'H', 'x', 'g', 'G' to move
data between both buffers).

Can I use the sed command to replace multiple empty line with one empty line?

I know there is a similar question in SO How can I replace mutliple empty lines with a single empty line in bash?. But my question is can this be implemented by just using the sed command?
Thanks

Give this a try:
sed '/^$/N;/^\n$/D' inputfile
Explanation:
/^$/N - match an empty line and append it to pattern space.
; - command delimiter, allows multiple commands on one line, can be used instead of separating commands into multiple -e clauses for versions of sed that support it.
/^\n$/D - if the pattern space contains only a newline in addition to the one at the end of the pattern space, in other words a sequence of more than one newline, then delete the first newline (more generally, the beginning of pattern space up to and including the first included newline)

You can do this by removing empty lines first and appending line space with G command:
sed '/^$/d;G' text.txt
Edit2: the above command will add empty lines between each paragraph, if this is not desired, you could do:
sed -n '1{/^$/p};{/./,/^$/p}'
Or, if you don't mind that all leading empty lines will be stripped, it may be written as:
sed -n '/./,/^$/p'
since the first expression just evaluates the first line, and prints it if it is blank.
Here: -n option suppresses pattern space auto-printing, /./,/^$/ defines the range between at least one character and none character (i.e. empty space between newlines) and p tells to print this range.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse