Perl - Replace all but last occurence - perl

I have a string which contains multiple occurrences of the string <br />. I want to replace all of those, except the last one, without the slash: <br>
So, if I have a string:
A<br />B<br />C<br />D<br />.
I want to have the string:
A<br>B<br>C<br>D<br />.

You can use a lookahead assertion, that requires the string to have at least one <br /> left: (?=.*<br />). Here is an example:
$ perl -pe's|<br />(?=.*<br />)|<br>|g'
A<br />B<br />C<br />D<br />
A<br>B<br>C<br>D<br />

Related

sed: cut a string within a pattern

I have many XHTML files whose contents are like:
<h:panelGroup rendered="#{not accessBean.isUserLoggedIn}">
<h:form>
<p:panel style="margin-top:10px">
<table style="margin:10px">
<tbody>
<tr>
<td align="center">#{i.m['Login']}</td>
<td align="center">
<h:inputText value="#{accessBean.login}" />
</td>
</tr>
<tr>
<td align="center">#{i.m['Password']}</td>
<td align="center">
<h:inputSecret value="#{accessBean.password}" />
</td>
</tr>
</tbody>
</table>
<p:commandButton ajax="false" value="#{i.m['Submit']}" action="#{accessBean.login}" />
</p:panel>
</h:form>
</h:panelGroup>
I want to replace every occurrence of #{i.m['any-string>']} with any-string, i.e., cut the string within the pattern.
I have created the following sed command
sed -e "s/#{i.m\['\(.*\)']}/\1/g"
And to run it recursively within a directory I could execute
find . -iname '*.xhtml' -type f -exec sed -i -e "s/#{i.m\['\(.*\)']}/\1/g" {} \;
Here, the any-string can be any human-readable HTML displayable character, i.e, alphabet, numbers, other characters etc. That's why I have used regex (.*).
But it seems to be not working perfectly.
Here are some tests I made using echo:
$ echo "<td align=\"center\">#{i.m['Login']}</td>" | sed -e "s/#{i.m\['\(.*\)']}/\1/g"
Result:
<td align="center">Login</td>
OK
$ echo "<p:commandButton ajax=\"false\" value=\"#{i.m['Submit']}\" action=\"#{accessBean.login}\" />" | sed -e "s/#{i.m\['\(.*\)']}/\1/g"
Result:
<p:commandButton ajax="false" value="Submit" action="#{accessBean.login}" />
OK
$ echo "<p:commandButton ajax=\"false\" value=\"#{i.m['Submit']}\" action=\"#{accessBean.login}\" /> <td align=\"center\">#{i.m['Login']}</td>" | sed -e "s/#{i.m\['\(.*\)']}/\1/g"
Result:
<p:commandButton ajax="false" value="Submit']}" action="#{accessBean.login}" /> <td align="center">#{i.m['Login</td>
NOK
I'm using Ubuntu 18.04.
Per your request, and as noted in my comment and the comment of others, you should definitely use a proper XML parser like xmlstartlet for proper XHTML parsing. A simple regex has no validation for what is left behind.
That being said, for your example (only), to replace the text leaving LOGIN, PASSWORD and Submit you could use the following regex:
sed "s/[#][{]i[.]m[[][']\([^']*\)['][]][}]/\1/" <file
Whenever you have to match characters that can also be part of the regex itself, it helps to explicitly make sure the character you want to match is treated as a character and not part of the regex expression. To do that you make use of a character-class (e.g. [...] where the characters between the brackets are matched. (if the first character in the character class is '^' it will invert the match -- i.e. match everything but what is in the class)
With that explanation, the regex should become clear. The regex uses the basic substitution form:
sed "s/find/replace/" file
The 'find' REGEX
[#] - match the pound sign
[{] - match the opening brace
i - match the 'i'
[.] - explicitly match the '.' character (instead of . any character)
m - match the 'm'
[[] - match the opening bracket
['] - match the single quote
\( - begin your capture group to capture text to reinsert as a back reference
[^']* - match zero-or-more characters that are not a single-quote
\) - end your capture group
['] - match the single-quote as the next character
[]] - match the closing bracket
[}] - match the closing brace.
The 'replace' REGEX
All characters captured as part of the find capture group (between the \(....\)), are available to use as a back reference in the replace portion of the substitution. You can have more than one capture group in the find portion, which you reference in the replace part of the substitution as \1, \2, ... and so on. Here you have only a single capture group in the find portion, so whatever was matched can be used as the entire replacement, e.g.
\1 - to replace the whole mess with just the text that was captured with [^']*
Example Use/Output
For use with your example, it will properly leave Login, Password and Submit as indicated in your question, e.g.
sed "s/[#][{]i[.]m[[][']\([^']*\)['][]][}]/\1/" file
<h:panelGroup rendered="#{not accessBean.isUserLoggedIn}">
<h:form>
<p:panel style="margin-top:10px">
<table style="margin:10px">
<tbody>
<tr>
<td align="center">Login</td>
<td align="center">
<h:inputText value="#{accessBean.login}" />
</td>
</tr>
<tr>
<td align="center">Password</td>
<td align="center">
<h:inputSecret value="#{accessBean.password}" />
</td>
</tr>
</tbody>
</table>
<p:commandButton ajax="false" value="Submit" action="#{accessBean.login}" />
</p:panel>
</h:form>
</h:panelGroup>
Again, as a disclaimer and just good common sense, don't parse X/HTML with a regex, use a proper tool like xmlstartlet. Don't parse JSON with a regex, use a proper tools for the job like jq -- you get the drift. (but for this limited example, the regex works well, but it is fragile, if anything in the input changes, it will break -- which is why we have tools like xmlstartlet and jq)
The problem here is that you do not take the greedy nature of regexps into account. You need to prevent your regexp from gobbling up extra 's:
sed -e "s/#{i.m['([^']*)']}/\1/g"
This is also the reason why David C. Rankin's solution works. His regexp is unnecessarily complex, however.

Sed expression to match this multiline code?

Assume the following code snippet:
<head>
<script>....</script>
<script>....</script>
</head>
<body>
<script>
some stuff
a change
more stuff
more changes
more stuff
}
}
}
}
final changes
</script>
</body>
I need to add something right before the last </script>, what's stated as final changes. How can I tell sed to match that one? final changes doesn't exist, the last lines of the script are like four or five }, so it would be the scenario, I'd need to match multiple lines.
All the other changes were replaced by matching the line, then replacing with the line + the changes. But I don't know how to match the multi line to replace</script></body> with final changes </script></body>.
I tried to use the same tactic I use for replacing with multiple lines, but it didn't work, keep reporting unterminated substitute pattern.
sed 's|</script>\
</body>|lalalalala\
</script>\
</body>|' file.hmtl
I've read this question Sed regexp multiline - replace HTML but it doesn't suit my particular case because it matches everything between the search options. I need to match something, then add something before the first search operator.
sed, grep, awk etc. are NOT for XML/HTML processing.
Use a proper XML/HTML parsers.
xmlstarlet is one of them.
Sample file.html:
<html>
<head>
<script>....</script>
<script>....</script>
</head>
<body>
<script>
var data = [0, 1, 2];
console.log(data);
</script>
</body>
</html>
The command:
xmlstarlet ed -O -P -u '//body/script' -v 'alert("success")' file.htm
The output:
<html>
<head>
<script>....</script>
<script>....</script>
</head>
<body>
<script>alert("success")</script>
</body>
</html>
http://xmlstar.sourceforge.net/doc/UG/xmlstarlet-ug.html
Finally got this following xara's answer in https://unix.stackexchange.com/questions/26284/how-can-i-use-sed-to-replace-a-multi-line-string
In summary, instead of trying to do magic with sed, replace the newlines with a character which sed understands (like \r), do the replace and then replace the character with newline again.

How to sed stuff within pairs of quotes?

I want to change lines like:
<A HREF="classes_index_additions.html"class="hiddenlink">
to
<A HREF="classes_index_additions.html" class="hiddenlink">
(note the added ' ' before class) but it should leave lines like
<meta name="generator" content="JDiff v1.1.1">
alone. sed -e 's|\("[^"]*"\)\([^ />]\)|\1 \2|g' satisfies the first condition but it changes the other text to
<meta name="generator" content=" JDiff v1.1.1"/>
How do I get sed to process the correct pairs of double quotes?
You can try this:
sed -e 's/"\([^" ]*\)=/" \1=/g'
But with sed, it may be possible that the regular expression matches other parts of your document that you didn't intend, so best to try it and look over the results to see if there are any unintended side effects!
You can try putting each attributes on a new line and then triming trailing spaces on each line before removing new lines.
sed -r 's/(\w*="[^"]*")/\n\1/g; s/ *\n/\n/g; s/\n/ /g'
This works as follow :
s/(\w*="[^"]*")/\n\1/g
Put every attributes on a new line so your node looks like this
<A
HREF="classes_index_additions.html"
class="hiddenlink">
After that you remove trailing spaces
s/ *\n/\n/g
And remove new lines
s/\n/ /g

Match pattern append line but exclude pattern line

I want to add
<br \>
at the end of each line in-between the tag
<div> ... </div>
source file
bla bala
<div>
bla bala
bla bala
bla bala
</div>
bla bala
I want to have out put like the
bla bala
<div>
bla bala <br \>
bla bala <br \>
bla bala <br \>
</div>
bla bala
I tried this but it also adds to the tag line
sed -i '' '/<pre\>/,/<\/pre\>/ s/$/<br \\>/' test.txt
also tried this
sed -i '' '/<pre\>/,/<\/pre\>/{/$/<br \\>/;}' test.txt
How can I exclude the line that has match pattern?
update: can you do this with sed?
Something like this makes it:
$ awk '/<\/div>/ {p=0} p{$0=$0"<br \>"} /<div>/ {p=1} 1' file
bla bala
<div>
bla bala<br \>
bla bala<br \>
bla bala<br \>
</div>
bla bala
With sed:
sed '/<div>/,/<\/div>/s/[^>]$/<br \/>/' test.html
It will apply a substitute command to a range of lines described by the beginning and ending pattern separated by a ,:
/<div>/,/<\/div>/
The substitute command (simplified):
s/$/<br \/>/
... will replace line endings with <br /> tags.
Unfortunately the pattern range includes the opening and closing <div> tags and there is no way to tell sed that it should use only the lines between the start and end pattern. That's why I've added [^>] to avoid that \n will be placed after the tags, which is the final command:
s/[^>]$/<br \/>/
Another solution to apply the substitution only to the lines between the <div> tags could look like this (maybe more clean and general):
sed '/<div>/,/<\/div>/ {/<div>/n; /<\/div>/ ! {s/$/<br \/>/}}' test.html
It will select the range including the opening and closing div tags and the line between them as in the example above, but then skips the opening <div> tag using the n command and the closing </div> using the ! before the following block between the curly braces. For more info check this
However, although I like to have fun using sed I would not use regexes to manipulate html or xml documents in a real world application. I would use xslt for this.
This might work for you (GNU sed):
sed '/<div>/,/<\/div>/!b;//!s/$/ <br \\>/' file
Sed has a feature whereby an empty regexp takes the previous regexp value.

Replace word tag to entire file content

Assume that we have a content xml-file:
<field name="id" id="1" type="number" default="" />
Assume that we have template file with tag:
INCLUDE_XML
We need to replace INCLUDE_XML tag to entire content from xml-file. We can try.
tpl_content=$(<tpl.xml)
xml_content=$(<cnt.xml)
xml_content="$(echo "$tpl_content" | sed "s/INCLUDE_XML/"$xml_content"/g")"
echo "$xml_content" > out.xml
The problem is unterminated 's' command cause xml-file has lot of bless characters (quotes, slashes, etc). How we can do the replacement without this care about the characters in content xml-file?
Just use sed's built-in facilities.
sed -e '/INCLUDE_XML/!b' -e 'r cnt.xml' -ed tpl.xml >out.xml
Translation: if the current input line doesn't match the regex, just continue. Otherwise, read in and print the other file, and delete the current line.