I want to add
<br \>
at the end of each line in-between the tag
<div> ... </div>
source file
bla bala
<div>
bla bala
bla bala
bla bala
</div>
bla bala
I want to have out put like the
bla bala
<div>
bla bala <br \>
bla bala <br \>
bla bala <br \>
</div>
bla bala
I tried this but it also adds to the tag line
sed -i '' '/<pre\>/,/<\/pre\>/ s/$/<br \\>/' test.txt
also tried this
sed -i '' '/<pre\>/,/<\/pre\>/{/$/<br \\>/;}' test.txt
How can I exclude the line that has match pattern?
update: can you do this with sed?
Something like this makes it:
$ awk '/<\/div>/ {p=0} p{$0=$0"<br \>"} /<div>/ {p=1} 1' file
bla bala
<div>
bla bala<br \>
bla bala<br \>
bla bala<br \>
</div>
bla bala
With sed:
sed '/<div>/,/<\/div>/s/[^>]$/<br \/>/' test.html
It will apply a substitute command to a range of lines described by the beginning and ending pattern separated by a ,:
/<div>/,/<\/div>/
The substitute command (simplified):
s/$/<br \/>/
... will replace line endings with <br /> tags.
Unfortunately the pattern range includes the opening and closing <div> tags and there is no way to tell sed that it should use only the lines between the start and end pattern. That's why I've added [^>] to avoid that \n will be placed after the tags, which is the final command:
s/[^>]$/<br \/>/
Another solution to apply the substitution only to the lines between the <div> tags could look like this (maybe more clean and general):
sed '/<div>/,/<\/div>/ {/<div>/n; /<\/div>/ ! {s/$/<br \/>/}}' test.html
It will select the range including the opening and closing div tags and the line between them as in the example above, but then skips the opening <div> tag using the n command and the closing </div> using the ! before the following block between the curly braces. For more info check this
However, although I like to have fun using sed I would not use regexes to manipulate html or xml documents in a real world application. I would use xslt for this.
This might work for you (GNU sed):
sed '/<div>/,/<\/div>/!b;//!s/$/ <br \\>/' file
Sed has a feature whereby an empty regexp takes the previous regexp value.
Related
I have the following text:
<h2 id="title"> ABC A BBBBB0 </h2>
<h2 id="title">ABC A BBBBB1 </h2>
<h2 id="title">ABC A BBBBB2</h2>
<h2 id="title"> ABC A BBBBB3 </h2>
and want to get of it the following:
ABC A BBBBB0
ABC A BBBBB1
ABC A BBBBB2
ABC A BBBBB3
I am currently running the next command:
sed -n "s/.*\"title\">[[:space:]]*\(.*\)<.*/\1/p" ./file.txt
but get lines with spaces at the end:
ABC A BBBBB0[space][space][space][space]
ABC A BBBBB1[space]
ABC A BBBBB2
ABC A BBBBB3[space]
I can not understand the concept of ignoring possible spaces at the end in my case, at the beginning of the possible matches I understand how to do it. Can somebody give me a clear example for this?
The last character in the group has to not be a space, then there may be spaces.
's/.*"title">[[:space:]]*\(.*[^[:space:]]\)[[:space:]]*<.*/\1/p'
I can not understand the concept
.* matches everything up until the end of the whole line. Then regex engine reads < and goes back from right to left up until it matches <, and then continues matching further.
You have to put something so that when you go back from the end of the string, you will end up at the place you want to be. So "not a space", for example. The process of "going back" is called "backtracking".
I can recommend https://www.regular-expressions.info/engine.html
Using sed
$ sed 's/[^>]*>[[:space:]]*\?\([[:alnum:][:space:]]*\)[[:space:]]\?<.*/\1/' file
ABC A BBBBB0
ABC A BBBBB1
ABC A BBBBB2
ABC A BBBBB3
$ sed -E 's/[^>]*> *?([A-Z0-9 ]*) ?<.*/\1/' file
ABC A BBBBB0
ABC A BBBBB1
ABC A BBBBB2
ABC A BBBBB3
When using seds grouping and back referencing, you can easily exclude any character, including spaces by not including it within the grouping parenthesis.
[^>]*> - Skip everything till the next >, as this is not within the parenthesis, it will be excluded.
*? - As too will this space. The ? makes it an optional character (or zero or more).
([A-Z0-9 ]*) - Everything within the parenthesis is included which will be capitals, integers and spaces.
?<.*/\1/' - Exclude a single space before < if one is present.
I'd just use awk:
$ awk -F'> *| *<' '{print $3}' file
ABC A BBBBB0
ABC A BBBBB1
ABC A BBBBB2
ABC A BBBBB3
This might work for you (GNU sed):
sed -nE 's/<h2 id="title">\s*(.*\S)\s*<\/h2>/\1/p' file
Use pattern matching to return the required strings.
N.B. \s matches white space and \S is its dual. Thus (.*\S) captures word or words.
I have a string which contains multiple occurrences of the string <br />. I want to replace all of those, except the last one, without the slash: <br>
So, if I have a string:
A<br />B<br />C<br />D<br />.
I want to have the string:
A<br>B<br>C<br>D<br />.
You can use a lookahead assertion, that requires the string to have at least one <br /> left: (?=.*<br />). Here is an example:
$ perl -pe's|<br />(?=.*<br />)|<br>|g'
A<br />B<br />C<br />D<br />
A<br>B<br>C<br>D<br />
For example;
I'd love to replace /test src path only within <img> tag.
However <p>test</p> should not be touched.
$ cat test.html
<img src="/test" width="18" alt="" /><br>
<p>test</p>
For now I could execute something like;
sed -i '/test'|/hoge|g' test.html
However it changes the word globally.
sed '/<img/s|/test|/hoge|g' test.html would work for one line <img tags
Sed allows the s///g replacement to be prefixed with another /PATTERN/ to restrict the replacement to lines matching PATTERN.
But you should really use an xml parser to be safe.
Another approach with sed:
sed -i 's|\(<img *src="/\)test|\1hoge|' test.html
<img *src="/ is captured and backreferenced using \1 in substitution string.
Following string(test) is replaced with hoge.
I want to change lines like:
<A HREF="classes_index_additions.html"class="hiddenlink">
to
<A HREF="classes_index_additions.html" class="hiddenlink">
(note the added ' ' before class) but it should leave lines like
<meta name="generator" content="JDiff v1.1.1">
alone. sed -e 's|\("[^"]*"\)\([^ />]\)|\1 \2|g' satisfies the first condition but it changes the other text to
<meta name="generator" content=" JDiff v1.1.1"/>
How do I get sed to process the correct pairs of double quotes?
You can try this:
sed -e 's/"\([^" ]*\)=/" \1=/g'
But with sed, it may be possible that the regular expression matches other parts of your document that you didn't intend, so best to try it and look over the results to see if there are any unintended side effects!
You can try putting each attributes on a new line and then triming trailing spaces on each line before removing new lines.
sed -r 's/(\w*="[^"]*")/\n\1/g; s/ *\n/\n/g; s/\n/ /g'
This works as follow :
s/(\w*="[^"]*")/\n\1/g
Put every attributes on a new line so your node looks like this
<A
HREF="classes_index_additions.html"
class="hiddenlink">
After that you remove trailing spaces
s/ *\n/\n/g
And remove new lines
s/\n/ /g
Assume that we have a content xml-file:
<field name="id" id="1" type="number" default="" />
Assume that we have template file with tag:
INCLUDE_XML
We need to replace INCLUDE_XML tag to entire content from xml-file. We can try.
tpl_content=$(<tpl.xml)
xml_content=$(<cnt.xml)
xml_content="$(echo "$tpl_content" | sed "s/INCLUDE_XML/"$xml_content"/g")"
echo "$xml_content" > out.xml
The problem is unterminated 's' command cause xml-file has lot of bless characters (quotes, slashes, etc). How we can do the replacement without this care about the characters in content xml-file?
Just use sed's built-in facilities.
sed -e '/INCLUDE_XML/!b' -e 'r cnt.xml' -ed tpl.xml >out.xml
Translation: if the current input line doesn't match the regex, just continue. Otherwise, read in and print the other file, and delete the current line.