Sed only with specific place - sed

For example;
I'd love to replace /test src path only within <img> tag.
However <p>test</p> should not be touched.
$ cat test.html
<img src="/test" width="18" alt="" /><br>
<p>test</p>
For now I could execute something like;
sed -i '/test'|/hoge|g' test.html
However it changes the word globally.

sed '/<img/s|/test|/hoge|g' test.html would work for one line <img tags
Sed allows the s///g replacement to be prefixed with another /PATTERN/ to restrict the replacement to lines matching PATTERN.
But you should really use an xml parser to be safe.

Another approach with sed:
sed -i 's|\(<img *src="/\)test|\1hoge|' test.html
<img *src="/ is captured and backreferenced using \1 in substitution string.
Following string(test) is replaced with hoge.

Related

Sed ignoring potential characters at the end of the match group

I have the following text:
<h2 id="title"> ABC A BBBBB0 </h2>
<h2 id="title">ABC A BBBBB1 </h2>
<h2 id="title">ABC A BBBBB2</h2>
<h2 id="title"> ABC A BBBBB3 </h2>
and want to get of it the following:
ABC A BBBBB0
ABC A BBBBB1
ABC A BBBBB2
ABC A BBBBB3
I am currently running the next command:
sed -n "s/.*\"title\">[[:space:]]*\(.*\)<.*/\1/p" ./file.txt
but get lines with spaces at the end:
ABC A BBBBB0[space][space][space][space]
ABC A BBBBB1[space]
ABC A BBBBB2
ABC A BBBBB3[space]
I can not understand the concept of ignoring possible spaces at the end in my case, at the beginning of the possible matches I understand how to do it. Can somebody give me a clear example for this?
The last character in the group has to not be a space, then there may be spaces.
's/.*"title">[[:space:]]*\(.*[^[:space:]]\)[[:space:]]*<.*/\1/p'
I can not understand the concept
.* matches everything up until the end of the whole line. Then regex engine reads < and goes back from right to left up until it matches <, and then continues matching further.
You have to put something so that when you go back from the end of the string, you will end up at the place you want to be. So "not a space", for example. The process of "going back" is called "backtracking".
I can recommend https://www.regular-expressions.info/engine.html
Using sed
$ sed 's/[^>]*>[[:space:]]*\?\([[:alnum:][:space:]]*\)[[:space:]]\?<.*/\1/' file
ABC A BBBBB0
ABC A BBBBB1
ABC A BBBBB2
ABC A BBBBB3
$ sed -E 's/[^>]*> *?([A-Z0-9 ]*) ?<.*/\1/' file
ABC A BBBBB0
ABC A BBBBB1
ABC A BBBBB2
ABC A BBBBB3
When using seds grouping and back referencing, you can easily exclude any character, including spaces by not including it within the grouping parenthesis.
[^>]*> - Skip everything till the next >, as this is not within the parenthesis, it will be excluded.
*? - As too will this space. The ? makes it an optional character (or zero or more).
([A-Z0-9 ]*) - Everything within the parenthesis is included which will be capitals, integers and spaces.
?<.*/\1/' - Exclude a single space before < if one is present.
I'd just use awk:
$ awk -F'> *| *<' '{print $3}' file
ABC A BBBBB0
ABC A BBBBB1
ABC A BBBBB2
ABC A BBBBB3
This might work for you (GNU sed):
sed -nE 's/<h2 id="title">\s*(.*\S)\s*<\/h2>/\1/p' file
Use pattern matching to return the required strings.
N.B. \s matches white space and \S is its dual. Thus (.*\S) captures word or words.

How to insert text after a certain complex string in a file using SED

I want to add new line in a html file by using sed command
The line I want to add is
<link href="https://newvalue.css" rel="test1" id="test2">
After
<link href="test.css" rel="test1" id="test2">
in a html file.
Can anyone help ?
Use sed and a for append and so:
sed -i '/<link href="test.css" rel="test1" id="test2">/a<link href="https://newvalue.css" rel="test1" id="test2">' file
Search for the line by using /.../ and then use a for append followed by the string to add.

Sed expression to match this multiline code?

Assume the following code snippet:
<head>
<script>....</script>
<script>....</script>
</head>
<body>
<script>
some stuff
a change
more stuff
more changes
more stuff
}
}
}
}
final changes
</script>
</body>
I need to add something right before the last </script>, what's stated as final changes. How can I tell sed to match that one? final changes doesn't exist, the last lines of the script are like four or five }, so it would be the scenario, I'd need to match multiple lines.
All the other changes were replaced by matching the line, then replacing with the line + the changes. But I don't know how to match the multi line to replace</script></body> with final changes </script></body>.
I tried to use the same tactic I use for replacing with multiple lines, but it didn't work, keep reporting unterminated substitute pattern.
sed 's|</script>\
</body>|lalalalala\
</script>\
</body>|' file.hmtl
I've read this question Sed regexp multiline - replace HTML but it doesn't suit my particular case because it matches everything between the search options. I need to match something, then add something before the first search operator.
sed, grep, awk etc. are NOT for XML/HTML processing.
Use a proper XML/HTML parsers.
xmlstarlet is one of them.
Sample file.html:
<html>
<head>
<script>....</script>
<script>....</script>
</head>
<body>
<script>
var data = [0, 1, 2];
console.log(data);
</script>
</body>
</html>
The command:
xmlstarlet ed -O -P -u '//body/script' -v 'alert("success")' file.htm
The output:
<html>
<head>
<script>....</script>
<script>....</script>
</head>
<body>
<script>alert("success")</script>
</body>
</html>
http://xmlstar.sourceforge.net/doc/UG/xmlstarlet-ug.html
Finally got this following xara's answer in https://unix.stackexchange.com/questions/26284/how-can-i-use-sed-to-replace-a-multi-line-string
In summary, instead of trying to do magic with sed, replace the newlines with a character which sed understands (like \r), do the replace and then replace the character with newline again.

How to sed stuff within pairs of quotes?

I want to change lines like:
<A HREF="classes_index_additions.html"class="hiddenlink">
to
<A HREF="classes_index_additions.html" class="hiddenlink">
(note the added ' ' before class) but it should leave lines like
<meta name="generator" content="JDiff v1.1.1">
alone. sed -e 's|\("[^"]*"\)\([^ />]\)|\1 \2|g' satisfies the first condition but it changes the other text to
<meta name="generator" content=" JDiff v1.1.1"/>
How do I get sed to process the correct pairs of double quotes?
You can try this:
sed -e 's/"\([^" ]*\)=/" \1=/g'
But with sed, it may be possible that the regular expression matches other parts of your document that you didn't intend, so best to try it and look over the results to see if there are any unintended side effects!
You can try putting each attributes on a new line and then triming trailing spaces on each line before removing new lines.
sed -r 's/(\w*="[^"]*")/\n\1/g; s/ *\n/\n/g; s/\n/ /g'
This works as follow :
s/(\w*="[^"]*")/\n\1/g
Put every attributes on a new line so your node looks like this
<A
HREF="classes_index_additions.html"
class="hiddenlink">
After that you remove trailing spaces
s/ *\n/\n/g
And remove new lines
s/\n/ /g

Replace word tag to entire file content

Assume that we have a content xml-file:
<field name="id" id="1" type="number" default="" />
Assume that we have template file with tag:
INCLUDE_XML
We need to replace INCLUDE_XML tag to entire content from xml-file. We can try.
tpl_content=$(<tpl.xml)
xml_content=$(<cnt.xml)
xml_content="$(echo "$tpl_content" | sed "s/INCLUDE_XML/"$xml_content"/g")"
echo "$xml_content" > out.xml
The problem is unterminated 's' command cause xml-file has lot of bless characters (quotes, slashes, etc). How we can do the replacement without this care about the characters in content xml-file?
Just use sed's built-in facilities.
sed -e '/INCLUDE_XML/!b' -e 'r cnt.xml' -ed tpl.xml >out.xml
Translation: if the current input line doesn't match the regex, just continue. Otherwise, read in and print the other file, and delete the current line.