Xmlstarlet and sed to replace string in a file - sed

I have huge number of html files. I need to replace all the , and " with html entities &nsbquo and &quto respectively.
I need to succeed in two steps for this:
1) Find all the text between tags. I need to replace only in this text between tags.
2) Replace all required strings using sed
My command for this is :
xmlstarlet sel -t -v "*//p" "index.html" | sed 's/,/\&nsbquo/'
This works, but now I dont know how to put back the changes to index.html file.
In sed we have -i option, but for that I need to specify the filename with sed command. But in my case, i have to use | to filter out the required string from html file.
Please help. I did a lot of search for this from 2 days but no luck.
Thank you,
Divya.

The main problem here is that in XML there is no difference between " and ", so you can't use xmlstarlet to do this directly. You could replace " with a special string and then use sed to replace that with ":
xmlstarlet ed -u "//p/text()" \
-x "str:replace(str:replace(., ',', '#NSBQUO#'), '\"', '#QUOT#')" \
quote.html | \
sed 's/#NSBQUO#/\&nsbquo\;/g; s/#QUOT#/\&quot\;/g' > quote-new.html
mv quote-new.html quote.html
NOTE: str:replace and other exslt functions were only added to xmlstarlet ed in version 1.3.0, so it was not available at the time this question was asked.

Related

Extracting the contents between two different strings using bash or perl

I have tried to scan through the other posts in stack overflow for this, but couldn't get my code work, hence I am posting a new question.
Below is the content of file temp.
<?xml version="1.0" encoding="UTF-8"?>
<env:Envelope xmlns:env="http://schemas.xmlsoap.org/soap/envelope/<env:Body><dp:response xmlns:dp="http://www.datapower.com/schemas/management"><dp:timestamp>2015-01-
22T13:38:04Z</dp:timestamp><dp:file name="temporary://test.txt">XJzLXJlc3VsdHMtYWN0aW9uX18i</dp:file><dp:file name="temporary://test1.txt">lc3VsdHMtYWN0aW9uX18i</dp:file></dp:response></env:Body></env:Envelope>
This file contains the base64 encoded contents of two files names test.txt and test1.txt. I want to extract the base64 encoded content of each file to seperate files test.txt and text1.txt respectively.
To achieve this, I have to remove the xml tags around the base64 contents. I am trying below commands to achieve this. However, it is not working as expected.
sed -n '/test.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's#<dp:file name="temporary://test.txt">##g'|perl -p -e 's#</dp:file>##g' > test.txt
sed -n '/test1.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's#<dp:file name="temporary://test1.txt">##g'|perl -p -e 's#</dp:file></dp:response></env:Body></env:Envelope>##g' > test1.txt
Below command:
sed -n '/test.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's#<dp:file name="temporary://test.txt">##g'|perl -p -e 's#</dp:file>##g'
produces output:
XJzLXJlc3VsdHMtYWN0aW9uX18i
<dp:file name="temporary://test1.txt">lc3VsdHMtYWN0aW9uX18i</dp:response> </env:Body></env:Envelope>`
Howeveer, in the output I am expecting only first line XJzLXJlc3VsdHMtYWN0aW9uX18i. Where I am commiting mistake?
When i run below command, I am getting expected output:
sed -n '/test1.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's#<dp:file name="temporary://test1.txt">##g'|perl -p -e 's#</dp:file></dp:response></env:Body></env:Envelope>##g'
It produces below string
lc3VsdHMtYWN0aW9uX18i
I can then easily route this to test1.txt file.
UPDATE
I have edited the question by updating the source file content. The source file doesn't contain any newline character. The current solution will not work in that case, I have tried it and failed. wc -l temp must output to 1.
OS: solaris 10
Shell: bash
sed -n 's_<dp:file name="\([^"]*\)">\([^<]*\).*_\1 -> \2_p' temp
I add \1 -> to show link from file name to content but for content only, just remove this part
posix version so on GNU sed use --posix
assuming that base64 encoded contents is on the same line as the tag around (and not spread on several lines, that need some modification in this case)
Thanks to JID for full explaination below
How it works
sed -n
The -n means no printing so unless explicitly told to print, then there will be no output from sed
's_
This is to substitute the following regex using _ to separate regex from the replacement.
<dp:file name=
Regular text
"\([^"]*\)"
The brackets are a capture group and must be escaped unless the -r option is used( -r is not available on posix). Everything inside the brackets is captured. [^"]* means 0 or more occurrences of any character that is not a quote. So really this just captures anything between the two quotes.
>\([^<]*\)<
Again uses the capture group this time to capture everything between the > and <
.*
Everything else on the line
_\1 -> \2
This is the replacement, so replace everything in the regex before with the first capture group then a -> and then the second capture group.
_p
Means print the line
Resources
http://unixhelp.ed.ac.uk/CGI/man-cgi?sed
http://www.grymoire.com/Unix/Sed.html
/usr/xpg4/bin/sed works well here.
/usr/bin/sed is not working as expected in case if the file contains just 1 line.
below command works for a file containing only single line.
/usr/xpg4/bin/sed -n 's_<env:Envelope\(.*\)<dp:file name="temporary://BackUpDir/backupmanifest.xml">\([^>]*\)</dp:file>\(.*\)_\2_p' securebackup.xml 2>/dev/null
Without 2>/dev/null this sed command outputs the warning sed: Missing newline at end of file.
This because of the below reason:
Solaris default sed ignores the last line not to break existing scripts because a line was required to be terminated by a new line in the original Unix implementation.
GNU sed has a more relaxed behavior and the POSIX implementation accept the fact but outputs a warning.

How to extract URL from html source with sed/awk or cut?

I am writing a script that will download an html page source as a file and then read the file and extract a specific URL that is located after a specific code. (it only has 1 occurrence)
Here is a sample that I need matched:
<img id="sample-image" class="photo" src="http://xxxx.com/some/ic/pic_1asda963_16x9.jpg"
The code preceding the URL will always be the same so I need to extract the part between:
<img id="sample-image" class="photo" src="
and the " after the URL.
I tried something with sed like this:
sed -n '\<img\ id=\"sample-image\"\ class=\"photo\"\ src=\",\"/p' test.txt
But it does not work. I would appreciate your suggestions, thanks a lot !
You can use grep like this :
grep -oP '<img\s+id="sample-image"\s+class="photo"\s+src="\K[^"]+' test.txt
or with sed :
sed -r 's/<img\s+id="sample-image"\s+class="photo"\s+src="([^"]+)"/\1/' test.txt
or with awk :
awk -F'src="' -F'"' '/<img\s+id="sample-image"/{print $6}' test.txt
If you have GNU grep then you can do something like:
grep -oP "(?<=src=\")[^\"]+(?=\")" test.txt
If you wish to use awk then the following would work:
awk -F\" '{print $(NF-1)}' test.txt
With sed as
echo $string | sed 's/\<img.*src="\(.*\)".*/\1/'
A few things about the sed command you are using:
sed -n '\<img\ id=\"sample-image\"\ class=\"photo\"\ src=\",\"/p' test.txt
You don't need to escape the <, " or space. The single quotes prevents the shell from doing word splitting and other stuff on your sed expression.
You are essentially doing this sed -n '/pattern/p' test.txt (except you seemed to be missing the opening backslash) which says "match this pattern, then print the line which contain the match", you are not really extracting the URL.
This is minor, but you don't need to match class="photo" since the id already makes the HTML element unique (no two elements share the same id w/in the same HTML).
Here's what I would do
sed -n 's/.*<img id="sample-image".*src="\([^"]+\)".*/\1/p' test.txt
The p flag tells sed to print the line where substitution (s) was performed.
\(pattern\) captures a subexpression which can be accessed via \1, \2, etc. on the right side of s///
The .* at the start of regex is in case there is something else preceding the <img> element on the line (you did mention you are parsing a HTML file)

How to insert strings containing slashes with sed? [duplicate]

This question already has answers here:
Using different delimiters in sed commands and range addresses
(3 answers)
Closed 1 year ago.
I have a Visual Studio project, which is developed locally. Code files have to be deployed to a remote server. The only problem is the URLs they contain, which are hard-coded.
The project contains URLs such as ?page=one. For the link to be valid on the server, it must be /page/one .
I've decided to replace all URLs in my code files with sed before deployment, but I'm stuck on slashes.
I know this is not a pretty solution, but it's simple and would save me a lot of time. The total number of strings I have to replace is fewer than 10. A total number of files which have to be checked is ~30.
An example describing my situation is below:
The command I'm using:
sed -f replace.txt < a.txt > b.txt
replace.txt which contains all the strings:
s/?page=one&/pageone/g
s/?page=two&/pagetwo/g
s/?page=three&/pagethree/g
a.txt:
?page=one&
?page=two&
?page=three&
Content of b.txt after I run my sed command:
pageone
pagetwo
pagethree
What I want b.txt to contain:
/page/one
/page/two
/page/three
The easiest way would be to use a different delimiter in your search/replace lines, e.g.:
s:?page=one&:pageone:g
You can use any character as a delimiter that's not part of either string. Or, you could escape it with a backslash:
s/\//foo/
Which would replace / with foo. You'd want to use the escaped backslash in cases where you don't know what characters might occur in the replacement strings (if they are shell variables, for example).
The s command can use any character as a delimiter; whatever character comes after the s is used. I was brought up to use a #. Like so:
s#?page=one&#/page/one#g
A very useful but lesser-known fact about sed is that the familiar s/foo/bar/ command can use any punctuation, not only slashes. A common alternative is s#foo#bar#, from which it becomes obvious how to solve your problem.
add \ before special characters:
s/\?page=one&/page\/one\//g
etc.
In a system I am developing, the string to be replaced by sed is input text from a user which is stored in a variable and passed to sed.
As noted earlier on this post, if the string contained within the sed command block contains the actual delimiter used by sed - then sed terminates on syntax error. Consider the following example:
This works:
$ VALUE=12345
$ echo "MyVar=%DEF_VALUE%" | sed -e s/%DEF_VALUE%/${VALUE}/g
MyVar=12345
This breaks:
$ VALUE=12345/6
$ echo "MyVar=%DEF_VALUE%" | sed -e s/%DEF_VALUE%/${VALUE}/g
sed: -e expression #1, char 21: unknown option to `s'
Replacing the default delimiter is not a robust solution in my case as I did not want to limit the user from entering specific characters used by sed as the delimiter (e.g. "/").
However, escaping any occurrences of the delimiter in the input string would solve the problem.
Consider the below solution of systematically escaping the delimiter character in the input string before having it parsed by sed.
Such escaping can be implemented as a replacement using sed itself, this replacement is safe even if the input string contains the delimiter - this is since the input string is not part of the sed command block:
$ VALUE=$(echo ${VALUE} | sed -e "s#/#\\\/#g")
$ echo "MyVar=%DEF_VALUE%" | sed -e s/%DEF_VALUE%/${VALUE}/g
MyVar=12345/6
I have converted this to a function to be used by various scripts:
escapeForwardSlashes() {
# Validate parameters
if [ -z "$1" ]
then
echo -e "Error - no parameter specified!"
return 1
fi
# Perform replacement
echo ${1} | sed -e "s#/#\\\/#g"
return 0
}
this line should work for your 3 examples:
sed -r 's#\?(page)=([^&]*)&#/\1/\2#g' a.txt
I used -r to save some escaping .
the line should be generic for your one, two three case. you don't have to do the sub 3 times
test with your example (a.txt):
kent$ echo "?page=one&
?page=two&
?page=three&"|sed -r 's#\?(page)=([^&]*)&#/\1/\2#g'
/page/one
/page/two
/page/three
replace.txt should be
s/?page=/\/page\//g
s/&//g
please see this article
http://netjunky.net/sed-replace-path-with-slash-separators/
Just using | instead of /
Great answer from Anonymous. \ solved my problem when I tried to escape quotes in HTML strings.
So if you use sed to return some HTML templates (on a server), use double backslash instead of single:
var htmlTemplate = "<div style=\\"color:green;\\"></div>";
A simplier alternative is using AWK as on this answer:
awk '$0="prefix"$0' file > new_file
You may use an alternative regex delimiter as a search pattern by backs lashing it:
sed '\,{some_path},d'
For the s command:
sed 's,{some_path},{other_path},'

Replacing the last word of a path using sed

I have the following: param="/var/tmp/test"
I need to replace the word test with another word such as new_test
need a smart way to replace the last word after "/" with sed
echo 'param="/var/tmp/test"' | sed 's/\/[^\/]*"/\/REPLACEMENT"/'
param="/var/tmp/REPLACEMENT"
echo '/var/tmp/test' | sed 's/\/[^\/]*$/\/REPLACEMENT/'
/var/tmp/REPLACEMENT
Extracting bits and pieces with sed is a bit messy (as Jim Lewis says, use basename and dirname if you can) but at least you don't need a plethora of backslashes to do it if you are going the sed route since you can use the fact that the delimiter character is selectable (I like to use ! when / is too awkward, but it's arbitrary):
$ echo 'param="/var/tmp/test"' | sed ' s!/[^/"]*"!/new_test"! '
param="/var/tmp/new_test"
We can also extract just the part that was substituted, though this is easier with two substitutions in the sed control script:
$ echo 'param="/var/tmp/test"' | sed ' s!.*/!! ; s/"$// '
test
You don't need sed for this...basename and dirname are a better choice for assembling or disassembling pathnames. All those escape characters give me a headache....
param="/var/tmp/test"
param_repl=`dirname $param`/newtest
It's not clear whether param is part of the string that you need processed or it's the variable that holds the string. Assuming the latter, you can do this using only Bash (you don't say which shell you're using):
shopt -s extglob
param="/var/tmp/test"
param="${param/%\/*([^\/])//new_test}"
If param= is part of the string:
shopt -s extglob
string='param="/var/tmp/test"'
string="${string/%\/*([^\/])\"//new}"
This might work for you:
echo 'param="/var/tmp/test"' | sed -r 's#(/(([^/]*/)*))[^"]*#\1newtest#'
param="/var/tmp/newtest"

How do I push `sed` matches to the shell call in the replacement pattern?

I need to replace several URLs in a text file with some content dependent on the URL itself. Let's say for simplicity it's the first line of the document at the URL.
What I'm trying is this:
sed "s/^URL=\(.*\)/TITLE=$(curl -s \1 | head -n 1)/" file.txt
This doesn't work, since \1 is not set. However, the shell is getting called. Can I somehow push the sed match variables to that subprocess?
The accept answer is just plain wrong. Proof:
Make an executable script foo.sh:
#! /bin/bash
echo $* 1>&2
Now run it:
$ echo foo | sed -e "s/\\(foo\\)/$(./foo.sh \\1)/"
\1
$
The $(...) is expanded before sed is run.
So you are trying to call an external command from inside the replacement pattern of a sed substitution. I dont' think it can be done, the $... inside a pattern just allows you to use an already existent (constant) shell variable.
I'd go with Perl, see the /e option in the search-replace operator (s/.../.../e).
UPDATE: I was wrong, sed plays nicely with the shell, and it allows you do to that. But, then, the backlash in \1 should be escaped. Try instead:
sed "s/^URL=\(.*\)/TITLE=$(curl -s \\1 | head -n 1)/" file.txt
Try this:
sed "s/^URL=\(.*\)/\1/" file.txt | while read url; do sed "s#URL=\($url\)#TITLE=$(curl -s $url | head -n 1)#" file.txt; done
If there are duplicate URLs in the original file, then there will be n^2 of them in the output. The # as a delimiter depends on the URLs not including that character.
Late reply, but making sure people don't get thrown off by the answers here -- this can be done in gnu sed using the e command. The following, for example, decrements a number at the beginning of a line:
echo "444 foo" | sed "s/\([0-9]*\)\(.*\)/expr \1 - 1 | tr -d '\n'; echo \"\2\";/e"
will produce:
443 foo