Sed replace in html file

Sed replace in html file - sed

How could I append 'index.html' to all links in a html file that do not end with that word ?
So that, for example, href="http://mysite/" would become href="http://mysite/index.html".

I am not a sed expert, but think this works:
sed -e "s_\"\(http://[^\"]*\)/index.html\"_\"\1\"_g" \
-e "s_\"\(http://[^\"]*[^/]\)/*\"_\"\1/index.html\"_g"
The first replacement finds URLS already ending in /index.html and deletes this ending.
The second replacement adds the /index.html as required. It deals with cases that end in / and also those that don't.
More than one version of sed exists. I'm using the one that comes in XCode for OS X.

for href ending with /
sed '\|href="http://.*/| s||\1index.html' YourFile
if there is folder ref without ending /, you should specifie what is consider as a file or not (like last name with a dot inside for file, ...)

What about this:
echo 'href="http://mysite/"' | awk '/http/ {sub(/\/\"/,"/index.html\"")}1'
href="http://mysite/index.html"
echo 'href="http://www.google.com/"' | awk '/http/ {sub(/\/\"/,"/index.html\"")}1'
href="http://www.google.com/index.html"

In general this is an almost unsolvable problem. If your html is "reasonably well behaved", the following expression searches for things that "look a lot like a URL"; you can see it at work at http://regex101.com/r/bZ9mR8 (this shows the search and replace for several examples; it should work for most others)
((?:(?:https?|ftp):\/{2})(?:(?:[0-9a-z_#-]+\.)+(?:[0-9a-z]){2,4})?(?:(?:\/(?:[~0-9a-z\#\+\%\#\.\/_-]+))?\/)*(?=\s|\"))(\/)?(index\.html?)?
The result of the above match should be replaced with
\1index.html
Unfortunately this requires regex wizardry that is well beyond the rather pedestrian capabilities of sed, so you will have to unleash the power of perl, as follows:
perl -p -e '((?:(?:https?|ftp):\/{2})(?:(?:[0-9a-z_#-]+\.)+(?:[0-9a-z]){2,4})?(?:(?:\/(?:[~0-9a-z\#\+\%\#\.\/_-]+))?\/)*(?=\s|\"))(\/)?(index\.html?)?/\index.html/gi'
It looks a bit daunting, I know. But it works. The only problem - if a link ends in /, it will add /index.html. You could easily take the output of the above and process it with
sed 's/\/\/index.html/\/index.html/g'
To replace a double-backslash-before-index.html with a single backslash...
Some examples (several more given in the link above)
http://www.index.com/ add /index.html
http://ex.com/a/b/" add /index.html
http://www.example.com add /index.html
http://www.example.com/something do nothing
http://www.example.com/something/ add /index.html
http://www.example.com/something/index.html do nothing

Related

replace a string from first line on multiple files

I got 10,000 text files which I have to make changes.
First line on every file contains a url.
By mistake for few files url missking 'com'
eg:
1) http://www.supersonic./psychology
2) http://www.supersonic./social
3) http://www.supersonic.com/science
my task is to check and add 'com' if it is missing
eg:
1) http://www.supersonic.com/psychology
2) http://www.supersonic.com/social
3) http://www.supersonic.com/science
all urls are of same domain(supersonic.com)
can you suggest me any fast and easy approach ?
Tried this : replacing supersonic./ with supersonic.com
sed -e '1s/supersonic.//supersonic.com/' *
no change in the output.

Use -i to change the files instead of just outputting the changed lines.
Use a different delimiter than / if you want to use / in the regex (or use \/ in the regex).
Use \. to match a dot literally, . matches anything.
sed -i~ -e '1s=supersonic\./=supersonic.com/=' *
Some versions of sed don't support -i.

You are very close with your code, but you need to account for the trailing / char after the . char.
Assuming you are using a modern sed with the -i (inplace-edit) option you can do
sed -i '1s#supersonic\./#supersonic.com/#' *
Note that rather than have to escape / inside of the s/srchpat\/withSlash/replaceStr/', you can use another char after the the s command as the delimiter, here I use s#...#...#. If your search pattern had a # char, then you would have to use a different char.
Some older versions of sed need to you to escape the alternate delimiter at the first use, so
sed 's\#srchStr#ReplStr#' file
for those cases.
If you're using a sed that doesn't support the -i options, then
you'll need to loop on your file, and manage the tmp files, i.e.
for f in *.html ; do
sed '1s#supersonic\./#supersonic.com/#' "$f" > /tmp/"$f".fix \
&& /bin/mv /tmp/"$f".fix "$f"
done
Warning
But as you're talking about 10,000+files, you'll want to do some testing before using either of these solutions. Copy a good random set of those files to /tmp/mySedTest/ dir and run one of these solutions there to make sure there are no surprises.
And you're likely to blow out the cmd-line MAX_SIZE with 10,000+ files, so read about find and xargs. There are many posts here about [sed] find xargs. Check them out if needed.
IHTH

Manipulate characters with sed

I have a list of usernames and i would like add possible combinations to it.
Example. Lets say this is the list I have
johna
maryb
charlesc
Is there is a way to use sed to edit it the way it looks like
ajohn
bmary
ccharles
And also
john_a
mary_b
charles_c
etc...
Can anyone assist me into getting the commands to do so, any explanation will be awesome as well. I would like to understand how it works if possible. I usually get confused when I see things like 's/\.(.*.... without knowing what some of those mean... anyway thanks in advance.
EDIT ... I change the username

sed s/\(user\)\(.\)/\2\1/
Breakdown:
sed s/string/replacement/ will replace all instances of string with replacement.
Then, string in that sed expression is \(user\)\(.\). This can be broken down into two
parts: \(user\) and \(.\). Each of these is a capture group - bracketed by \( \). That means that once we've matched something with them, we can reuse it in the replacement string.
\(user\) matches, surprisingly enough, the user part of the string. \(.\) matches any single character - that's what the . means. Then, you have two captured groups - user and a (or b or c).
The replacement part just uses these to recreate the pattern a little differently. \2\1 says "print the second capture group, then the first capture group". Which in this case, will print out auser - since we matched user and a with each group.
ex:
$ echo "usera
> userb
> userc" | sed "s/\(user\)\(.\)/\2\1/"
auser
buser
cuser
You can change the \2\1 to use any string you want - ie. \2_\1 will give a_user, b_user, c_user.
Also, in order to match any preceding string (not just "user"), just replace the \(user\) with \(.*\). Ex:
$ echo "marya
> johnb
> alfredc" | sed "s/\(.*\)\(.\)/\2\1/"
amary
bjohn
calfred

here's a partial answer to what is probably the easy part. To use sed to change usera to user_a you could use:
sed 's/user/user_/' temp
where temp is the name of the file that contains your initial list of usernames. How this works: It is finding the first instance of "user" on each line and replacing it with "user_"
Similarly for your dot example:
sed 's/user/user./' temp
will replace the first instance of "user" on each line with "user."

Sed does not offer non-greedy regex, so I suggest perl:
perl -pe 's/(.*?)(.)$/$2$1/g' file
ajohn
bmary
ccharles
perl -pe 's/(.*?)(.)$/$1_$2/g' file
john_a
mary_b
charles_c
That way you don't need to know the username before hand.

Simple solution using awk
awk '{a=$NF;$NF="";$0=a$0}1' FS="" OFS="" file
ajohn
bmary
ccharles
and
awk '{a=$NF;$NF="";$0=$0"_" a}1' FS="" OFS="" file
john_a
mary_b
charles_c
By setting FS to nothing, every letter is a field in awk. You can then easy manipulate it.
And no need to using capturing groups etc, just plain field swapping.

This might work for you (GNU sed):
sed -r 's/^([^_]*)_?(.)$/\2\1/' file
This matches any charactes other than underscores (in the first back reference (\1)), a possible underscore and the last character (in the second back reference (\2)) and swaps them around.

How to use SED to replace underscores with hyphens in URLs?

Hi this looks like a common problem but I can't find a solution via googling or SO. Within several HTML files kept in a local directory I need to change URLs from http://www.myblog.com from using underscores in the ending part of the URL to using hyphens instead. (Any other URLs need to remain unchanged.) I use Ubuntu Linux, I think SED will work for me here but can use other tools instead if helpful.
So if an HTML file includes:
...
I need to switch it to:
...
The URLs in question will appear only within HTML anchor tags. Also, the underscore-to-hyphen part will occur only on the ending part (i.e., after the last "/" in the URL string.) I can't just do a global search-and-replace (s/_/-/g) because there are potentially URLs to other sites and other underscores not related to URLs that I wouldn't want to alter.

You can use awk
awk -F\" '{for (i=1;i<=NF;i++) if ($i~/http/) {n=split($i,a,"/");gsub(/_/,"-",a[n]);for (j=1;j<=n;j++) {s=s (s?"/":"") a[j];$i=s}}print $0}' OFS=\" file
data with_underscore
This will only replace _ if its in the file path of the url
A more readable version:
awk -F\" '
{for (i=1;i<=NF;i++)
if ($i~/http/) {
n=split($i,a,"/")
gsub(/_/,"-",a[n])
for (j=1;j<=n;j++) {
s=s (s?"/":"") a[j]
$i=s}
}
print $0
}' OFS=\" file

This might work for you (GNU sed):
sed -r ':a;/<a href="http:\/\/www.myblog.com\/[^"]*_[^"]*"/{s//\n&\n/;h;y/_/-/;G;s/.*\n(.*)\n.*\n(.*)\n.*\n/\2\1/;ba}' file
If a line contains the URL as shown in the question, surround the URL by markers (newlines in this case), copy the line and then translate the '_''s to '-''s. Append the original line and reassemble using the markers as guides. Repeat the process until all URL's are processed.

Extract CentOS mirror domain names using sed

I'm trying to extract a list of CentOS domain names only from http://mirrorlist.centos.org/?release=6.4&arch=x86_64&repo=os
Truncating prefix "http://" and "ftp://" to the first "/" character only resulting a list of
yum.phx.singlehop.com
mirror.nyi.net
bay.uchicago.edu
centos.mirror.constant.com
mirror.teklinks.com
centos.mirror.netriplex.com
centos.someimage.com
mirror.sanctuaryhost.com
mirrors.cat.pdx.edu
mirrors.tummy.com
I searched stackoverflow for the sed method but I'm still having trouble.
I tried doing this with sed
curl "http://mirrorlist.centos.org/?release=6.4&arch=x86_64&repo=os" | sed '/:\/\//,/\//p'
but doesn't look like it is doing anything. Can you give me some advice?

Here you go:
curl "http://mirrorlist.centos.org/?release=6.4&arch=x86_64&repo=os" | sed -e 's?.*://??' -e 's?/.*??'
Your sed was completely wrong:
/x/,/y/ is a range. It selects multiple lines, from a line matching /x/ until a line matching /y/
The p command prints the selected range
Since all lines match both the start and end pattern you used, you effectively selected all lines. And, since sed echoes the input by default, the p command results in duplicated lines (all lines printed twice).
In my fix:
I used s??? instead of s/// because this way I didn't need to escape all the / in the patterns, so it's a bit more readable this way
I used two expressions with the -e flag:
s?.*://?? matches everything up until :// and replaces it with nothing
s?/.*?? matches everything from / until the end replaces it with nothing
The two expressions are executed in the given order
In modern versions of sed you can omit -e and separate the two expressions with ;. I stick to using -e because it's more portable.

How can I remove all non-word characters except the newline?

I have a file like this:
my line - some words & text
oh lóok i've got some characters
I want to 'normalize' it and remove all the non-word characters. I want to end up with something like this:
mylinesomewordstext
ohlóokivegotsomecharacters
I'm using Linux on the command line at the moment, and I'm hoping there's some one-liner I can use.
I tried this:
cat file | perl -pe 's/\W//'
But that removed all the newlines and put everything one line. Is there someway I can tell Perl to not include newlines in the \W? Or is there some other way?

This removes characters that don't match \w or \n:
cat file | perl -C -pe 's/[^\w\n]//g'

#sth's solution uses Perl, which is (at least on my system) not Unicode compatible, thus it loses the accented o character.
On the other hand, sed is Unicode compatible (according to the lists on this page), and gives a correct result:
$ sed 's/\W//g' a.txt
mylinesomewordstext
ohlóokivegotsomecharacters

In Perl, I'd just add the -l switch, which re-adds the newline by appending it to the end of every print():
perl -ple 's/\W//g' file
Notice that you don't need the cat.

The previous response isn't echoing the "ó" character. At least in my case.
sed 's/\W//g' file

Best practices for shell scripting dictate that you should use the tr program for replacing single characters instead of sed, because it's faster and more efficient. Obviously use sed if replacing longer strings.
tr -d '[:blank:][:punct:]' < file
When run with time I get:
real 0m0.003s
user 0m0.000s
sys 0m0.004s
When I run the sed answer (sed -e 's/\W//g' file) with time I get:
real 0m0.003s
user 0m0.004s
sys 0m0.004s
While not a "huge" difference, you'll notice the difference when running against larger data sets. Also please notice how I didn't pipe cat's output into tr, instead using I/O redirection (one less process to spawn).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Sed replace in html file - sed

How could I append 'index.html' to all links in a html file that do not end with that word ? So that, for example, href="http://mysite/" would become href="http://mysite/index.html".

for href ending with / sed '\|href="http://.*/| s||\1index.html' YourFile if there is folder ref without ending /, you should specifie what is consider as a file or not (like last name with a dot inside for file, ...)

What about this: echo 'href="http://mysite/"' | awk '/http/ {sub(/\/\"/,"/index.html\"")}1' href="http://mysite/index.html" echo 'href="http://www.google.com/"' | awk '/http/ {sub(/\/\"/,"/index.html\"")}1' href="http://www.google.com/index.html"

Related

replace a string from first line on multiple files

Manipulate characters with sed

How to use SED to replace underscores with hyphens in URLs?

Extract CentOS mirror domain names using sed

How can I remove all non-word characters except the newline?

Categories

Resources