How to extract URL from html source with sed/awk or cut? - sed

I am writing a script that will download an html page source as a file and then read the file and extract a specific URL that is located after a specific code. (it only has 1 occurrence)
Here is a sample that I need matched:
<img id="sample-image" class="photo" src="http://xxxx.com/some/ic/pic_1asda963_16x9.jpg"
The code preceding the URL will always be the same so I need to extract the part between:
<img id="sample-image" class="photo" src="
and the " after the URL.
I tried something with sed like this:
sed -n '\<img\ id=\"sample-image\"\ class=\"photo\"\ src=\",\"/p' test.txt
But it does not work. I would appreciate your suggestions, thanks a lot !

You can use grep like this :
grep -oP '<img\s+id="sample-image"\s+class="photo"\s+src="\K[^"]+' test.txt
or with sed :
sed -r 's/<img\s+id="sample-image"\s+class="photo"\s+src="([^"]+)"/\1/' test.txt
or with awk :
awk -F'src="' -F'"' '/<img\s+id="sample-image"/{print $6}' test.txt

If you have GNU grep then you can do something like:
grep -oP "(?<=src=\")[^\"]+(?=\")" test.txt
If you wish to use awk then the following would work:
awk -F\" '{print $(NF-1)}' test.txt

With sed as
echo $string | sed 's/\<img.*src="\(.*\)".*/\1/'

A few things about the sed command you are using:
sed -n '\<img\ id=\"sample-image\"\ class=\"photo\"\ src=\",\"/p' test.txt
You don't need to escape the <, " or space. The single quotes prevents the shell from doing word splitting and other stuff on your sed expression.
You are essentially doing this sed -n '/pattern/p' test.txt (except you seemed to be missing the opening backslash) which says "match this pattern, then print the line which contain the match", you are not really extracting the URL.
This is minor, but you don't need to match class="photo" since the id already makes the HTML element unique (no two elements share the same id w/in the same HTML).
Here's what I would do
sed -n 's/.*<img id="sample-image".*src="\([^"]+\)".*/\1/p' test.txt
The p flag tells sed to print the line where substitution (s) was performed.
\(pattern\) captures a subexpression which can be accessed via \1, \2, etc. on the right side of s///
The .* at the start of regex is in case there is something else preceding the <img> element on the line (you did mention you are parsing a HTML file)

Related

search a string which contains "/" and replace using sed

How to search a pattern and remove the line using sed which contains special characters like "ranasnfs2:/SA_kits/prod"
I tried using a variable to hold the complete string and then recall the variable in sed command but it is not working.
echo $a
ranasnfs2:/SA_kits/prod
sed -i '/"$a"/d' test.txt
cat test.txt | grep -i SA
/SA_kits -rw,suid,soft,retry=4 ranasnfs2:/SA_kits/prod
You need to escape the slash character.
Use this for deleting lines which contain a /:
sed '/\//d' file

Remove all the characters from string after last '/'

I have the followiing input file and I need to remove all the characters from the strings that appear after the last '/'. I'll also show my expected output below.
input:
/start/one/two/stopone.js
/start/one/two/three/stoptwo.js
/start/one/stopxyz.js
expected output:
/start/one/two/
/start/one/two/three/
/start/one/
I have tried to use sed but with no luck so far.
You could simply use good old grep:
grep -o '.*/' file.txt
This simple expression takes advantage of the fact that grep is matching greedy. Meaning it will consume as much characters as possible, including /, until the last / in path.
Original Answer:
You can use dirname:
while read line ; do
echo dirname "$line"
done < file.txt
or sed:
sed 's~\(.*/\).*~\1~' file.txt
perl -lne 'print $1 if(/(.*)\//)' your_file
Try this GNU sed command,
$ sed -r 's~^(.*\/).*$~\1~g' file
/start/one/two/
/start/one/two/three/
/start/one/
Through awk,
awk -F/ '{sub(/.*/,"",$NF); print}' OFS="/" file

extract a substring of 11 characters from a line using sed,awk or perl

I have a file with many lines, in each line
there is either substring
whatever_blablablalsfjlsdjf;asdfjlds;f/watch?v=yPrg-JN50sw&amp,whatever_blabla
or
whatever_blablabla"/watch?v=yPrg-JN50sw&amp" class=whatever_blablablavwhate
I want to extract a substring, like the "yPrg-JN50s" above
the matching pattern is
the 11 characters after the string "/watch?="
how to extract the substring
I hope it is sed, awk in one line
if not, a pn line perl script is also ok
You can do
grep -oP '(?<=/watch\?v=).{11}'
if your grep knows Perl regex, or
sed 's/.*\/watch?v=\(.\{11\}\).*/\1/g'
$ cat file
/watch?v=yPrg-JN50sw&amp
"/watch?v=yPrg-JN50sw&amp" class=
$
$ awk 'match($0,/\/watch\?v=/) { print substr($0,RSTART+RLENGTH,11) }' file
yPrg-JN50sw
yPrg-JN50sw
Just with the shell's parameter expansion, extract the 11 chars after "watch?v=":
while IFS= read -r line; do
tmp=${line##*watch?v=}
echo ${tmp:0:11}
done < filename
You could use sed to remove the extraneous information:
sed 's/[^=]\+=//; s/&.*$//' file
Or with awk and sensible field separators:
awk -F '[=&]' '{print $2}' file
Contents of file:
cat <<EOF > file
/watch?v=yPrg-JN50sw&amp
"/watch?v=yPrg-JN50sw&amp" class=
EOF
Output:
yPrg-JN50sw
yPrg-JN50sw
Edit accommodating new requirements mentioned in the comments
cat <<EOF > file
<div id="" yt-grid-box "><div class="yt-lockup-thumbnail"><a href="/watch?v=0_NfNAL3Ffc" class="ux-thumb-wrap yt-uix-sessionlink yt-uix-contextlink contains-addto result-item-thumb" data-sessionlink="ved=CAMQwBs%3D&ei=CPTsy8bhqLMCFRR0fAodowXbww%3D%3D"><span class="video-thumb ux-thumb yt-thumb-default-185 "><span class="yt-thumb-clip"><span class="yt-thumb-clip-inner"><img src="//i1.ytimg.com/vi/0_NfNAL3Ffc/mqdefault.jpg" alt="Miniature" width="185" ><span class="vertical-align"></span></span></span></span><span class="video-time">5:15</span>
EOF
Use awk with sensible record separator:
awk -v RS='[=&"]' '/watch/ { getline; print }' file
Note, you should use a proper XML parser for this sort of task.
grep --perl-regexp --only-matching --regexp="(?<=/watch\\?=)([^&]{0,11})"
Assuming your lines have exactly the format you quoted, this should work.
awk '{print substr($0,10,11)}'
Edit: From the comment in another answer, I guess your lines are much longer and complicated than this, in which case something more comprehensive is needed:
gawk '{if(match($0, "/watch\\?v=(\\w+)",a)) print a[1]}'

Unix - Split to N files using regexp to name destination file

How do I split a file to N files using as a filename the first 2 chars on the line.
Ex input file:
AA23409234TEXT
BA23201202Other Text
AA23509234YADA
BA23202202More Text.
C1000000000000000000
Should generate 3 files:
AA.txt
AA23409234TEXT
AA23509234YADA
BA.txt
BA23201202Other Text
BA23202202More Text.
C1.txt
C1000000000000000000
I'm thinking of using a sed script similar to this
/^(..)/w \1
But what that really does is create a file named '\1' instead of the capture group.
Any ideas?
$ awk '{fname=substr($0, 0, 2); print >>fname}' input.txt
Or
$ while read line; do echo "$line" >>"${line:0:2}"; done <input.txt
The first thing you need to do is determine all of your file names:
filenames=$(sed 's/\(..\).*/\1/' listOfStrings.txt | sort | uniq)
Then, loop through those filenames
for filename in $filenames
do
sed -n '/^$filename/ p' listOfStrings.txt > $filename.txt
done
I have not tested this, but I think it should work.
This might work for you:
sed 's/\(..\).*/echo "&" >>\1.txt/' file | sh
or if you have GNU sed:
sed 's/\(..\).*/echo "&" >>\1.txt/e' file

How do I push `sed` matches to the shell call in the replacement pattern?

I need to replace several URLs in a text file with some content dependent on the URL itself. Let's say for simplicity it's the first line of the document at the URL.
What I'm trying is this:
sed "s/^URL=\(.*\)/TITLE=$(curl -s \1 | head -n 1)/" file.txt
This doesn't work, since \1 is not set. However, the shell is getting called. Can I somehow push the sed match variables to that subprocess?
The accept answer is just plain wrong. Proof:
Make an executable script foo.sh:
#! /bin/bash
echo $* 1>&2
Now run it:
$ echo foo | sed -e "s/\\(foo\\)/$(./foo.sh \\1)/"
\1
$
The $(...) is expanded before sed is run.
So you are trying to call an external command from inside the replacement pattern of a sed substitution. I dont' think it can be done, the $... inside a pattern just allows you to use an already existent (constant) shell variable.
I'd go with Perl, see the /e option in the search-replace operator (s/.../.../e).
UPDATE: I was wrong, sed plays nicely with the shell, and it allows you do to that. But, then, the backlash in \1 should be escaped. Try instead:
sed "s/^URL=\(.*\)/TITLE=$(curl -s \\1 | head -n 1)/" file.txt
Try this:
sed "s/^URL=\(.*\)/\1/" file.txt | while read url; do sed "s#URL=\($url\)#TITLE=$(curl -s $url | head -n 1)#" file.txt; done
If there are duplicate URLs in the original file, then there will be n^2 of them in the output. The # as a delimiter depends on the URLs not including that character.
Late reply, but making sure people don't get thrown off by the answers here -- this can be done in gnu sed using the e command. The following, for example, decrements a number at the beginning of a line:
echo "444 foo" | sed "s/\([0-9]*\)\(.*\)/expr \1 - 1 | tr -d '\n'; echo \"\2\";/e"
will produce:
443 foo