Retrieve information Text/Word from HTML code using awk/sed - sed

awk/sed newbie here. I have a HTML file and from that file and I would like to retrieve a text word.
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc_process.txt>abc</a> NDK Version: 4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc01_process.txt>abc01</a> NDK Version: 4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc045_process.txt>abc045</a> NDK Version: 4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/cdf_process.txt>cdf</a> NDK Version: 4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/Manhattan_process.txt>Manhattan</a> NDK Version: 4.0 </li>
For eg. From the 1st line I would like to retrieve abc placed between: .txt>abc/a>
I have used the following command but as you can see that number of letters in the word keeps changing abc, abc01, abc045, cdf, Manhattan.
awk -F\/ '{print substr($4,0,3)}' list.html
So this command is getting the output for only the 3 letter word. However I want to extract the same information (abc01, abc045, cdf, Manhattan) from all the lines in the HTML code. Please help.

Using awk:
awk -F'[<>]' '{print $7}' urls
abc
abc01
abc045
cdf
Manhattan

You could try:
perl -nE '/<a href.*?>(.*?)<\/a>/; say $1' file
Output:
abc
abc01
abc045
cdf
Manhattan

$ sed -n 's/.*txt>\([[:alnum:]]\+\)<.*/\1/p' list.html
abc
abc01
abc045
cdf
Manhattan
Or:
$ awk -F'(txt>|</a)' '{print $2}' list.html
abc
abc01
abc045
cdf
Manhattan

I use command sed or awk to extract it. Here, I save origin data into file /tmp/html.txt.
Both of them utilize regular expression and back reference
Via sed
flying#lempstacker:~$ sed -r -n 's#.*<a [^>]*>(.*)</a>.*#\1#p' /tmp/html.txt
abc
abc01
abc045
cdf
Manhattan
flying#lempstacker:~$
Via awk
using function gensub
flying#lempstacker:~$ awk '{print gensub(/.*<a [^>]*>(.*)<\/a>.*/,"\\1"," ",$0)}' /tmp/html.txt
abc
abc01
abc045
cdf
Manhattan
flying#lempstacker:~$

Using gnu grep
grep -Po "<a href.*?>\K[^<]*" file

Related

Extract data using grep/sed from html tag with special class/id

I need to grep info from website and it is stored like:
<div class="name">Mark</div>
<div class="surname">John</div>
<div class="phone">8434</div>
and etc.
Tried to grep it and parse it later with sed:
grep -o '<div class="name">.*</div>' | sed -e 's?<div class="name">?|?g'
but, when I try to replace with sed -e 's?<\/div><div class="phone">?|?g' - no result
and for every class do the same thing. I cannot delete all html tags (sed 's/<[^>]\+>//g'), and need to do it only for div with this classes.
The output format should be like
|Mark|John|8434|
I need to do it with grep/sed
Using awk should do the job:
awk -F"[<>]" '{printf "%s|",$3}' file
Mark|John|8434|
If you need a new line at the end:
awk -F"[<>]" '{printf "%s|",$3} END {print ""}' file
It creates filed separated by < or > then print the third field with | as separator.

Sed. How to print lines matching pattern from another file?

I have file1 containing some text, like:
abcdef 123456 abcdef
ghijkl 789123 abcdef
mnopqr 123456 abcdef
and I have file2 containing single line of text which I want to use as pattern:
ghijkl 789123
How can I use second file as a pattern to print lines containing it to third file using sed? like file3:
ghijkl 789123 abcdef
I've tried to use
sed -ne "s/r file2//p" file1 > file3
But the content of file3 is blank for some reason
P.S. using Windows
If you have sed, do have access to grep?
grep -f file2 file1 > file3
This is the simplest sed solution on linux: sed -n /`<file2`/p file1 > file3, but windows does not provides backticks. So the windows work-around would be:
set /p PATERN=<file2
sed -n /%PATERN%/p file1 > file3
The sed solution is:
cat f2.txt | xargs -I {} sed -n "/{}/p" f1.txt > f3.txt
but, as #Cyrus correctly notes, grep is the proper tool for this solution and it's much nicer:
grep -f f2.txt f1.txt > f3.txt
Note: using these incredibly powerful *nix tools like sed, grep, cat, xargs, bash, etc. on Microsoft Windows can be frustrating. Consider spinning up a Linux environment, instead -- you'll save yourself many hours of grief dealing with subtle path and environment issues from emulators like Cygwin, etc.

How to add new line using sed on MacOS?

I wanted to add a new line between </a> and <a><a>
</a><a><a>
</a>
<a><a>
I did this
sed 's#</a><a><a>#</a>\n<a><a>#g' filename but it didn't work.
Powered by mac in two Interpretation:
echo foo | sed 's/f/f\'$'\n/'
echo foo | gsed 's/f/f\n/g'
Some seds, notably Mac / BSD, don't interpret \n as a newline, you need to use an actual newline, preceded by a backslash:
$ echo foo | sed 's/f/f\n/'
fnoo
$ echo foo | sed 's/f/f\
> /'
f
oo
$
Or you can use:
echo foo | sed $'s/f/f\\\n/'
...or you just pound on it! worked for me on insert on mac / osx:
sed "2 i \\\n${TEXT}\n\n" -i ${FILE_PATH_NAME}
sed "2 i \\\nSomeText\n\n" -i textfile.txt

sed replace text in a XML file

I have huge XML file with data like this:
<amount quantity="1">12.00</amount>
How can i replace the 12.00 with something else using sed?
Not really enough information in your question but to replace all values of 12.00 with say 24.00 you could do:
$ sed 's/>12\.00</>24.00</g' file.xml
If you are happy with the results you can store them back using the -i option:
$ sed -i 's/>12\.00</>24.00</g' file.xml
A more rubust match would be:
$ sed -r 's_(<amount quantity="[0-9]+">)12.00(</amount>)_\124.00\2_g' file.xml
But you should really parse the XML properly and not force regexp to do something it wasn't designed for.
script.sh:
#!/bin/bash
xml="<amount quantity="1">12.00</amount>"
newxml=`echo $xml | sed -n "s/\(<amount[^>]*>\)\([^<]*\)\(<\/amount>\)/\113.37\3/gp"`
echo "$newxml"
result:
$ ./script.sh
<amount quantity=1>13.37</amount>

extract a substring of 11 characters from a line using sed,awk or perl

I have a file with many lines, in each line
there is either substring
whatever_blablablalsfjlsdjf;asdfjlds;f/watch?v=yPrg-JN50sw&amp,whatever_blabla
or
whatever_blablabla"/watch?v=yPrg-JN50sw&amp" class=whatever_blablablavwhate
I want to extract a substring, like the "yPrg-JN50s" above
the matching pattern is
the 11 characters after the string "/watch?="
how to extract the substring
I hope it is sed, awk in one line
if not, a pn line perl script is also ok
You can do
grep -oP '(?<=/watch\?v=).{11}'
if your grep knows Perl regex, or
sed 's/.*\/watch?v=\(.\{11\}\).*/\1/g'
$ cat file
/watch?v=yPrg-JN50sw&amp
"/watch?v=yPrg-JN50sw&amp" class=
$
$ awk 'match($0,/\/watch\?v=/) { print substr($0,RSTART+RLENGTH,11) }' file
yPrg-JN50sw
yPrg-JN50sw
Just with the shell's parameter expansion, extract the 11 chars after "watch?v=":
while IFS= read -r line; do
tmp=${line##*watch?v=}
echo ${tmp:0:11}
done < filename
You could use sed to remove the extraneous information:
sed 's/[^=]\+=//; s/&.*$//' file
Or with awk and sensible field separators:
awk -F '[=&]' '{print $2}' file
Contents of file:
cat <<EOF > file
/watch?v=yPrg-JN50sw&amp
"/watch?v=yPrg-JN50sw&amp" class=
EOF
Output:
yPrg-JN50sw
yPrg-JN50sw
Edit accommodating new requirements mentioned in the comments
cat <<EOF > file
<div id="" yt-grid-box "><div class="yt-lockup-thumbnail"><a href="/watch?v=0_NfNAL3Ffc" class="ux-thumb-wrap yt-uix-sessionlink yt-uix-contextlink contains-addto result-item-thumb" data-sessionlink="ved=CAMQwBs%3D&ei=CPTsy8bhqLMCFRR0fAodowXbww%3D%3D"><span class="video-thumb ux-thumb yt-thumb-default-185 "><span class="yt-thumb-clip"><span class="yt-thumb-clip-inner"><img src="//i1.ytimg.com/vi/0_NfNAL3Ffc/mqdefault.jpg" alt="Miniature" width="185" ><span class="vertical-align"></span></span></span></span><span class="video-time">5:15</span>
EOF
Use awk with sensible record separator:
awk -v RS='[=&"]' '/watch/ { getline; print }' file
Note, you should use a proper XML parser for this sort of task.
grep --perl-regexp --only-matching --regexp="(?<=/watch\\?=)([^&]{0,11})"
Assuming your lines have exactly the format you quoted, this should work.
awk '{print substr($0,10,11)}'
Edit: From the comment in another answer, I guess your lines are much longer and complicated than this, in which case something more comprehensive is needed:
gawk '{if(match($0, "/watch\\?v=(\\w+)",a)) print a[1]}'