Extract data using grep/sed from html tag with special class/id - sed

I need to grep info from website and it is stored like:
<div class="name">Mark</div>
<div class="surname">John</div>
<div class="phone">8434</div>
and etc.
Tried to grep it and parse it later with sed:
grep -o '<div class="name">.*</div>' | sed -e 's?<div class="name">?|?g'
but, when I try to replace with sed -e 's?<\/div><div class="phone">?|?g' - no result
and for every class do the same thing. I cannot delete all html tags (sed 's/<[^>]\+>//g'), and need to do it only for div with this classes.
The output format should be like
|Mark|John|8434|
I need to do it with grep/sed

Using awk should do the job:
awk -F"[<>]" '{printf "%s|",$3}' file
Mark|John|8434|
If you need a new line at the end:
awk -F"[<>]" '{printf "%s|",$3} END {print ""}' file
It creates filed separated by < or > then print the third field with | as separator.

Related

Parsing HTML on the command line; How to capture text in <strong></strong>?

I'm trying to grab data from HTML output that looks like this:
<strong>Target1NoSpaces</strong><span class="creator"> ....
<strong>Target2 With Spaces</strong><span class="creator"> ....
I'm using a pipe train to whittle down the data to the targets I'm trying to hit. Here's my approach so far:
grep "/strong" output.html | awk '{print $1}'
Grep on "/strong" to get the lines with the targets; that works fine.
Pipe to 'awk '{print $1}'. That works in case #1 when the target has no spaces, but fails in case #2 when the target has spaces..only the first word is preserved as below:
<strong>Target1NoSpaces</strong><span
<strong>Target2
Do you have any tips on hitting the target properly, either in my awk or in different command? Anything quick and dirty (grep, awk, sed, perl) would be appreciated.
Try pup, a command line tool for processing HTML. For example:
$ pup 'strong text{}' < file.html
Target1NoSpaces
Target2 With Spaces
To search via XPath, try xpup.
Alternatively, for a well-formed HTML/XML document, try html-xml-utils.
One way using mojolicious and its DOM parser:
perl -Mojo -E '
g("http://your.web")
->dom
->find("strong")
->each( sub { if ( $t = shift->text ) { say $t } } )'
Using Perl regex's look-behind and look-ahead feature in grep. It should be simpler than using awk.
grep -oP "(?<=<strong>).*?(?=</strong>)" file
Output:
Target1NoSpaces
Target2 With Spaces
Add:
This implementation of Perl's regex's multi-matching in Ruby could match values in multiple lines:
ruby -e 'File.read(ARGV.shift).scan(/(?<=<strong>).*?(?=<\/strong>)/m).each{|e| puts "----------"; puts e;}' file
Input:
<strong>Target
A
B
C
</strong><strong>Target D</strong><strong>Target E</strong>
Output:
----------
Target
A
B
C
----------
Target D
----------
Target E
Here's a solution using xmlstarlet
xml sel -t -v //strong input.html
Trying to parse HTML without a real HTML parser is a bad idea. Having said that, here is a very quick and dirty solution to the specific example you provided. It will not work when there is more than one <strong> tag on a line, when the tag runs over more than one line, etc.
awk -F '<strong>|</strong>' '/<strong>/ {print $2}' filename
You never need grep with awk and the field separator doesn't have to be whitespace:
$ awk -F'<|>' '/strong/{print $3}' file
Target1NoSpaces
Target2 With Spaces
You should really use a proper parser for this however.
Since you tagged perl
perl -ne 'if(/(?:<strong>)(.*)(?:<\/strong>)/){print $1."\n";}' input.html
I am surprised no one mensions W3C HTML-XML-utils
curl -Ss https://stackoverflow.com/questions/18746957/parsing-html-on-the-command-line-how-to-capture-text-in-strong-strong |
hxnormalize -x |
hxselect -s '\n' strong
output:
<strong class="fc-black-750 mb6">Stack Overflow
for Teams</strong>
<strong>Teams</strong>
To capture only content:
curl -Ss https://stackoverflow.com/questions/18746957/parsing-html-on-the-command-line-how-to-capture-text-in-strong-strong |
hxnormalize -x |
hxselect -s '\n' -c strong
Stack Overflow
for Teams
Teams

How to extract URL from html source with sed/awk or cut?

I am writing a script that will download an html page source as a file and then read the file and extract a specific URL that is located after a specific code. (it only has 1 occurrence)
Here is a sample that I need matched:
<img id="sample-image" class="photo" src="http://xxxx.com/some/ic/pic_1asda963_16x9.jpg"
The code preceding the URL will always be the same so I need to extract the part between:
<img id="sample-image" class="photo" src="
and the " after the URL.
I tried something with sed like this:
sed -n '\<img\ id=\"sample-image\"\ class=\"photo\"\ src=\",\"/p' test.txt
But it does not work. I would appreciate your suggestions, thanks a lot !
You can use grep like this :
grep -oP '<img\s+id="sample-image"\s+class="photo"\s+src="\K[^"]+' test.txt
or with sed :
sed -r 's/<img\s+id="sample-image"\s+class="photo"\s+src="([^"]+)"/\1/' test.txt
or with awk :
awk -F'src="' -F'"' '/<img\s+id="sample-image"/{print $6}' test.txt
If you have GNU grep then you can do something like:
grep -oP "(?<=src=\")[^\"]+(?=\")" test.txt
If you wish to use awk then the following would work:
awk -F\" '{print $(NF-1)}' test.txt
With sed as
echo $string | sed 's/\<img.*src="\(.*\)".*/\1/'
A few things about the sed command you are using:
sed -n '\<img\ id=\"sample-image\"\ class=\"photo\"\ src=\",\"/p' test.txt
You don't need to escape the <, " or space. The single quotes prevents the shell from doing word splitting and other stuff on your sed expression.
You are essentially doing this sed -n '/pattern/p' test.txt (except you seemed to be missing the opening backslash) which says "match this pattern, then print the line which contain the match", you are not really extracting the URL.
This is minor, but you don't need to match class="photo" since the id already makes the HTML element unique (no two elements share the same id w/in the same HTML).
Here's what I would do
sed -n 's/.*<img id="sample-image".*src="\([^"]+\)".*/\1/p' test.txt
The p flag tells sed to print the line where substitution (s) was performed.
\(pattern\) captures a subexpression which can be accessed via \1, \2, etc. on the right side of s///
The .* at the start of regex is in case there is something else preceding the <img> element on the line (you did mention you are parsing a HTML file)

delete a column with awk or sed

I have a file with three columns. I would like to delete the 3rd column(in-place editing). How can I do this with awk or sed?
123 abc 22.3
453 abg 56.7
1236 hjg 2.3
Desired output
123 abc
453 abg
1236 hjg
try this short thing:
awk '!($3="")' file
With GNU awk for inplace editing, \s/\S, and gensub() to delete
1) the FIRST field:
awk -i inplace '{sub(/^\S+\s*/,"")}1' file
or
awk -i inplace '{$0=gensub(/^\S+\s*/,"",1)}1' file
2) the LAST field:
awk -i inplace '{sub(/\s*\S+$/,"")}1' file
or
awk -i inplace '{$0=gensub(/\s*\S+$/,"",1)}1' file
3) the Nth field where N=3:
awk -i inplace '{$0=gensub(/\s*\S+/,"",3)}1' file
Without GNU awk you need a match()+substr() combo or multiple sub()s + vars to remove a middle field. See also Print all but the first three columns.
This might work for you (GNU sed):
sed -i -r 's/\S+//3' file
If you want to delete the white space before the 3rd field:
sed -i -r 's/(\s+)?\S+//3' file
It seems you could simply go with
awk '{print $1 " " $2}' file
This prints the two first fields of each line in your input file, separated with a space.
Try using cut... its fast and easy
First you have repeated spaces, you can squeeze those down to a single space between columns if thats what you want with tr -s ' '
If each column already has just one delimiter between it, you can use cut -d ' ' -f-2 to print fields (columns) <= 2.
for example if your data is in a file input.txt you can do one of the following:
cat input.txt | tr -s ' ' | cut -d ' ' -f-2
Or if you better reason about this problem by removing the 3rd column you can write the following
cat input.txt | tr -s ' ' | cut -d ' ' --complement -f3
cut is pretty powerful, you can also extract ranges of bytes, or characters, in addition to columns
excerpt from the man page on the syntax of how to specify the list range
Each LIST is made up of one range, or many ranges separated by commas.
Selected input is written in the same order that it is read, and is
written exactly once. Each range is one of:
N N'th byte, character or field, counted from 1
N- from N'th byte, character or field, to end of line
N-M from N'th to M'th (included) byte, character or field
-M from first to M'th (included) byte, character or field
so you also could have said you want specific columns 1 and 2 with...
cat input.txt | tr -s ' ' | cut -d ' ' -f1,2
Try this :
awk '$3="";1' file.txt > new_file && mv new_file file.txt
or
awk '{$3="";print}' file.txt > new_file && mv new_file file.txt
Try
awk '{$3=""; print $0}'
If you're open to a Perl solution...
perl -ane 'print "$F[0] $F[1]\n"' file
These command-line options are used:
-n loop around every line of the input file, do not automatically print every line
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace
-e execute the following perl code

extract a substring of 11 characters from a line using sed,awk or perl

I have a file with many lines, in each line
there is either substring
whatever_blablablalsfjlsdjf;asdfjlds;f/watch?v=yPrg-JN50sw&amp,whatever_blabla
or
whatever_blablabla"/watch?v=yPrg-JN50sw&amp" class=whatever_blablablavwhate
I want to extract a substring, like the "yPrg-JN50s" above
the matching pattern is
the 11 characters after the string "/watch?="
how to extract the substring
I hope it is sed, awk in one line
if not, a pn line perl script is also ok
You can do
grep -oP '(?<=/watch\?v=).{11}'
if your grep knows Perl regex, or
sed 's/.*\/watch?v=\(.\{11\}\).*/\1/g'
$ cat file
/watch?v=yPrg-JN50sw&amp
"/watch?v=yPrg-JN50sw&amp" class=
$
$ awk 'match($0,/\/watch\?v=/) { print substr($0,RSTART+RLENGTH,11) }' file
yPrg-JN50sw
yPrg-JN50sw
Just with the shell's parameter expansion, extract the 11 chars after "watch?v=":
while IFS= read -r line; do
tmp=${line##*watch?v=}
echo ${tmp:0:11}
done < filename
You could use sed to remove the extraneous information:
sed 's/[^=]\+=//; s/&.*$//' file
Or with awk and sensible field separators:
awk -F '[=&]' '{print $2}' file
Contents of file:
cat <<EOF > file
/watch?v=yPrg-JN50sw&amp
"/watch?v=yPrg-JN50sw&amp" class=
EOF
Output:
yPrg-JN50sw
yPrg-JN50sw
Edit accommodating new requirements mentioned in the comments
cat <<EOF > file
<div id="" yt-grid-box "><div class="yt-lockup-thumbnail"><a href="/watch?v=0_NfNAL3Ffc" class="ux-thumb-wrap yt-uix-sessionlink yt-uix-contextlink contains-addto result-item-thumb" data-sessionlink="ved=CAMQwBs%3D&ei=CPTsy8bhqLMCFRR0fAodowXbww%3D%3D"><span class="video-thumb ux-thumb yt-thumb-default-185 "><span class="yt-thumb-clip"><span class="yt-thumb-clip-inner"><img src="//i1.ytimg.com/vi/0_NfNAL3Ffc/mqdefault.jpg" alt="Miniature" width="185" ><span class="vertical-align"></span></span></span></span><span class="video-time">5:15</span>
EOF
Use awk with sensible record separator:
awk -v RS='[=&"]' '/watch/ { getline; print }' file
Note, you should use a proper XML parser for this sort of task.
grep --perl-regexp --only-matching --regexp="(?<=/watch\\?=)([^&]{0,11})"
Assuming your lines have exactly the format you quoted, this should work.
awk '{print substr($0,10,11)}'
Edit: From the comment in another answer, I guess your lines are much longer and complicated than this, in which case something more comprehensive is needed:
gawk '{if(match($0, "/watch\\?v=(\\w+)",a)) print a[1]}'

AWK/SED. How to remove parentheses in simple text file

I have a text file looking like this:
(-9.1744438E-02,7.6282293E-02) (-9.1744438E-02,7.6282293E-02) ... and so on.
I would like to modify the file by removing all the parenthesis and a new line for each couple
so that it look like this:
-9.1744438E-02,7.6282293E-02
-9.1744438E-02,7.6282293E-02
...
A simple way to do that?
Any help is appreciated,
Fred
I would use tr for this job:
cat in_file | tr -d '()' > out_file
With the -d switch it just deletes any characters in the given set.
To add new lines you could pipe it through two trs:
cat in_file | tr -d '(' | tr ')' '\n' > out_file
As was said, almost:
sed 's/[()]//g' inputfile > outputfile
or in awk:
awk '{gsub(/[()]/,""); print;}' inputfile > outputfile
This would work -
awk -v FS="[()]" '{for (i=2;i<=NF;i+=2) print $i }' inputfile > outputfile
Test:
[jaypal:~/Temp] cat file
(-9.1744438E-02,7.6282293E-02) (-9.1744438E-02,7.6282293E-02)
[jaypal:~/Temp] awk -v FS="[()]" '{for (i=2;i<=NF;i+=2) print $i }' file
-9.1744438E-02,7.6282293E-02
-9.1744438E-02,7.6282293E-02
This might work for you:
echo "(-9.1744438E-02,7.6282293E-02) (-9.1744438E-02,7.6282293E-02)" |
sed 's/) (/\n/;s/[()]//g'
-9.1744438E-02,7.6282293E-02
-9.1744438E-02,7.6282293E-02
Guess we all know this, but just to emphasize:
Usage of bash commands is better in terms of time taken for execution, than using awk or sed to do the same job. For instance, try not to use sed/awk where grep can suffice.
In this particular case, I created a file 100000 lines long file, each containing characters "(" as well as ")". Then ran
$ /usr/bin/time -f%E -o log cat file | tr -d "()"
and again,
$ /usr/bin/time -f%E -ao log sed 's/[()]//g' file
And the results were:
05.44 sec : Using tr
05.57 sec : Using sed
cat in_file | sed 's/[()]//g' > out_file
Due to formatting issues, it is not entirely clear from your question whether you also need to insert newlines.