Use curl to parse XML, get an image's URL and download it - perl

I want to write a shell script to get an image from an rss feed.
Right now I have:
curl http://foo.com/rss.xml | grep -E '<img src="http://www.foo.com/full/' | head -1 | sed -e 's/<img src="//' -e 's/" alt=""//' -e 's/width="400"//' -e 's/ height="400" \/>//' | sed 's/ //g'
This I use to grab the first occurence of an image URL in the file.
Now I want to put this URL in a variable to use cURL again to download the image.
Any help appreciated! (Also you might give tipps on how to better remove everything from the line with the URL. This is the line:
<img src="http://www.nichtlustig.de/comics/full/100802.jpg" alt="" width="400" height="400" />
There's probably some better regex to remove everything except the URL than my solution.)
Thanks in advance!

Using a regexp to parse HTML/XML is a Bad Idea in general. Therefore I'd recommend that you use a proper parser.
If you don't object to using Perl, let Perl do the proper XML or HTML parsing for you using appropriate parser libraries:
HTML
curl http://BOGUS.com |& perl -e '{use HTML::TokeParser;
$parser = HTML::TokeParser->new(\*STDIN);
$img = $parser->get_tag('img') ;
print "$img->[1]->{src}\n";
}'
/content02/groups/intranetcommon/documents/image/blk_logo.gif
XML
curl http://BOGUS.com/whdata0.xml | perl -e '{use XML::Twig;
$twig=XML::Twig->new(twig_handlers =>{img => sub {
print $_[1]->att("src")."\n"; exit 0;}});
open(my $fh, "-");
$twig->parse($fh);
}'
/content02/groups/intranetcommon/documents/image/blk_logo.gif

I used wget instead of curl, but its just the same
#!/bin/bash
url='http://www.nichtlustig.de/rss/nichtrss.rss'
wget -O- -q "$url" | awk 'BEGIN{ RS="</a>" }
/<img src=/{
gsub(/.*<img src=\"/,"")
gsub(/\".[^>]*>/,"")
print
}' | xargs -i wget "{}"

Use a DOM parser and extract all img elements using getElementsByTagName. Then add them to a list/array, loop through and separately fetch them.
I would suggest using Python, but any language would have a DOM library.

#!/bin/sh
URL=$(curl http://foo.com/rss.xml | grep -E '<img src="http://www.foo.com/full/' | head -1 | sed -e 's/<img src="//' -e 's/" alt=""//' -e 's/width="400"//' -e 's/ height="400" \/>//' | sed 's/ //g')
curl -C - -O $URL
This totally does the job!
Any idea on the regex?

Here's a quick Python solution:
from BeautifulSoup import BeautifulSoup
from os import sys
soup = BeautifulSoup(sys.stdin.read())
print soup.findAll('img')[0]['src']
Usage:
$ curl http://www.google.com/`curl http://www.google.com | python get_img_src.py`
This works like a charm and will not leave you trying to find the magical regex that will parse random HTML (Hint: there is no such expression, especially not if you have a greedy matcher like sed.)

Related

Parsing HTML on the command line; How to capture text in <strong></strong>?

I'm trying to grab data from HTML output that looks like this:
<strong>Target1NoSpaces</strong><span class="creator"> ....
<strong>Target2 With Spaces</strong><span class="creator"> ....
I'm using a pipe train to whittle down the data to the targets I'm trying to hit. Here's my approach so far:
grep "/strong" output.html | awk '{print $1}'
Grep on "/strong" to get the lines with the targets; that works fine.
Pipe to 'awk '{print $1}'. That works in case #1 when the target has no spaces, but fails in case #2 when the target has spaces..only the first word is preserved as below:
<strong>Target1NoSpaces</strong><span
<strong>Target2
Do you have any tips on hitting the target properly, either in my awk or in different command? Anything quick and dirty (grep, awk, sed, perl) would be appreciated.
Try pup, a command line tool for processing HTML. For example:
$ pup 'strong text{}' < file.html
Target1NoSpaces
Target2 With Spaces
To search via XPath, try xpup.
Alternatively, for a well-formed HTML/XML document, try html-xml-utils.
One way using mojolicious and its DOM parser:
perl -Mojo -E '
g("http://your.web")
->dom
->find("strong")
->each( sub { if ( $t = shift->text ) { say $t } } )'
Using Perl regex's look-behind and look-ahead feature in grep. It should be simpler than using awk.
grep -oP "(?<=<strong>).*?(?=</strong>)" file
Output:
Target1NoSpaces
Target2 With Spaces
Add:
This implementation of Perl's regex's multi-matching in Ruby could match values in multiple lines:
ruby -e 'File.read(ARGV.shift).scan(/(?<=<strong>).*?(?=<\/strong>)/m).each{|e| puts "----------"; puts e;}' file
Input:
<strong>Target
A
B
C
</strong><strong>Target D</strong><strong>Target E</strong>
Output:
----------
Target
A
B
C
----------
Target D
----------
Target E
Here's a solution using xmlstarlet
xml sel -t -v //strong input.html
Trying to parse HTML without a real HTML parser is a bad idea. Having said that, here is a very quick and dirty solution to the specific example you provided. It will not work when there is more than one <strong> tag on a line, when the tag runs over more than one line, etc.
awk -F '<strong>|</strong>' '/<strong>/ {print $2}' filename
You never need grep with awk and the field separator doesn't have to be whitespace:
$ awk -F'<|>' '/strong/{print $3}' file
Target1NoSpaces
Target2 With Spaces
You should really use a proper parser for this however.
Since you tagged perl
perl -ne 'if(/(?:<strong>)(.*)(?:<\/strong>)/){print $1."\n";}' input.html
I am surprised no one mensions W3C HTML-XML-utils
curl -Ss https://stackoverflow.com/questions/18746957/parsing-html-on-the-command-line-how-to-capture-text-in-strong-strong |
hxnormalize -x |
hxselect -s '\n' strong
output:
<strong class="fc-black-750 mb6">Stack Overflow
for Teams</strong>
<strong>Teams</strong>
To capture only content:
curl -Ss https://stackoverflow.com/questions/18746957/parsing-html-on-the-command-line-how-to-capture-text-in-strong-strong |
hxnormalize -x |
hxselect -s '\n' -c strong
Stack Overflow
for Teams
Teams

sed replace text in a XML file

I have huge XML file with data like this:
<amount quantity="1">12.00</amount>
How can i replace the 12.00 with something else using sed?
Not really enough information in your question but to replace all values of 12.00 with say 24.00 you could do:
$ sed 's/>12\.00</>24.00</g' file.xml
If you are happy with the results you can store them back using the -i option:
$ sed -i 's/>12\.00</>24.00</g' file.xml
A more rubust match would be:
$ sed -r 's_(<amount quantity="[0-9]+">)12.00(</amount>)_\124.00\2_g' file.xml
But you should really parse the XML properly and not force regexp to do something it wasn't designed for.
script.sh:
#!/bin/bash
xml="<amount quantity="1">12.00</amount>"
newxml=`echo $xml | sed -n "s/\(<amount[^>]*>\)\([^<]*\)\(<\/amount>\)/\113.37\3/gp"`
echo "$newxml"
result:
$ ./script.sh
<amount quantity=1>13.37</amount>

AWK/SED. How to remove parentheses in simple text file

I have a text file looking like this:
(-9.1744438E-02,7.6282293E-02) (-9.1744438E-02,7.6282293E-02) ... and so on.
I would like to modify the file by removing all the parenthesis and a new line for each couple
so that it look like this:
-9.1744438E-02,7.6282293E-02
-9.1744438E-02,7.6282293E-02
...
A simple way to do that?
Any help is appreciated,
Fred
I would use tr for this job:
cat in_file | tr -d '()' > out_file
With the -d switch it just deletes any characters in the given set.
To add new lines you could pipe it through two trs:
cat in_file | tr -d '(' | tr ')' '\n' > out_file
As was said, almost:
sed 's/[()]//g' inputfile > outputfile
or in awk:
awk '{gsub(/[()]/,""); print;}' inputfile > outputfile
This would work -
awk -v FS="[()]" '{for (i=2;i<=NF;i+=2) print $i }' inputfile > outputfile
Test:
[jaypal:~/Temp] cat file
(-9.1744438E-02,7.6282293E-02) (-9.1744438E-02,7.6282293E-02)
[jaypal:~/Temp] awk -v FS="[()]" '{for (i=2;i<=NF;i+=2) print $i }' file
-9.1744438E-02,7.6282293E-02
-9.1744438E-02,7.6282293E-02
This might work for you:
echo "(-9.1744438E-02,7.6282293E-02) (-9.1744438E-02,7.6282293E-02)" |
sed 's/) (/\n/;s/[()]//g'
-9.1744438E-02,7.6282293E-02
-9.1744438E-02,7.6282293E-02
Guess we all know this, but just to emphasize:
Usage of bash commands is better in terms of time taken for execution, than using awk or sed to do the same job. For instance, try not to use sed/awk where grep can suffice.
In this particular case, I created a file 100000 lines long file, each containing characters "(" as well as ")". Then ran
$ /usr/bin/time -f%E -o log cat file | tr -d "()"
and again,
$ /usr/bin/time -f%E -ao log sed 's/[()]//g' file
And the results were:
05.44 sec : Using tr
05.57 sec : Using sed
cat in_file | sed 's/[()]//g' > out_file
Due to formatting issues, it is not entirely clear from your question whether you also need to insert newlines.

parsing a curl in the command line

I have this:
curl -H \"api_key:{key}\" http://api.wordnik.com/api/word.xml/dog/definitions
How do I parse this (within the commandline) to make it take whatever's in between <text> and </text> in this page?
$ curl ....... | awk -vRS="</text>" '/<text>/{ gsub(/.*<text/,""); print "->"$0}'
$ curl ....... | awk 'BEGIN{RS="</text>"}/<text>/{ gsub(/.*<text/,""); print "->"$0}'
Note, use GNU awk. (gawk)

How do I push `sed` matches to the shell call in the replacement pattern?

I need to replace several URLs in a text file with some content dependent on the URL itself. Let's say for simplicity it's the first line of the document at the URL.
What I'm trying is this:
sed "s/^URL=\(.*\)/TITLE=$(curl -s \1 | head -n 1)/" file.txt
This doesn't work, since \1 is not set. However, the shell is getting called. Can I somehow push the sed match variables to that subprocess?
The accept answer is just plain wrong. Proof:
Make an executable script foo.sh:
#! /bin/bash
echo $* 1>&2
Now run it:
$ echo foo | sed -e "s/\\(foo\\)/$(./foo.sh \\1)/"
\1
$
The $(...) is expanded before sed is run.
So you are trying to call an external command from inside the replacement pattern of a sed substitution. I dont' think it can be done, the $... inside a pattern just allows you to use an already existent (constant) shell variable.
I'd go with Perl, see the /e option in the search-replace operator (s/.../.../e).
UPDATE: I was wrong, sed plays nicely with the shell, and it allows you do to that. But, then, the backlash in \1 should be escaped. Try instead:
sed "s/^URL=\(.*\)/TITLE=$(curl -s \\1 | head -n 1)/" file.txt
Try this:
sed "s/^URL=\(.*\)/\1/" file.txt | while read url; do sed "s#URL=\($url\)#TITLE=$(curl -s $url | head -n 1)#" file.txt; done
If there are duplicate URLs in the original file, then there will be n^2 of them in the output. The # as a delimiter depends on the URLs not including that character.
Late reply, but making sure people don't get thrown off by the answers here -- this can be done in gnu sed using the e command. The following, for example, decrements a number at the beginning of a line:
echo "444 foo" | sed "s/\([0-9]*\)\(.*\)/expr \1 - 1 | tr -d '\n'; echo \"\2\";/e"
will produce:
443 foo