I want to download web images from the command line.
This works fine sometimes, other times it doesn't and I can't figure out why.
Here's an example (Wikimedia Commons picture of the day):
wget https://commons.wikimedia.org/wiki/Main_Page#/media/File:01_Calanche_Piana.jpg
This somehow gets me an .html
HTTP request sent, awaiting response... 200 OK
Length: 185986 (182K) [text/html]
Saving to: 'Main_Page'
The following however (it's the same picture but with explicitly selected resolution) gets me a .jpg (which is what I want)
wget https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/01_Calanche_Piana.jpg/640px-01_Calanche_Piana.jpg
...
HTTP request sent, awaiting response... 200 OK
Length: 118796 (116K) [image/jpeg]
Saving to: '640px-01_Calanche_Piana.jpg'
I tried adding -O test.jpg to the first example, this will still be an .html file though.
Does anyone know why the command works in one case but not in the other?
why the command works in one case but not in the other?
This one
https://commons.wikimedia.org/wiki/Main_Page#/media/File:01_Calanche_Piana.jpg
despite what last letter might suggest is link to HTML page, note that there is # which is used to denote URI fragment, whilst this one
https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/01_Calanche_Piana.jpg/640px-01_Calanche_Piana.jpg
is URL to actual image. If you wondering what type of file is under given URL, but do not want to download that file you might do
wget -S --spider https://www.example.com
It will show you response headers, there might be many of them, but for determining type of resource Content-Type should suffice.
I was testing some code and parsing XML was included. For simple testing I requested / of my localhost and the response was my Apache2 default page.
So far, so good.
The response is XHTML and therefore XML. So I took it for my parsing (~11k of size).
XML::LibXML->load_xml (string => $response);
It takes about 16s till it finishes with no error.
If I give it an other xml-file with double the size if need 0 time.
So...why????
Apache/2.4.10
Debian/8.6
XML::LibXML/2.0128
EDIT
I need to mention that I removed the non-XML HTTP-header.
So the string starts with
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
and ends with
</html>
EDIT
Link: http://s000.tinyupload.com/index.php?file_id=88759644475809123183
One possibility is that each time you parse the document the parser is downloading the DTD from W3C. You could confirm this using strace or similar tools depending on your platform.
The DTD contains (among other things) the named entity definitions which map for example the string to the character U+00A0. So in order to parse HTML documents, the parser does need the DTD, however fetching it via HTTP each time is obviously not a good idea.
One approach is to install a copy of the DTD locally and use that. On Debian/Ubuntu systems you can just install the w3c-dtd-xhtml package which also sets up the appropriate XML catalog entries to allow libxml to find it.
Another approach is to use XML::LibXML->load_html instead of XML::LibXML->load_xml. In HTML parsing mode, the parser is more forgiving of markup errors and I think also always uses a local copy of the DTD.
The parser also provides options which allow you to specify your own handler routine for retrieving reference URIs.
I am new to LWP and thanks for all the help. I am writing a small perl script to log into a website and download a file. The process works perfectly fine with a browser but not through LWP. With a browser the process is
Log into website via authentication (username, password)
Upon successful login, the wesbite loads another page
One can then access the Downloads page and download the file
In case one is not logged in and tries to access the download page, the website
loads the Registration page to create a login.
This process works perfectly fine with a browser. The URL and user/pass are real so you can try this on the website with the details in the code
With a script however, I get a success code but the website does not allow access to steps 2 or 3. Instead of downloading the file, the Registration page gets downloaded. I suspect that this means that login is not working with the script.
All help in making this work will be greatly appreciated
Code below
#!/usr/bin/perl -w
use strict;
use warnings;
use LWP::Simple;
use LWP::UserAgent;
use HTTP::Cookies;
use HTTP::Request;
use WWW::Mechanize;
my $base_url = "http://www.eoddata.com/default.aspx";
my $username = 'xcytt';
my $password = '321pass';
# create a cookie jar on disk
my $cookies = HTTP::Cookies->new(
file => 'cookies1.txt',
autosave => 1,
);
my $http = LWP::UserAgent->new();
$http->cookie_jar($cookies);
my $login = $http->post(
'http://www.eoddata.com/default.aspx',
Content => [
username => $username,
password => $password,
]
);
# check if log in succeeded
if ( $login->is_success ) {
print "The response from server is " . $login->status_line . "\n\n";
print "The headers in the response are \n" . $login->headers()->as_string() . "\n\n";
print "Logged in Successfully\n\n";
print "Printing cookies after successful login\n\n";
print $http->cookie_jar->as_string() . "\n";
my $url = "http://www.eoddata.com/Data/symbollist.aspx?e=NYSE";
print "Now trying to download " . $url . "\n\n";
# make request to download the file
my $file_req = HTTP::Request->new( 'GET', $url );
print "Printing cookies before file download request\n\n";
print $http->cookie_jar->as_string() . "\n";
my $get_file = $http->request($file_req);
# check request status
if ( $get_file->is_success ) {
print "The response from server is " . $get_file->status_line . "\n\n";
print "The headers in the response are " . $get_file->headers()->as_string() . "\n\n";
print "Downloaded $url, saving it to file ...\n\n";
open my $fh, '>', 'tmp_NYSE.txt' or die "ERROR: $!n";
print $fh $get_file->decoded_content;
close $fh;
} else {
print "File Download failure\n";
}
} else {
print "Login Error\n";
}
Output from the script:
The response from server is 200 OK
The headers in the response are
Cache-Control: private
Date: Sun, 12 Oct 2014 17:43:47 GMT
Server: Microsoft-IIS/7.5
Content-Length: 39356
Content-Type: text/html; charset=utf-8
Client-Date: Sun, 12 Oct 2014 17:43:48 GMT
Client-Peer: 64.182.238.14:80
Client-Response-Num: 1
Link: <styles/jquery-ui-1.10.0.custom.min.css>; rel="stylesheet"; type="text/css"
Link: <styles/main.css>; rel="stylesheet"; type="text/css"
Link: <styles/button.css>; rel="stylesheet"; type="text/css"
Link: <styles/nav.css>; rel="stylesheet"; type="text/css"
Link: </styles/colorbox.css>; rel="stylesheet"; type="text/css"
Link: </styles/slides.css>; rel="stylesheet"; type="text/css"
Set-Cookie: ASP.NET_SessionId=cjgm4oscl1xmlzwnzql4gcns; path=/; HttpOnly
Title: End of Day Stock Quote Data and Historical Stock Prices
X-AspNet-Version: 4.0.30319
X-Meta-Description: Free end of day stock market data and historical quotes for many of the world's top exchanges including NASDAQ, NYSE, AMEX, TSX, OTCBB, FTSE, SGX, HKEX, and FOREX.
X-Meta-Keywords: metastock eod,free eod,free eod data,eod download,stock,exchange,data,historical stock quotes,free,historical share prices,download,day,end,prices,market,chart,NYSE,NASDAQ,AMEX,FTSE,FOREX,ASX,SGX,NZSE,tsx stock,stock share prices,stock ticker symbol,daily prices,daily stock,historic stock price,stock futures
X-Meta-Verify-V1: cT9ZK5uSlR3GrcasqgUh7Yh3fnuRGsRY1IRvE85ffa0=
X-Powered-By: ASP.NET
Logged in Successfully
Printing cookies after successful login
Set-Cookie3: ASP.NET_SessionId=cjgm4oscl1xmlzwnzql4gcns; path="/"; domain=www.eoddata.com; path_spec; discard; HttpOnly; version=0
Now trying to download http://www.eoddata.com/Data/symbollist.aspx?e=NYSE
Printing cookies before file download request
Set-Cookie3: ASP.NET_SessionId=cjgm4oscl1xmlzwnzql4gcns; path="/"; domain=www.eoddata.com; path_spec; discard; HttpOnly; version=0
The response from server is 200 OK
The headers in the response are Cache-Control: private
Date: Sun, 12 Oct 2014 17:43:48 GMT
Server: Microsoft-IIS/7.5
Content-Length: 49880
Content-Type: text/html; charset=utf-8
Client-Date: Sun, 12 Oct 2014 17:43:49 GMT
Client-Peer: 64.182.238.14:80
Client-Response-Num: 1
Link: <styles/jquery-ui-1.10.0.custom.min.css>; rel="stylesheet"; type="text/css"
Link: <styles/main.css>; rel="stylesheet"; type="text/css"
Link: <styles/button.css>; rel="stylesheet"; type="text/css"
Link: <styles/nav.css>; rel="stylesheet"; type="text/css"
Title: Member Registration
X-AspNet-Version: 4.0.30319
X-Meta-Description: Register now for Free end of day stock market data and historical quotes for many of the world's top exchanges including NASDAQ, NYSE, AMEX, TSX, OTCBB, FTSE, ASX, SGX, HKEX, and FOREX.
X-Meta-Keywords: metastock eod,free eod,free eod data,eod download,stock,exchange,data,historical stock quotes,free,download,day,end,prices,market,chart,NYSE,NASDAQ,AMEX,FTSE,FOREX,ASX,SGX,NZSE,tsx stock,stock share prices,stock ticker symbol,daily prices,daily stock,historic stock price
X-Powered-By: ASP.NET
Downloaded http://www.eoddata.com/Data/symbollist.aspx?e=NYSE, saving it to file ...
The header from the browser is:
http://www.eoddata.com/myaccount/default.aspx
GET /Data/symbollist.aspx?e=NYSE HTTP/1.1
Host: www.eoddata.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:32.0) Gecko/20100101 Firefox/32.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Cookie: ASP.NET_SessionId=uvnqhzpzco1wpe300egm4hqj; __utma=264658075.1162754774.1412987203.1413069850.1413137050.4; __utmc=264658075; __utmz=264658075.1412987203.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); _cb_ls=1; _chartbeat2=DMtSRyBOnGNFDptR86.1412466246942.1413137060190.10011111; _chartbeat_uuniq=3; EODDataAdmin=D838F9AA985E247A47493320CC8DC14950FA6CE49C6E1079DCFA95F632CEA7A2A6A691B352C544D41D0C208077D0C23897C9EA6EF0FE9221833A7131C334A657A48F5001BF2EBDE073D98BE4FD5719943AAC94D7C3DAA5A422FD575C663C337C93D5046AF3F7987998EDD60347531460FC54DEC81394352D9EDA00B7C954CC3304BC7D4C30D1F3A82C0EE58B890E0765; __utmb=264658075.2.10.1413137050; __utmt=1
Connection: keep-alive
HTTP/1.1 200 OK
Cache-Control: private
Transfer-Encoding: chunked
Content-Type: text/plain; charset=utf-8
Server: Microsoft-IIS/7.5
Content-Disposition: attachment;filename=NYSE.txt
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Date: Sun, 12 Oct 2014 18:05:24 GMT
The downloaded file snippet which is NOT the output I want is below. Note that the title is "Member Registration" instead of the data file I am expecting
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><link rel="stylesheet" href="styles/jquery-ui-1.10.0.custom.min.css" type="text/css" /><link rel="stylesheet" href="styles/main.css" type="text/css" /><link rel="stylesheet" href="styles/button.css" type="text/css" /><link rel="stylesheet" href="styles/nav.css" type="text/css" />
<script src="../scripts/jquery-1.9.0.min.js" type="text/javascript"></script>
<script src="../scripts/jquery-ui-1.10.0.custom.min.js" type="text/javascript"></script>
<script type="text/javascript"> var _sf_startpt = (new Date()).getTime()</script>
<meta name="keywords" content="metastock eod,free eod,free eod data,eod download,stock,exchange,data,historical stock quotes,free,download,day,end,prices,market,chart,NYSE,NASDAQ,AMEX,FTSE,FOREX,ASX,SGX,NZSE,tsx stock,stock share prices,stock ticker symbol,daily prices,daily stock,historic stock price" />
<meta name="description" content="Register now for Free end of day stock market data and historical quotes for many of the world's top exchanges including NASDAQ, NYSE, AMEX, TSX, OTCBB, FTSE, ASX, SGX, HKEX, and FOREX." />
<title>
Member Registration
</title></head>
Most of those use statements are unnecessary, as LWP will generally pull in any modules that it needs.
If you are using LWP::UserAgent then you certainly don't need LWP::Simple orWWW::Mechanize, and by default LWP will create an in-memory HTTP::Cookies object.
The problem is most likely that the HTML that you are fetching from the web site contains JavaScript code that modifies it after it is retrieved. LWP won't emulate that for you, so the page remains just as it was sent from the web site.
There is no good solution to this, but WWW::Mechanize::Firefox allows you to drive an installed Firefox browser from Perl code, and will do what you need.
Your login code isn't logging you in--the data you are posting doesn't resemble the input that the login form takes.
Using WWW::Mechanize's mech-dump to examine the contents of the form at http://www.eoddata.com/default.aspx shows the following:
POST http://www.eoddata.com/default.aspx [aspnetForm]
ctl00_tsm_HiddenField= (hidden readonly)
__VIEWSTATE=/wEPDwUJNTgzMTIzMjMyD2QWAmYPZBYCAgMPZBYCAgcPZBYCAh0PZBYEAgMPZBYCAgcPDxYCHgRUZXh0ZWRkAgcPDxYCHgdWaXNpYmxlaGRkGAEFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYBBRpjdGwwMCRjcGgxJGxnMSRjaGtSZW1lbWJlcuq72b0jSSSEoSOAcZlLZzWMmsYqjOMTbPl/Op1ToVKf (hidden readonly)
__VIEWSTATEGENERATOR=CA0B0334 (hidden readonly)
__PREVIOUSPAGE=72Ep8BrmYqNbOSb65afxljULshovHpRLBJcMC0funBrM2g0qkkpORQb_wqNsu_2SbA5JbxbwNkpXlR_SZWwgPwwbGdBP4YGDoNJCDtPRQS81 (hidden readonly)
__EVENTVALIDATION=/wEdAAvsaJw1zF2h8PWbp8tJHjaFx+CzKn9gssNaJswg1PWksJd223BvmKj73tdq9M98Zo0JWPh42opnSCw9zAHys7YwDyn98qMl4Da8RNKOYtjmMtj1Nek/A8Dky1WNDflwB7GO1vgbcIR7aON1c4Cm5wJw0r2yvex8d7TohORX6QMo1j8IRvmRE3IYRPV0S4fj4csX1838LMsOJxqMoksh8zNIRuOmXf1pY8AyXSwvWgp1mYRx4mHFI6oep3qpPKhhA22Mc6tB5KOFIqkGgyvucIby (hidden readonly)
ctl00$Menu1$s1$txtSearch= (text)
ctl00$Menu1$s1$btnSearch=Search (submit)
ctl00$cph1$btns1=CLICK HERE (submit)
ctl00$cph1$btns2=CLICK HERE (submit)
ctl00$cph1$btns3=CLICK HERE (submit)
ctl00$cph1$lg1$txtEmail= (text)
ctl00$cph1$lg1$txtPassword= (password)
ctl00$cph1$lg1$chkRemember=<UNDEF> (checkbox) [*<UNDEF>/off|on]
ctl00$cph1$lg1$btnLogin=Login (submit)
Your POST request needs to set the appropriate fields from the form above to successfully log in to the server, unless there is documentation somewhere that specifically says that the method you are using to login is valid (I did not do a search of the website to check this).
I cheated somewhat and created a valid login request using data from Chrome's inspector panel (rather than using WWW::Mechanize to populate the form or creating the request myself). With this, I was able to login and download the file:
my $resp = $http->post(
'http://www.eoddata.com/default.aspx',
Content => 'ctl00_tsm_HiddenField=&__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwUJNTgzMTIzMjMyD2QWAmYPZBYCAgMPZBYCAgcPZBYCAh0PZBYEAgMPZBYCAgcPDxYCHgRUZXh0ZWRkAgcPDxYCHgdWaXNpYmxlaGRkGAEFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYBBRpjdGwwMCRjcGgxJGxnMSRjaGtSZW1lbWJlcuq72b0jSSSEoSOAcZlLZzWMmsYqjOMTbPl%2FOp1ToVKf&__VIEWSTATEGENERATOR=CA0B0334&__PREVIOUSPAGE=72Ep8BrmYqNbOSb65afxljULshovHpRLBJcMC0funBrM2g0qkkpORQb_wqNsu_2SbA5JbxbwNkpXlR_SZWwgPwwbGdBP4YGDoNJCDtPRQS81&__EVENTVALIDATION=%2FwEdAAvsaJw1zF2h8PWbp8tJHjaFx%2BCzKn9gssNaJswg1PWksJd223BvmKj73tdq9M98Zo0JWPh42opnSCw9zAHys7YwDyn98qMl4Da8RNKOYtjmMtj1Nek%2FA8Dky1WNDflwB7GO1vgbcIR7aON1c4Cm5wJw0r2yvex8d7TohORX6QMo1j8IRvmRE3IYRPV0S4fj4csX1838LMsOJxqMoksh8zNIRuOmXf1pY8AyXSwvWgp1mYRx4mHFI6oep3qpPKhhA22Mc6tB5KOFIqkGgyvucIby&ctl00%24Menu1%24s1%24txtSearch=&ctl00%24cph1%24lg1%24txtEmail=xcytt&ctl00%24cph1%24lg1%24txtPassword=321pass&ctl00%24cph1%24lg1%24btnLogin=Login' );
if ($resp->is_success) {
my $get_file = $http->get("http://www.eoddata.com/Data/symbollist.aspx?e=NYSE");
}
Dumping the contents of $get_file gave me the list of symbols and company names as expected.
You can use WWW::Mechanize to fill in the form fields, or you can scrape the form input values from http://www.eoddata.com/default.aspx (particularly the hidden fields, which change on every page load) and then create a POST request using those values and your login credentials.
Also note that it is perfectly possible to get a successful response from the server without performing the action (e.g. login) that you were intending. Redirects and pages with "Login failed" will both be counted as success by LWP::UA.
In case anyone is stil interested in this problem, I have taken another look at it and found that it is quite workable using just LWP. However, the facilities of WWW::Mechanize make it much more simple to work with HTML forms
Here's a program that logs in to the page using the credentials provided. Being an ASP page it has dreadful input names. For instance the names if the username and password fields and the login button are ctl00$cph1$lg1$txtEmail, ctl00$cph1$lg1$txtPassword, and ctl00$cph1$lg1$btnLogin respectively. I have used the HTML::Form methods directly to locate these input fields using regular expressions, which I think makes the code much clearer
I have displayed the title of the HTML page that is reached after logging in to demonstrate that it is working
use strict;
use warnings;
use WWW::Mechanize;
my $base_url = 'http://www.eoddata.com/default.aspx';
my $username = 'xcytt';
my $password = '321pass';
my $mech = WWW::Mechanize->new;
$mech->get($base_url);
my $form = $mech->form_id('aspnetForm');
my #inputs = $form->inputs;
my ($email) = grep $_->name =~ /Email/, #inputs;
my ($pass) = grep $_->name =~ /Password/, #inputs;
my ($login) = grep $_->name =~ /Login/, #inputs;
$email->value($username);
$pass->value($password);
$mech->click_button(value => 'Login');
print $mech->title, "\n";
output
EODData - My Download
I have written a perl script that feeds data into a web service.
I have some system tests for the perl script that check that I can interact with the webservice, and these work just fine, but I do not want to be running system tests when I make small changes - I want to run unit tests:
So far I have written a subclass of my importer that simply intercepts the web requests before it actually calls the URL in question and tests that all the inputs are of the right type and form, and this works fine in all cases except where the perl script needs to read the response for instructions, and then proceed to the next steps.
My problem is that I cannot fake a response object.
I've tried using HTTP::Response->new, but it keeps complaining about bad header arguments
How do I best FAKE a response object?
There is no need to mock the HTTP::Response object. They are easy to construct—at least as easy as mocking would be and less likely to introduce bugs into the tests. You need to read the documentation and not just guess at usage.
You can construct them in code, of course, but what I've done in the past more than once is just save the output of curl or a stringified request that was made against an application and parse it back into an object.
Try playing around with these–
use warnings;
use strict;
use HTTP::Response;
my $response = HTTP::Response->new(204);
print $response->as_string;
my $other = HTTP::Response->parse(join "", <DATA>);
print $other->decoded_content, $/;
__DATA__
HTTP/1.1 200 OK
Cache-Control: public, max-age=53
Content-Type: text/html; charset=utf-8
Expires: Wed, 06 Jul 2011 19:13:54 GMT
Last-Modified: Wed, 06 Jul 2011 19:12:54 GMT
Vary: *
Date: Wed, 06 Jul 2011 19:12:59 GMT
Content-Length: 198121
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>Stack Overflow</title>
</head>
<body class="home-page">
<blockquote>O HAI!</blockquote>
</body>
</html>
You may be looking for mock objects - in this case a mock LWP object?
See Test::Mock::LWP on CPAN.
Its documentation shows usage like this:
use Test::Mock::LWP;
# Setup fake response content and code
$Mock_response->mock( content => sub { 'foo' } );
$Mock_resp->mock( code => sub { 201 } );
# Validate args passed to request constructor
is_deeply $Mock_request->new_args, \#expected_args;
# Validate request headers
is_deeply [ $Mock_req->next_call ],
[ 'header', [ 'Accept', 'text/plain' ] ];
# Special User Agent Behaviour
$Mock_ua->mock( request => sub { die 'foo' } );
If you search CPAN for Test::Mock, there are quite a few modules for mocking/faking objects for testing.