I'm having a little problem which looks very simple... but I just don't get it!
I try to download the website content of: http://cspsp.gshi.org/ (if you try to access it via www.cspsp.gshi.org you get to the wrong page....)
For this I do it like that in Powershell:
(New-Object System.Net.WebClient).DownloadFile( 'http://cspsp.gshi.org/', 'save.htm' )
I can acess the website with Firefox and download its contents easily but Powershell always outputs something like that:
The remoteserver returned an Error: (404) Nothing found. (translated from German).
I'm not sure what I'm doing wrong here. Other websites like Google just work fine.
It appears that the site relies on the User-Agent request headers being sent by HTTP clients, and that System.Net.WebClient doesn't send even a default value (at least, it didn't when I hit my own, local servers.)
Either way, this worked for me:
$request = (New-Object System.Net.WebClient)
$request.headers['User-Agent'] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.40 Safari/537.17"
$request.DownloadFile('http://cspsp.gshi.org/', 'saved.html')
Related
My question: why does my perl script--successful via home laptop--not work when run in the context of my hosting website? (Perhaps they have a firewall, for example. Perhaps my website needs to provide credentials. Perhaps this is in the realm of cross-site scripting. I DON'T KNOW and appeal for your help in my understanding what could be the cause and then the solution. Thanks!)
Note that all works fine IF I run the perl script from my laptop at home.
But if I upload the perl script to my web host, where I have a web page whose javascript successfully calls that perl script, there is an error back from the site whose URL is in the perl script (finance.yahoo in this example).
To bypass the javascript, I'm just typing the URL of my perl script, e.g. http://example.com/blah/script.pl
Here is the full error message from finance.yahoo when $url starts with http:
Can't connect to finance.yahoo.com:80 nodename nor servname provided, or not known at C:/Perl/lib/LWP/Protocol/http.pm line 47.
Here is the full error message from finance.yahoo when $url starts with https:
Can't connect to finance.yahoo.com:443 nodename nor servname provided, or not known at C:/Perl/lib/LWP/Protocol/http.pm line 47.
Code:
#!/usr/bin/perl
use strict; use warnings;
use LWP 6; # one site suggested loading this "for all important LWP classes"
use HTTP::Request;
### sample of interest: to scrape historical data and feed massaged facts to my private web page via js ajax
my $url = 'http://finance.yahoo.com/quote/sbux/profile?ltr=1';
my $browser = LWP::UserAgent->new;
# one site suggested having this empty cookie jar could help
$browser->cookie_jar({});
# another site suggested I should provide WAGuess
my #ns_headers = (
'User-Agent' =>
# 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0',
'Accept' => 'text/html, */*',
'Accept-Charset' => 'iso-8859-1,*,utf-8',
'Accept-Language' => 'en-US',
);
my $response = $browser->get($url, #ns_headers);
# for now, I just want to confirm, in my web page itself, that
# the target web page's contents was returned
my $content = $response->content;
# show such content in my web page
print "Content-type: text/html\n\n" . $content;
Well it is not obvious what is your final goal and it is possible that you over complicate the task.
You can retrieve above mentioned page with simpler perl code
#!/usr/bin/env perl
#
# vim: ai:ts=4:sw=4
#
use strict;
use warnings;
use feature 'say';
use HTTP::Tiny;
my $debug = 1;
my $url = 'https://finance.yahoo.com/quote/sbux/profile?ltr=1';
my $responce = HTTP::Tiny->new->get($url);
if ($responce->{success}) {
my $html = $responce->{content};
say $html if $debug;
}
In your post you indicated that javascript is somehow involved -- it is not clear how and what it's purpose in retrieving of the page.
Error message has a reference to at C:/Perl/lib/LWP/Protocol/http.pm line 47 which indicates that web hosting is taking place on Windows machine -- it would be nice to indicate it in your message.
Could you shed some light on purpose of following block in your code?
# WAGuess
$browser->env_proxy;
# WAGuess
$browser->cookie_jar({});
I do not see cookie_jar be utilized in your code anywhere.
Do you plan to use some authentication approach to extract some data under your personal account which is not accessible otherwise?
Please state in a few first sentences what you try to achieve on grand scale.
Perhaps it's about cookies or about using yahoo's "query" url instead.
Yahoo Finance URL not working
We have a proxy server here and all internet traffic is going through that. The command: cpan package fails with the following error:
LWP failed with code[403] message[Browserblocked]
I think, only specific browsers are let through the proxy server, so I need to set the useragent for cpan. Where can I set it? I don't see anything similar in o conf.
Rewriting the code of site\lib\LWP\UserAgent.pm
sub _agent { "libwww-perl/$VERSION" }
say to:
sub _agent { 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0' }
solves the problem, but is this really the official solution?
I'm switching from regular files to zip files doing an upload, and was told I'd need to use a header in this format - Content-Type: application/zip.
So, I can get my file to upload properly via curl with the following:
curl --verbose --header "Content-Type: application/zip" --data-binary #C:\Junk\test.zip "http://client.xyz.com/submit?username=test#test.com&password=testpassword&job=test"
However, when I write a simple powershell script to do the same thing, I run into problems - the data isn't loaded. I don't know how to get a good error message returned, so I don't know the details, but bottom line the data isn't getting in.
$FullFileName = "C:\Junk\test.zip"
$wc = new-object System.Net.WebClient -Verbose
$wc.Headers.Add("Content-Type: application/zip")
$URL = "http://client.xyz.com/submit?username=test#test.com&password=testpassword&job=test"
$wc.UploadFile( $URL, $FullFileName )
# $wc.UploadData( $URL, $FullFileName )
I've tried using UploadData instead of UploadFile, but that doesn't appear to work either.
Thanks,
Sylvia
I don't necessarily have a solution but I think the issue is that you are trying to upload a binary file using the WebClient object. You most likely will need the UploadData method but I think you are going to have to run the zip file into an array of bytes to upload. That I'm not sure of off the top of my head.
If you haven't, be sure to look at the MSDS docs for this class and methods: http://msdn.microsoft.com/en-us/library/system.net.webclient_methods.aspx
Now that I look at it again I think you need: $wc.Headers.Add("Content-Type", "application/zip") because the collection is key/value paired. Check out this SO question:
WebClient set headers
Also if your still having issues you might need to add a user agent header. I think curl has it's own.
$userAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2;)"
$wc.Headers.Add("user-agent", $userAgent)
I want to access a webpage but I'm getting 403, failure for THIS url.
But when I access using Firefox it shows HTTP 200 OK.
This is the code I'm using to access it:
my $agent = LWP::UserAgent->new(env_proxy => 1,keep_alive => 1, timeout => 30, agent => "Mozilla/5.0");
my $header = HTTP::Request->new(GET => $link);
my $request = HTTP::Request->new('GET', $link, $header);
my $response = $agent->request($request);
if ($response->is_success){
........
Your code worked fine on my system accessing one of my own sites. I would guess that the website you hitting is allergic to automated requests. The user agent you are using is very minimal, and they may reject anything that does not look real. Here is a more genuine agent:
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.71 Safari/534.24"
I'm trying to output an image from RRD Tool using Perl. I've posted the relevant part of the CGI script below:
sub graph
{
my $rrd_path = $co->param('rrd_path');
my $RRD_DIR = "../data/";
#generate a PNG from the RRD
my $png_filename = "-"; # a '-' as the filename send the PNG to stdout
my $rrd = "$RRD_DIR/$rrd_path";
my $png = `rrdtool graph $png_filename -a PNG -r -l 0 --base 1024 --start -151200 -- vertical-label 'bits per second' --width 500 --height 200 DEF:bytesInPerSec=$rrd:bytesInPerSec:AVERAGE DEF:bytesOutPerSec=$rrd:bytesOutPerSec:AVERAGE CDEF:sbytesInPerSec=bytesInPerSec,8,* CDEF:sbytesOutPerSec=bytesOutPerSec,8,* AREA:sbytesInPerSec#00cf00:AvgIn LINE1:sbytesOutPerSec#002a97:AvgOut VRULE:1246428000#ff0000:`;
#print the image header
use bytes;
print $co->header(-type=>"image/png",-Content_length=>length($png));
binmode STDOUT;
print $png;
}#end graph
This works fine on the command line (perl graph.cgi > test.png) - commenting out the header, of course, as well as on my Ubuntu 10.04 development machine. However, when I move to the Centos 5 production server, it doesn't, and the browser receives a content-length of 0:
Ubuntu 10.04/Apache:
Request URL:http://noc-student.nmsu.edu/grasshopper/web/graph.cgi
Request Method:GET
Status Code:200 OK
Request Headers
Accept:application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Cache-Control:max-age=0
User-Agent:Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.36 Safari/534.7
Response Headers
Connection:Keep-Alive
Content-Type:image/png
Content-length:12319
Date:Fri, 08 Oct 2010 21:40:05 GMT
Keep-Alive:timeout=15, max=97
Server:Apache/2.2.14 (Ubuntu)
And from the Centos 5/Apache Server:
Request URL:http://grasshopper-new.nmsu.edu/grasshopper/branches/michael_dev/web/graph.cgi
Request Method:GET
Status Code:200 OK
Request Headers
Accept:application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Cache-Control:max-age=0
User-Agent:Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.36 Safari/534.7
Response Headers
Connection:close
Content-Type:image/png
Content-length:0
Date:Fri, 08 Oct 2010 21:40:32 GMT
Server:Apache/2.2.3 (CentOS)
The use bytes and manual setting of the content length are in there to try to fix the problem, but it's the same without them. Same with setting binmode on STDOUT. The script works fine from the command line on both machines.
See my How can I troubleshoot my Perl CGI program. Typically, the difference between running your program on the command line and from the web server is a matter of difference environments. In this case I'd expect that either rddtool is not in the path or the webserver user can't run it.
The backticks only capture standard output. There is probably some standard error output in the web server error log.
Are you sure your web user has access to your data? Try having the CGI writing the png to the filesystem, so you can make sure it's generated properly. If it is, the problem is in the transmission (headers, encodings, etc). If not, it's unrelated to the web server, and probably related to permissions.