Icon Image Proxy in Perl - perl

I'm trying to figure out the right way to serve icon files for our site listings. Basically an icon for a listing can come from an image file from a handful of different services (Flickr, Picasa, Google Static Maps, our own internal image hosting service, etc). The URL of the icon is stored in our database so I'd like to enable each listing icon to be displayed by simply calling:
http://www.example.com/listing/1234/icon
Currently I have been using CGI.pm to do a redirect to the correct icon URL, however, I want the file to be directly displayed without having to do a 301 redirect. Here is the code for what we've been using:
my $url = "http://www.example-service.com/image-123.gif";
print $query->redirect(-url=>$url);
I would appreciate any suggestions and code examples of on how I could update this to serve the file via proxy without having to redirect the user. Thanks in advance for your help!

Use LWP to get the remote file and print it out.
#!/usr/local/bin/perl
use LWP::UserAgent;
use CGI;
my $q = CGI->new;
my $ua = LWP::UserAgent->new;
$ua->agent("MyApp/0.1");
my $url = 'http://www.example-service.com/image-123.gif';
# Create a request
my $req = HTTP::Request->new(GET => $url);
my $res = $ua->request($req);
if ($res->is_success) {
print $q->header( $res->content_type );
print $res->content;
} else {
print $q->header( 'text/plain', $res->status_line );
print $res->status_line, "\n";
}
Alternatively you could write a trigger for your database which downloads the image for the listing and stores it either in the webroot somewhere or in the database itself when you add a new listing.

Related

How can I scrape images off of a webage using Perl

I would like to scrape all images off of a webpage and am running into a problem I don't understand.
For instance if I use enter https://www.google.com/search?as_st=y&tbm=isch&hl=en&as_q=%22escape%20room%22+movie+poster&safe=images into my browser and then use the browser's "View Source" option I get a massive amount of text/code. Using "find" I get more than 400 instances of
https://
So the simple code I wrote (below) gets the content and writes the result to a file. But a grep search of https:// only returns 7 instances. So obviously I am doing something incorrectly, perhaps the page is dynamic and I can't access that part?
Is there a way I can get the same source, via Perl, that I get via View Source?
my $ua = new LWP::UserAgent;
$ua->agent("$0/0.1 " . $ua->agent);
$ua->agent("Mozilla/8.0");
my $sstring = "https://www.google.com/search?as_st=y&tbm=isch&hl=en&as_q=%22escape%20room%22+movie+poster&safe=images";
my $req = new HTTP::Request 'GET' => $sstring;
$req->header('Accept' => 'text/html');
my $res = $ua->request($req);
open(my $fh, '>', 'report.txt');
print $fh $res->decoded_content;
close $fh;
Here's the example I got from WWW:Mechanize::Chrome
my $mech = WWW::Mechanize::Chrome->new();
my $sstring = "https://www.google.com/search?as_st=y&tbm=isch&hl=en&as_q=\"" . $arrayp . "\"" . "+movie+poster&safe=images";
$mech->get($sstring);
sleep 5;
print $_->get_attribute('href'), "\n\t-> ", $_->get_attribute('innerHTML'), "\n"
for $mech->selector('a.download');
The Google search uses Javascript to alter the page content after load. LWP::UserAgent does not support Javascript and what you get is only the initial document. (Hint: An easy way to see in the browser what LWP::UserAgent "sees" is using a browser addon to disable Javascript).
You will need to use something that is called a "headless Browser", for example WWW::Mechanize::Chrome

Unable to download PDFs with Perl and LWP

I'm trying to use LWP::Simple in Perl to download a number of PDF documents from the United Nations website (Security Council resolutions, etc.). Yet instead of returning PDFs, I am receiving an HTML error page. Consider this very simple example:
use LWP::Simple;
use strict;
my $url = 'https://documents-dds-ny.un.org/doc/UNDOC/GEN/N16/100/02/PDF/N1610002.pdf';
my $file = 'test.pdf';
getstore($url, $file);
If I then look at the contents of "test.pdf", I find that they are an HTML page.
I have also tried a number of tricks with LWP::UserAgent and even with cURL, but with no success. Any ideas?
Ok, thanks to #SteffenUllrich and # ikegami for putting me on the right track!
It is indeed a cookie issue. The fix? Open a cookie jar, access the homepage of the site first, then access the PDF once a cookie has been stored in the jar.
This can be done without using HTTP::Cookies. We need to use LWP::UserAgent instead of LWP::Simple, however.
Minimal working example below:
use strict;
use warnings 'all';
use LWP::UserAgent;
my $homeUrl = "https://documents.un.org/prod/ods.nsf/home.xsp";
my $pdfUrl = "https://documents-dds-ny.un.org/doc/UNDOC/GEN/N16/100/02/PDF/N1610002.pdf";
my $pdfOutputName = "test.pdf";
my $browser = LWP::UserAgent->new( cookie_jar => { } );
my $resp;
$resp = $browser->get( $homeUrl );
die $resp->status_line unless $resp->is_success;
$resp = $browser->get( $pdfUrl, ':content_file' => $pdfOutputName );
die $resp->status_line unless $resp->is_success;
This will produce a complete PDF file.

Processing external page with perl CGI or act as a reverse proxy

There is a page residing on a local server running Apache. I would like to submit the form via a GET request with a single name/value pair, like:
id=item1234
This GET request has to be processed by another server which I don't have control over subsequently returning a page which I would like to transform with the CGI script. In other words:
User submits form
MY apache proxies to external resource
EXTERNAL resource throws back a page
MY apache transforms it with a CGI (maybe another way?)
User get a modified page
Again this more like an architectural question so I'd be grateful for any hints, even poking my nose into some guides will help as I wasn't able to structure my google request well enough to locate anything related.
Thanks.
Pass the id "17929632" to this CGI code ("proxy.pl?id=17929632"), and you should this exact page in your browser.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use CGI::Pretty qw(:standard -any -no_xhtml -oldstyle_urls);
print header;
print "<html>\n";
print " <head><title>Proxy Demo</title></head>\n";
print " <body bgcolor=\"white\">\n";
my $id = param('id') || die "No CGI param 'id'\n";
my $ua = LWP::UserAgent->new;
$ua->agent("MyApp/0.1 ");
# Create a request
my $req = HTTP::Request->new(GET => "http://stackoverflow.com/questions/$id");
# Pass request to the user agent and get a response back
my $response = $ua->request($req);
# Check the outcome of the response
if ($response->is_success) {
my $content = $response->content;
# Modify the original content here!
print $content;
}
else {
print $response->status_line;
}
print "</body></html>\n";
Vague question, vague answer: write your CGI program to include a HTTP user agent, e.g. LWP.

Get files from a given URL on the basis of pattern passed using Perl on Unix

I have been told that a given URL contains several xml and text files and I need to download all the xml files starting with AAA(that is AAA*.xml) inside a given directory.
Credentials to access that URL are provided to me.
Please not that size of xml files could be in GBs.
I have used below code to achieve the same-
use strict;
use warnings;
use LWP;
my $browser = LWP::UserAgent->new;
my $username ='scott';
my $password='tiger';
# Create HTTP request object
my $req = HTTP::Request->new( GET => "https://url.com/");
# Authenticate the user
$req->authorization_basic( $username , $password);
my $res = $browser->request( $req , ':content_file' => '/fold/AAA1.xml');
print $res->status_line, "\n";
It prints 200 OK status but I am not able to get the file. Any suggestions?
Man
If the server doesn't allow you to receive a folder list (i.e. Apache without "Options +Indexes"), you will not GET the collection of files.
But, having the list, you can filter it with a regexpr like /AAA.*/, and with LWP::Simple module, it's easy to get it

WWW:Mechanize Form Select

I am attempting to login to Youtube with WWW:Mechanize and use forms() to print out all the forms on the page after logging in. My script is logging in successfully, and also successfully navigating to Youtube.com/inbox; However, for some reason Mechanize can not see any forms at Youtube.com/inbox. It just returns blank. Here is my code:
#!"C:\Perl64\bin\perl.exe" -T
use strict;
use warnings;
use CGI;
use CGI::Carp qw/fatalsToBrowser/;
use WWW::Mechanize;
use Data::Dumper;
my $q = CGI->new;
$q->header();
my $url = 'https://www.google.com/accounts/ServiceLogin?uilel=3&service=youtube&passive=true&continue=http://www.youtube.com/signin%3Faction_handle_signin%3Dtrue%26nomobiletemp%3D1%26hl%3Den_US%26next%3D%252Findex&hl=en_US&ltmpl=sso';
my $mechanize = WWW::Mechanize->new(autocheck => 1);
$mechanize->agent_alias( 'Windows Mozilla' );
$mechanize->get($url);
$mechanize->submit_form(
form_id => 'gaia_loginform',
fields => { Email => 'myemail',Passwd => 'mypassword' },
);
die unless ($mechanize->success);
$url = 'http://www.youtube.com/inbox';
$mechanize->get($url);
$mechanize->form_id('comeposeform');
my $page = $mechanize->content();
print Dumper($mechanize->forms());
Mechanize is unable to see any forms at youtube.com/inbox, however, like I said, I can print all of the forms from the initial link, no matter what I change it to...
Thanks in advance.
As always, one of the best debugging approaches is to print what you get and check if it is what you were expecting. This applies to your problem too.
In your case, if you print $mechanize->content() you'll see that you didn't get the page you're expecting. YouTube wants you to follow a JavaScript redirect in order to complete your cross-domain login action. You have multiple options here:
parse the returned content manually – i.e. /location\.replace\("(.+?)"/
try to have your code parse JavaScript (have a look at WWW::Scripter)
[recommended] use YouTube API for managing your inbox