Getting different source code for same url - perl

I'm trying to grab the headline from a Washington Post news page on the web with a simple Perl script:
#! /usr/bin/env perl
use strict;
use warnings;
use LWP::Simple;
use Web::Scraper;
my $url = 'https://www.washingtonpost.com/outlook/why-trump-is-flirting-with-abandoning-fox-news-for-one-america/2019/10/11/785fa156-eba4-11e9-85c0-85a098e47b37_story.html';
my $scraper = scraper{
process '//h1[#data-qa="headline"]', 'headline' => 'TEXT',
};
my $html = get($url);
print $html;
my $res = $scraper->scrape ($html);
The problem I'm having is that it works only about 1/2 of the time even when fetching the exact same URL. The source code that is returned is in a completely different format than other times.
Perhaps this is an anti-scraping measure for unknown agents? I'm not sure but it seems like it should never work at all if that was the case.
Is there a simple workaround I might employ like accepting cookies?

Modified $scraper to the following to get it to work with the different source code:
my $scraper = scraper {
process '//h1[#data-qa="headline"]', 'headline' => 'TEXT',
process '//h1[#itemprop="headline"]', 'headline2' => 'TEXT',
};
Either headline or headline will be populated.

Related

Unable to download PDFs with Perl and LWP

I'm trying to use LWP::Simple in Perl to download a number of PDF documents from the United Nations website (Security Council resolutions, etc.). Yet instead of returning PDFs, I am receiving an HTML error page. Consider this very simple example:
use LWP::Simple;
use strict;
my $url = 'https://documents-dds-ny.un.org/doc/UNDOC/GEN/N16/100/02/PDF/N1610002.pdf';
my $file = 'test.pdf';
getstore($url, $file);
If I then look at the contents of "test.pdf", I find that they are an HTML page.
I have also tried a number of tricks with LWP::UserAgent and even with cURL, but with no success. Any ideas?
Ok, thanks to #SteffenUllrich and # ikegami for putting me on the right track!
It is indeed a cookie issue. The fix? Open a cookie jar, access the homepage of the site first, then access the PDF once a cookie has been stored in the jar.
This can be done without using HTTP::Cookies. We need to use LWP::UserAgent instead of LWP::Simple, however.
Minimal working example below:
use strict;
use warnings 'all';
use LWP::UserAgent;
my $homeUrl = "https://documents.un.org/prod/ods.nsf/home.xsp";
my $pdfUrl = "https://documents-dds-ny.un.org/doc/UNDOC/GEN/N16/100/02/PDF/N1610002.pdf";
my $pdfOutputName = "test.pdf";
my $browser = LWP::UserAgent->new( cookie_jar => { } );
my $resp;
$resp = $browser->get( $homeUrl );
die $resp->status_line unless $resp->is_success;
$resp = $browser->get( $pdfUrl, ':content_file' => $pdfOutputName );
die $resp->status_line unless $resp->is_success;
This will produce a complete PDF file.

WWW::Mechanize Perl

I am writing a simple code to login to a website for learning purpose.
I get an error saying "No Form Defined"
How do I know the form name?
Below is the code snippet (I found it from this forum).
use strict;
use warnings;
use WWW::Mechanize;
use HTTP::Cookies;
my $mech = WWW::Mechanize->new();
my $url = "http://www.something.net";
$mech->cookie_jar->set_cookie(0,"start",1,"/",".something.net");
$mech->get($url);
$mech->form_name("frmLogin");
$mech->set_fields(user=>'user',passwrd=>'password');
$mech->click();
$mech->save_content("logged_in.html");
Does the code look alright?
The name of the form(s), if any, are embedded in the content that you are retrieving. If you view the source for this page, for example, you will find many form elements. This one has the id add-comment-44827103:
<form id="add-comment-44827103"
class=""
data-placeholdertext="Use comments to ask for more information or suggest improvements. Avoid answering questions in comments."></form>
You can retrieve them with $mech->forms. This call returns a list of HTML::Form objects that you can interrogate further.
my ($form) = $mech->forms; # note ($var)=... for list context
my $form_id = $form->attr("id") || die "form on page doesn't have 'id' attr";
$mech->form_id($form_id);
...
There is also the $mech->form_number( index ) call
$mech->form_number(2); # select the 2nd form on the page

Mojolicious and directory traversal

I am new to Mojolicious and trying to build a tiny webservice using this framework ,
I wrote the below code which render some file remotely
use Mojolicious::Lite;
use strict;
use warnings;
app->static->paths->[0]='C:\results';
get '/result' => sub {
my $self = shift;
my $headers = $self->res->headers;
$headers->content_type('text/zip;charset=UTF-8');
$self->render_static('result.zip');
};
app->start;
but it seems when i try to fetch the file using the following url:
http://mydomain:3000/result/./../result
i get the file .
is there any option on mojolicious to prevent such directory traversal?
i.e in the above case i want only
http:/mydomain:300/result
to serve the page if someone enter this url :
http://mydomain:3000/result/./../result
the page should not be served .
is it possoible to do this ?
/$result^/ is a regular expression, and if you have not defined the scalar variable $result (which it does not appear you have), it resolves to /^/, which matches not just
http://mydomain:3000/result/./../result but also
http://mydomain:3000/john/jacob/jingleheimer/schmidt.
use strict and use warnings, even on tiny webservices.

Processing external page with perl CGI or act as a reverse proxy

There is a page residing on a local server running Apache. I would like to submit the form via a GET request with a single name/value pair, like:
id=item1234
This GET request has to be processed by another server which I don't have control over subsequently returning a page which I would like to transform with the CGI script. In other words:
User submits form
MY apache proxies to external resource
EXTERNAL resource throws back a page
MY apache transforms it with a CGI (maybe another way?)
User get a modified page
Again this more like an architectural question so I'd be grateful for any hints, even poking my nose into some guides will help as I wasn't able to structure my google request well enough to locate anything related.
Thanks.
Pass the id "17929632" to this CGI code ("proxy.pl?id=17929632"), and you should this exact page in your browser.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use CGI::Pretty qw(:standard -any -no_xhtml -oldstyle_urls);
print header;
print "<html>\n";
print " <head><title>Proxy Demo</title></head>\n";
print " <body bgcolor=\"white\">\n";
my $id = param('id') || die "No CGI param 'id'\n";
my $ua = LWP::UserAgent->new;
$ua->agent("MyApp/0.1 ");
# Create a request
my $req = HTTP::Request->new(GET => "http://stackoverflow.com/questions/$id");
# Pass request to the user agent and get a response back
my $response = $ua->request($req);
# Check the outcome of the response
if ($response->is_success) {
my $content = $response->content;
# Modify the original content here!
print $content;
}
else {
print $response->status_line;
}
print "</body></html>\n";
Vague question, vague answer: write your CGI program to include a HTTP user agent, e.g. LWP.

WWW:Mechanize Form Select

I am attempting to login to Youtube with WWW:Mechanize and use forms() to print out all the forms on the page after logging in. My script is logging in successfully, and also successfully navigating to Youtube.com/inbox; However, for some reason Mechanize can not see any forms at Youtube.com/inbox. It just returns blank. Here is my code:
#!"C:\Perl64\bin\perl.exe" -T
use strict;
use warnings;
use CGI;
use CGI::Carp qw/fatalsToBrowser/;
use WWW::Mechanize;
use Data::Dumper;
my $q = CGI->new;
$q->header();
my $url = 'https://www.google.com/accounts/ServiceLogin?uilel=3&service=youtube&passive=true&continue=http://www.youtube.com/signin%3Faction_handle_signin%3Dtrue%26nomobiletemp%3D1%26hl%3Den_US%26next%3D%252Findex&hl=en_US&ltmpl=sso';
my $mechanize = WWW::Mechanize->new(autocheck => 1);
$mechanize->agent_alias( 'Windows Mozilla' );
$mechanize->get($url);
$mechanize->submit_form(
form_id => 'gaia_loginform',
fields => { Email => 'myemail',Passwd => 'mypassword' },
);
die unless ($mechanize->success);
$url = 'http://www.youtube.com/inbox';
$mechanize->get($url);
$mechanize->form_id('comeposeform');
my $page = $mechanize->content();
print Dumper($mechanize->forms());
Mechanize is unable to see any forms at youtube.com/inbox, however, like I said, I can print all of the forms from the initial link, no matter what I change it to...
Thanks in advance.
As always, one of the best debugging approaches is to print what you get and check if it is what you were expecting. This applies to your problem too.
In your case, if you print $mechanize->content() you'll see that you didn't get the page you're expecting. YouTube wants you to follow a JavaScript redirect in order to complete your cross-domain login action. You have multiple options here:
parse the returned content manually – i.e. /location\.replace\("(.+?)"/
try to have your code parse JavaScript (have a look at WWW::Scripter)
[recommended] use YouTube API for managing your inbox