Parsing Site with Perl LWP::UserAgent -- Cookies Required - perl

On a certain project in Perl, I've written several "parsers", which allow me to visit websites with LWP::UserAgent. However, I'm having a problem with one website: it's behaving exactly as if I had visited the site with my browser, having turned off Cookies, so instead of giving me the page I want, it gives me a page with the message that I must turn on cookies. The entire code of my script is below. Any ideas? Thanks in advance.
(Note that I looked at the following url, which seems to be addressing my question, but unfortunately, I was unable to get a working script based on its suggestion: Cookies in perl lwp.)
use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Cookies;
my $useragent = LWP::UserAgent->new;
$useragent->cookie_jar(HTTP::Cookies->new);
my $request = HTTP::Request->new(GET => "http://www.the-site-im-trying-to-parse.com");
my $response = $useragent->request($request);
print "Content-type: text/html\n\n";
print $response->as_string;

Have you considered using WWW::Mechanize module? It does collect cookies automatically by default. And it's a bit easier to use since there are plenty of included methods which are very useful.

All you are doing is downloading the html data over HTTP, so there is no browser interaction until you decide to view the result in one. That being said, the HTTP server has no way of knowing if your request is from a client that has cookies enabled. So doing so won't actually do anything to change the result.
The WWW:Mechanize module is useful for easily traversing web sites, but it won't fix the problem you are facing. So it won't actually help you resolve the issue you are having.
More realistically what it is going on is there is some sort of client-side javascript code that isn't working correctly once you download the file and display it in your browser. This could be any number of things, such as breaking the cross-domain policy implemented in the javascript code. Without providing the URL you are accessing, it is impossible to say.

Try setting up cookie_jar to temporary storage (give it empty hashref):
$useragent->cookie_jar( {} );

Related

Using cookies in Perl without CGI

I am trying to create a standard forum style website using Perl but not using CGI, or any other framework for that matter. I've seen before use of a script simply titled "cookies.pl," but can't find any sort of documentation on it. Is there a way to set/read cookies with just core modules?
First, I am assuming that when you say you don't want to use "CGI", you mean the Perl module CGI.pm rather than the Common Gateway Interface (CGI) method of communicating with the web server, which is implemented by the CGI.pm module.
Second, this answer is for information and entertainment purposes only. Attempting to implement your own CGI handler for use in a production environment is not advisable. It's a Really, Really Bad Idea unless you know exactly what you're doing. And probably still a Bad Idea even if you do. And if you did know exactly what you're doing, you wouldn't have to ask questions about basic parts of the interface like how to implement cookie handling.
With all that out of the way, cookies are pretty simple to handle directly.
To set a cookie, send a Set-Cookie HTTP header to the client. In the most basic form, this looks like Set-Cookie: CookieName=CookieValue. There are many other options which can be added to this basic format, which are documented in various places around the web.
If you're now wondering "How do I send an HTTP header?", every line of text that you send to the client (i.e., print to STDOUT) prior to the first empty line is an HTTP header:
print "Content-Type: text/html\n"; # Content-Type header is mandatory!
print "Set-Cookie: CookieName=CookieValue\n"; # Header to set a cookie
print "\n"; # Blank line = end of headers
# continue on with sending the response body now that headers are done
To read a cookie, look at the environment variable HTTP_COOKIE, which is provided by the web server as a part of its CGI implementation and will contain a semicolon-delimited list of all cookies received with the client's HTTP request. This is accessed in Perl as $ENV{HTTP_COOKIE}.

Simple API request not working - 403 error

I am trying to run a simple API request from a perl script. But it seems not working. Same request if copied to web browser works without any problem.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my $query = 'http://checkdnd.com/api/check_dnd_no_api.php?mobiles=9944384761';
my $result = get($query);
print $result."\n";
when I use getprint($query) - it gives 403 error.
If you take a look at the body of the response (i.e. not only at the status code 403) you will find:
The owner of this website (checkdnd.com) has banned your access based on your browser's signature (2f988642c0f02798-ua22).
This means that it is blocking the client because it probably looks too much like a non-browser. For this site a simple fix is to include some User-Agent header. The following works for me:
my $ua = LWP::UserAgent->new;
$ua->default_header('User-Agent' => 'Mozilla/5.0');
my $resp = $ua->get('http://checkdnd.com/api/check_dnd_no_api.php?mobiles=9944384761');
my $result = $resp->decoded_content;
The site in question seems to be served by Cloudflare which has some thing they call "Browser Integrity Check". From the support page for this feature:
... looks for common HTTP headers abused most commonly by spammers and denies access to your page. It will also challenge visitors that do not have a user agent or a non standard user agent (also commonly used by abuse bots, crawlers or visitors).

How to maintain cookies during redirects using perl's LWP::UserAgent?

The HTTP Request I used to send to a specific website is now getting redirected which eventually broke my code. I realized that the cookies are not working anymore for the redirected domain (of course). I read the docs of LWP but I did not find any related option to preserve/maintain cookies automatically. Is there an easy way to do it?
Just for a side note: this behavior works out of the box using Python's Requests class.
The following adds support for cookies to LWP::UserAgent.
my $ua = LWP::UserAgent->new( cookie_jar => {} );
It causes cookies returned in a response to be sent with subsequent matching requests, just like a browser does.

Use perl REST::Client and multi-part form to post image to Confluence

How can I attach an image to an existing Confluence page, using their latest REST API, from Perl via the REST::Client module?
This may be a perl question (I may be posting an image incorrectly), or a Confluence question (their API docs may be missing a required detail such as a required header that is being quietly added by curl), or both.
I have established a connection object, $client, using the REST::Client module. I have verified that $client is a valid connection by performing a $client->GET to a known Confluence page ID, which correctly returns the page's details.
I attempt to upload an image, using:
$headers = {Accept => 'application/json',
Authorization => 'Basic ' . encode_base64($user . ':' .
$password),
X_Atlassian_Token => 'no-check',
Content_Type => 'form-data',
Content =>
[ file => ["file1.jpg"], ]};
$client->POST('rest/api/content/44073843/child/attachment', $headers);
... and the image doesn't appear on the attachments list.
I've packet-sniffed the browser whilst uploading an image there, only to find that it uses the prototype API that is being deprecated. I'd hoped that I could just stand on Atlassian's shoulders in terms of seeing exactly what their post stream looks like, and replicating that... but I don't want to use the API that's being deprecated, since they recommend against it.
The curl example of calling the Confluence API to attach a file that they give at https://developer.atlassian.com/confdev/confluence-rest-api/confluence-rest-api-examples, when my host, filename, and page ID are substituted in, does post the attachment.
I formerly specified comment in my array of Content items, but removed that during debugging to simplify things since the documentation said it was optional.
One thing I'm unclear about is getting the contents of the file into the post stream. In the curl command, the # accomplishes that. In REST::Client, I'm not sure if I have to do something more than I did, to make that happen.
I can't packet-sniff the outgoing traffic because our server only allows https, and I don't know how (or if it's even possible) to set the REST::Client module or one of its underlying modules to record the SSL info to a log file so that Wireshark can pick it up and decode the resulting TLS traffic, the way one can with the environment variable for Chrome or Firefox. I also don't have access to server logs. So I don't know what the request I'm actually sending looks like (if I did, I could probably say, "OK, it looks wrong right HERE" or "But it looks right?!"). I am therefore unfortunately at a loss as to how to debug it so blindly.
A similar question about posting multipart forms using REST::Client was asked by someone else, more generically, back in April of last year, but received no responses. I'm hoping that since mine is more specific, someone can tell me what I might be doing wrong.
Any help would be appreciated.
You should be capturing your post response like so:
my $response = $client->POST('rest/api/content/44073843/child/attachment', $headers); print Dumper $response;
Also, the url you are using in the POST call is incomplete and won't work, you need the full url.

POST to RESTful system using system authenticantion

My work requires an authorization for internet use. I log in, and after that it recognizes me and lets me access whatever I need.
I have been using POSTMAN to test send to and receive from a company RESTful service. It automatically uses my same internet use auth at the other end to give my user account POST and GET permissions.
Now, I am trying to automate with a perl script and it won't authorize. The owner of the RESTful service says if I make a windows/.net application it will authorize automatically, but that isn't an option.
Any suggestions? I would think I could just do special headers or something and duplicate whatever windows is doing....
I have been asked to provide what I have done so far
#!/usr/local/bin/perl
use strict;
use LWP::UserAgent;
my $ua=LWP::UserAgent->new;
my $server_endpoint = "The post destination";
my $req= HTTP::Request->new(POST => $server_endpoint);
$req->header('content-type' => 'application/json');
my $post_data="[ SOME JSON HERE ]";
$req->content($post_data);
my $resp = $ua->request($req);
if($resp->is_success){
my $message = $resp->decoded_content;
print "received reply : $message\n";
}
else{
print "post error code : ",$resp->code,"\n";
print "post error message : ",$resp->message,"\n";
}
In the past when I had to authenticate against an IIS server I had to use LWP::Authen::Ntlm to get it to authenticate.
For more information about LWP::Authen::Ntlm, see https://metacpan.org/pod/LWP::Authen::Ntlm
The main "pitfalls" I had is that keepalive is required, and that newer versions of IIS now use Digest, and not NTLM
In those cases, I simply switched to the built-in LWP::Authen::Digest (it comes inside LWP)
Have a look at a similar question (scroll up to the top see the question) and see if the included bit of Perl code doesn't help...
LWP::UserAgent HTTP Basic Authentication
The short version is that it doesn't appear that your Perl code above includes any login information and this POSTMAN plugin may be sending over cached login info that your Perl code is not yet aware of.