save directly to file and get filename using WWW::Mechanize - perl

I would normally save to a file using this:
$mech->save_content($mech->response->filename);
But due to some big files which cause "Out of memory" I have to use this instead:
$mech->get( $url, ":content_file"=>$tempfile );
How can I get the filename with the second method, or do I have to make one up?
I want the filename that would be returned in the response object: $mech->response->filename. I don't want to make up my own filename.

The :content_file option is inherited from LWP::UserAgent and behaves in the same way. You don't know the file beforehand.
You could do a HEAD request to check the filename, and then do a GET request.
Alternatively, have a look at the lwp-download utility that ships with LWP::UserAgent. It provides exactly what you need. You could use it directly, or lift the stuff you want out of it. A WWW::Mechanize object can be dropped into this code and will behave exactly the same as the LWP::UserAgent one.

Related

Perl: Passing a URL in a query string

I'm trying to pass a URL as a query string so it can be read by another website and then used:
www.example.com/domain?return_url=/another/domain
Gets returned as:
www.example.com/domain?return_url=%2Fanother%2Fdomain
Is there a way this URL can be read and parsed by the other application with the escaped characters?
The only way I can think of is to encode it somehow so it comes out like this:
www.example.com/domain?return_url=L2Fub3RoZXIvZG9tYWlu
then the other application can decode and use?
https://www.base64encode.org/
www.example.com/domain?return_url=%2Fanother%2Fdomain
This is called URL encoding. Not because you put a URL in it, but because it encodes characters that have a special meaning in a URL.
The %2F corresponds to a slash /. You've probably also seen the %20 before, which is a space .
Putting a complete URI into a URL parameter of another URI is totally fine.
http://example.org?url=http%3A%2F%2Fexample.org%2Ffoo%3Fbar%3Dbaz
The application behind the URL you are calling needs to be able to understand URL encoding, but that is a trivial thing. Typical web frameworks and interfaces to the web (like CGI.pm or Plack in Perl) will do that. You should not have to care about it a all.
To URL-encode something in Perl, you have several options.
You could use the URI module to create the whole URI including the URL encoded query.
use URI;
my $u = URI->new("http://example.org");
$u->query_form( return_url => "http://example.org/foo/bar?baz=qrr");
print $u;
__END__
http://example.org?return_url=http%3A%2F%2Fexample.org%2Ffoo%2Fbar%3Fbaz%3Dqrr
This seems like the natural thing to do.
You could also use the URI::Encode module, which gives you a uri_encode function. That's useful if you want to encode strings without building a URI object.
use URI::Encode qw(uri_encode uri_decode);
my $encoded = uri_encode($data);
my $decoded = uri_decode($encoded);
All of this is a normal part of how the web works. There is no need to do Base 64 encoding.
The correct way would be to uri-encode the second hop as you do in your first example. The URI and URI::QueryParam modules make this nice and easy:
To encode a URI, you simply create a URI object on your base url. Then add any query parameters that you want. (NOTE: they will be automatically uri-encoded by URI::QueryParam):
use strict;
use warnings;
use feature qw(say);
use URI;
use URI::QueryParam;
my $u = URI->new('http://www.example.com/domain');
$u->query_param_append('return_url', 'http://yahoo.com');
say $u->as_string;
# http://www.example.com/domain?return_url=http%3A%2F%2Fyahoo.com
To receive this url and then redirect to return_url, you simply create a new URI object then pull off the return_url query parameter with URI::QueryParam. (NOTE: again URI::QueryParam automatically uri-decodes the parameter for you):
my $u = URI->new(
'http://www.example.com/domain?return_url=http%3A%2F%2Fyahoo.com'
);
my $return_url = $u->query_param('return_url');
say $return_url;
# http://yahoo.com

Using WWW::Mechanize, how do I add a lower case header with an underscore?

I'm using an API that requires me to use a header named "m_id" to the request.
When I use
$mech->add_header('m_id' => 'whatever')
WWW::Mechanize (or rather HTTP::Headers) “helpfully” changes the header name to “M-Id”. Which doesn't work.
Is there any way to prevent this from happening?
I thought I RTFMed before posting, but not well enough...
A second read through the HTTP::Headers perldoc told me to use:
$mech->add_header(':m_id'=>'whatever');
and that does the trick.

How does one access (POST) arguments from nginx upload module in embedded lua/perl?

I've been trying to figure out how to access the results of the nginx upload module from embedded perl (using nginx-perl) or lua (using the embedded lua module). I've only been able to find examples of how to use the module with fastcgi (or similar), something I would, if possible, like to avoid having to use.
Simply letting the upload_pass have a lua/perl content handler does not seem to work; with the body being somehow truncated to just the first line (yes, I've told it to wait for the body and made sure it's not written to a file).
At least when using Perl (I haven't tried Lua, but I'm suspecting the same thing will happen), the complete body (as raw multipart/form-data) can be made available if one does a proxy_pass to another nginx instance.
My question is threefold. Firstly is this expected behaviour/how are arguments passed from the upload module? Secondly, is it possible to access the results of the upload module without (re)parsing the multipart/form-data using a perl/lua library in the content handler.
Finally, if the latter is not possible, can I use multipart/form-data parser used by nginx/upload without manually exporting the functions and using some form of FFI.
Thanks in advance.
With lua you can get at normal params via methods like this:
ngx.req.read_body()
local inputjson = ngx.req.get_body_data()
For post args documented: http://wiki.nginx.org/HttpLuaModule#ngx.req.get_post_args
Regular vars:
ngx.var.my_var
Lua nginx module has this well documented: http://wiki.nginx.org/HttpLuaModule
The vars that are documented for the upload module: http://www.grid.net.ru/nginx/upload.en.html
upload_set_form_field $upload_field_name.name "$upload_file_name";
upload_set_form_field $upload_field_name.content_type "$upload_content_type";
upload_set_form_field $upload_field_name.path "$upload_tmp_path";
should be accessible via:
ngx.var.upload_field_name.path
Just do a log or a print on the var to verify

How can I detect the file type of image at a URL?

How to find the image file type in Perl form website URL?
For example,
$image_name = "logo";
$image_path = "http://stackoverflow.com/content/img/so/".$image_name
From this information how to find the file type that . here the example it should display
"png"
http://stackoverflow.com/content/img/so/logo.png .
Supposer if it has more files like SO web site . it should show all file types
If you're using LWP to fetch the image, you can look at the content-type header returned by the HTTP server.
Both WWW::Mechanize and LWP::UserAgent will give you an HTTP::Response object for any GET request. So you can do something like:
use strict;
use warnings;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
$mech->get( "http://stackoverflow.com/content/img/so/logo.png" );
my $type = $mech->response->headers->header( 'Content-Type' );
You can't easily tell. The URL doesn't necessarily reflect the type of the image.
To get the image type you have to make a request via HTTP (GET, or more efficiently, HEAD), and inspect the Content-type header in the HTTP response.
Well, https://stackoverflow.com/content/img/so/logo is a 404. If it were not, then you could use
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my ($content_type) = head "https://stackoverflow.com/content/img/so/logo.png";
print "$content_type\n" if defined $content_type;
__END__
As Kent Fredric points out, what the web server tells you about content type need not match the actual content sent by the web server. Keep in mind that File::MMagic can also be fooled.
#!/usr/bin/perl
use strict;
use warnings;
use File::MMagic;
use LWP::UserAgent;
my $mm = File::MMagic->new;
my $ua = LWP::UserAgent->new(
max_size => 1_000 * 1_024,
);
my $res = $ua->get('https://stackoverflow.com/content/img/so/logo.png');
if ( $res->code eq '200' ) {
print $mm->checktype_contents( $res->content );
}
else {
print $res->status_line, "\n";
}
__END__
You really can't make assumptions about content based on URL, or even content type headers.
They're only guides to what is being sent.
A handy trick to confuse things that use suffix matching to identify file-types is doing this:
http://example.com/someurl?q=foo#fakeheheh.png
And if you were to arbitrarily permit that image to be added to the page, it might in some cases be a doorway to an attack of some sorts if the browser followed it. ( For example, http://really_awful_bank.example.com/transfer?amt=1000000;from=123;to=123 )
Content-type based forgery is not so detrimental, but you can do nasty things if the person who controls the name works out how you identify things and sends different content types for HEAD requests as it does for GET requests.
It could tell the HEAD request that it's an Image, but then tell the GET request that its a application/javascript and goodness knows where that will lead.
The only way to know for certain what it is is downloading the file and then doing MAGIC based identification, or more (i.e., try to decode the image). Then all you have to worry about is images that are too large, and specially crafted images that could trip vulnerabilities in computers that are not yet patched for that vulnerability.
Granted all of the above is extreme paranoia, but if you know the rare possibilities you can make sure they can't happen :)
From what i understand you're not worried about the content type of an image you already know the the name+extension for, you want to find the extension for an image you know the base name of.
In order to do that you'd have to test all the image extensions you wanted individually and store which ones resolved and which ones didn't. For example both https://stackoverflow.com/content/img/so/logo.png and https://stackoverflow.com/content/img/so/logo.gif could exist. They don't in this exact situation but on some arbitrary server you could have multiple images with the same base name but different extensions. Unfortunately there's no way to get a list of available extensions of a file in a remote web directory by supplying its base name without looping through the possibilities.

Transparently Handling GZip Encoded content with WWW::Mechanize

I am using WWW::Mechanize and currently handling HTTP responses with the 'Content-Encoding: gzip' header in my code by first checking the response headers and then using IO::Uncompress::Gunzip to get the uncompressed content.
However I would like to do this transparently so that WWW::Mechanize methods like form(), links() etc work on and parse the uncompressed content. Since WWW::Mechanize is a sub-class of LWP::UserAgent, I would prefer to use the LWP::UA::handlers to do this.
While I have been partly successful (I can print the uncompressed content for example), I am unable to do this transparently in a way that I can call
$mech->forms();
In summary: How do I "replace" the content inside the $mech object so that from that point onwards, all WWW::Mechanize methods work as if the Content-Encoding never happened?
I would appreciate your attention and help.
Thanks
WWW::Mechanize::GZip, I think.
It looks to me like you can replace it by using the $res->content( $bytes ) member.
By the way, I found this stuff by looking at the source of LWP::UserAgent, then HTTP::Response, then HTTP::Message.
It is built in with UserAgent and thus Mechanize. One MAJOR caveat to save you some hair
-To debug, make sure you check for error $# after the call to decoded_content.
$html = $r->decoded_content;
die $# if $#;
Better yet, look through the source of HTTP::Message and make sure all the support packages are there
In my case, decoded_content returned undef while content is raw binary, and I went on a wild goose chase. UserAgent will set the error flag on failure to decode, but Mechanize will just ignore it (It doesn't check or log the incidence as its own error/warning).
In my case $# sez: "Can't find IO/HTML.pm .. It was eval'ed
After having to dive into the source, I find out the built-in decoding process is long, meticulous, and arduous, covering just about every scenario and making tons of guesses (Thank you Gisle!).
if you are paranoid, explicitly set the default header to be used with every request at new()
$browser = new WWW::Mechanize('default_headers' => HTTP::Headers->new('Accept-Encoding'
=> scalar HTTP::Message::decodable()));