Get information about file attachment via perl Mechanize - perl

BACKGROUND
I am experimenting with Mechanize on a web forum. The forum has some file attachments in its threads. The attachment can be of various media types. Each attachment has a link to a server-side program called "attachment.php?" and a unique id which identifies the file. When you visit it in a normal browser, a file is returned and the browser decides what to do with it. If it's an image, the file is displayed in the browser window and the titlebar is set to the filename. If it's another type of file, the browser will ask if you want to download the file (and it automatically sets the filename to the name of the file).
QUESTION
My question is how can I explore the details of such file attachments with Mechanize so that I can determine filetype and filename?
I've already successfully downloaded a file using my program, but I have to tell Mechanize what the filename should be. I would prefer to keep the original filename, but to do that I have to be able to discover it somehow. I know it can be done because my browser is able to determine the filetype and filename.
As a secondary objective I would also like to query the size of the file, if this is possible.
I hope my question makes sense and thank you in advance to anyone who takes the time to answer.

To achieve an inspection of the filetype you have to use $mech->res(). It returns an HTTP::Response object, and this class provides the filename method.
Example:
foreach (#media)
{
print "Fetching " . $_->url() . "\n";
$m->get($_);
my $res = $m->res();
if($res->is_success)
{
my $filename = $res->filename();
print "$filename\n";
}
}

Related

Download file with Perl on form submit

I am trying to initiate a Save As dialog on a form submit. My form is pretty simple, I'm using Dropzonejs for a drag and drop file, and it looks this:
<form action="action.epl" class="dropzone" id="dropzone" method="post">
</form>
So when the user drops the file it submits and kicks off action.epl. In action.epl, I handle the file and it gets saved to the server. Then I'm trying to spit back out an encrypted version of the file. The encryption is done and I have removed it to make sure it is not the source of the problem, the problem I have now is that I can't get it to download from the server. I have the following (also in action.epl):
$fileName = 'file.pdf';
$filepath= "/server/path/$fileName";
open (FILE, "<$filepath") or die "can't open : $!";
#fileholder = <FILE>;
close FILE;
print "Content-Type:application/x-downloadn";
print "Content-Disposition:attachment;filename=$fileName";
print #fileholder
It's doing /something/ because the submit takes 5x as long as it did without this snippet. What I thought I would get was the "Save As" dialog but nothing happens. This tutorial is where I got my info.
Edit, now I have:
$fileName ='file.pdf';
$filepath = "/server/path/$fileName";
print "Content-Type:application/x-download\n";
print "Content-Disposition:attachment;filename=$fileName\n\n";
open FILE, '<', $filepath or die "can't open: $!";
print while <FILE>;
close FILE;
However, there is still no dialog. I see you have the "$" sigil in your filehandle. I tried that too. But I dont think you need that right?
I see you addressed the typo in the tutorial. The
"Content-Type:application/x-downloadn"
should be "Content-Type: application/x-download\n", to specify the type of the content as "application/x-download" and the "\n" to end the line of the header field.
After that, you're getting into how the browser handles the response. If you provide the Content-Disposition:attachment;filename=$fileName header, you're asserting that the attachment ought to be the given filename $filename. Many browsers will take a peek at the file name, and try to sniff a suitable MIME type for the extension. So, if you're specifying that $filename is a .pdf then to modern browsers, "if it looks like a pdf and smells like a pdf, then it's a pdf". Not only are you saying "this is a pdf" with your specification of the Content-Disposition, you're also providing the name for the file that they download. In most situations, this should prevent the fall-back "save as" behavior.
Your best bet would be to not provide the Content-Disposition. That way, you're not specifying any default name to save the file as, and as such there's no extension for the browser to snoop. Unfortunately, some browsers simply default to the name of the script even if the extension is absurd compared to the contents. In some of the "enterprise solutions" that I deal with in my day-to-day, I get .csv files named as "report.cgi" because they use a MIME type that only Internet Explorer recognizes and they don't provide a Content-Disposition. Buyer beware.
The bottom line is that you can't force the browser to open the "Save As" from the server side unless you have information about the browser and know how to trick it, or you simply don't give it anything to go by (and even then some browsers may have default conventions).
By specifying the Content-Disposition and a filename, you're giving hints as to what the file should be, and what it should be saved as. On the other hand, if you don't give any other hints to the browser other than that the Content-Type is 'application/x-download' then you'll probably get a "Save As" dialog box, but the user will have no idea what kind of content the data is. This puts you at the mercy of the browser's default naming conventions. This is how I get my .csv files as "report.cgi", even when the server is providing a MIME type for csv files (though an IE-only flavor).
What I do is use perl's File::Type and the mime_type function to get the mime type and simply specify a name. If you use the mime_type function for determining the MIME type, and don't specify a file to return, you'll get silly things like .xlsx files being downloaded as zip files, or other absurdity.
How important is it that they get the "Save As" dialog box, because at the end of the day, what file type they choose is irrelevant if the content of the file is not appropriate for the type and they try to open an excel file in acrobat, or vice-versa.
In all of my years of experience doing server side programming, I have always found it futile to try to control the client side.

MATLAB How to delete a specific page from a .pdf File?

I recently learned how to download .pdf files using urlwrite, but I was wondering if there is any way to specify which pages of the .pdf to save.
The files are always either 1 or 2 pages long, and I only want to keep the first page of the .pdf. Is there any way to directly download just the first page, and if not, is there a way to download the entire .pdf and then get rid of the 2nd page?
I know that it is possible to manually get rid of the second page in Preview or Adobe Acrobat and other applications, but it'd make things a lot easy if I could automate the process in MATLAB.
Any help would be greatly appreciated!
Find an appropriate command line tool (example uses pdftk), and then you can make a call to it from MATLAB. Use sprintf to assemble the appropriate command and then pass it to system. This puts the output in a temporary file then uses movefile to change the filename back:
temp = 'sometempfile.pdf';
urlwrite(someurl, filename);
system(sprintf('pdftk %s cat 1 output %s dont_ask',filename,temp));
movefile(temp, filename);

Trouble reading text from a pdf in Perl

I am trying to read the text content of a pdf file into a Perl variable. From other SO questions/answers I get the sense that I need to use CAM::PDF. Here's my code:
#!/usr/bin/perl -w
use CAM::PDF;
my $pdf = CAM::PDF->new('1950-01-01.pdf');
print $pdf->numPages(), " pages\n\n";
my $text = $pdf->getPageText(1);
print $text, "\n";
I tried running this on this pdf file. There are no errors reported by Perl. The first print statement works; it prints "2 pages" which is the correct number of pages in this document.
The next print statement does not return anything readable. Here's what the output looks like in Emacs:
2 pages
^A^B^C^D^E^C^F^D^G^H
^D^A^K^L^C^M^D^N^C^M^O^D^P^C^Q^Q^C ^D^R^K^M^O^D ^A^B^C^D^E
^F^G^G^H^E
^K^L
^M^N^E^O^P^E^O^Q^R^S^E
.... more lines with similar codes ....
Is there something I can do to make this work? I don't understand pdf files too well, but I thought that because I can easily copy and paste the text from the PDF file using Acrobat, it must be recognized as text and not an image, so I hoped this meant I could extract it with Perl.
Any guidance would be much appreciated.
PDFs can have different kinds of content. A PDF may not have any readable text at all, only bitmaps and graphical content, for example. The PDF you linked to, has compressed data in it. Open it with a text editor, and you will see that the content is in a "/Filter/FlateDecode" block. Perhaps CAM::PDF doesn't support that. Google FlateDecode for a few ideas.
Looking further into that PDF, i see that it also uses embedded subsets of fonts, with custom encodings. Even if CAM::PDF handles the compression, the custom encoding may be what's throwing it off. This may help: Web page from a software company, describing the problem
I'm fairly certain that the issue isn't with your perl code, it is with the PDF file. I ran the same script on one of my own PDF files, and it works just fine.

Open a local web page from Perl

I'm writing a Perl script that creates HTML output and I would like to have it open in the user's preferred browser. Is there a good way to do this? I can't see a way of using ShellExecute since I don't have an http: address for it.
Assuming you saved your output to "../data/index.html",
$ret = system( 'start ..\data\index.html' );
should open the file in the default browser.
Added:
Advice here:
my $filename = "/xyzzy.html"; #whatever
system("start file://$filename");
If I understand what you're trying to do, this will not work. You would have to setup a web server, like apache and configure it to execute your script. This wouldn't be a trivial task if you've never done it before.
Since this is Windows, the easy option is to dump the data to a temporary file using File::Temp (making sure it has an extension .htm or .html, and that it isn't cleaned up immediately on script exit, so that the file remains, i.e, you probably want something like File::Temp->new(UNLINK => 0, SUFFIX => '.htm')). Then you ought to be able to use Win32::FileOp's ShellExecute to open the file regularly. This does make all sorts of assumptions about file types being associated with file extensions, but then, that's how Windows tends to work.

How can I limit file types in CGI file uploads in Perl?

I am using CGI to allow the user to upload some files. I just want the just to be able to upload .txt or .csv files. If the user uploads file with any other format then I want to be able to put out an error message.
I saw that this can be done by javascript: http://www.codestore.net/store.nsf/unid/DOMM-4Q8H9E
But is there a better way to achieve this? Is there is some functionality in Perl that allows this?
The disclaimer on the site to you link to is important:
Note: This is not entirely foolproof as people can easily change the extension of a file before uploading it, or do some other trickery, as in the case of the "LoveBug" virus.
If you really want to do this right, let the user upload the file, and
then use something like File::MimeInfo::Magic (or file(1), the
UNIX utility) to guess the actual file type. If you don't like the
file type, delete the file and give the user an error message.
I just want the just to be able to upload .txt or .csv files.
Sounds easy, doesn't it? It's not. And then some.
The simple approach is just to test that the file ends in ‘.txt’ or ‘.csv’ before storing it on the filesystem. This should be part of a much more in-depth validation of what the filename is allowed to contain before you let a user-submitted filename anywhere near the filesystem.
Because the rules about what can go in a filename are complex on some platforms (especially Windows) it's usually best to create your own filename independently with a known-good name and extension.
In any case there is no guarantee that the browser will send you a file with a usable name at all, and even if it does there is no guarantee that name will have ‘.txt’ or ‘.csv’ at the end, even if it is a text or CSV file. (Some platforms simply do not use extensions for file typing.)
Whilst you can try to sniff the contents of the file to see what type it might be, this is highly unreliable. For example:
<html>,<body>,</body>,</html>
could be plain text, CSV, HTML, XML, or a variety of other formats. Better to give the user an explicit control to say what file type they're uploading (or use one file upload field per type).
Now here's where it gets really nasty. Say you've accepted the upload and stored it as /data/mygoodfilename.txt, and the web server is correctly serving it as the Content-Type ‘text/plain’. What do you think the browser interprets it as? Plain text? You should be so lucky.
The problem is that browsers (primarily IE) don't trust your Content-Type header, and instead sniff the contents of the file to see if it looks like something else. Serve the above snippet as plain text, and IE will happily treat it as HTML. This can be a huge problem, because HTML can include client-side scripts that will take over the user's access to the site (a cross-site-scripting attack).
At this point you might be tempted to sniff the file on the server-side, for example using the ‘file’ command, to check it doesn't contain ‘<html>’. But this is doomed to failure. The ‘file’ command does not sniff for all the same HTML tags as IE does, and other browsers sniff differently anyway. It's quite easy to prepare a file that ‘file’ will claim is not HTML, but that IE will nevertheless treat as if it is (with security-disaster implications).
Content-sniffing approaches such as ‘file’ will give you only a false sense of security. This is a convenience tool for loose guessing of filetypes and not an effective security measure.
At this point your last desperate possibilities are things like:
serving all user-uploaded files from a separate hostname, so that a script injection attack can't purloin the credentials of your main site;
serving all user-uploaded files through a CGI wrapper, adding the header ‘Content-Disposition: attachment’ so that browsers won't attempt to display them directly;
only accepting uploads from trusted users.
On unix the easiest way is to do an JRockway suggested. If not on unix then your options are limited. You can examine the file extension and you can examine the contents to verify. I'm assuming for you specific case that you only want "* seperated value" text files. So one of the Text::CSV::* modules may be useful in verifying the file is the type you asked for.
Security for this operation is a whole other ball of wax.
try this:
$file_name = "file.txt";
$file_cmd = "file \"$file_name"\";
$file_type = `$file_cmd`;
return 0 unless($file_type =~ /(ASCII|text)/i)