I'm writing simple program which has to change some data on Polish auction site.
One of the steps involves loading edit page, changing one value, and submitting it.
Sample page can be viewed here: http://depesz.com/various/new_item.php.html - this is just static copy of such edit page.
Relevant part of my perl code:
$agent->form_number( 1 );
$agent->submit();
$agent->form_number( 1 );
my $q = $agent->current_form()->find_input( 'scheme_id' );
$agent->field('scheme_id', '1025');
# $agent->field('description', encode('utf-8', $agent->value("description")));
# $agent->field('location', encode('utf-8', $agent->value("location")));
# $agent->field('transport_shipment_description', encode('utf-8', $agent->value("transport_shipment_description")));
$agent->submit;
print $agent->response->decoded_content . "\n";
After first submit I get the page I showed. Then I change value in scheme_id field to 1025, and submit the form.
Afterward I get:
HTTP::Message content must be bytes at /usr/local/share/perl/5.8.8/HTTP/Request/Common.pm line 91
I tried to recode values on text fields on the form - hence the agent->field(... encode) lines, but it didn't help.
At the moment I have no idea what on the form can make WWW::Mechanize fail in such way, but I clearly cannot fix in on my own.
Is there any way to debug this situation? Or perhaps I should do something differently?
Make sure your LWP and WWW-Mechanize modules are fully up to date. LWP fixed a number of encoding problems in late 2008, if I recall correctly.
I have the same problem.
Solved it with :
my $newcontent = encode('utf-8', $file);
before posting the content!
thanks,
mike
see http://www.perlmonks.org/?node_id=647935
Related
I'm try to automate the extraction of a transcript found on a website. The entire transcript is found between dl tags since the site formatted the interview in a description list. The script I have below allows me to search the site and extract the text in a plain-text format, but I'm actually looking for it to include everything between the dl tags, meaning dd's, dt's, etc. This will allow us to develop our own CSS for the interview.
Something to note about the page is that there are break statements inserted at various points during the interview. Some tools we've found that extract information from webpages using pairings have found this to be a problem since it only grabs the information up until the break statement. Just something to keep in mind if you point me in a different direction. Here's what I have so far.
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
use WWW::Mechanize::TreeBuilder;
my $mech = WWW::Mechanize->new();
WWW::Mechanize::TreeBuilder->meta->apply($mech);
$mech->get("http://millercenter.org/president/clinton/oralhistory/madeleine-k-albright");
# find all <dl> tags
my #list = $mech->find('dl');
foreach ( #list ) {
print $_->as_text();
}
If there is a tool that essentially prints what I have, only this time as HTML, please let me know of it!
Your code is fine, just change the as_text() method to as_HTML() and it will show the content with HTML tags included.
I am re-designing a website where, based on the options selected by the user, I need to fetch data from a DB and then give it in a downloadable format to the user. I am fetching the data into a string variable, but I dont want to write it to a file and then write the download code. I want to download the string to a file on the client side. I am using perl for this.
Previously I was reading and downloading from a file using this perl-cgi code :
...
my $ID = "details.csv";
my #fileholder;
my $filesloc = "/html/details.csv";
open(DLFILE,'<',"$files_loc") || Error('open','file');
#fileholder = <DLFILE>;
close(DLFILE)
print "Content-Type:application/x-download\n";
print "Content-Disposition:attachment;filename=$ID\n\n";
print #fileholder;
Which is saved as downloadscript.cgi. But now, I want to do this in a .pm file, and I am storing string values in #fileholder. I tried with :
my $ID = "details.txt";
my #fileholder = qw(name age address);
print "Content-Type:text/plain\n";
print "Content-Disposition:attachment;filename=$ID\n\n";
print #fileholder;
in the .pm file, but it is PRINTING the above lines on the screen instead of opening the 'save as' dialog. Both are perl, so where am I going wrong?
EDIT : I got to know the reason, Its because I am previously opening a html content type, and then in the middle, I am opening this "Content-Type:text/plain\n" - this is where the browser gets confused. Now, could someone please tell me how to close the previous html content-type and open this new content type for downloading?
This is because the browser is viewing the content-type you create and send with print "Content-Type:text/plain\n"; - as #Julian mentions you could try changing that line back or adding the line print "Content-Type:application/x-download\n"; after the text/plain line and see if this fixes things. Since that particular content-type may actually need a real file to to work with, you could try other content-types (see #Hunter McMillen's suggestion) since the browser may offer a download/save dialog in that case.
You might need to add the following to fool the browser (and remove the text/csv):
print "Content-Type:application/x-download\n";
print "Content-Disposition:attachment;filename=$ID\n\n";
Okay, looks like there can be only one content-type set for one response. So, I am now navigating to another page from my current page, where I am writing the download code. Thanks for your help people!
This is the code to read text of a pdf using perl
#!/usr/bin/perl
use PDF::API2;
$pdf = PDF::API2->new;
$pdf = PDF::API2->open('01443325.pdf');
$page = $pdf->page;
$pagenum=10;
$pdf->stringify;
$page = $pdf->openpage($pagenum);
print $page;
I dont get any output when i Run this code . How to remove the error ?
When you run $pdf->stringify above, it returns the content of the file as a string, but then you don't do anything with it. If you were to print it, though, it would not give you the text representation you are after as it is simply the original PDF bytes in a string.
Likewise, setting $pagenum to 10 has no consequences for the rest of the program as the variable is not linked to either the $pdf or $page object in any way.
I think the easiest option is to not try to do this with PDF::API2, but to look at whether you can run something like pdftotext from xpdf or poppler first and then read in the output.
If not, then there are some suggestions on the Perl Monks page http://www.perlmonks.org/?node_id=810721, and many more on Google under "perl extract text from pdf". There's even a previous SO question at How can I extract text from a PDF file in Perl?.
Good luck!
I have encountered a weird situation while updating/upgrading some legacy code.
I have a variable which contains HTML. Before I can output it, it has to be filled with lots of data. In essence, I have the following:
for my $line (#lines) {
$output = loadstuff($line, $output);
}
Inside of loadstuff(), there is the following
sub loadstuff {
my ($line, $output) = #_;
# here the process is simplified for better understanding.
my $stuff = getOtherStuff($line);
my $result = $output.$stuff;
return $result;
}
This function builds a page which consists of different areas. All area is loaded up independently, that's why there is a for-loop.
Trouble starts right about here. When I load the page from ground up (click on a link, Perl executes and delivers HTML), everything is loaded fine. Whenever I load a second page via AJAX for comparison, that HTML has broken encoding.
I tracked down the problem to this line my $result = $output.$stuff. Before the concatenation, $output and $stuff are fine. But afterward, the encoding in $result is messed up.
Does somebody have a clue why concatenation messes up my encoding? While we are on the subject, why does it only happen when the call is done via AJAX?
Edit 1
The Perl and the AJAX call both execute the very same functions for building up a page. So, whenever I fix it for AJAX, it is broken for freshly reloaded pages. It really seems to happen only if AJAX starts the call.
The only difference in this particular case is that the current values for the page are compared with an older one (it is a backup/restore function). From here, everything is the same. The encoding in the variables (as far as I can tell) are ok. I even tried the Encode functions only on the values loaded from AJAX, but to no avail. The files themselves seem to be utf8 according to "Kate".
Besides that, I have a another function with the same behavior which uses the EXACT same functions, values and files. When the call is started from Perl/Apache, the encoding is ok. Via AJAX, again, it is messed up.
I have been examinating the AJAX Request (jQuery) and could not find anything odd. The encoding seems to be utf8 too.
Perl has a “utf8” flag for every scalar value, which may be “on” or “off”. “On” state of the flag tells perl to treat the value as a string of Unicode characters.
If you take a string with utf8 flag off and concatenate it with a string that has utf8 flag on, perl converts the first one to Unicode. This is the usual source of problems.
You need to either convert both variables to bytes with Encode::encode() or to perl's internal format with Encode::decode() before concatenation.
See perldoc Encode.
Expanding on the previous answer, here's a little more information that I found useful when I started messing with character encodings in Perl.
This is an excellent introduction to Unicode in perl: http://perldoc.perl.org/perluniintro.html. The section "Perl's Unicode Model" is particularly relevant to the issue you're seeing.
A good rule to use in Perl is to decode data to Perl characters on it's way in and encode it into bytes on it's way out. You can do this explicitly using Encode::encode and Encode::decode. If you're reading from/writing to a file handle you can specify an encoding on the filehandle by using binmode and setting layer: perldoc -f binmode
You can tell which of the strings in your example has been decoded into Perl characters using Encode::is_utf8:
use Encode qw( is_utf8 );
print is_utf8($stuff) ? 'characters' : 'bytes';
A colleague of mine found the answer to this problem. It really had something to do with the fact that AJAX started the call.
The file structure is as follows:
1 Handler, accessed by Apache
1 Handler, accessed by Apache but who only contains AJAX responders. We call it the AJAX-Handler
1 package, which contains functions relevant for the entire software, who access yet other packages from our own Framework
Inside of the AJAX-Handler, we print the result as such
sub handler {
my $r = shift;
# processing output
$r->print($output);
return Apache2::Const::OK;
}
Now, when I replace $r->print($output); by print($output);, the problem disappears! I know that this is not the recommended way to print stuff in mod_perl, but this seems to work.
Still, any ideas how to do this the proper way are welcome.
I understand the need to sanitize inputs from a HTML form, but when I sanitized the file upload field in a recent module of mine, the file upload started failing. It's important to sanitize all form inputs, right? Even the special file upload field?
My form output code looks something like this:
use CGI;
my $cgi = new CGI;
print $cgi->header();
# ... print some HTML here
print $cgi->start_form();
print $cgi->filefield(-name=>'uploaded_file',
-size=>50,
-maxlength=>80);
print $cgi->submit(-name=>'continue',
-value=>'Continue');
print $cgi->end_form();
# ... print some more HTML here
And my sanitization code looks something like this (it's actually earlier in the same module as above):
use HTML::Entities
my $OK_CHARS => 'a-zA-Z0-9 .,-_';
foreach my $param_name ( $cgi->param() ) {
my $original_content = $cgi->param($param_name);
my $replaced_content = HTML::Entities::decode( $original_content );
$replaced_content =~ s/[^$OK_CHARS]//go;
$cgi->param( $param_name, $replaced_content );
}
When I added the sanitization code recently, the file upload started failing. The filehandle is returning undefined now in this line:
my $uploadedFilehandle = $cgi->upload('uploaded_file');
So did I do something wrong in the sanitization code? I got that code snippet from the Internet somewhere, so I don't completely understand it all. I've never seen an 'o' regex modifier before and I've never used the HTML::Entities module before.
Entities are not encoded in file uploads' content. Sanitizing a file upload is not the same as sanitizing a text field. With a file upload you check the extension and possibly the format and encoding (by attempting to open it using particular decoder, etc.) and ensure that the file is not overly large.
In your code, you are in fact attempting to perform string operations on a file handle when you hit the file field.
No, you should not. See the CGI.pm docs on how to process an upload field:
To be safe, use the upload() function (new in version 2.47). When called with the name of an upload field, upload() returns a filehandle-like object, or undef if the parameter is not a valid filehandle. ...