I have been tinkering around with PDF::API2 and i am facing a problem, create a pdf file very well and add text into it. However say if the text to be written flows over to more than one page, the script does not print over to the next page. I have tried researching for an answer to this but to no avail. I would like each page to have exactly 50 lines of text. My script is as below. It only prints on the first page, creates the other pages but does not print into them. Anyone with a solution
!/usr/bin/perl
use PDF::API2;
use POSIX qw(setsid strftime);
my $filename = scalar(strftime('%F', localtime));
my $pdf = PDF::API2->new(-file => "$filename.pdf");
$pdf->mediabox(595,842);
my $page = $pdf->page;
my $fnt = $pdf->corefont('Arial',-encoding => 'latin1');
my $txt = $page->text;
$txt->textstart;
$txt->font($fnt, 20);
$txt->translate(100,800);
$txt->text("Lines for $filename");
my $i=0;
my $line = 780;
while($i<310)
{
if(($i%50) == 0)
{
my $page = $pdf->page;
my $fnt = $pdf->corefont('Arial',-encoding => 'latin1');
my $txt = $page->text;
}
$txt->font($fnt, 10);
$txt->translate(100,$line);
$txt->text("$i This is the first line");
$line=$line-15;
$i++;
}
$txt->textend;
$pdf->save;
$pdf->end( );
The problem is that you are making new page, but forget new variables instantly:
if(($i%50) == 0)
{
my $page = $pdf->page;
my $fnt = $pdf->corefont('Arial',-encoding => 'latin1');
my $txt = $page->text;
}
All my variables you make disappear on closing parentheses. Just remove my and you will modify variables from top-level scope.
Edit: You also probably want to reset $line variable when making new page.
The typeface, $fnt, does not have to be changed since it depends on the PDF, $pdf, and not the page, $page.
As much as I love Perl, I learned enough Python to use the ReportLabs library for PDF generation. Creating PDF is one of the weak spots of Perl v. Python.
Related
I have created a form in drupal 7 and it has a field to upload file (csv file only ) now how to display uploaded csv file into table on form submit ?
Not sure about theme function, but you can do it on your own.
I.e. use:
http://php.net/manual/en/function.file-get-contents.php
To read the file and then:
http://php.net/manual/en/function.str-getcsv.php
to parse CSV.
Or, maybe:
http://php.net/manual/en/function.fgetcsv.php
to read row by row. Anyway, you'll end up with looping trough rows so just print values the way you want, add markup around the values...
After go through various code exercises i have come up with solution which is as follows :
`function display_table($filename, $head=false) {
$handle = fopen($filename, "r");
$all_rows = array();
$header = null;
while ($row = fgetcsv($handle)) {
if ($header === null) {
$header = $row;
continue;
}
$all_rows[] = array_combine($header, $row);
}
$table = theme('table', array('header' => $header, 'rows' => $all_rows));
return $table;
}`
Hope it would be helpful to others as well !
I am in dilemma between these two.
I want to strip head tags ( and everything inside/before including doctype/html) , body tag and script tags from a page that I am importing via curl.
So first thought was this
$content = strip_tags($content, '<img><p><a><div><table><tbody><th><tr><td><br><span><h1><h2><h3><h4><h5><h6><code><pre><b><strong><ol><ul><li><em>'.$tags);
which as you can see can get even longer with HTML5 tags, video object etc..
Than I saw this here.
https://stackoverflow.com/a/16377509/594423
Can anyone advise the preferred method or show your way of doing this and please explain why and
possibly tell me which one is faster.
Thank you!
You can test something like that:
$dom = new DOMDocument();
#$dom->loadHTML($content);
$result = '';
$bodyNode = $dom->getElementsByTagName('body')->item(0);
$scriptNodes = $bodyNode->getElementsByTagName('script');
$toRemove = array();
foreach ($scriptNodes as $scriptNode) {
$toRemove[] = $scriptNode;
}
foreach($toRemove as $node) {
$node->parentNode->removeChild($node);
}
$bodyChildren = $bodyNode->childNodes;
foreach($bodyChildren as $bodyChild) {
$result .= $dom->saveHTML($bodyChild);
}
The advantage of the DOM approach is a relative reliability against several html traps, especially some cases of malformed tags, or tags inside javascript strings: var str = "<body>";
But what about speed?
If you use a regex approach, for example:
$pattern = <<<'EOD'
~
<script[^>]*> (?>[^<]++|<(?!/script>))* </script>
|
</body>.*$
|
^ (?>[^<]++|<(?!body\b))* <body[^>]*>
~xis
EOD;
$result = preg_replace($pattern, '', $content);
The result is a little faster (from 1x to 2x for an html file with 400 lines). But with this code, the reliability decreases.
If speed is important and if you have a good idea of the html quality, for the same reliability level than the regex version, you can use:
$offset = stripos($content, '<body');
$offset = strpos($content, '>', $offset);
$result = strrev(substr($content,++$offset));
$offset = stripos($result, '>ydob/<');
$result = substr($result, $offset+7);
$offset = 0;
while(false !== $offset = stripos($result, '>tpircs/<', $offset)) {
$soffset = stripos($result, 'tpircs<', $offset);
$result = substr_replace($result, '', $offset, $soffset-$offset+7);
}
$result = strrev($result);
That is between 2x and 5x faster than the DOM version.
I am having quite the issue creating a new line with this module and feel like I am just missing something.
my perl code looks like this:
use OpenOffice::OODoc;
my $name = "foo <br> bar";
$name=~s/<br>/\n/g;
my $outdir = "template.odt";
my $doc = ooDocument(file => $outdir);
my #pars = $doc->getParagraphList();
for my $p (#pars)
{
$doc->substituteText($p,'{TODAY}',$date);
$doc->substituteText($p,'{NAME}',$name);
...
Problem is when I open it in word or open office I have no newlines. Although if it open it in a text edit I have my new lines.. Any ideas of how to fix this?
Ok I figured it out, hopefully this will save someone hours of searching for the same thing. I added:
use Encode qw(encode);
ooLocalEncoding('utf8');
my $linebreak = encode('utf-8', "\x{2028}");
$doc->substituteText($p,'<br>', $linebreak);
So my final code looks like this:
use OpenOffice::OODoc;
use Encode qw(encode);
ooLocalEncoding('utf8');
my $linebreak = encode('utf-8', "\x{2028}");
my $outdir = "template.odt";
my $name = "foo <br> bar";
my $outdir = "template.odt";
my $doc = ooDocument(file => $outdir);
my #pars = $doc->getParagraphList();
for my $p (#pars)
{
$doc->substituteText($p,'{TODAY}',$date);
$doc->substituteText($p,'{NAME}',$name);
$doc->substituteText($p,'<br>', $linebreak);
...
Maybe not the best way to do things but it worked!
You could try and insert and empty para after the current one:
If the 'text' option is empty, calling this method is the equivalent
of adding a line feed.
This sequence (in a text document) inserts a linefeed immediately after paragraph 4. Replace 4 with current position.
$doc->insertElement
(
'//text:p', 4, 'text:p',
position => 'after',
text => '',
);
I am currently attempting to create a Perl webspider using WWW::Mechanize.
What I am trying to do is create a webspider that will crawl the whole site of the URL (entered by the user) and extract all of the links from every page on the site.
But I have a problem with how to spider the whole site to get every link, without duplicates
What I have done so far (the part im having trouble with anyway):
foreach (#nonduplicates) { #array contain urls like www.tree.com/contact-us, www.tree.com/varieties....
$mech->get($_);
my #list = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/); #find all links on this page that starts with http://www.tree.com
#NOW THIS IS WHAT I WANT IT TO DO AFTER THE ABOVE (IN PSEUDOCODE), BUT CANT GET WORKING
#foreach (#list) {
#if $_ is already in #nonduplicates
#then do nothing because that link has already been found
#} else {
#append the link to the end of #nonduplicates so that if it has not been crawled for links already, it will be
How would I be able to do the above?
I am doing this to try and spider the whole site to get a comprehensive list of every URL on the site, without duplicates.
If you think this is not the best/easiest method of achieving the same result I'm open to ideas.
Your help is much appreciated, thanks.
Create a hash to track which links you've seen before and put any unseen ones onto #nonduplicates for processing:
$| = 1;
my $scanned = 0;
my #nonduplicates = ( $urlToSpider ); # Add the first link to the queue.
my %link_tracker = map { $_ => 1 } #nonduplicates; # Keep track of what links we've found already.
while (my $queued_link = pop #nonduplicates) {
$mech->get($queued_link);
my #list = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/);
for my $new_link (#list) {
# Add the link to the queue unless we already encountered it.
# Increment so we don't add it again.
push #nonduplicates, $new_link->url_abs() unless $link_tracker{$new_link->url_abs()}++;
}
printf "\rPages scanned: [%d] Unique Links: [%s] Queued: [%s]", ++$scanned, scalar keys %link_tracker, scalar #nonduplicates;
}
use Data::Dumper;
print Dumper(\%link_tracker);
use List::MoreUtils qw/uniq/;
...
my #list = $mech->find_all_links(...);
my #unique_urls = uniq( map { $_->url } #list );
Now #unique_urls contains the unique urls from #list.
I am trying to figure out how to change text in a footer of an ODT file. The footer is kept in the styles.xml, however I can't seem to access it using selectElementsByContent or any other method:
my $a = odfContainer('test.odt');
my $styles = odfDocument(container => $a, part => 'styles');
foreach my $element ($styles->selectElementsByContent('mytest'))
{
#never runs...
}
The styles.xml in the odt is like:
<office:document-styles>
<office:master-styles>
<style:master-page>
<style:footer>
<text:p test:style-name="P49">
mytest
</text:p>
</style:footer>
</style:master-page>
</office:master-styles>
</office:document-styles>
What is the right way to change the text:p contents?
I ended up having to use odfXPath to loop through:
my $ss = odfXPath(file => 'myfile.odt' , part => 'styles');
my $p =0;
while (my $p = $ss->getElement('//text:p',$p))
{
if ($ss->getText($para) eq 'mytest') { $ss->setText($p,'foobar');}
$p++;
}
$ss->save('mynewfile.odt');