Read meta information from PDF using Perl module Image::ExifTool - perl

I would like to read meta information from PDF using Perl module Image::ExifTool. I need to process PDFs using cross reference streams (as of PDF 1.5), and the other well established modules like PDF::API2 and CAM::PDF seem not to support them or have limited support.
Anyway, Image::ExifTool reads apparently a number of PDF tags, but if I run the following code:
use Image::ExifTool qw(:Public);
my $file = 'file.pdf';
my $exifTool = new Image::ExifTool;
$exifTool->ExtractInfo($file);
my #tagList = $exifTool->GetFoundTags('File');
for (#tagList){
print "$_\n"
}
I do not seem to be able to get more then these tags:
ExifToolVersion
FileName
Directory
FileSize
FileModifyDate
FileAccessDate
FileCreateDate
FilePermissions
FileType
FileTypeExtension
MIMEType
PDFVersion
Linearized
Author
CreateDate
Creator (1)
ModifyDate
Producer (1)
Subject
Title (1)
XMPToolkit
CreateDate (1)
CreatorTool
ModifyDate (1)
MetadataDate
Producer
Format
Title
Description
Creator
DocumentID
InstanceID
PageLayout
PageMode
PageCount
In particular, I would like to get e.g. the PDF document catalog (Root tag). However running a code like this doesn't return any value:
my $tag = 'Root';
my $exifTool = new Image::ExifTool;
my $info = $exifTool->ImageInfo($file, $tag);
for (sort keys %$info) {
print "$_ => $$info{$_}\n";
}
Help please :-)

To see the parsing enable verbose mode: $exifTool->Options(Verbose => 1); It shows that the Root is indeed being parsed.
7) Root (SubDirectory) -->
+ [Root directory with 7 entries]
| 0) Metadata (SubDirectory) -->
| + [Metadata directory with 3 entries]
I'm unsure what data you need from the root tag, but it's possible to get with the internal API: my $roottag = Image::ExifTool::GetTagTable('Image::ExifTool::PDF::Root');
From Image::ExifTool:
exports not part of the public API, but used by ExifTool modules:

Related

Typo3 : add file to FAL when it is not presente in sys_file

I have this situation
I have a table in my db containing some file names in a field1 (eg field1: "my file.ext")
NOTE: the filename does not necessarily pass a Typo3 "sanitizeFilename" check -> it may contain spaces " " or other characters that would be removed by the sanitizeFilename () method
I have the file mentioned above, stored on the server that host typo3
In the sys_file table, the file is not present
the "update storage index" scheduler cannot process all the files, and if i launch it, it "destroy" the file name (my file.ext -> my_file.ext), so the name stored in the field of my table doensn't have much sense anymore.
I would need to absorb the above mentioned files in the FAL, in order to use them in an ext typo3.
I had thought of such a solution
<?php
// read from "field1" of my table
// $filename = the name extracted from my table (e.g. : "my file.ext")
// %path = the path of the file : e.g. "/fileadmin/user_upload")
if (file_exists($_SERVER['DOCUMENT_ROOT'] . "/" . $defaultStorage->getConfiguration()['basePath'] . $path . $filename)) {
// check the folder
if ($defaultStorage->hasFolder($path)) {
$folder = $defaultStorage->getFolder($path);
} else {
throw new \ Exception ($path . "path not found in AbstractImportCommand in method extractFile");
}
// CHECK IF FILE IS IN FAL
$file = $folder->getStorage()->getFileInFolder($filename, $folder);
if ($file) {
// the file already exists in the FAL
} else {
// create new sys_file
$file = $defaultStorage->addFile(
$_SERVER['DOCUMENT_ROOT'] . "/" . $defaultStorage->getConfiguration()['basePath'] . $path . $filename,
$folder,
DuplicationBehavior::REPLACE
);
}
}
Any suggestion?
Put your code into a command.
(optional) create a sys_file_metadata record for your file if you have information that needs to be stored there
create a sys_file_reference to your content record (before that you should adjust TCA accordingly)
For creating the sys_file_reference there is no api. A function doing so could look like this:
/**
* #param $fileUid
* #param $recordUid
* #param $table
*/
private function createSysFileReference($record, $fileUid, $tableName, $fieldName){
$data['sys_file_reference']['NEW_' . uniqid()] = [
'table_local' => 'sys_file',
'uid_local' => $fileUid,
'tablenames' => $tableName,
'uid_foreign' => $record['uid'],
'fieldname' => $fieldName,
'pid' => $record['pid']
];
$dataHandler = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance(\TYPO3\CMS\Core\DataHandling\DataHandler::class);
$dataHandler->start($data, []);
$dataHandler->process_datamap();
}
What you consider as "destroying the filename" is indeed preserving file functionality, so sanitizing the filenames is required.
As example consider a file with white space like "My Pdf.pdf". It will be shown eventually like "My%20Pdf.pdf" in the URL and also saved like this. If you link to it, at least the link text won't show the "%20" and the expectation that links and the name of the file are always synchronized (no matter how it's stored) isn't reliable as it probably depends on several parameters like operating system or browser too. The same problem might occur for many different signs too.
Consider that the problems occur not only when a user is downloading a file but also when a file is uploaded, where the wrong url-encoded name is saved in the database, this name might be saved differently in the filesystem or not be found even if the saved value is the same as in the filesystem, due to the url-encoded signs. Your file references are broken on the server then and according links might not work and images not displayed.
So circumventing FAL beside tables is a bad decision and while it's likely possible to write an own sanitizer, I would refrain from dropping it completely.

perl pdf::api2 checking if a pdf file is encrypted

I have a website using a perl script for customers to upload a pdf file for me to print and post the printed pages to them.
I am using PDF::API2 to detect the page size and number of pages in order to calculate the printing costs.
However, if the pdf file is password protected this does not work and I get this error -
Software error:
Objind 9 does not exist at index 0 at /home5/smckayws/public_html/hookincrochet.com/lib//PDF/API2/Basic/PDF/File.pm line 758.
I am trying to use the isEncrypted feature in the pdf::api2 module to catch that the file is encrypted in order to direct the customer to a different page so they can enter the page size and page number manually, but it is not working for me.
I just get the same error message as above.
I have tried the following code snippets found elsewhere.
my $pdf = PDF::API2->open( "$customer_directory/$filename" );
if ( defined $pdf && $pdf->isEncrypted )
{
print "$pdf is encrypted.\n";
exit;
}
while (glob "*.pdf") {
$pdf = PDF::API2->open($_);
print "$_ is encrypted.\n" if $pdf->isEncrypted();
}
Any help would be greatly appreciated.
My guess is that the PDFs might use a feature that your version of PDF::API2 doesn't support. This is a workaround for the problem.
Wrap the call to isEncrypted in an eval, catch the error and handle it.
This will only work if the error does not occur on unencrypted files.
my $pdf = PDF::API2->open( "$customer_directory/$filename" );
if ( defined $pdf ) {
eval { $pdf->isEncrypted };
if ($#) {
# there was some kind of error opening the file
# could abort now, or look more specific, like this:
if ($# =~ m/Objind 9 does not exist at index 0/) {
print "$pdf is encrypted.\n";
exit;
}
}
# file is not encrypted, opening worked, continue reading it
}

Perl mechanize Find all links array loop issue

I am currently attempting to create a Perl webspider using WWW::Mechanize.
What I am trying to do is create a webspider that will crawl the whole site of the URL (entered by the user) and extract all of the links from every page on the site.
But I have a problem with how to spider the whole site to get every link, without duplicates
What I have done so far (the part im having trouble with anyway):
foreach (#nonduplicates) { #array contain urls like www.tree.com/contact-us, www.tree.com/varieties....
$mech->get($_);
my #list = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/); #find all links on this page that starts with http://www.tree.com
#NOW THIS IS WHAT I WANT IT TO DO AFTER THE ABOVE (IN PSEUDOCODE), BUT CANT GET WORKING
#foreach (#list) {
#if $_ is already in #nonduplicates
#then do nothing because that link has already been found
#} else {
#append the link to the end of #nonduplicates so that if it has not been crawled for links already, it will be
How would I be able to do the above?
I am doing this to try and spider the whole site to get a comprehensive list of every URL on the site, without duplicates.
If you think this is not the best/easiest method of achieving the same result I'm open to ideas.
Your help is much appreciated, thanks.
Create a hash to track which links you've seen before and put any unseen ones onto #nonduplicates for processing:
$| = 1;
my $scanned = 0;
my #nonduplicates = ( $urlToSpider ); # Add the first link to the queue.
my %link_tracker = map { $_ => 1 } #nonduplicates; # Keep track of what links we've found already.
while (my $queued_link = pop #nonduplicates) {
$mech->get($queued_link);
my #list = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/);
for my $new_link (#list) {
# Add the link to the queue unless we already encountered it.
# Increment so we don't add it again.
push #nonduplicates, $new_link->url_abs() unless $link_tracker{$new_link->url_abs()}++;
}
printf "\rPages scanned: [%d] Unique Links: [%s] Queued: [%s]", ++$scanned, scalar keys %link_tracker, scalar #nonduplicates;
}
use Data::Dumper;
print Dumper(\%link_tracker);
use List::MoreUtils qw/uniq/;
...
my #list = $mech->find_all_links(...);
my #unique_urls = uniq( map { $_->url } #list );
Now #unique_urls contains the unique urls from #list.

First 8 bytes are always wrong when downloading a file from my script

I have a Mojolicious Lite script that "gives out" an executable file (user can download the file from the script's URL). I keep encoded data in an inline template in DATA section, then encode it and render_data.
get '/download' => sub {
my $self = shift;
my $hex_data = $self->render_partial( 'TestEXE' );
my $bin_data;
while( $hex_data =~ /([^\n]+)\n?/g ) {
$bin_data .= pack "H".(length $1), $1;
}
my $headers = Mojo::Headers->new;
$headers->add( 'Content-Type', 'application/x-download;name=Test.exe' );
$headers->add( 'Content-Disposition', 'attachment;filename=Test.exe' );
$headers->add( 'Content-Description', 'File Transfer');
$self->res->content->headers($headers);
$self->render_data( $bin_data );
};
__DATA__
## TestEXE.html.ep
4d5a90000300000004000000ffff0000b8000000000000004000000000000000
00000000000000000000000000000000000000000000000000000000b0000000
0e1fba0e00b409cd21b8014ccd21546836362070726f6772616d2063616e6e6f
....
When I run this locally (via built in webserver on http://127.0.0.1:3000/, Win7) I get the correct file (size and contents). But when I run it in CGI mode on shared hosting (Linux), it comes back with correct size, but first 8 bytes of the file are always incorrect (and always different). The rest of the file is correct.
If in my sub i specify $hex_data instead of $bin_data I get what suppose to be there.
I'm at lost.
render_partial isn't what you want.
First, re-encode the executable in base64 format, and specify that the template is base64 encoded (This is assuming hex is not a requirement for your app):
## template-name (base64)
Also, you don't actually need a controller method at all. Mojolicious will handle the process for you - all you have to do is appropriately name the template.
use Mojolicious::Lite;
app->start;
__DATA__
## Test.exe (base64)
...
http://127.0.0.1:3000/Test.exe will then download the file.
-
If you still want to use a controller method for app-specific concerns, get the data template specifically:
use Mojolicious::Lite;
get '/download' => sub {
my $self = shift;
# http://mojolicio.us/perldoc/Mojolicious/Renderer.pm#get_data_template
my $data = $self->app->renderer->get_data_template({}, 'Test.exe');
# Replace content-disposition instead of adding it,
# to prevent duplication from elsewhere in the app
$self->res->headers->header(
'Content-Disposition', 'attachment;filename=name.exe');
$self->render_data($data);
};
app->start;
__DATA__
## Test.exe (base64)
...
http://127.0.0.1:3000/download will get the template, set the header, and then download it as name.exe.

Perl OpenOffice::OODoc - accessing header/footer elements

How do you get elements in a header/footer of a odt doc?
for example I have:
use OpenOffice::OODoc;
my $doc = odfDocument(file => 'whatever.odt');
my $t=0;
while (my $table = $doc->getTable($t))
{
print "Table $t exists\n";
$t++;
}
When I check the tables they are all from the body. I can't seem to find elements for anything in the header or footer?
I found sample code here which led me to the answer:
#! /usr/local/bin/perl
use OpenOffice::OODoc;
my $file='asdf.odt';
# odfContainer is a representation of the zipped odf file
# and all of its parts.
my $container = odfContainer("$file");
# We're going to look at the 'style' part of the container,
# because that's where the header is located.
my $style = odfDocument
(
container => $container,
part => 'styles'
);
# masterPageHeader takes the style name as its argument.
# This is not at all clear from the documentation.
my $masterPageHeader = $style->masterPageHeader('Standard');
my $headerText = $style->getText( $masterPageHeader );
print "$headerText\n"
The master page style defines the look and feel of the document -- think CSS. Apparently 'Standard' is the default name for the master page style of a document created by OpenOffice... that was the toughest nut to crack... once I found the example code, that fell out in my lap.