parsing pdf in perl

parsing pdf in perl - perl

I am trying to extract some information from pdf. I am trying to use getpdftext.pl from the CAM::PDF module. When I just run $~ getpdftext.pl sample.pdf, it produces a text of the pdf to stdout.
But I am thinking of writing this to a textfile and parse for required fields in perl. Can someone please guide me on how to do this?
But when I try to call pdftotext.pl inside my perl script I am getting a No such file error.
#program to extract text from pdf and save it in a text file
use PDF;
use CAM::PDF;
use CAM::PDF::PageText;
use warnings;
use IPC::System::Simple qw(system capture);
$filein = 'sample.pdf';
$fileout = 'output1.txt';
open OUT, ">$fileout" or die "error: $!";
open IN, "getpdftext.pl $filein" or die "error :$!" ;
while(<IN>)
{
print OUT $fileout;
}

It would probably be easier to make getpdftext.pl to do what you want.
Working with the code from getpdftext.pl, this (untested code) should output the pdf to a text file.
my $filein = 'sample.pdf';
my $fileout = 'output1.txt';
my $doc = CAM::PDF->new($filein) || die "$CAM::PDF::errstr\n";
open my $fo, '>', $fileout or die "error: $!";
foreach my $p ( 1 .. $doc->numPages() ) {
my $str = $doc->getPageText($p);
if (defined $str) {
CAM::PDF->asciify(\$str);
print $fo $str;
}
}
close $fo;

See perldoc -f open. You want to take the output stream of an external command and use it as an input stream inside your Perl script. That's what the -| mode is for:
open my $IN, '-|', "getpdftext.pl $filein" or die $!;
while (<$IN>) {
...
}

Related

write a command output to a file and match a string

I am new to perl. I'm writing below script to print the system boot time information from the windows command "systeminfo". There look some problem here. I'm getting the output like this. Could someone help me.
use strict;
use warnings;
my $filename = 'sysinfo.txt';
my #cmdout = `systeminfo`;
open(my $cmd, '>', $filename) or die "Could not open file '$filename' $!";
print $cmd #cmdout;
foreach my $file (#cmdout) {
open my $cmd, '<:encoding(UTF-8)', $file or die;
while (my $line = <$cmd>) {
if ($line =~ m/.*System Boot.*/i) {
print $line;
}
}
}
Output: Died at perl_sysboottime.pl line 8.

As indicated by the error printed, your script is dying the first time it executes
open my $cmd, '<:encoding(UTF-8)', $file or die;
This means that open failed to open a file.
I'm not familiar with Windows's commands, but I'll go by the example systeminfo output given here.
After executing line 4, the array #cmdout contains the lines output by systeminfo. When line 8 is executed, $file has been set to the first line of the output, or Host Name: COMPUTERHOPE\n in my example (note the trailing newline). This is not a filename, so open fails.
It looks like you're trying to combine two different ways of iterating through the lines of a file, one inside of another. Try something like this:
foreach my $line (#cmdout) {
if ($line =~ m/.*System Boot.*/i) {
print $line;
}
}

Perl Script: sorting through log files.

Trying to write a script which opens a directory and reads bunch of multiple log files line by line and search for information such as example:
"Attendance = 0 " previously I have used grep "Attendance =" * to search my information but trying to write a script to search for my information.
Need your help to finish this task.
#!/usr/bin/perl
use strict;
use warnings;
my $dir = '/path/';
opendir (DIR, $dir) or die $!;
while (my $file = readdir(DIR))
{
print "$file\n";
}
closedir(DIR);
exit 0;

What's your perl experience?
I'm assuming each file is a text file. I'll give you a hint. Try to figure out where to put this code.
# Now to open and read a text file.
my $fn='file.log';
# $! is a variable which holds a possible error msg.
open(my $INFILE, '<', $fn) or die "ERROR: could not open $fn. $!";
my #filearr=<$INFILE>; # Read the whole file into an array.
close($INFILE);
# Now look in #filearr, which has one entry per line of the original file.
exit; # Normal exit

I prefer to use File::Find::Rule for things like this. It preserves path information, and it's easy to use. Here's an example that does what you want.
use strict;
use warnings;
use File::Find::Rule;
my $dir = '/path/';
my $type = '*';
my #files = File::Find::Rule->file()
->name($type)
->in(($dir));
for my $file (#files){
print "$file\n\n";
open my $fh, '<', $file or die "can't open $file: $!";
while (my $line = <$fh>){
if ($line =~ /Attendance =/){
print $line;
}
}
}

Writing results in a text file with perl

I have a problem when the script print the whole line of text file in a result text file:
use strict;
use warnings;
use autodie;
my $out = "result2.txt";
open my $outFile, ">$out" or die $!;
my %permitted = do {
open my $fh, '<', 'f1.txt';
map { /(.+?)\s+\(/, 1 } <$fh>;
};
open my $fh, '<', 'f2.txt';
while (<$fh>) {
my ($phrase) = /(.+?)\s+->/;
if ($permitted{$phrase}) {
print $outFile $fh;
}
close $outFile;
The problem is in this line
print $outFile $fh;
Any idea please?
Thank you

print $outFile $fh is printing the value of the file handle $fh to the file handle $outFile. Instead you want to print the entire current line, which is in $_.
There are a couple of other improvements that can be made
You should always use the three-parameter form of open, so the open mode appears on its own as the second paremeter
There is no need to test the success of an open of autodie is in place
If you have a variable that contains the name of the output file, then you really should have ones for the names of the two input files as well
This is how your program should look. I hope it helps.
use strict;
use warnings;
use autodie;
my ($in1, $in2, $out) = qw/ f1.txt f2.txt result2.txt /;
my %permitted = do {
open my $fh, '<', $in1;
map { /(.+?)\s+\(/, 1 } <$fh>;
};
open my $fh, '<', $in2;
open my $outfh, '>', $out;
while (<$fh>) {
my ($phrase) = /(.+?)\s+->/;
if ($permitted{$phrase}) {
print $outfh $_;
}
}
close $outfh;

I think you want print $outfile $phrase here, don't you? The line you currently have is trying to print out a file handle reference ($fh) to a file ($outfile).
Also, just as part of perl best practices, you'll want to use the three argument open for your first open line:
open my $outFile, ">", $out or die $!;
(FWIW, you're already using 3-arg open for your other two calls to open.)

Although Borodin has provided an excellent solution to your question, here's another option where you pass your 'in' files' names to the script on the command line, and let Perl handle the opening and closing of those files:
use strict;
use warnings;
my $file2 = pop;
my %permitted = map { /(.+?)\s+\(/, 1 } <>;
push #ARGV, $file2;
while (<>) {
my ($phrase) = /(.+?)\s+->/;
print if $permitted{$phrase};
}
Usage: perl script.pl inFile1 inFile2 [>outFile]
The last, optional parameter directs output to a file.
The pop command implicitly removes inFile2's name off of #ARGV, and stores it in $file2. Then, inFile1 is read using the <> directive. The file name of inFile2 is then pushed onto #ARGV, and that file is read and a line is printed if $permitted{$phrase} is true.
Running the script without the last, optional parameter will print results (if any) to the screen. Using the last parameter saves output to a file.
Hope this helps!

How to Call .pl File inside .cgi Script

i am using getpdftext.pl from CAM::PDF to extract pdf and print it to text, but in my web application i want to call this getpdftext.pl inside .cgi script. Can u suggest me as to what to do or how to proceed ahead. i tried converting getpdftext.pl to getpdftext.cgi but it doesnt work.
Thanks all
this is a extract from my request_admin.cgi script
my $filename = $q->param('quote');
:
:
:
&parsePdf($filename);
#function to extract text from pdf ,save it in a text file and parse the required fields
sub parsePdf($)
{
my $i;
print $_[0];
$filein = "quote_uploads/$_[0]";
$fileout = 'output.txt';
print "inside parsePdf\n";
open OUT, ">$fileout" or die "error: $!";
open IN, '-|', "getpdftext.pl $filein" or die "error :$!" ;
while(<IN>)
{
print "$i";
$i++;
print OUT;
}
}

It's highly likely that
Your CGI script's environment isn't complete enough to locate
getpdftext.pl and/or
The web-server user doesn't have permission to execute it anyway
Have a look in your web-server's error-log and see if it is reporting any pointers as to why this doesn't work.

In your particular case, it might be simpler and more direct to use CAM::PDF directly, which should have been installed along with getpdftext.pl anyway.
I had a look at this script and I think that your parsePdf sub could just as easily be written as:
#!/usr/bin/perl
use warnings;
use strict;
use CAM::PDF;
sub parsePdf {
my $filein = "quote_uploads/$_[0]";
my $fileout = 'output.txt';
open my $out_fh, ">$fileout" or die "error: $!";
my $doc = CAM::PDF->new($filein) || die "$CAM::PDF::errstr\n";
my $i = 0;
foreach my $p ($doc->rangeToArray(1,$doc->numPages()))
{
my $str = $doc->getPageText($p);
if (defined $str)
{
CAM::PDF->asciify(\$str);
print $i++;
print $out_fh $str;
}
}
}

Perl: opening a file, and saving it under a different name after editing

I'm trying to write a configuration script.
For each customer, it will ask for variables, and then write several text files.
But each text file needs to be used more than once, so it can't overwrite them. I'd prefer it read from each file, made the changes, and then saved them to $name.originalname.
Is this possible?

You want something like Template Toolkit. You let the templating engine open a template, fill in the placeholders, and save the result. You shouldn't have to do any of that magic yourself.
For very small jobs, I sometimes use Text::Template.

why not copy the file first and then edit the copied file

The code below expects to find a configuration template for each customer where, for example, Joe's template is joe.originaljoe and writes the output to joe:
foreach my $name (#customers) {
my $template = "$name.original$name";
open my $in, "<", $template or die "$0: open $template";
open my $out, ">", $name or die "$0: open $name";
# whatever processing you're doing goes here
my $output = process_template $in;
print $out $output or die "$0: print $out: $!";
close $in;
close $out or warn "$0: close $name";
}

assuming you want to read in one file, make changes to it line-by-line, then write to another file:
#!/usr/bin/perl
use strict;
use warnings;
# set $input_file and #output_file accordingly
# input file
open my $in_filehandle, '<', $input_file or die $!;
# output file
open my $out_filehandle, '>', $output_file or die $!;
# iterate through the input file one line at a time
while ( <$in_filehandle> ) {
# save this line and remove the newline
my $input_line = $_;
chomp $input_line;
# prepare the line to be written out
my $output_line = do_something( $input_line );
# write to the output file
print $output_line . "\n";
}
close $in_filehandle;
close $out_filehandle;

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

parsing pdf in perl - perl

See perldoc -f open. You want to take the output stream of an external command and use it as an input stream inside your Perl script. That's what the -| mode is for: open my $IN, '-|', "getpdftext.pl $filein" or die $!; while (<$IN>) { ... }

Related

write a command output to a file and match a string

Perl Script: sorting through log files.

Writing results in a text file with perl

How to Call .pl File inside .cgi Script

Perl: opening a file, and saving it under a different name after editing

Categories

Resources