How to Extract text from MS Word? - perl

I'm trying to open a word document and just extract all the text that is in the document and display it to the user using Win32::OLE
#usr/bin/perl
#OLEWord.pl
#Use string and print warnings
use strict;use warnings;
#Using OLE + OLE constants for Variants and OLE enumeration for Enumerations
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Word';
$Win32::OLE::Warn = 3;
#set the file to be opened
my $file = '/work/Test.docx';
#Create a new instance of Win32::OLE for the Word application, die if could not open the application
my $MSWord = Win32::OLE->new('Word.Application','Quit') and "Opened Word" or die "Unable to open document ", Win32::OLE->LastError();
#Set the screen to Visible, so that you can see what is going on
$MSWord->{'Visible'} = 1;
#open the request file or die and print warning message
my $Doc = $MSWord->Documents->Open('C:\work\Test.docx') or die "Could not open ", $file, " Error:", Win32::OLE->LastError();
#$MSWord->ActiveDocument->SaveAs({Filename => 'AlteredTest.docx',
#FileFormat => wdFormatDocument});
sub ShowObjs {
my $obj = shift;
foreach (sort keys %$obj) {
print "Keys: $_ - $obj->{$_}\n"; }
}
my $paragraphs = $Doc->Paragraphs;
ShowObjs($paragraphs);
# Get and print the Text inside the opened file
my $paragraphs = $Doc->Paragraphs;
my $txt = $Doc->Range->Text;
print $txt;
$MSWord->ActiveDocument->Close;
$MSWord->Quit;
I'm getting this error code:
OLE exception from "Microsoft Word":
Command Failed
Win32::OLE(0.1709) error ox800a1066
in METHOD/PROPERTYGET "Open" at OLEWord.pl line 20
Update: I can open up the Word application fine, it's just when I try to open up the file that is the problem

I have few scripts to convert DOC to various output format using Win32::OLE. They usually start like this:
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Word';
my $wr = Win32::OLE->new('Word.Application')
or die "Failure - word. \n";
$wr->{DisplayAlerts} = wdAlertsNone;
$wr->{Visible} = 0;
my $Doc = $wr->Documents->Open({
FileName => $input_file_path,
ConfirmConversions => 0,
AddToRecentFiles => 0,
Revert => 0,
ReadOnly => 1,
OpenAndRepair => 0,
}) or exit;
...
Please note that $input_file_path has to contain absolute path to your file. You can also enable Visible and DisplayAlerts to see any error Word might give you.
Edit: You can traverse paragraphs using in enumerator:
use Win32::OLE qw(in);
...
my $paragraphs = $Doc->Paragraphs;
for my $par (in $paragraphs) {
print $par->Range->Text();
}
Or you can use Word's own exporting method and save document as one of supported formats:
$Doc->SaveAs({
FileName => 'c:\\work\\Test.txt',
FileFormat => wdFormatEncodedText,
});
The advantage of latter method is that formatting is retained if possible, which produces better results for bullets, numbering and such.

Win32::OLE can be a little funny about interaction. if anything triggers a prompt, you may get a message like this. Typically it may be that it wants to open the file read-only, for example, and put up a dialog, but these dialogs can break under the default initialization of Win32::OLE.
If this is the case, calling
Win32::OLE->Initialize(Win32::OLE::COINIT_OLEINITIALIZE);
before you do anything like instantiate any objects (i.e., before Win32::OLE->new) might do the trick.

Related

Using Try and Catch to move past errors

This is my first question so I apologise in advance if I format/ask it all wrong.
I am using Perl to extract a string from a file, submit a web form, and download a new file created by the web-page. The aim is to have it run for 30,000 files in a loop, which I estimate will take ~8 days. I am using WWW::Selenium and WWW::Mechanize to perform the web automation. The issue I have is that if for some reason a page doesn't load properly or the internet drops for a period of time then the script exits and gives an error message like(depending on which stage it failed at):
Error requesting http://localhost:4444/selenium-server/driver/:
ERROR: Could not find element attribute: link=Download PDB File#href
I would like the script to continue running, moving onto the next round of the loop so I don't have to worry if a single round of the loop throws an error. My research suggests that using Try::Tiny may be the best solution. Currently I have the script below using only try{...} which seems to suppress any error and allow the script to continue through the files. However I'm concerned that this seems to be a very blunt solution and provides me no insight into which/why files failed.
Ideally I would want to print the filename and error message for each occurence to another file that could then be reviewed once the script is complete but I am struggling to understand how to use catch{...} to do this or if that is even the correct solution.
use strict;
use warnings;
use WWW::Selenium;
use WWW::Mechanize;
use Try::Tiny;
my #fastas = <*.fasta>;
foreach my $file (#fastas) {
try{
open(my $fh, "<", $file);
my $sequence;
my $id = substr($file, 0, -6);
while (my $line = <$fh>) {
## discard fasta header line
} elsif($line =~ /^>/) { # / (turn off wrong coloring)
next;
## keep line, add to sequence string
} else {
$sequence .= $line;
}
}
close ($fh);
my $sel = WWW::Selenium->new( host => "localhost",
port => 4444,
browser => "*firefox",
browser_url => "http://www.myurl.com",
);
$sel->start;
$sel->open("http://www.myurl.com");
$sel->type("chain1", $sequence);
$sel->type("chain2", "EVQLVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS");
$sel->click("css=input.btn.btn-success");
$sel->wait_for_page_to_load("30000");
## Wait through the holding page - will timeout after 5 mins
$sel->wait_for_element_present("link=Download PDB File", "300000");
## Get the filename part of link
$sel->wait_for_page_to_load("30000");
my $pdbName = $sel->get_attribute("link=Download PDB File\#href");
## Concatenate it with the main domain
my $link = "http://www.myurl.com/" . $pdbName;
$sel->stop;
my $mech = WWW::Mechanize->new( autocheck => 1 );
$mech -> get($link);
#print $mech -> content();
$mech -> save_content($id . ".pdb");
};
}
You are completely right that you want to see, log, and review all errors (and warnings). The mechanism and syntax provided by Try::Tiny is meant to be bare-bones and simple to use.
use warnings;
use strict;
use feature qw(say);
use Try::Tiny;
my #fastas = <*.fasta>;
my $errlog = 'error_log.txt';
open my $fh_err, '>', $errlog or die "Can't open $errlog for writing: $!";
foreach my $file (#fastas) {
try {
# processing, potentially throwing a die
}
catch {
say $fh_err "Error with $file: $_"; # NOTE, it is $_ (not $! or $#)
};
}
close $fh_err;
# Remove the log if empty
if (-z $errlog) {
say "No errors logged, removing $errlog";
unlink $errlog or warn "Can't unlink $errlog: $!";
}
You can save names of files for which the processing failed, with push #failed_files, $file inside the catch { } block. Then the code can attempt again after the main processing, if you know that errors are mostly due to random connection problems. And having the list of failed files is handy.
Note that with v5.14 the problems that this module addresses were fixed, so that a normal use of eval is fine. It is mostly a matter of preference at this point, but note that Try::Tiny has a few twists of its own. See this post for a discussion.
This addresses the question of the simple exception handling, not the rest of the code.

Handling unicode directory and filenames in Perl on Windows

I have an encoding problem with Perl and Windows. On a Windows 7 running Perl (strawberry 5.16) and a simple TK GUI I need to open files and/or access directories with non-english characters in their name/path. For opening files I've come out with this solution which seems to work fine:
#!/usr/bin/perl -w
use strict;
use warnings;
use Win32::Unicode::File;
use Encode;
use Tk;
my $mw = Tk::MainWindow->new;
my $tissue_but = $mw->Button(
-text => 'Open file',
-command => [ \&select_unicode_file ],
);
$tissue_but->grid( -row => 3, -column => 1 );
Tk::MainLoop();
sub select_unicode_file{
my $types = [ ['Txt', '.txt'],
['All Files', '*'],];
my $input_file= $mw->getOpenFile(-filetypes => $types);
my $fh = Win32::Unicode::File->new;
if ($fh->open('<', $input_file)){
while (my $line = $fh->readline()){
print "\n$line\n";
}
close $fh;
}
else{
print "Couldn't open file: $!\n";
}
}
This correctly opens files such as Поиск/Поиск.txt
What I CANNOT do is to simply get a directory path and than process it. I think I should use use Win32::Unicode::Dir but I really can't understand the documentation.
It should be something like this:
#!/usr/bin/perl -w
use strict;
use warnings;
use Win32::Unicode::Dir;
use Encode;
use Tk;
my $mw = Tk::MainWindow->new;
my $tissue_but = $mw->Button(
-text => 'Open file',
-command => [ \&select_unicode_directory ],
);
$tissue_but->grid( -row => 3, -column => 1 );
Tk::MainLoop();
sub select_unicode_directory{
my $dir = $mw->chooseDirectory( );
my $wdir = Win32::Unicode::Dir->new;
my $dir = $wdir->open($dir) || die $wdir->error;
my $dir_complete = "$dir/a.txt";
open (MYFILE, $dir_complete );
while (<MYFILE>) {
chomp;
print "$_\n";
}
close (MYFILE);
}
There is a logical error in:
my $dir = $wdir->open($dir) || die $wdir->error;
my $dir_complete = "$dir/a.txt";
$wdir->open('path') returns an object, not a string. You can't use it like a path. But that is not the worst of it. Sadly, it seems like the Tk implementation does not yet have support for Unicode file names (including chooseDirectory). I guess you will have to write a custom dir selector, but I'm not sure it's even possible.
This is capable of listing files in an ascii-chars folder (and ->fetch can list utf-8 files), and crashes when opening a folder with utf-8 chars. Well, to be fair it crashes when opening ??????.
use strict;
use warnings;
use Win32::Unicode::Dir;
use Win32::Unicode::Console;
use Encode;
use Tk;
my $mw = Tk::MainWindow->new;
my $tissue_but = $mw->Button(
-text => 'Select dir',
-command => [ \&select_unicode_directory ],
);
$tissue_but->grid( -row => 3, -column => 1 );
Tk::MainLoop();
sub select_unicode_directory {
my $wdir = Win32::Unicode::Dir->new;
my $selected = $mw->chooseDirectory(-parent =>$mw);
# http://search.cpan.org/dist/Tk/pod/chooseDirectory.pod#CAVEATS
$selected = encode("utf-8", $selected);
print "selected: $selected\n";
$wdir->open($selected) || die $wdir->error;
print "\$mw->chooseDirectory: $selected\n";
print "\$wdir->open(\$selected): $wdir\n";
# CRASH HERE, presumably because winders can't handle '?' in a file (dir) name
for ($wdir->fetch) {
# http://search.cpan.org/~xaicron/Win32-Unicode-0.38/lib/Win32/Unicode/Dir.pm
next if /^\.{1,2}$/;
my $path = "$selected/$_";
if (file_type('f', $path)) { print "file: $path\n"; }
elsif (file_type('d', $path)) { print " dir: $path\n"; }
}
print "closing \n";
$wdir->close || die $wdir->error;
}
Sample out (opening Поиск/):
Both samples below were run using: Strawberry Perl 5.12.3 built for MSWin32-x64-multi-thread
selected: C:/cygwin/home/jaroslav/tmp/so/perl/open-file-tk/?????
$mw->chooseDirectory: C:/cygwin/home/jaroslav/tmp/so/perl/open-file-tk/?????
$wdir->open($selected): Win32::Unicode::Dir=HASH(0x2e38158)
>>> perl crash <<<
Sample out (opening parent of Поиск):
selected: C:/cygwin/home/jaroslav/tmp/so/perl/open-file-tk
$mw->chooseDirectory: C:/cygwin/home/jaroslav/tmp/so/perl/open-file-tk
$wdir->open($selected): Win32::Unicode::Dir=HASH(0x2b92c10)
file: C:/cygwin/home/jaroslav/tmp/so/perl/open-file-tk/.select_uni_dir.pl.swp
file: C:/cygwin/home/jaroslav/tmp/so/perl/open-file-tk/o
file: C:/cygwin/home/jaroslav/tmp/so/perl/open-file-tk/o.dir
file: C:/cygwin/home/jaroslav/tmp/so/perl/open-file-tk/select_uni_dir.pl
file: C:/cygwin/home/jaroslav/tmp/so/perl/open-file-tk/select_uni_file.pl
dir: C:/cygwin/home/jaroslav/tmp/so/perl/open-file-tk/Поиск
Conclusion
The Tk dir selector returns ????? instead of Поиск. While looking for a way to enable Unicode in Tk, I found this:
http://search.cpan.org/dist/Tk/pod/UserGuide.pod#Perl/Tk_and_Unicode :
(...) Unfortunately, there are still places in Perl ignorant of
Unicode. One of these places are filenames. Consequently, the file selectors
in Perl/Tk do not handle encoding of filenames properly. Currently they
suppose that filenames are in iso-8859-1 encoding, at least on Unix systems.
As soon as Perl has a concept of filename encodings, then Perl/Tk will also
implement such schemes.
So at first glance it seems what you're trying to do is impossible (unless you
write or find a custom dir-selector). Actually, it may not be a bad idea to
submit this bug, because the GUI did show "Поиск" so the error is in the return value.

Perl How to merge two or more excel files in one (multiple worksheets)?

I need to merge a few excel file into one, multiple sheets.
I do not care too much about the sheet name on the new file.
I do not have Excel on the computer I plan to run this. so I cannot use Win32 OLE.
I attempted to run this code https://sites.google.com/site/mergingxlsfiles/ but it is not working, I get a new empty excel file.
I attempt to run http://www.perlmonks.org/?node_id=743574 but I only obtained one of the file in the new excel file.
My input excel files have some french characters (é for e.g.) I believe these are cp1252.
Code used :
#!/usr/bin/perl -w
use strict;
use Spreadsheet::ParseExcel;
use Spreadsheet::WriteExcel;
use File::Glob qw(bsd_glob);
use Getopt::Long;
use POSIX qw(strftime);
GetOptions(
'output|o=s' => \my $outfile,
'strftime|t' => \my $do_strftime,
) or die;
if ($do_strftime) {
$outfile = strftime $outfile, localtime;
};
my $output = Spreadsheet::WriteExcel->new($outfile)
or die "Couldn't create '$outfile': $!";
for (#ARGV) {
my ($filename,$sheetname,$targetname);
my #files;
if (m!^(.*\.xls):(.*?)(?::([\w ]+))$!) {
($filename,$sheetname,$targetname) = ($1,qr($2),$3);
warn $filename;
if ($do_strftime) {
$filename = strftime $filename, localtime;
};
#files = glob $filename;
} else {
($filename,$sheetname,$targetname) = ($_,qr(.*),undef);
if ($do_strftime) {
$filename = strftime $filename, localtime;
};
push #files, glob $filename;
};
for my $f (#files) {
my $excel = Spreadsheet::ParseExcel::Workbook->Parse($f);
foreach my $sheet (#{$excel->{Worksheet}}) {
if ($sheet->{Name} !~ /$sheetname/) {
warn "Skipping '" . $sheet->{Name} . "' (/$sheetname/)";
next;
};
$targetname ||= $sheet->{Name};
#warn sprintf "Copying %s to %s\n", $sheet->{Name}, $targetname;
my $s = $output->add_worksheet($targetname);
$sheet->{MaxRow} ||= $sheet->{MinRow};
foreach my $row ($sheet->{MinRow} .. $sheet->{MaxRow}) {
my #rowdata = map {
$sheet->{Cells}->[$row]->[$_]->{Val};
} $sheet->{MinCol} .. $sheet->{MaxCol};
$s->write($row,0,\#rowdata);
}
}
};
};
$output->close;
I have 2 excel files named: 2.xls (only 1 sheet named 2 in it), 3.xls (only 1 sheet named 3)
I launched the script as this:
xlsmerge.pl -s -o results-%Y%m%d.xls 2.xls:2 3.xls:3
Results: results-20121024.xls empty nothing in it.
Then I tried
xlsmerge.pl -s -o results-%Y%m%d.xls 2.xls 3.xls
And it worked.
I am not sure why is it failing while adding the Sheetname
It appears that there is a bug in this line of the script:
if (m!^(.*\.xls):(.*?)(?::([\w ]+))$!) {
($filename,$sheetname,$targetname) = ($1,qr($2),$3);
...
It looks to me like the goal of that line is to allow arguments either in the form
spreadsheet.xls:source_worksheet
or in another form allowing the name of the target sheet to be specified:
spreadsheet.xls:source_worksheet:target_worksheet
The last grouping appears intended to capture that last, optional argument: (?::([\w ]+)). The only problem is, this grouping was not made optional. Thus, when you only specify the source sheet and not the target, the regex fails to match and it falls to the backup behavior, which is to treat the whole argument as the filename. But this fails, too, because you don't have a file called 2.xls:2.
The solution would be to introduce the ? modifier after the last group in the regex to make it optional:
if (m!^(.*\.xls):(.*?)(?::([\w ]+))?$!) {
($filename,$sheetname,$targetname) = ($1,qr($2),$3);
...
Of course, that may not be the only problem. If the script was posted with an error, there could be other errors, too. I don't have Perl available to test it at the moment.

How to use the Windows "hook function" with Perl Win32::GUI?

I am working on Windows XP SP 3, Strawberry Perl.
I would like to let the user of my Perl program select a file; but when using Win32::GUI::GetOpenFileName(), I would like the Windows file selection dialog to open in the "Details" file list option, and not in the default "List" file list option.
Googling on the Net, it seems that I have to use the Windows "hook" function, and send certain messages to the file selection control. The documentation about that is MSDN, and I don't seem to master how to apply it in Perl.
Can anyone recommend what should be the right call syntax in Perl?
Here is my code sample, where the file selection dialog opens with the (default) "List" option:
use strict;
use warnings;
use 5.014;
use Win32::Console;
use Win32::GUI();
use autodie;
use warnings qw< FATAL utf8 >;
use Carp::Always;
use Win32API::File::Time qw{:win};
use Image::ExifTool qw(:Public);
use Date::Parse;
# ...
my ( $FileName, $ImageDir, $DIR, $TopDir);
# ...
$TopDir = 'D:\My Documents';
$ImageDir = Win32::GUI::BrowseForFolder( -root => $TopDir, -includefiles => 1,);
unless ($ImageDir) {
say '$DirName not defined after calling Win32::GUI::BrowseForFolder, ',
'Photo date set line'.__LINE__;
exit;
}
else {
say "Identified directory: $ImageDir";
}
# now select a file
$FileName = Win32::GUI::GetOpenFileName( -title => 'Select an image file', -directory => $ImageDir,
-file => "\0" . " " x 256,
-filter => ["Image files (*.jpg)" => "*.jpg;*.jpeg", "All files", "*.*", ],);
unless ($FileName) {
say '$FileName not defined after calling Win32::GUI::GetOpenFileName, ',
'Photo date set line'.__LINE__;
}
else {
say "Identified image file: $FileName";
}
# ...
Note: (somewhat) similar post at: http://www.perlmonks.org/?node_id=989418
Unfortunately the Win32::GUI API exposes neither the OFN_ENABLEHOOK flag bit or the lpfnHook field of the GetOpenFileName options.
You could perhaps get it working using the Win32::API module to work at an even lower level, but you would have to build the entire OPENFILENAME structure yourself using pack and write some XS code for the hook handler.

Getting list of hyperlinks from an Excel worksheet with Perl Win32::OLE

I want to change the path for a bunch of hyperlinks in an Excel spreadsheet. After searching Google, I came across a solutions to the problem of adding hyperlinks to spreadsheets, but not changing them. Microsoft showed how to something close with VBA here.
Since I want to edit every single hyperlink in my document, the key steps that I don't know how to solve are:
Get a list of hyperlink objects in Perl
Extract their addresses 1 by 1 and
Run a regular expression to make the path change
Store the updated path in the Hyperlink->object and repeat
I am new to using the OLE and am getting tripped up on (1). Here is what I have tried so far:
#!perl
use strict;
use warnings;
use 5.014;
use OLE;
use Win32::OLE::Const "Microsoft Excel";
my $file_name = 'C:\path\to\spreadsheet.xlsx';
my $excel = Win32::OLE->new('Excel.Application', sub {$_[0]->Quit;});
$excel->{Visible} = 1;
my $workbook = $excel->Workbooks->Open($file_name);
my $sheet = $workbook->Worksheets('Sheet 1');
foreach my $link (in $sheet->Hyperlinks ) {
say $link->Address;
}
But this gives code the error:
Win32::OLE(0.1709): GetOleEnumObject() Not a Win32::OLE::Enum object at C:/Dwimperl/perl/vendor/lib/Win32/OLE/Lite.pm line 167.
Can't call method "Hyperlinks" without a package or object reference at at script.pl line 14.
It's selecting the right worksheet, so I am not sure why it complains about an object reference. I tried several variations (Adding {} around Hyperlinks, removing the 'in', trying to store it as a list, as a hash, and as a reference to a hash) Can anyone give me some pointers? Thanks!
First, you should set $Win32::OLE::Warn=3 so your script will croak the moment something goes wrong. Second, I know you can't select sheets by name in older versions of Excel, although I do not know what things are like in the newest versions. Finally, I think you'll find it easier to use Win32::OLE::Enum.
Here is an example:
#!/usr/bin/env perl
use 5.014;
use warnings; use strict;
use Carp qw( croak );
use Path::Class;
use Try::Tiny;
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Excel';
use Win32::OLE::Enum;
$Win32::OLE::Warn = 3;
my $book_file = file($ENV{TEMP}, 'test.xls');
say $book_file;
my $excel = Win32::OLE->new('Excel.Application', sub {$_[0]->Quit;});
$excel->{Visible} = 1;
my $book = $excel->Workbooks->Open("$book_file");
my $sheet = get_sheet($book, 'Sheet with Hyperlinks');
my $links = $sheet->Hyperlinks;
my $it = Win32::OLE::Enum->new($links);
while (defined(my $link = $it->Next)) {
my $address = $link->{Address};
say $address;
if ($address =~ s/example/not.example/) {
$link->{Address} = $address;
$link->{TextToDisplay} = "Changed to $address";
}
}
$book->Save;
$book->Close;
$excel->Quit;
sub get_sheet {
my ($book, $wanted_sheet) = #_;
my $sheets = $book->Worksheets;
my $it = Win32::OLE::Enum->new($sheets);
while (defined(my $sheet = $it->Next)) {
my $name = $sheet->{Name};
say $name;
if ($name eq $wanted_sheet) {
return $sheet;
}
}
croak "Could not find '$wanted_sheet'";
}
The workbook did contain a sheet with the name "Sheet with Hyperlinks". Cell A1 in that sheet contained http://example.com and A2 contained http://stackoverflow.com.