Convert Word doc or docx files into text files? - perl

I need a way to convert .doc or .docx extensions to .txt without installing anything. I also don't want to have to manually open Word to do this obviously. As long as it's running on auto.
I was thinking that either Perl or VBA could do the trick, but I can't find anything online for either.
Any suggestions?

A simple Perl only solution for docx:
Use Archive::Zip to get the word/document.xml file from your docx file. (A docx is just a zipped archive.)
Use XML::LibXML to parse it.
Then use XML::LibXSLT to transform it into text or html format. Seach the web to find a nice docx2txt.xsl file :)
Cheers !
J.

Note that an excellent source of information for Microsoft Office applications is the Object Browser. You can access it via Tools → Macro → Visual Basic Editor. Once you are in the editor, hit F2 to browse the interfaces, methods, and properties provided by Microsoft Office applications.
Here is an example using Win32::OLE:
#!/usr/bin/perl
use strict;
use warnings;
use File::Spec::Functions qw( catfile );
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Word';
$Win32::OLE::Warn = 3;
my $word = get_word();
$word->{Visible} = 0;
my $doc = $word->{Documents}->Open(catfile $ENV{TEMP}, 'test.docx');
$doc->SaveAs(
catfile($ENV{TEMP}, 'test.txt'),
wdFormatTextLineBreaks
);
$doc->Close(0);
sub get_word {
my $word;
eval {
$word = Win32::OLE->GetActiveObject('Word.Application');
};
die "$#\n" if $#;
unless(defined $word) {
$word = Win32::OLE->new('Word.Application', sub { $_[0]->Quit })
or die "Oops, cannot start Word: ",
Win32::OLE->LastError, "\n";
}
return $word;
}
__END__

For .doc, I've had some success with the linux command line tool antiword. It extracts the text from .doc very quickly, giving a good rendering of indentation. Then you can pipe that to a text file in bash.
For .docx, I've used the OOXML SDK as some other users mentioned. It is just a .NET library to make it easier to work with the OOXML that is zipped up in an OOXML file. There is a lot of metadata that you will want to discard if you are only interested in the text. Some other people have already written the code I see: DocXToText.
Aspose.Words has a very simple API with great support too I have found.
There is also this bash command from commandlinefu.com which works by unzipping the .docx:
unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

I strongly recommend AsposeWords if you can do Java or .NET. It can convert, without Word installed, between all major text file types.

If you have some flavour of unix installed, you can use the 'strings' utility to find and extract all readable strings from the document. There will be some mess before and after the text you are looking for, but the results will be readable.

Note that you can also use OpenOffice to perform miscellaneous document, drawing, spreadhseet etc. conversions on both Windows and *nix platforms.
You can access OpenOffice programmatically (in a way analogous to COM on Windows) via UNO from a variety of languages for which a UNO binding exists, including from Perl via the OpenOffice::UNO module.
On the OpenOffice::UNO page you will also find a sample Perl scriptlet which opens a document, all you then need to do is export it to txt by using the document.storeToURL() method -- see a Python example which can be easily adapted to your Perl needs.

.doc's that use the WordprocessingML and .docx's XML format can have their XML parsed to retrieve the actual text of the document. You'll have to read their specifications to figure out which tags contain readable text.

The method of Sinan Ünür works well.
However, I got some crash with the files I was transforming.
Another method is to use Win32::OLE and Win32::Clipboard as such:
Open the Word document
Select all the text
Copy in the Clipboard
Print the content of Clipboard in a txt file
Empty the Clipboard and close the Word document
Based on the script given by Sigvald Refsu in http://computer-programming-forum.com/53-perl/c44063de8613483b.htm, I came up with the following script.
Note: I chose to save the txt file with the same basename as the .docx file and in the same folder but this can easily be changed
###########################################
use strict;
use File::Spec::Functions qw( catfile );
use FindBin '$Bin';
use Win32::OLE qw(in with);
use Win32::OLE::Const 'Microsoft Word';
use Win32::Clipboard;
my $monitor_word=0; #set 1 to watch MS Word being opened and closed
sub docx2txt {
##Note: the path shall be in the form "C:\dir\ with\ space\file.docx";
my $docx_file=shift;
#MS Word object
my $Word = Win32::OLE->new('Word.Application', 'Quit') or die "Couldn't run Word";
#Monitor what happens in MS Word
$Word->{Visible} = 1 if $monitor_word;
#Open file
my $Doc = $Word->Documents->Open($docx_file);
with ($Doc, ShowRevisions => 0); #Turn of revision marks
#Select the complete document
$Doc->Select();
my $Range = $Word->Selection();
with ($Range, ExtendMode => 1);
$Range->SelectAll();
#Copy selection to clipboard
$Range->Copy();
#Create txt file
my $txt_file=$docx_file;
$txt_file =~ s/\.docx$/.txt/;
open(TextFile,">$txt_file") or die "Error while trying to write in $txt_file (!$)";
printf TextFile ("%s\n", Win32::Clipboard::Get());
close TextFile;
#Empty the Clipboard (to prevent warning about "huge amount of data in clipboard")
Win32::Clipboard::Set("");
#Close Word file without saving
$Doc->Close({SaveChanges => wdDoNotSaveChanges});
# Disconnect OLE
undef $Word;
}
Hope it can helps you.

You can't do it in VBA if you don't want to start Word (or another Office application). Even if you meant VB, you'd still have to start a (hidden) instance of Word to do the processing.

I need a way to convert .doc or .docx extensions to .txt without installing anything
for I in *.doc?; do mv $I `echo $ | sed 's/\.docx?/\.txt'`; done
Just joking.
You could use antiword for the older versions of Word documents, and try to parse the xml of the new ones.

With docxtemplater, you can easily get the full text of a word (works with docx only).
Here's the code (Node.JS)
DocxTemplater=require('docxtemplater');
doc=new DocxTemplater().loadFromFile("input.docx");
result=doc.getFullText();
This is just three lines of code and doesn't depend on any word instance (all plain JS)

Related

Reading Xlsx from another Xlsx file

I have few Xlsx files say X.xlsx,Y.xlsx,Z.XLSX and I kept those three Xlsx files in another xlsx file say A.xlsx. Now I want to ready the content in the three xlsx files(x,y,z) at a time through A.xlsx.
Can any one help me on this.
Thanks in advance
This is easy on Windows if your target machine also has Microsoft Excel installed.
Use the Win32::OLE module to create an instance of Excel, open your master file A.xlsx and then iterate over its ->{OLEObjects} property:
#!perl
use strict;
use warnings;
use Win32::OLE 'in';
$ex = Win32::OLE->new('Excel.Application') or die "oops\n";
my $Axlsx = $ex->Open('C:\\Path\\To\\A.xlsx');
my $i=0;
for my $embedded (in $Axlsx->OLEObjects) {
$embedded->Object->Activate();
$embedded->Object->SaveAs("test$i++.xlsx");
$embedded->Object->Close;
}
After saving them, you can treat them as normal Excel files. Alternatively, you can work directly with $embedded->Object, but as you haven't told us what exactly you need to do, it's hard to give specific advice.
See also Save as an Excel file embedded in another Excel file

Open Excel file in perl and print row count

I am using Win32::OLE module to open an excel file and get row count. The problem is when i hard code excel file path it works fine but when i dynamically pass path it throw an error saying that "cant call method workbooks on unblessed reference". Please find the below sample code.
use OLE;
use Win32::OLE::Const 'Microsoft Excel';
my $xapp= Win32::OLE->GetActiveObject('Excel.Application')
or do { Win32::OLE->new('Excel.Application', 'Quit')};
$xapp->{'Visible'} = 0;
my $file='excel.xlsx';
my $fileName="c:/users/mujeeb/desktop/".$file;
print $fileName;
my $wkb = $xapp->Workbooks->Open($fileName); //here i am getting error coz i am passing dynamic fileName;
my $wks = $wkb->Worksheets('Sheet1');
my $Tot_Rows=$wks->UsedRange->Rows->{'Count'};
print $Tot_Rows."\n";
$xapp->close;
Use backslashes in the filename.
The filename is given to excel and excel won't understand forward slashes. Perl does not convert them because Perl doesn't know the string is a file.
Are you sure that there exists a method named as Open? Because I don't see it in the documentation of Win32::OLE. Also you must add use Win32::OLE; in your code.
You could use this line of code to change the path into readable path for OLE:
my $file='excel.xlsx';
my $fileName="c:/users/mujeeb/desktop/".$file;
$fileName=~s/[\/]/\\/g;
print $fileName;
outputs:
c:\\users\\mujeeb\\desktop\\excel.xlsx

Perl - How to crawl a directory, parse every file in the directory and extract all comments to html file

I need some serious help, I'm new to perl and need help on how to create a perl script that would prompt the user for a directory containing perl files, parse every file in that directory and then extract all comments from each file to individual html files.
code examples or existing modules that already does this would be great.
Thank you!
PPI can be used to parse Perl code files. This should get you started on getting Perl files in a directory (assuming they have .pl extensions) and grabbing the comments. I'm not sure what you mean about the HTML piece:
use warnings;
use strict;
use PPI;
my $dir = shift;
for my $file (glob "$dir/*.pl") {
my $doc = PPI::Document->new($file);
for my $com (#{ $doc->find('PPI::Token::Comment') }) {
print $com->{content};
}
}
Update: Look at HTML::Template (but it may be overkill).
A simple cpan search with keyword "dir" turned up a whole slew of helpful modules. One of the ones I use a lot is:
IO::Dir
If you have a choice, here's a Ruby script
#!/usr/bin/env ruby
print "Enter directory: "
directory=File.join(gets.chomp,"*.pl")
directory="/home/yhlee/test/ruby/*.pl"
c=0
Dir[directory].each do |file|
c+=1
o = File.open("file_#{c}.html","w")
File.open(file).each do |line|
if line[/#/]
o.write ( line.scan(/;*\s+(#.*)$/)[0].first + "\n" ) if line[/;*\s+#/]
o.write ( line.scan(/^\s+(#.*)$/)[0].first + "\n") if line[/^\s+#/]
end
end
o.close
end

How can my Perl script determine whether an Excel file is in XLS or XLSX format?

I have a Perl script that reads data from an Excel (xls) binary file. But the client that sends us these files has started sending us XLSX format files at times. I've updated the script to be able to read those as well. However, the client sometimes likes to name the XLSX files with an .xls extension, which currently confuses the heck outta my script since it uses the file name to determine which file type it is.
An XLSX file is a zip file that contains XML stuff. Is there a simple way for my script to look at the file and tell whether it's a zip file or not? If so, I can make my script go by that instead of just the file name.
Yes, it is possible by checking magic number.
There are quite a few modules in Perl for checking magic number in a file.
An example using File::LibMagic:
use strict;
use warnings;
use File::LibMagic;
my $lm = File::LibMagic->new();
if ( $lm->checktype_filename($filename) eq 'application/zip; charset=binary' ) {
# XLSX format
}
elsif ( $lm->checktype_filename($filename) eq 'application/vnd.ms-office; charset=binary' ) {
# XLS format
}
Another example, using File::Type:
use strict;
use warnings;
use File::Type;
my $ft = File::Type->new();
if ( $ft->mime_type($file) eq 'application/zip' ) {
# XLSX format
}
else {
# probably XLS format
}
.xlsx files have the first 2 bytes as 'PK', so a simple open and examination of the first 2 characters will do.
Edit: Archive::Zip is a better
solution
# Read a Zip file
my $somezip = Archive::Zip->new();
unless ( $somezip->read( 'someZip.zip' ) == AZ_OK ) {
die 'read error';
}
Use File::Type:
my $file = "foo.zip";
my $filetype = File::Type->new( );
if( $filetype->mime_type( $file ) eq 'application/zip' ) {
# File is a zip archive.
...
}
I just tested it with a .xlsx file, and the mime_type() returned application/zip. Similarly, for a .xls file the mime_type() is application/octet-stream.
You can detect the xls file by checking the first bytes of the file for Excel headers.
A list of valid older Excel headers can be gotten from here (unless you know exact version of their Excel, check for all applicable possibilities):
http://toorcon.techpathways.com/uploads/headersig.txt
Zip headers are described here: http://en.wikipedia.org/wiki/ZIP_(file_format)#File_headers
but i'm not sure if .xlsx files have the same headers.
File::Type's logic seems to be "PK\003\004" as the file header to decide on zip files... but I'm not certain if that logic would work as far as .xlsx, not having a file to test.
The-Evil-MacBook:~ ivucica$ file --mime-type --brief file.zip
application/zip
Hence, probably comparing
`file --mime-type --brief $filename`
with application/zipwould do the trick of detecting zips. Of course, you need to have file installed which is quite usual on UNIX systems. I'm afraid I cannot provide Perl example since all knowledge of Perl evaporated from my memory, and I have no examples at hand.
I can't say about Perl, but with the framework I use, .Net, there are a number of libraries available that will manipulate zip files you could use.
Another thing that I've seen people use is the command-line version of WinZip. It give a return-value that is 0 when a file is unzipped and non-zero when there is an error.
This may not be the best way to do this, but it's a start.

How can I modify an existing Excel workbook with Perl?

With Spreadsheet::WriteExcel, I can create a new workbook, but what if I want to open an existing book and modify certain columns? How would I accomplish that?
I could parse all of the data out of the sheet using Spreadsheet::ParseExcel then write it back with new values in certain rows/columns using Spreadsheet::WriteExcel, however. Is there a module that already combines the two?
Mainly I just want to open a .xls, overwrite certain rows/columns, and save it.
Spreadsheet::ParseExcel will read in existing excel files:
my $parser = Spreadsheet::ParseExcel->new();
# $workbook is a Spreadsheet::ParseExcel::Workbook object
my $workbook = $parser->Parse('Book1.xls');
But what you really want is Spreadsheet::ParseExcel::SaveParser, which is a combination of Spreadsheet::ParseExcel and Spreadsheet::WriteExcel. There is an example near the bottom of the documentation.
If you have Excel installed, then it's almost trivial to do this with Win32::OLE. Here is the example from Win32::OLE's own documentation:
use Win32::OLE;
# use existing instance if Excel is already running
eval {$ex = Win32::OLE->GetActiveObject('Excel.Application')};
die "Excel not installed" if $#;
unless (defined $ex) {
$ex = Win32::OLE->new('Excel.Application', sub {$_[0]->Quit;})
or die "Oops, cannot start Excel";
}
# get a new workbook
$book = $ex->Workbooks->Add;
# write to a particular cell
$sheet = $book->Worksheets(1);
$sheet->Cells(1,1)->{Value} = "foo";
# write a 2 rows by 3 columns range
$sheet->Range("A8:C9")->{Value} = [[ undef, 'Xyzzy', 'Plugh' ],
[ 42, 'Perl', 3.1415 ]];
# print "XyzzyPerl"
$array = $sheet->Range("A8:C9")->{Value};
for (#$array) {
for (#$_) {
print defined($_) ? "$_|" : "<undef>|";
}
print "\n";
}
# save and exit
$book->SaveAs( 'test.xls' );
undef $book;
undef $ex;
Basically, Win32::OLE gives you everything that is available to a VBA or Visual Basic application, which includes a huge variety of things -- everything from Excel and Word automation to enumerating and mounting network drives via Windows Script Host. It has come standard with the last few editions of ActivePerl.
There's a section of the Spreadsheet::WriteExcel docs that covers Modifying and Rewriting Spreadsheets.
An Excel file is a binary file within a binary file. It contains several interlinked checksums and changing even one byte can cause it to become corrupted.
As such you cannot simply append or update an Excel file. The only way to achieve this is to read the entire file into memory, make the required changes or additions and then write the file out again.
You can read and rewrite an Excel file using the Spreadsheet::ParseExcel::SaveParser module which is a wrapper around Spreadsheet::ParseExcel and Spreadsheet::WriteExcel. It is part of the Spreadsheet::ParseExcel package.
There's an example as well.
The Spreadsheet::ParseExcel::SaveParser module is a wrapper around Spreadsheet::ParseExcel and Spreadsheet::WriteExcel.
I recently updated the documentation with, what I hope, is a clearer example of how to do this.